Summary
In this lab, we will use the scan_bedg.py script to break up the Encode Sp1 and reversed ChIP seq files (which are in .bedg format), and create a standard .bed file. We will do a few things to look at that file, but ultimately you will upload a version of the BED file(s) to the UCSC browser, and download associated DNA sequences, which will then be used to search for consensus binding site sequences.
mkdir ~/biol4230/hwk10 cd ~/biol4230/hwk10 ln -s ../hwk9/wgEncode_*.bedg .Alternatively, the files are in /nv/md_rdlib2/biol4230/bed-tools/, and you can link or copy the files from there.
scan_bedg.py -t 2.0 -w 10 -n Sp1 wgEncode_ChIP_Sp1.bedg > wgEncode_Sp1_020.bedThe scan_bed.py script looks for successive points in the .bedg file in a window of -w 10 (10) residues with an average score of 2.0 in the window (-t 2.0) per location. Where the average has gone above 2.0, it writes a start coordinate. When the average score in the window drops below 2.0, it ends the segment, and writes out the name, the score across the window, and '+' for the strand.
How many segments are created?
awk '{sum += ($3 - $2); n++} END{print sum/n}' wgEncode_Sp1_020.bedThis number is very high, because there are a small number of very long regions. You can get a more sensible value by excluding long regions:
awk '{len = $3-$2; if (len < 1000) {sum += ($3 - $2); n++}} END{print sum/n}' wgEncode_Sp1_020.bedawk makes it very easy to isolate the subset of lines with reasonable region lengths.
awk '{len = $3 - $2; if (len < 250 && len > 25) { print $_}}' wgEncode_Sp1_020.bed > wgEncode_Sp1_ranged.bedRun this command and check the average length of the Sp1 region.
bedtools slop -l 250 -r 50 -i tss.bed -g hg19.chromsizes > tss_minus250plus50.bedCheck the top of the tss_minus250plus50.bed file, and tss.bed to be sure that the new file is correct using head tss.bed and head tss_minus250plus50.bed. (You could also check the average segment length with awk. What should it be?)
bedtools intersect -a wgEncode_Sp1_ranged.bed -b tss_minus250plus50.bed -wa | uniq > wgEncode_Sp1_tss.bedHow many Sp1 regions did you start with? How many do you have now? What is the average length of the Sp1 regions near transcription start sites?
Select:
group: Custom Tracks
output format: sequence
output file: sp1.fasta
Download the file to your laptop, and take a look at it.
For MEME to work, each of the FASTA entries must have a different "name". By default, all the sequences downloaded from the UCSC browser custom track have headers of the form:
>hg19_ct_UserTrack_3545_Xfactor range=chr1:567551-567600 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:569896-569940 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:1334695-1334759 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:1334782-1334913 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:1850756-1850804 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:2245833-2245867 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:2245887-2245928 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:2245964-2245990 5'pad=0 3'pad=0 strand=+ repeatMasking=none >hg19_ct_UserTrack_3545_Xfactor range=chr1:2322970-2323022 5'pad=0 3'pad=0 strand=+ repeatMasking=noneYou need to remove the constant text between ">" and "range=". On franklin, this can be done with:
cat sp1.fasta | sed -e 's/hg19_ct_UserTrack_3545_Sp1 range=//g' > sp1_fixed.fasta
Due Monday, May 9, 5:00 PM.