In this lab, we take a set of SP1 binding site coordinates, downloaded from UCSC and the ENCODE project, This lab goes through the steps for the SP1 transcription factor binding data. In the Final Project, you will do exactly the same steps, but with a different transcription factor.
mkdir ~/biol4230/hwk10 cd ~/biol4230/hwk10 ln -s $HPC_SLIB/data/bed-tools/wgEncode_Sp1.bed .
This file was created extracting the Sp1 sites from the wgEncodeRegTfbsClusteredV3.bed.gz file, which is described more fully in the link.
How many Sp1 binding regions are found in the file?
awk '{sum += ($3 - $2); n++} END{print sum/n}' wgEncode_Sp1.bed
awk makes it very easy to isolate the subset of lines with reasonable region lengths.
awk '{len = $3 - $2; if (len < 250 && len > 10) { print $_}}' wgEncode_Sp1.bed > wgEncode_Sp1_ranged.bedRun this command and check the average length of the Sp1 region.
bedtools slop -l 250 -r 50 -i tss.bed -g hg19.chromsizes > tss_minus250plus50.bedCheck the top of the tss_minus250plus50.bed file, and tss.bed to be sure that the new file is correct using head tss.bed and head tss_minus250plus50.bed. (You could also check the average segment length with awk. What should it be?)
bedtools intersect -a wgEncode_Sp1_ranged.bed -b tss_minus250plus50.bed -wa | uniq > wgEncode_Sp1_tss.bedHow many Sp1 regions did you start with? How many do you have now? What is the average length of the Sp1 regions near transcription start sites?
sort -n -k 5 -r wgEncode_Sp1_tss.bed | head -n 150 > wgEncode_Sp1_tss_top150.bed sort -n -k 5 -r wgEncode_Sp1_tss.bed | tail -n 150 > wgEncode_Sp1_tss_bot150.bed
awk '{sum += $3 - $2; n++} END{print sum, n}' wgEncode_Sp1_tss_top150.bed awk '{sum += $3 - $2; n++} END{print sum, n}' wgEncode_Sp1_tss_bot150.bed
A simpler solution is to use the fastaFromBed program to extract the sequences from a local copy of the Human genome:
fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed wgEncode_Sp1_tss_top150.bed -fo wgEncode_Sp1_tss_top150.fasta fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed wgEncode_Sp1_tss_bot150.bed -fo wgEncode_Sp1_tss_bot150.fasta
How many significant motifs did you find?
Were any of them close to an Sp1 site: GGCGGG?
Again, how many significant motifs, any Sp1 sites?
Due Monday, May 7, 5:00 PM.