Biol4230 - BED Tools for Motifs

From BEDTools to MEME

Summary

In this lab, we take a set of SP1 binding site coordinates, downloaded from UCSC and the ENCODE project, This lab goes through the steps for the SP1 transcription factor binding data. In the Final Project, you will do exactly the same steps, but with a different transcription factor.

Getting started

Make a new ~/biol4230/hwk10 directory, and link the ${HPC_SLIB}/data/bed-tools/wgEncode_Sp1.bed file into it:
```
mkdir ~/biol4230/hwk10
cd ~/biol4230/hwk10
ln -s $HPC_SLIB/data/bed-tools/wgEncode_Sp1.bed .
```
This file was created extracting the Sp1 sites from the wgEncodeRegTfbsClusteredV3.bed.gz file, which is described more fully in the link.
How many Sp1 binding regions are found in the file?
You can determine the average length of the segments with the command:
```
awk '{sum += ($3 - $2); n++} END{print sum/n}' wgEncode_Sp1.bed 
```
awk makes it very easy to isolate the subset of lines with reasonable region lengths.
```
awk '{len = $3 - $2; if (len < 250 && len > 10) { print $_}}' wgEncode_Sp1.bed > wgEncode_Sp1_ranged.bed
```
Run this command and check the average length of the Sp1 region.

Finding sites near TSS's
We seek to identify an Sp1 binding site motif from ChIP-Seq regions that are near transcription start sites (TSS's). To do this, you will need the tss.bed and hg19.chromsizes files you used in last week's exercises. Then, create a .bed file that lists the coordinates -250 - +50 around the tss.bed sites:
```
bedtools slop -l 250 -r 50 -i tss.bed -g hg19.chromsizes > tss_minus250plus50.bed
```
Check the top of the tss_minus250plus50.bed file, and tss.bed to be sure that the new file is correct using head tss.bed and head tss_minus250plus50.bed. (You could also check the average segment length with awk. What should it be?)
Now that we have the Sp1 binding regions, and the TSS regions, we can intersect them to get the subset of Sp1 sites near transcription start sites. We would like to recover the original Sp1 regions, so we will use -wa:
```
bedtools intersect -a wgEncode_Sp1_ranged.bed -b tss_minus250plus50.bed -wa | uniq > wgEncode_Sp1_tss.bed
```
How many Sp1 regions did you start with? How many do you have now? What is the average length of the Sp1 regions near transcription start sites?
Isolating the best and worst sites
The MEME program has constrains on how many sequences it allows, and how long they are (you cannot have more than 1000 sequences, and the total length of sequences cannot be longer than 60,000 nt). In the final project, you will look at 600 sequences with the best and worst transcription factor occupancy scores. Since we only have 374 sequences in the correct length reange and near TSS's, take the top 150 and the bottom 150:
```
sort -n -k 5 -r wgEncode_Sp1_tss.bed | head -n 150 > wgEncode_Sp1_tss_top150.bed
sort -n -k 5 -r wgEncode_Sp1_tss.bed | tail -n 150 > wgEncode_Sp1_tss_bot150.bed
```
Check to make certain that you have not exceeded the 60,000 nt limit on the lengths of the sequences for MEME:
```
awk '{sum += $3 - $2; n++} END{print sum, n}' wgEncode_Sp1_tss_top150.bed
awk '{sum += $3 - $2; n++} END{print sum, n}' wgEncode_Sp1_tss_bot150.bed
```
Getting the FASTA sequences from the bed coordinates
Now that we have a set of Sp1 sites near transcription start sites, we need to download the genome sequences. One strategy is to use the UCSC browser, edit the sequences that come down from the browser to make sure they have different names, and then upload the sequences to MEME.
A simpler solution is to use the fastaFromBed program to extract the sequences from a local copy of the Human genome:
```
fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed wgEncode_Sp1_tss_top150.bed  -fo wgEncode_Sp1_tss_top150.fasta
fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed wgEncode_Sp1_tss_bot150.bed  -fo wgEncode_Sp1_tss_bot150.fasta
```

Download to and send to MEME

Now that we have a set of FASTA sequences of Sp1 sites near transcription start sites, we need to download the sequences and send those sequences to MEME to get a consensus site.

Download the wgEncode_Sp1_tss_top150.fasta and wgEncode_Sp1_tss_bot150.fasta files to a hwk10 directory on your laptop.
Go to the MEME site, select the MEME function, and upload your top150 Sp1/TSS sequences, looking for zero or one motif per sequence, with motif len gths between 4 and 12.
How many significant motifs did you find?
Were any of them close to an Sp1 site: GGCGGG?
Do the same thing for your bot150 Sp1/TSS sequences.
Again, how many significant motifs, any Sp1 sites?

Homework 10 (hwk10, due Monday, April 23, 5:00 PM.

Create the files listed above.
Report the numbers of segments and average length of segments for the initial Sp1 binding regions, the "_ranged" Sp1 regions, and the Sp1 regions near transcription start sites.
Answer the last two questions above (significant motifs, are they Sp1) for both top150 and bot150 sequences.
Upload the motif images to the Collab web site.

Final Project - Project assignment

Due Monday, May 7, 5:00 PM.

Course home page