Biol4230 - BED Tools for Motifs

From BEDTools to MEME

Summary

NOTICE
This exercise is under revision. Rather than use scan_bedg.py, described below, to find ChIP-seq peaks, you should use macs2. But I have not (yet) found or installed macs2 on interactive.hpc, so I cannot provide those instructions yet. I hope to have this done today, after which this page will be updated and you should use macs2.

In this lab, we will use the scan_bedg.py script to break up the Encode Sp1 and reversed ChIP seq files (which are in .bedg format), and create a standard .bed file. We will do a few things to look at that file, but ultimately you will upload a version of the BED file(s) to the UCSC browser, and download associated DNA sequences, which will then be used to search for consensus binding site sequences.

Getting started; Converting .bedg files to .bed

Most of the files you need for this exercise have already been downloaded your ~/biol4230/hwk9 directory. Make a new ~/biol4230/hwk10 directory, and link the wgEncode_*.bedg files into it:
```
mkdir ~/biol4230/hwk10
cd ~/biol4230/hwk10
ln -s ../hwk9/wgEncode_*.bedg .
```
Alternatively, the files are in /nv/md_rdlib2/biol4230/bed-tools/, and you can link or copy the files from there.
Use the scan_bedg.py program to convert the wgEncode_ChIP_Sp1.bedg file to a .bed file in your hwk10 directory:
```
scan_bedg.py -t 2.0 -w 10 -n Sp1 wgEncode_ChIP_Sp1.bedg > wgEncode_Sp1_020.bed
```
The scan_bed.py script looks for successive points in the .bedg file in a window of -w 10 (10) residues with an average score of 2.0 in the window (-t 2.0) per location. Where the average has gone above 2.0, it writes a start coordinate. When the average score in the window drops below 2.0, it ends the segment, and writes out the name, the score across the window, and '+' for the strand.
How many segments are created?
You can determine the average length of the segments with the command:
```
awk '{sum += ($3 - $2); n++} END{print sum/n}' wgEncode_Sp1_020.bed 
```
This number is very high, because there are a small number of very long regions. You can get a more sensible value by excluding long regions:
```
awk '{len = $3-$2; if (len < 1000) {sum += ($3 - $2); n++}} END{print sum/n}' wgEncode_Sp1_020.bed 
```
awk makes it very easy to isolate the subset of lines with reasonable region lengths.
```
awk '{len = $3 - $2; if (len < 250 && len > 25) { print $_}}' wgEncode_Sp1_020.bed > wgEncode_Sp1_ranged.bed
```
Run this command and check the average length of the Sp1 region.
We seek to identify an Sp1 binding site motif from ChIP-Seq regions that are near transcription start sites (TSS's). To do this, you will need the tss.bed and hg19.chromsizes files you used in last week's exercises. Then, create a .bed file that lists the coordinates -250 - +50 around the tss.bed sites:
```
bedtools slop -l 250 -r 50 -i tss.bed -g hg19.chromsizes > tss_minus250plus50.bed
```
Check the top of the tss_minus250plus50.bed file, and tss.bed to be sure that the new file is correct using head tss.bed and head tss_minus250plus50.bed. (You could also check the average segment length with awk. What should it be?)
Now that we have the Sp1 binding regions, and the TSS regions, we can intersect them to get the subset of Sp1 sites near transcription start sites. We would like to recover the original Sp1 regions, so we will use -wa:
```
bedtools intersect -a wgEncode_Sp1_ranged.bed -b tss_minus250plus50.bed -wa | uniq > wgEncode_Sp1_tss.bed
```
How many Sp1 regions did you start with? How many do you have now? What is the average length of the Sp1 regions near transcription start sites?

Now that we have a set of Sp1 sites near transcription start sites, we need to download the genome sequences using UCSC, and then send those sequences to MEME to get a consensus site. We will use the Add Custom Tracks page at UCSC to get the FASTA sequences we need.

Transfer your file wgEncode_Sp1_tss.bed file to your local computer.
Go to the Add Custom Tracks page at UCSC and on the Paste URLs or data: line, select Choose File.

Once the file is successfully uploaded, you can use the Table Browser to download the FASTA sequences associated with your regions.

Select:
group: Custom Tracks
output format: sequence
output file: sp1.fasta

Download the file to your laptop, and take a look at it.

Now we have a file of FASTA sequences from Sp1 sites near transcription start sites.

For MEME to work, each of the FASTA entries must have a different "name". By default, all the sequences downloaded from the UCSC browser custom track have headers of the form:

>hg19_ct_UserTrack_3545_Xfactor range=chr1:567551-567600 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:569896-569940 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:1334695-1334759 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:1334782-1334913 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:1850756-1850804 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:2245833-2245867 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:2245887-2245928 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:2245964-2245990 5'pad=0 3'pad=0 strand=+ repeatMasking=none
>hg19_ct_UserTrack_3545_Xfactor range=chr1:2322970-2323022 5'pad=0 3'pad=0 strand=+ repeatMasking=none

You need to remove the constant text between ">" and "range=". On franklin, this can be done with:

cat sp1.fasta | sed -e 's/hg19_ct_UserTrack_3545_Sp1 range=//g' > sp1_fixed.fasta

Go to the MEME site, select the MEME function, and upload your Sp1/TSS sequences, looking for zero or one motif per sequence, with motif lengths between 4 and 12.

Homework 10 (hwk10, due Monday, April 25, 5:00 PM.

Create the files listed above.
Report the numbers of segments and average length of segments for the initial Sp1 binding regions, the Sp1 binding regions < 1000 nt, the "_ranged" Sp1 regions, and the Sp1 regions near transcription start sites.
In addition to the MEME analysis on TSS adjacent Sp1 binding regions, do the same analysis on the rev dataset (start with wgEncode_ChIP_rev.bedg and follow the same steps).
Provide links to the MEME results, or upload the images of the motif plots.

Final Project - Project assignment

Due Monday, May 9, 5:00 PM.

Course home page