Biol4230 - Final Project

Final Project - From BEDTools to MEME

Summary

The final term project (Due Monday, May 7, at 5:00 PM):

The overall goal is to identify the consensus sequence for a transcription factor binding site using the steps we used in the BED Tools2 lab.

You are not expected to write a python program to perform all these steps. You ARE expected to provide a description.txt file that contains the necessary shell scripts/commands to perform each of the steps, and a files.txt that describes the contents of the files you created during the analysis steps. At the end the project, you will have some *.html files from MEME. The differences between the different *.html files should also be documented in the files.txt file.

Working with a new set of transcription factor ChIP-seq data, do the folowing:

(Option 1 of 2):
Pick a transcription factor ChIP-seq data set from the /nv/md_rdlib2/biol4230/ChIP-seq/ directory. Datasets are available for: Atf2, Ctcf, Egr1Atf2, Fos1, Jund, Taf1, and Yy1. Note that all of these datasets have replicates, and the analysis should be done on both of the replicate datasets.
Copy the two replicate files to your ~/biol4230/final directory.
I do not know what the best thresholds are for peak finding for these datasets. You will need to use the /seqprg/bin/scan_bedg.py to partition the .bedg file into potential binding regions, but before you do this, you should run the /seqprg/bin/stat_bedg.py program to get a sense of the distribution of scores in the 10 nt windows. For wgEncode_Atf2_Rep1.bedg, the numbers are:
```
 3 $ stat_bedg.py wgEncode_Atf2_Rep1.bedg 
nlines:  1244240
window:  10
sum_score:  586526.53499
min_win_score:  1.641222
max_win_score:  1963.678
ave_win_score:  4.7139340882
```
Either modify the stat_bedg.py program, or use 'R', to look window score deciles. Pick a window score theshold that suppresses 50%, or 90% of the ChIP-seq peaks. Report the distribution of window scores in description.txt.
Run /seqprg/bin/scan_bedg.py -t xx wgEncode_yyy_Rep1.bedg, where xx is your threshold, and yyy is the abbreviation for your transcription factor. Record the threshold you used in your description.txt file.
Check to see how many binding regions are created, and calculate the average length of your regions using the awk script from the BED Tools2 lab. Use the related script to calculate the average length selecting only segments shorter than 500 nt.
Finally, create two sets of ChIP-seq .bed files, one with regions that are between 25 and 100 nt long, and a second betwee 25 and 200 nt. Since you are doing this for both replicates of the ChIP-seq data, you will have four files to analyze.
Use grep to identify the set of ChIP peaks for a specific transcription factor in the /nv/md_rdlib2/biol4230/bed-tools/wgEncodeRegTfbsClusteredV3.bed and save that set of peaks to your ~/biol4230/final/ directory. Note that the wgEncodeRegTfbsClusteredV3.bed file is not in the bed format required by bedtools. It does not have the strand (+/- cut -f 1-5 your_grepped_factor.bed | awk '{print $0,"\t+"}' > your_factor_fixed.bed
Also create two sets of tss files, one specifying regions from -250 to +50 around the TSS site, and a second -125 + 25.
With each of the ChIP-seq .bed files, do the following: Option 1: Intersect the four .bed files with the two +/-tss files, to produce eight .bed bed files of ChIP-seq sites near TSS sites, Option 2: intersect your one ChIP-seq modified bed file with the two TSS +/- files to produce two ChIP-seq-TSS files. How many of the ChIP-seq sites are within the TSS ranges you used? How many are not? Build a third set of ChIP-seq sites that are more than 1000 nt upstream or 500 nt downstream, of a TSS site (not near any TSS site) (now 10 files total for option 1, and 3 for option 2). (Be certain to use the uniq command to ensure that each intersected .bed file has unique coordinates. If there are more than 600 .bed file entries in any of these files (the maximum for MEME), use the sample_bed.py program to sample -n 600 or fewer bed regions at random: sample_bed.py -n 600 wgEncode_yy1_tss.bed > wgEncode_yy1_tss_1K.bed Alternatively, select the 600 with the highest ChIP occupancy scores by sorting the bed file on the 5-th field using: sort -k 5 -n your_bed_file | head -n 600 > your_top500.bed (Skip to next step, do not do this.) Now use these .bed files to download a set of FASTA sequences from the UCSC browser. If you download the sequences from UCSC, you will need to run run the /seqprg/bin/trim_ucsc_header.py program to remove the constant region of the FASTA description line. Instead of downloading the FASTA sequences through the UCSC table interface, I have downloaded the hg19.fa sequence to /nv/md_rdlib2/data/hg19/hg19.fa, so you should be able to use the BEDTools program: fastaFromBed -fi /nv/md_rdlib2/biol4230/data/hg19/hg19.fa -bed your_chip_tf.bed -fo your_chip_tf.fasta Note that this output file does not need to have the fasta headers edited. This should make it much easier to script a solution to the final project.
Go to the MEME site, and use both the MEME tool and the MEME-ChIP tools to identify the consensus sites found in your datasets.
In your description.txt file, address the following questions: Did you find the consensus binding site with both MEME and MEME-ChIP? (Find the consensus site using Google/Wikipedia). You looked for the consensus binding site in 10 (option 1) or three (option 2) different datasets. Make a table with 10 or 3 rows for each of your datasets, and provide columns reporting the E()-value for (1) the correct consensus, (2) (possibly) significant incorrect consensus, using both MEME and MEME-Chip. If you did not have a significant E()-value for either the correct or an incorrect motif for one of the MEME methods, enter NA Based on your results, is the consensus binding site enriched near transcription start sites? Is it nearer (-125/+25) or farther from the TSS? Justify your answer.
Put the description.txt, files.txt and the internediate .bed and .fasta files in ~/biol4230/final directory.

Course home page