Final Project - From BEDTools to MEME

Summary

The final term project (Due Monday, May 7, at 5:00 PM):

The overall goal is to identify the consensus sequence for a transcription factor binding site using a variant of the steps we used in the BED Tools2 lab.

You are not expected to write a python program or shell script to perform all these steps. You ARE expected to provide a description.txt file that contains the necessary shell scripts/commands to perform each of the steps, and a files.txt that describes the contents of the files you created during the analysis steps. At the end the project, you will have some *.html files from MEME. The differences between the different *.html files should also be documented in the files.txt file, and you should download the meme.txt output, as well as the PNG graphics for the domains that you found (remember that your MEME results only persist a week).

Working with the "polished" set of bed intervals in:,

$HPC_SLIB/biol4230/data/bed-tools/wgEncodeRegTfbsClusteredV3.bed
do the folowing:
  1. Use grep to identify the set of ChIP peaks for a specific transcription factor in the $HPC_SLIB/data/bed-tools/wgEncodeRegTfbsClusteredV3.bed and save that set of peaks to your ~/biol4230/final/ directory.

    Note that the wgEncodeRegTfbsClusteredV3.bed file is not in the bed format required by bedtools -- it does not have the strand (+/-) field. You can add that field, and remove the last two fields, which are also not part of the standard bed format, with the command line:

    cut -f 1-5 your_grepped_factor.bed | awk '{print $0,"\t+"}' > your_factor_fixed.bed
    

  2. Also create three sets of tss files, one specifying regions from -250 to +50 around the TSS site (tss_m250p50.bed), and a second -125 + 25 (tss_m125p25.bed), and a third for intervals that are more than 1000 nt upstream, and 500 downstream of a TSS (tss_m1000p500.bed). You will use the third set of coordinates with the bedtools intersect -v option to identify binding sites that do NOT intersect with the tss_m1000p500.bed coordinates.

  3. With your grep-ed/awk-ed ChIP-seq .bed files, do the following:
    1. Write an 'R' script that reads in the BED file for your transcription factor, and do a summary() of the lengths of the bed intervals (column 3 minus column 2), and of the intensities (column 5).

      You can run 'R' on interactive.hpc, or download the bed file to your computer and use 'R' studio. But provide an 'R' script "bed_stats.r" that provides these summaries.

    2. Intersect your one ChIP-seq modified bed file with the three TSS +/- files to produce three ChIP-seq-TSS (or two TSS and one non-TSS) files. To get the regions outside the tss_m1000p500.bed regions, use the bedtools intersect -v command to invert the bedtools intersect selection.

      For each intersection, look at the beginning of your tss_m??p??.bed file and your resulting intersection file to make sure that you are getting the binding sites you expect.

      At the end of this process, you should have 3 separate intersection files, one for binding sites that are (1) very close, (2) close, and (3) far from transcription start sites (TSSs).

      How many of the ChIP-seq sites are within the TSS ranges you used? How many are not?

      (Be certain to use the uniq command to ensure that each intersected .bed file has unique coordinates.

    3. Use your 'R' script from the previous step to summarize() the distribution of lengths and binding intensities resulting from the intersections to the three TSS interval (closer, close, far) files.

    4. For each of the three ChIP/TSS intersections, Select the 600 intervals with the highest ChIP occupancy scores by sorting the bed file on the 5-th field using:
        sort -k 5 -n -r your_bed_file | head -n 600 > your_top600.bed
      

      Also select the 600 intervals with the lowest scores:

        sort -k 5 -n -r your_bed_file | tail -n 600 > your_low600.bed
      

      If you have fewer than 1200 bed intervals for your transcription factor, then divide your number of bed intervals in two, and separate the top half and bottom half (don't worry if the number in the top and bottom halves is not identical).

      Run an awk script, as you did in BEDTools2/homework 10, to make certain that you have less than 60.000 nt in each of your sample sets. If you do not, then either:

      1. repeat the sort | head/tail step above to get fewer sequences that are likely to add up to 60,000 nt.
      2. run the sample_bed.py program to down sample your 600 sequences from the top and bottom sets.
      You will need to check to make certain that you have less than 60,000 nt for each of the six intersection files (top, bottom vs very-close, close, and far) you have created.
    5. Use the fastaFromBed program to extract the sequences from a local copy of the Human genome to extract six sets of fasta sequences from your six transcription-factor/TSS intersection files.
      fastaFromBed -fi $HPC_SLIB/data/hg19/hg19.fa -bed your_chip_tf.bed -fo your_chip_tf.fasta
      
      You will need to do this six times, for each of your ChIP/TSS/top-half/bottom-half bed files. This output file does not need to have the fasta headers edited.
  4. Go to the MEME site, and use both the MEME tool and the MEME-ChIP tools to identify the consensus sites found in your datasets.

    Download the meme text output into your final/ directory, and submit the PNG files summarizing the binding motifs to the final project page (so that I can see the results after MEME has deleted them).

  5. In your description.txt file, address the following questions:
    1. Did you find the consensus binding site with both MEME and MEME-ChIP? (Find the consensus site using Google/Wikipedia).
    2. You looked for the consensus binding site in six different datasets -- the top and bottom binding intervals from three distances from TSS's - very close, farther, and not-near a TSS. Make a table 6 rows, and provide columns reporting the E()-value for (1) the correct consensus, (2) (possibly) significant incorrect consensus, using both MEME and MEME-Chip. If you did not have a significant E()-value for either the correct or an incorrect motif for one of the MEME methods, enter NA
    3. Based on your results, is the consensus binding site enriched near transcription start sites? Is it nearer (-125/+25) or farther from the TSS? Justify your answer.

  6. Put the description.txt, files.txt and the internediate .bed and .fasta files in ~/biol4230/final directory.

Course home page