Final Project - From BEDTools to MEME


The final term project (Due Monday, May 8, at 5:00 PM):

The overall goal is to identify the consensus sequence for a transcription factor binding site using a variant of the steps we used in the BED Tools2 lab.

You are not expected to write a python program or shell script to perform all these steps. You ARE expected to provide a description.txt file that contains the necessary shell scripts/commands to perform each of the steps, and a files.txt that describes the contents of the files you created during the analysis steps. At the end the project, you will have some *.html files from MEME. The differences between the different *.html files should also be documented in the files.txt file, and you should download the meme.txt output, as well as the PNG graphics for the domains that you found (remember that your MEME results only persist a week).

Working with the "polished" set of bed intervals in:,

do the folowing:
  1. Use grep to identify the set of ChIP peaks for a specific transcription factor in the /nv/md_rdlib2/biol4230/bed-tools/wgEncodeRegTfbsClusteredV3.bed and save that set of peaks to your ~/biol4230/final/ directory.

    Note that the wgEncodeRegTfbsClusteredV3.bed file is not in the bed format required by bedtools -- it does not have the strand (+/-) field. You can add that field, and remove the last two fields, which are also not part of the standard bed format, with the command line:

    cut -f 1-5 your_grepped_factor.bed | awk '{print $0,"\t+"}' > your_factor_fixed.bed

  2. Also create three sets of tss files, one specifying regions from -250 to +50 around the TSS site, and a second -125 + 25, and a third for intervals that are more than 1000 nt upstream, and 500 downstream of a TSS.

  3. With your grep-ed/awk-ed ChIP-seq .bed files, do the following:
    1. Write an 'R' script that reads in the BED file for your transcription factor, and do a summary() of the lengths of the bed intervals (column 3 minus column 2), and of the intensities (column 5).

      You can run 'R' on interactive.hpc, or download the bed file to your computer and use 'R' studio. But prove a 'R' script "bed_stats.r" that provides these summaries.

    2. Intersect your one ChIP-seq modified bed file with the three TSS +/- files to produce three ChIP-seq-TSS (or two TSS and one non-TSS) files.

      How many of the ChIP-seq sites are within the TSS ranges you used? How many are not?

      (Be certain to use the uniq command to ensure that each intersected .bed file has unique coordinates.

    3. Use your 'R' script from the previous step to summarize() the distribution of lengths and binding intensities resulting from the intersections to the three TSS interval (closer, close, far) files.

    4. For each of the three ChIP/TSS intersections, Select the 600 intervals with the highest ChIP occupancy scores by sorting the bed file on the 5-th field using:
        sort -k 5 -n -r your_bed_file | head -n 600 > your_top600.bed

      Also select the 600 intervals with the lowest scores:

        sort -k 5 -n your_bed_file | head -n 600 > your_low600.bed

      If you have fewer than 1200 bed intervals for your transcription factor, then divide your number of bed intervals in two, and separate the top half and bottom half (don't worry if the number in the top and bottom halves is not identical).

    5. Instead of downloading the FASTA sequences through the UCSC table interface, I have downloaded the hg19.fa sequence to /nv/md_rdlib2/data/hg19/hg19.fa, so you should be able to use the BEDTools program:
      fastaFromBed -fi /nv/md_rdlib2/biol4230/data/hg19/hg19.fa -bed your_chip_tf.bed -fo your_chip_tf.fasta
      You will need to do this six times, for each of your ChIP/TSS/top-half/bottom-half bed files. Note that this output file does not need to have the fasta headers edited. This should make it much easier to script a solution to the final project.
  4. Go to the MEME site, and use both the MEME tool and the MEME-ChIP tools to identify the consensus sites found in your datasets.

    Download the meme text output into your final/ directory, and submit the PNG files summarizing the binding motifs to the final project page (so that I can see the results after MEME has deleted them).

  5. In your description.txt file, address the following questions:
    1. Did you find the consensus binding site with both MEME and MEME-ChIP? (Find the consensus site using Google/Wikipedia).
    2. You looked for the consensus binding site in six different datasets -- the top and bottom binding intervals from three distances from TSS's - very close, farther, and not-near a TSS. Make a table 6 rows, and provide columns reporting the E()-value for (1) the correct consensus, (2) (possibly) significant incorrect consensus, using both MEME and MEME-Chip. If you did not have a significant E()-value for either the correct or an incorrect motif for one of the MEME methods, enter NA
    3. Based on your results, is the consensus binding site enriched near transcription start sites? Is it nearer (-125/+25) or farther from the TSS? Justify your answer.

  6. Put the description.txt, files.txt and the internediate .bed and .fasta files in ~/biol4230/final directory.

Course home page