Using Meme to identify common motifs in aligned DNA sequences

In this exercise, you will begin with a set of coordinates (for example, .bed file output after a ChIP-seq analysis), download the DNA sequence associated with those coordinates, and search for motifs shared by those sequences. We provide a set of bed-file coordinates: ZNF286A_fdr0_bed.txt and FASTA sequences: ZNF286A_fdr0_summits_seq.fa.
Downloading sequences from .bed coordinates

Use the bed file coordinates ( ZNF286A_fdr0_bed.txt) to download a set of FASTA format sequences from the UCSC Genome browser. (Note that you can also do this with Galaxy or Ensembl but the process is different).

  1. From the UCSC browser main page, select genome browser from the menu at left; the human genome will open by default (or if you've used the browser recently it will open to the last genome you accessed). You can change the genome to be accessed by selecting the correct clade and genome, using toggles at the top of the page. The assembly can also be chosen. For this exercise, use mammal, human, and the Mar. 2006 assembly.
  2. Now, click add custom tracks button below the genome selection buttons, and paste the ZNF286A_fdr0_bed.txt .bed formatted data into the window; then select SUBMIT. Note that if you wanted to do it, you could go directly from this custom track to the UCSC browser; this is a way you can view coordinates from your own dataset (e.g. ChIPseq or RNAseq peaks) in the context of all the other data in the browser.
  3. To retrieve sequence for these coordinates, select go to table browser button at the right; a new window will open. The table browser can do a lot of useful things; we will just use it here to get sequence coordinates. If you go in through this route, all the buttons should be preset, except the ones at the bottom of the form.
  4. For output format select sequence; and for file type returned, select plain text. A new window will open where you can choose various options for your sequence (e.g. repeat masking). Note that for meme and similar programs it is important to mask repeats to N; otherwise, sequences in repetitive elements will dominate your motif list.
  5. When you are done select get sequence. A fasta file will appear; save this as plain text. You will need to modify the UCSC header that comes with the sequences to use them for meme. You can use the program here (check Extract CHR:coordinates from UCSC to reformat the fasta files for MEME. (Alternatively, you can just use a Word global replace function.)
You can also go through Galaxy to fetch sequences from UCSC or from Biomart


Sending sequences to MEME:
  1. Open the MEME site and select the MEME icon to go to the data entry page. Enter your email address, and browse to open the .bed text file OR just paste the contents into the window
  2. At left, select either Zero or one per sequence or any number of repetitions (the two choices will give you slightly different answers). Leave everything else as default and click start search. An email will arrive within a few minutes to hours with a link to your data. The data will be comprised of position weight matrices that are predicted to be over-represented in your data set.
  3. To find out what the sequence motif resembles, scroll below the motif information lists and select send to TomTom; a new window will open In TomTom, select Transfac as the database and give your data a title; then select submit. A new window will open with matrices of known transcription factors that resemble the motif in your data.

CSHL Computational Genomics