In this exercise, you will begin with a set of coordinates (for
example, .bed file output after a ChIP-seq analysis),
download the DNA sequence associated with those coordinates, and
search for motifs shared by those sequences. We provide a set of
bed-file coordinates: ZNF286A_fdr0_bed.txt
and FASTA
sequences: ZNF286A_fdr0_summits_seq.fa.
Downloading sequences from .bed coordinates
Use the bed file coordinates ( ZNF286A_fdr0_bed.txt) to download a set of FASTA
format sequences from the UCSC Genome browser. (Note that you
can also do this with Galaxy or Ensembl but the process is different).
-
From the UCSC browser main page,
select genome browser from the menu at left; the human
genome will open by default (or if you've used the browser recently it
will open to the last genome you accessed). You can change the genome
to be accessed by selecting the correct clade and genome, using
toggles at the top of the page. The assembly can also be chosen. For
this exercise, use mammal, human, and the Mar. 2006 assembly.
-
Now, click add custom tracks button below the genome
selection buttons, and paste the ZNF286A_fdr0_bed.txt
.bed formatted data into the window;
then select SUBMIT.
Note that if you wanted to do it, you could go directly from this custom track to the UCSC
browser; this is a way you can view coordinates from your own dataset
(e.g. ChIPseq or RNAseq peaks) in the context of all the other data in
the browser.
-
To retrieve sequence for these coordinates, select
go to table browser button at the right; a new window
will open. The table browser can do a lot of useful things; we will
just use it here to get sequence coordinates. If you go in through
this route, all the buttons should be preset, except the ones at the
bottom of the form.
-
For output format select sequence; and for file type returned, select
plain text. A new window will open where you can
choose various options for your sequence (e.g. repeat masking).
Note that for meme and similar programs it is important to mask
repeats to N; otherwise, sequences in
repetitive elements will dominate your motif list.
-
When you are done select get sequence. A fasta file will appear; save
this as plain text. You will need to modify the UCSC header that
comes with the sequences to use them for meme. You can use the program
here (check Extract CHR:coordinates from UCSC to reformat the fasta files for MEME. (Alternatively, you can just use a Word global replace function.)
You can also go through Galaxy to fetch sequences from UCSC or from Biomart
Sending sequences to MEME:
-
Open the MEME site and select the MEME icon to go to the data entry page.
Enter your email address, and browse to open the .bed text file OR just paste the contents into the window
-
At left, select either Zero or one per sequence or any number of repetitions (the two choices will give you slightly different answers).
Leave everything else as default and click start search.
An email will arrive within a few minutes to hours with a link to your data.
The data will be comprised of position weight matrices that are predicted to be over-represented in your data set.
-
To find out what the sequence motif resembles, scroll below the motif information lists and select send to TomTom; a new window will open
In TomTom, select Transfac as the database and give your data a title; then select submit. A new window will open with matrices of known transcription factors that resemble the motif in your data.
CSHL Computational Genomics