RSAT and PATSER - mapping motifs


The followsing excercises use the RSAT/PATSER web site.
PATSER
  1. On the left side under Matrices and then under Pattern matching select patser [discontinued]
  2. In the Sequence part chose fasta and upload the file TF1-top50.fasta (You will need to download this file to your computer first.)
  3. In Patser option chose consensus as the Format and click in vertical
  4. Copy/paste the matrix given in file TF1_Matrix.txt into the Matrix field.

    For Lower threshold estimation chose adjusted information content(auto)*

    You can leave everything else as default

  5. Click Go

    The Information page will open giving you a list of your sequences and their length, as well as information about you matrix

    Further down it shows you a table with information on in which sequence the motif was found, including position, score, and ln(P).

  6. Go on with Feature Map

    Change the Display limits to go from -800 to 0, And Go.
    This will take a while

    You will see a figure of each of your sequences and the TF1 binding sites (PSSM hits) shown in blue Right now it shows all hits independent of significance

To get only significant TF1 binding sites:
  1. Go back to the table that lists all hits to see what a good ln(p)-value cutoff would be
  2. For instance above the table you will find: in ln(cutoff p-value) based on sample size adjusted information content: -8.782
  3. Go back one more window; change the Lower threshold estimation to maximum ln(p)-value and enter -8.782

    Click Go

    See that the table with hits is much shorter now

  4. Go on with Feature Map

    Change the Display limits to go from -800 to 0

    And Go

    The figure will now only display sequences with significant hits

*Calculates scores automatically. This methods takes into account the information content of the matrix, and the size of the sequence set, to choose a good compromise between selectivity and specificity. The matching positions probably contain several false positives. Higher scores indicate binding sites with higher probability (low ln(P)).
Matrix scan

The Matrix scan function can be used to look for binding sites of multiple TFs and cis-regulatroy modules:

  1. On the left side under Matrices and then under Pattern matching select matrix-scan (full option)
  2. In the Sequence part chose fasta and upload the file TF1-top50-seq.fa
  3. In the Matrix filed copy/paste the matrices of 4 TFs of the file 4TFs_Matrix.pscm

    As Background model chose Markow order 1a

    From the boxes below pick Individual sequences, click on site, pval, rank, and limits Set Lower threshold for p-value to 0 and higher threshold to 0.0001

    Go
    This will take a while

    The table shows you the individual hits for each TF binding site in each sequence together with location, p-valueb etc.

  4. Go on with Feature Map

    Change the Display limits to go from -800 to 0

    And Go
    This will take a while

    You will see a figure of each of your sequences and the binding sites (PSSM hits) in different colors

  5. Are any of the binding sites clustered more than expected by chance, i.e. do they form cis-regulatory modules (CRERs = cis-regulatroy enriched regions)?
  6. Go back to the first window and chose CRERs instead of individual sites

    Change the lower threshold of crer_sig to 0 and leave everything else at default

    Note that these are very permissive parameters with high false positive rate, but we chose them to have a first look if we might have any CRERs at all

    Go

    The result table is huge (because of our permissive parameters) but we found some CRERs

  7. Go on with Feature Map

    Change the Display limits to go from -800 to 0 and un-check the box for legend

    And Go

    You will see a lot of CRERs depicted as red boxes in your sequences

  8. Now run it with more stringent parameters to lower the number of false positives

    Set Lower threshold for p-value to 0 and higher threshold to 0.0001 and set crer_sig to 2c

    Go on with Feature Map

    Change the Display limits to go from -800 to 0

    And Go

    You will see the individual binding sites and predicted CRERs

Some individual binding sites are now hidden under the CRERs; at the bottom of the page you can select which motifs to display; get rid of the CRERs to see where all individual sites are

a 1 means that your background model accounts for the frequencies of di-nucleotides like CpG; 0 would just count all 4 nucleotides independently of each other; 2 would account for tri-

b p-value for each site and PSSM tells how likely it is to get the score by chance, note that your p-value threshold determines the number of false positives you allow/expect: i.e. p<0.001 gives one false prediction every 1kb

c with this p-value you expect less than 1 false positive site within 5.5kb and with crer_sig = 2 you expect 1 false positive for 100 tested CRERs


Further exploring RSAT
On the left side close to the bottom you will find Tutorials. Scroll down and read how to retrieve sequences Retrieving the sequence 1000 bp upstream of human PAX6. Search for binding sites of the 4 TFs (4TFs_Matrix.pscm) within the PAX6 UTR.
Course home page