Functional Site identification

RSAT and PATSER - mapping motifs

The followsing excercises use the RSAT/PATSER web site.

PATSER

On the left side under Matrices and then under Pattern matching select patser [discontinued]
In the Sequence part chose fasta and upload the file TF1-top50.fasta (You will need to download this file to your computer first.)
In Patser option chose consensus as the Format and click in vertical
Copy/paste the matrix given in file TF1_Matrix.txt into the Matrix field.
For Lower threshold estimation chose adjusted information content(auto)^*
You can leave everything else as default
Click Go
The Information page will open giving you a list of your sequences and their length, as well as information about you matrix
Further down it shows you a table with information on in which sequence the motif was found, including position, score, and ln(P).
Go on with Feature Map
Change the Display limits to go from -800 to 0, And Go.
This will take a while
You will see a figure of each of your sequences and the TF1 binding sites (PSSM hits) shown in blue Right now it shows all hits independent of significance

To get only significant TF1 binding sites:

Go back to the table that lists all hits to see what a good ln(p)-value cutoff would be
For instance above the table you will find: in ln(cutoff p-value) based on sample size adjusted information content: -8.782
Go back one more window; change the Lower threshold estimation to maximum ln(p)-value and enter -8.782
Click Go
See that the table with hits is much shorter now
Go on with Feature Map
Change the Display limits to go from -800 to 0
And Go
The figure will now only display sequences with significant hits

^*Calculates scores automatically. This methods takes into account the information content of the matrix, and the size of the sequence set, to choose a good compromise between selectivity and specificity. The matching positions probably contain several false positives. Higher scores indicate binding sites with higher probability (low ln(P)).

Matrix scan

The Matrix scan function can be used to look for binding sites of multiple TFs and cis-regulatroy modules:

On the left side under Matrices and then under Pattern matching select matrix-scan (full option)
In the Sequence part chose fasta and upload the file TF1-top50-seq.fa
In the Matrix filed copy/paste the matrices of 4 TFs of the file 4TFs_Matrix.pscm
As Background model chose Markow order 1^a
From the boxes below pick Individual sequences, click on site, pval, rank, and limits Set Lower threshold for p-value to 0 and higher threshold to 0.0001
Go
This will take a while
The table shows you the individual hits for each TF binding site in each sequence together with location, p-value^b etc.
Go on with Feature Map
Change the Display limits to go from -800 to 0
And Go
This will take a while
You will see a figure of each of your sequences and the binding sites (PSSM hits) in different colors
Are any of the binding sites clustered more than expected by chance, i.e. do they form cis-regulatory modules (CRERs = cis-regulatroy enriched regions)?
Go back to the first window and chose CRERs instead of individual sites
Change the lower threshold of crer_sig to 0 and leave everything else at default
Note that these are very permissive parameters with high false positive rate, but we chose them to have a first look if we might have any CRERs at all
Go
The result table is huge (because of our permissive parameters) but we found some CRERs
Go on with Feature Map
Change the Display limits to go from -800 to 0 and un-check the box for legend
And Go
You will see a lot of CRERs depicted as red boxes in your sequences
Now run it with more stringent parameters to lower the number of false positives
Set Lower threshold for p-value to 0 and higher threshold to 0.0001 and set crer_sig to 2^c
Go on with Feature Map
Change the Display limits to go from -800 to 0
And Go
You will see the individual binding sites and predicted CRERs

Some individual binding sites are now hidden under the CRERs; at the bottom of the page you can select which motifs to display; get rid of the CRERs to see where all individual sites are

^a 1 means that your background model accounts for the frequencies of di-nucleotides like CpG; 0 would just count all 4 nucleotides independently of each other; 2 would account for tri-

^b p-value for each site and PSSM tells how likely it is to get the score by chance, note that your p-value threshold determines the number of false positives you allow/expect: i.e. p<0.001 gives one false prediction every 1kb

^c with this p-value you expect less than 1 false positive site within 5.5kb and with crer_sig = 2 you expect 1 false positive for 100 tested CRERs

Further exploring RSAT

On the left side close to the bottom you will find Tutorials. Scroll down and read how to retrieve sequences Retrieving the sequence 1000 bp upstream of human PAX6. Search for binding sites of the 4 TFs (4TFs_Matrix.pscm) within the PAX6 UTR.

Course home page