In the links
below, [pgm]
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence.
Identifying homologs and non-homologs; effects of scoring matrices
and algorithms
1. Use the FASTA search page [pgm] to compare Drosophila
glutathione transferase
GSTT1_DROME [seq] (gi|121694) to the PIR1
Annotated protein sequence database. Be sure to press , not .
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm
that your candidate non-homolog was truly unrelated?
Note that this drosophila glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_ECO57, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - compare your candidate non-homolog with SwissProt for a more reliable test.)
Compare the expectation (E()) value for the distant
relationship between GSTT1_DROME and
GSTP1_HUMAN (class-mu). How would you demonstrate that
GSTT1_DROME is homologous to GSTP1_HUMAN?
Answer: Try searching PIR1 with an intermediate (but significant) mammalian sequence.
Examine how the expectation value changes with different scoring
matrices (BlastP62, PAM250, MD40) and different gap
penalties. (The default scoring matrix for the FASTA programs is
BLOSUM50, with gap penalties of -10 to open a gap and -2 for each
residue in the gap - e.g. -12 for a one residue gap).
What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
What happens to the E()-value for the highest scoring unrelated
sequence with the different matrices?
Try the search with ssearch [pgm] (Smith-Waterman). Again, look
at the E()-values for distant homologs and the highest scoring
unrelated sequence.
2. Do the same search (121694) using the Course
BLAST [pgm] WWW page.
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog?
How do the blastp E()-values compare with the
FASTA (blosum62) E()-values for the distantly related
mammalian and plant sequences?
Try a limited search at the NCBI/BLAST web site. Compare Drosophila GSTT1_DROME (121694) to the Human Refseq proteins using BLASTP [pgm].
What is the highest scoring unrelated sequence (hint, it has an E()-value < 0.001).
How would you demonstrate that this "significant" unrelated sequence is in fact unrelated.
3. Working with short sequences -- when the scoring matrix matters
The default scoring matrices/gap-penalties for BLAST (BLOSUM62,
-11/-1) and FASTA (BLOSUM50, 10/-2) are chosen to identify the most
distant homologs possible. BLOSUM50 provides about 0.48 bits of
significance per aligned amino acid, on average (ungapped). BLOSUM62
provides about 0.70 bits/aligned position (ungapped). In a search
against SwissProt (~450,000 proteins, 1.7 x 108 residues) with an
"average" length query (400 aa), an alignment must score about 45 bits
to produce E() < 0.001. For BLOSUM62, this means the alignment must
be 45 bits/0.70 bits/aa = 64 residues long. While protein sequences are
almost always longer than 64 residues, metagenomic DNA reads are often
shorter than 200 nt (=66 aa max). For shorter DNA reads, shallower
scoring matrices are required. BLAST (BLASTP, BLASTX) only offers one
shallow scoring matrix, PAM30, which provides about 2.6 bits per
position. FASTA/FASTX provides MD40 (2.2 bits/position, MD20 - 2.9,
and MD10 - 3.4). A 70 nt DNA sequence can produce 60 bits with PAM30
or 68 bits with MD20.
Try the
FASTA [pgm] search in question (1) using a
fragment of gstt1_drome (51-100) (be sure to check the box "Use subset
range" on the Query sequence" line.
What is the most distantly related homolog with E() < 0.001? E()<0.01?
How long is it's alignment? What is the percent identity?
What would be the "right" scoring matrix for a 50 residue query to
produce 40 bits of significance (40 bits, rather than 45 bits, because
PIR1 only has 13,000, rather than 450,000 sequences - 1/32 as many, or 5
bits). (BLOSUM80 and PAM120 both produce about 0.91 bits of score per
position.)
Note that you can also use the "percent identity" to estimate the most
effective matrix. 80% identity -> MD20, 60% identity -> MD40, 40% ->
BLOSUM80.
Try the
FASTA [pgm] search in question (1) using a
fragment of gstt1_drome (51-80, or 90 nt). Again, be sure to check the
box "Use subset range" on the Query sequence" line.
Now what is the most sensitive scoring matrix?
What is the percent identity of the most distant alignment?
What sequences are significant if you use BLOSUM62 -11/-1 (BlastP62) as the scoring matrix? (BLAST default.)
Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches4. In the next three exercises, we will try to find
gstt1_drome homologs in the Arabidopsis genome, using
(a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c)
protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.
In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.
What are the E()-values for Arabidopsis THETA1, phi, zeta-class 1and TAU
BLASTX
Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) against Arabidopsis proteins using
BLASTX [pgm]. If the search fails, try turning off low complexity filtering.
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
Finally, try the DNA:DNA comparison.
Use
BLASTN [pgm] to compare dmgst
(gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.
Are there detectable Arabidopsis homologues?
Confirming statistical estimates with shuffles
5. Use the PRSS shuffle [pgm] program to evaluate the statistical significance of a match.
What is the E()-value? What equivalent database size is used to calculate the E()-value? Why isn't the database size 1? What should it be?
Use PRSS [pgm] to
compare OPSD_HUMAN [seq] (gi|129207)
to US27_HCMVA [seq] (gi|137159) Human Herpes GPCR homolog. Compare with uniform or window shuffling. How does the E()-value change? Why?
Significant similarities within sequences - domain duplication
6. Exploring domains with local alignments --- Calmodulin
Use lalign [pgm] to examine local
similarities between calmodulin CALM_HUMAN and itself.
Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence.
What happens to the domain alignment plot when you use a shallower scoring matrix (try BlastP62, MD20).
7. Exploring domains with local alignments --- Cortactin (SRC8_HUMAN)
Use lalign [pgm] to examine local
similarities between SRC8_HUMAN and itself.
Note the average percent identity for the some of the most significant alignments.
Use plalign [pgm] to plot the same alignment. How many
repeats are shown in this alignment? How long do they appear to be?
Based on the percent identities you saw in part (a), what would the appropriate scoring matrix be to accurately identify the cortactin domains?
What happens to the domain alignment plot when you use a shallower
scoring matrix (try BlastP62, MD40). How do the BlastP62, PAM120, and MD40
alignments differ? Why are they different? Which matrix appears to
best identify all the repeats found in the PFAM diagram?
You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [pfam] (Cortactin). How many repeats are shown in this diagram? How long are they?
For more complex domain alignments, try mwkw, or mouse RNA
polymerase (rpb1_mouse residues 1500- ) against itself. Try
the rpb1_mouse alignment using the MD20 scoring matrix as
well as BLOSUM50.