Computational Genomics - Workshop I - Similarity Searching
These exercises use programs on the FASTA WWW Search page and the Computational Genomics BLAST WWW Search page.
Identifying homologs and non-homologs; effects of scoring matrices
and algorithms
1. Use the FASTA search page [pgm] to compare Drosophila
glutathione transferase
GSTT1_DROME (gi|121694) [seq] to the PIR1
Annotated protein sequence database.
- Take a look at the output.
-
How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
-
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm
that your candidate non-homolog was truly unrelated?
-
Note that this drosophila glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_SHIFL, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - compare your candidate non-homolog with SwissProt for a more reliable test.)
- Compare the expectation (E()) value for the distant
relationship between GSTT1_DROME and
GSTM2_RAT (class-mu). How would you demonstrate that
GSTT1_DROME is homologous to GSTM2_RAT?
- Examine how the expectation value changes with different scoring
matrices (BLOSUM62, BlastP62, PAM250) and different gap
penalties. (The default scoring matrix for the FASTA programs is
BLOSUM50, with gap penalties of -10 to open a gap and -2 for each
residue in the gap - e.g. -12 for a one residue gap).
-
What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
-
What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
-
What happens to the E()-value for the highest scoring unrelated
sequence with the different matrices?
- Try the search with ssearch [pgm] (Smith-Waterman). Again, look
at the E()-values for distant homologs and the highest scoring
unrelated sequence.
2. Do the same search (121694) using the Course
BLAST [pgm] WWW page.
- Take a look at the output.
-
How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
-
What is the highest scoring non-homolog?
- How do the blastp E()-values compare with the
FASTA (blosum62) E()-values for the distantly related
mammalian and plant sequences?
Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches
3. In the next three exercises, we will try to find
gstt1_drome homologs in the Arabidopsis genome, using
(a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c)
protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.
In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.
- BLASTP
Compare the GSTT1_DROME (gi|121694) [seq] protein sequence to Arabidopsis proteins using BLASTP [pgm].
What are the E()-values for Arabidopsis THETA1, phi, zeta-1and lambda 1?
-
BLASTX
Try the same search using the GSTT1_DROME cDNA DMGST (gi|8033) [seq] against Arabidopsis proteins using
BLASTX [pgm].
What are the E()-values for Arabidopsis THETA1, phi, zeta-1and lambda 1?
-
TBLASTN. Use
GSTT1_DROME (gi|121694) [seq] against translated Arabidopsis DNA using
TBLASTN [pgm].
What are the E()-values for Arabidopsis THETA1, phi, zeta-1and lambda 1?
- Finally, try the DNA:DNA comparison.
Use
BLASTN [pgm] to compare dmgst
(gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.
Are there detectable Arabidopsis homologues?
- Check the BLAST DNA match/mismatch penalties used for the search.
Confirming statistical estimates with shuffles
4. Use the PRSS shuffle program to evaluate the statistical significance of a match.
-
Compare
GSTT1_DROME (gi|121694) to
GSTA4_RAT (gi|121714) using PRSS
What is the E()-value? What equivalent database size is used to calculate the E()-value? Why isn't the database size 1? What should it be?
-
Use PRSS to
compare OPSD_HUMAN (gi|129207)
to US27_HCMVA (gi|137159) Human Herpes GPCR homolog. Compare with or without window shuffling. How does the E()-value change? Why?
-
Use the FASTA search page [pgm] to compare rat synapsin-1
SYN1_RAT (gi|6686305) [seq] to the PIR1
Annotated protein sequence database.
- What is the E()-value against the human IL3 receptor IL3RB_HUMAN [seq]?
-
Use PRSS [pgm] to compare SYN1_RAT with IL3RB_HUMAN. Now what is the E()-value? What happens with the window shuffle?
-
Look at the two sequences using pseg [pgm].
Significant similarities within sequences - domain duplication
5. Exploring domains with local alignments --- Calmodulin
- Use lalign [pgm] to examine local
similarities between calmodulin CALM_HUMAN and itself.
- Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence.
-
What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).
6. Exploring domains with local alignments --- Cortactin (SRC8_HUMAN)
- Use lalign [pgm] to examine local
similarities between SRC8_HUMAN and itself.
- Use plalign [pgm] to plot the same alignment. How many
repeats are shown in this alignment? How long do they appear to be?
-
You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [seq] (Cortactin). How many repeats are shown in this diagram? How long are they?
- What happens to the domain alignment plot when you use a shallower
scoring matrix (try BP62, MD40). Which matrix appears to best
identify all the repeats found in the PFAM diagram?
- Using lalign [pgm], note the end alignment
coordinates for the second sequence when the MD40 matrix is used. Note
the end coordinates when a deeper matrix (BL50, BP62 is used).
7. Exploring domains with local alignments --- Death Associated Protein Kinase 1 (DAPK1)
- Use lalign [pgm] to examine local
similarities between DAPK1_HUMAN and itself.
- Use plalign [pgm] to plot the same alignment. How many
repeats are present in this sequence. Try zooming in [pgm] by doing the alignment plot using
the subset of the sequence from 350-650
- What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).
You can look at the PFAM annotation of this protein at: DAPK1_HUMAN Pfam [seq]
For more complex domain alignments, try mwkw, or mouse RNA
polymerase (rpb1_mouse residues 1500- ) against itself. Try
the rpb1_mouse alignment using the MD20 scoring matrix as
well as BLOSUM50.
Where to get the FASTA package:
faculty.virginia.edu/wrpearson/fasta/
The "normal" FASTA WWW site:
Contact Bill Pearson: wrp@virginia.edu
Course Home Page