EBI -- Patterns, Similarities, and Workshop - Similarity Seaching -
June 10, 2008
fasta.bioch.virginia.edu/ebi08/
These exercises use programs on the FASTA WWW Search page and the EBI 08 BLAST WWW Search page.
Identifying homologs and non-homologs; effects of scoring matrices
and algorithms
- Use the FASTA search page to compare Drosophila
glutathione transferase
GSTT1_DROME (gi|121694) to the PIR1
Annotated protein sequence database.
-
What is the highest scoring non-homolog? (How would you confirm
that your candidate non-homolog was truly unrelated?)
-
Note that this drosophila glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_ECO57, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - search SwissProt for a more
comprehensive view of the family)
- Compare the expectation (E()) value for the distant
relationship between GSTT1_DROME and
GSTM2_RAT (class-mu). How would you demonstrate that
GSTT1_DROME is homologous to GSTM2_RAT?
- Examine how the expectation value changes with different scoring
matrices (BLOSUM62, BlastP62, PAM250) and different gap
penalties. (The default scoring matrix for the FASTA programs is
BLOSUM50, with gap penalties of -10 to open a gap and -2 for each
residue in the gap - e.g. -12 for a one residue gap).
What happens to the E()-value for the highest scoring unrelated
sequence with the different matrices?
Look at the distribution of scores and the
E()-value of the highest scoring unrelated sequence when the
gap-open/gap-ext penalties are small (-7/-1).
- Try the search with ssearch (Smith-Waterman). Again, look
at the E()-values for distant homologs and the highest scoring
unrelated sequence.
- (optional) Try the search with ktup=1 (What is ktup?).
FASTA uses the ktup parameter to adjust the
sensitivity and speed of the search. With ktup=2,
FASTA looks for "pairs" of matched identical residues to find
regions of similarity. ktup=1 looks for singly-aligned
residues, and thus takes longer.
- (optional) The latest version (35) of the FASTA package
also provides global:global and global:local protein database
searches. The statistical estimates for these programs have not been
fully validated.
Try the search with ggsearch (Global query:Global library)
and glsearch (Global query:Local
library). Note that global sequence comparisons can have negative
scores (local comparisons must have scores > 0). Are global
searches more or less sensitive than local searches? Are the
statistical estimates as accurate?
-
Do the same search (121694) using the Course
BLAST WWW page.
-
What is the highest scoring non-homolog?
- How do the blastp E()-values compare with the
FASTA (blosum62) E()-values for the distantly related
mammalian and plant sequences?
Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches
-
In the next three exercises, we will try to find
gstt1_drome homologs in the Arabidopsis genome, using
(a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c)
protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.
In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.
- BLASTP
Compare the GSTT1_DROME (gi|121694) protein sequence to Arabidopsis thaliana proteins (select Database: refseq_protein, Organism: Arabidopsis thaliana using NCBI BLASTP.
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4
-
BLASTX
Try the same search using the GSTT1_DROME cDNA DMGST (gi|8033) against Arabidopsis proteins using
NCBI BLASTX. (Once again, select Database: refseq_protein, Organism: Arabidopsis thaliana.)
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4
-
TBLASTN. Use
GSTT1_DROME (gi|121694) against translated Arabidopsis DNA using
NCBI TBLASTN.
Select Database: reference mRNA and Organism: Arabidopsis thaliana.
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4
- Finally, try the DNA:DNA comparison.
Use NCBI BLASTN to compare dmgst
(gi|8033) to the DNA sequences in Arabidopsis. Once again, select Database: reference mRNA and Organism: Arabidopsis thaliana.
Are there detectable Arabidopsis homologues?
Are all statistically significant matches homologous??
-
Search statistics with low complexity regions:
-
Use the fasta program to search the
PIR1 database with grou_drome. Do one search excluding
low-complexity regions (the default). What is the highest scoring
unrelated sequence? Its E()-value? What is the E()-value of
GBB1_DROME?
-
Do a second search including low complexity regions (un-check the
Exclude low complexity (seg) box). Compare the E()-value of the
highest scoring unrelated sequence and the GTP-binding regulatory
protein GBB1_DROME.
-
Do the same search with
BLASTP. Compare the results (high scoring sequences and homolog scores) with and without using the filter: low complexity option.
Confirming statistical estimates with shuffles
-
Use the PRSS shuffle program to evaluate the statistical significance of a match.
-
Compare
GSTT1_DROME (gi|121694) to
GSTA4_RAT (gi|121714)) using PRSS
What is the E()-value? What database size is used to calculate the E()-value? Why?
-
Compare SKIL_HUMAN (gi|134594) to KINH_STRPU (kinesin heavy chain) using PRSS. Compare with or without window shuffling.
Significant similarities within sequences - domain duplication
- Exploring domains with local alignments --- Calmodulin
- Use lalign to examine local
similarities between calmodulin CALM_HUMAN and itself.
- Use plalign to plot the same alignment. How many repeats are present in this sequence.
-
What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).
-
Exploring domains with local alignments --- Death Associated Protein Kinase 1 (DAPK1)
- Use lalign to examine local
similarities between DAPK1_HUMAN and itself.
- Use plalign to plot the same alignment. How many
repeats are present in this sequence. Try zooming in by doing the alignment plot using
the subset of the sequence from 350-650
- What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).
You can look at the PFAM annotation of this protein at: DAPK1_HUMAN Pfam
For more complex domain alignments, try mwkw, or mouse RNA
polymerase (rpb1_mouse resdiues 1500-) against itself. Try
the rpb1_mouse alignment using the MD20 scoring matrix as
well as BLOSUM50.
How to get the FASTA package.
The "normal" FASTA WWW site:
Contact Bill Pearson: wrp@virginia.edu