To do these exercises in a breakout room with zoom, I recommend that
one person in the group do the searches, and share their screen with
the rest of the group during this class-time. At the same time, all
of the members of the group should have a window open to this page, so
that you can look at the questions on this page, and the results from
the search from the shared zoom screen, and discuss the questions and
In the links
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence. In general, you should
click [pgm] links, but not [seq] links.
Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations
Use the FASTA search page [pgm] to compare Honey bee
glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties? (what is the penalty for a one-residue gap? two residues? -- this is a "trick" question)
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
Homologs, non-homologs, and the statistical control.
What is the highest (worst) E()-value shown? What should the
highest (worst) E()-value calculated in the search be (approximately)?
Which alignment has the worst statistically significant (E()<0.001) score? Do you think this sequence is likely to be homologous?
What is the highest scoring (most significant) non-homolog? (The non-homolog with the
highest alignment score, or the lowest E()-value.) Why do you think it is not homologous?
Look for positive evidence (e.g. a non-homologous domain) for non-homology.
You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.
statistical estimates are accurate, what should the E()-value for the
highest non-homolog (the highest score by chance) be? (This is a
control for statistical accuracy.)
What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs in the database?
How would you confirm that your candidate non-homolog was truly
unrelated? (Hint - compare your candidate non-homolog
with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)
Domains and alignment regions
There are three parts to the domain display, the domain structure of
the query (top) sequence (if available), the domain structure of the library (bottom)
sequence, and the domain alignment boundaries in the middle (inside the
alignment box). The boundaries and color of the alignment domain
coloring match the Region: sub-alignment scores.
Note that the alignment of Honey bee GSTD1
and SSPA_ECO57 includes portions of both the N-terminal and
C-terminal domains, but neither domain is completely aligned. Why do
you think the alignments do not include the complete domains?
Is your explanation for the partial domain alignment consistent the
the argument that domains have a characteristic length? How might you
test whether a complete domain is present?
GSTD1 search [pgm] using the BLASTP62/-11/-1
scoring matrix that
Re-examine the honey bee/SSPA_ECO57 alignment.
Are both Glutathione transferase domains present in the honey bee protein??
Look at the alignments to the homologs above and below SSPA_ECO57.
Based on those.
aligments, do you think the Glutathione-S-Trfase C-like domain is
really missing from the honey bee protein?
Why did the alignment become shorter?
Why would a domain appear to be present in the first (BLOSUM50) search, but not in the second (BLOSUM62)?