BIMS6000 -- 2020 - Similarity Searching Exercises

To do these exercises in a breakout room with zoom, I recommend that one person in the group do the searches, and share their screen with the rest of the group during this class-time. At the same time, all of the members of the group should have a window open to this page, so that you can look at the questions on this page, and the results from the search from the shared zoom screen, and discuss the questions and answers.

These exercises use programs on the FASTA WWW Search page and the BLAST WWW Search page.

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.

Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .

  1. Take a look at the output.

    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues? -- this is a "trick" question)
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
    7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?

  2. Homologs, non-homologs, and the statistical control.

    1. What is the highest (worst) E()-value shown? What should the highest (worst) E()-value calculated in the search be (approximately)?
    2. Which alignment has the worst statistically significant (E()<0.001) score? Do you think this sequence is likely to be homologous?
    3. What is the highest scoring (most significant) non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) Why do you think it is not homologous? Look for positive evidence (e.g. a non-homologous domain) for non-homology.

      You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.

    4. If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? (This is a control for statistical accuracy.)
    5. What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs in the database?
    6. How would you confirm that your candidate non-homolog was truly unrelated? (Hint - compare your candidate non-homolog with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)

  3. Domains and alignment regions

    1. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
    2. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?
    3. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present?

  4. Repeat the GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix that BLAST uses.
    Re-examine the honey bee/SSPA_ECO57 alignment.

    1. Are both Glutathione transferase domains present in the honey bee protein??
    2. Look at the alignments to the homologs above and below SSPA_ECO57. Based on those. aligments, do you think the Glutathione-S-Trfase C-like domain is really missing from the honey bee protein?
    3. Why did the alignment become shorter?
    4. Why would a domain appear to be present in the first (BLOSUM50) search, but not in the second (BLOSUM62)?

Where to get the FASTA package:

The "normal" FASTA WWW site:

Contact Bill Pearson: