BIMS6000 - Similarity searching exercises

These exercises use programs on the FASTA WWW Search page and the BLAST WWW Search page.

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.

Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

1. Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .

  1. Take a look at the output.

    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
    7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?

  2. Homologs, non-homologs, and the statistical control.

    1. What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? (This is a control for statistical accuracy.)

      You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.

    2. What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs?
    3. How would you confirm that your candidate non-homolog was truly unrelated? (Hint - compare your candidate non-homolog with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)

  3. Domains and alignment regions

    1. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
    2. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?
    3. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present?

      In the subalignment scores, the Q value is -10 * log(p) for the sub-alignment score, so Q=30.0 means p < 0.001.

  4. Repeat the GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix that BLAST uses. Re-examine the SSPA_ECO57 alignment. Are both Glutathione transferase domains present? Look at the alignments to the homologs above and below SSPA_ECO57. Based on those aligments, do you think the Glutathione-S-Trfase C-like domain is really missing? Why did the alignment become so much shorter?

  5. One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH, with an E()-value of 0.11.
    1. Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione transferase homolog?
    2. Use the General Research to explore the domains contained in EF1D2_ARATH homologs found in SwissProt. Use the "Use subset range" option to limit the search to the N-terminal region.
    3. Does this secondary search support homology or non-homology? What is the percent identity of the first significant aligment with an annotated domain? What is the E()-value?
    4. The portion of the protein you are searching with is very short. Based on the percent identity that you saw in the previous search, what would be a better scoring matrix? Why?
    5. Try a search with that scoring matrix. What is the new E()-value with the scoring matrix you chose?

2. Exploring domains and over-extension with local alignments -- death associated protein kinase (DAPK1_HUMAN)
  1. Look up the domain structure of DAPK1_HUMAN at Pfam [pgm].
    1. What are the major (PfamA) domain regions on the protein?
    2. Which of the domains is repeated?
    3. In a local (LALIGN) alignment, where would you expect to see overlapping domains like those in Calmodulin (CALM_HUMAN) and Cortactin (DAPK1_HUMAN)?

  2. Use lalign/plalign [pgm] to examine local similarities between DAPK1_HUMAN and itself. Check the options to "annotate sequence 1 domains" and "annotate sequence 2 domainss". Annotate one of the sequences with "Interpro Domains/UniProt features", and the other with "Uniprot Domains/Uniprot Features". Do you see the domains you expected from Pfam? Do they map in the same places?

  3. Repeat the LALIGN/PLALIGN analysis lalign/plalign [pgm], but select the subset of the protein where the repeated domains are found (350-700) on both the query (first) and subject (second) sequence. Looking at the first or second non-identical self-alignment:
    1. What is the overall percent identity of the alignment?
    2. What is the range in identity accross the different aligned Ankryin domains?
    3. Do the ends of the first alignment correspond to the domain boundaries?
    4. How long are the ankyrin domains?

  4. Based on the percent identities you saw in part (c), what would the appropriate scoring matrix be to accurately identify the ankyrin domains?
    1. Using a "correct" scoring matrix, are the alignment boundaries more accurate?
    2. What is the percent identity of the alignment (did you pick the right matrix?)

3. Exploring domains and alignment over-extension -- cortactin (SRC8_HUMAN)

Compare SRC8_HUMAN [pgm] (human cortactin) to the SwissProt protein sequence database.

  1. Looking at the colored rectangles to the right of the list of best scores, what are the green domains and blue domains?
    1. How many proteins have homologous green domains?
    2. How many significant alignments only have a blue domain? Do you think those proteins are homologous?

  2. Looking at the top five alignments, how many cortactin orthologs do you see? (ortholog, same protein, different species).

  3. In the SRC8 HUMAN:CHICK alignment, both the query and the subject (library) sequences align seven cortactin domains and an SH3 domain. In addition, two regions (one before the cortactin domain cluster and one after) are well conserved, but do not have annotated domains (NODOM). Are these non-domain (NODOM) regions as well conserved as the annotated domains?

  4. Look at the SRC8_HUMAN:HCLS1_MOUSE alignment. How many cortactin domains does HCLS1_MOUSE contain? How much score does the NODOM between the cortactin domains and the SH3 domain contribute? Why is it included in the alignment? Is it likely to be homologous?

  5. Is the NODOM between the cortactin domains and the SH3 domain likely to be homologous in the SRC8_HUMAN:DBNLB_XENLA alignment? What data would convince you that the sequences were homologous?

  6. What scoring matrix should be used to reduce over-extension from the SH3 domain?

4. -- Searching for sequences with known structure -- death associated protein kinase (DAPK1_HUMAN)

Search [pgm] the protein structure database (PDB Structures - NCBI) using the DAPK1_HUMAN protein.

  1. How much of the protein has a known structure?
  2. To double check your answer, search [pgm] the PDB structure sequences using the three domain regions (Kinase, Ankyrin, Death) identified by Pfam the local domain plots.

    Are there homologs to the Death domain with known structures?

  3. Try searching the protein structure database with a glutathione S-transferase sequence (e.g. GSTT1_DROME [pgm] or GSTM1_HUMAN [pgm]) or a calmodulin sequence (CALM_HUMAN [pgm]), annotating both the query and PDB database. How well do the Interpro domains line up with the structural domains.

Course home page

Where to get the FASTA package:

The "normal" FASTA WWW site:

Contact Bill Pearson: