fasta-demo

Computational Genomics ECG2025 -- Similarity Searching Exercises

These exercises use programs on the FASTA WWW Search page and the BLAST WWW Search page.

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.

Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .

Take a look at the output.
1. How long is the query sequence?
2. How many sequences are in the PIR1 database?
3. What scoring matrix was used?
4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues? -- this is a "trick" question)
5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
Homologs, non-homologs, and the statistical control.
1. What is the highest (worst) E()-value shown? What should the highest (worst) E()-value calculated in the search be (approximately)?
2. Which alignment has the worst statistically significant (E()<0.001) score? Do you think this sequence is likely to be homologous?
3. What is the highest scoring (most significant) non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) Why do you think it is not homologous? Look for positive evidence (e.g. a non-homologous domain) for non-homology.
  You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.
4. If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? (This is a control for statistical accuracy.)
5. What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs in the database?
  1. The highest scoring sequence that does not have glutathione transferase in the description is SSPA_ECO57. How can we be certain that this is really Glutathione transferase homolog? Do a search with SSPA_ECO57 [pgm] against a larger sequence database, e.g. SwissProt or QFO78/Uniprot Ref, for a more comprehensive perspective.
    Do you think SSPA_ECOLI is a glutathione transferase homolog?
  2. To find the highest scoring non-homolog, go down the list of high scoring sequences search those sequences against QFO_uniprot/ref (980K).
    Confirm that your candidate non-homolog is unrelated by using the Re-search w/subject link to search the QFO_uniprot/ref (980K) database with Pfam domain annotation. Search for GST in the results. Do any of the significant hits contain a GST domain?
  3. There are several proteins with "glutathione transferase" in their descriptions that do not have significant scores against the honey bee protein. Why not? How would you show that they are probably glutathione transferase homologs?
Domains and alignment regions
1. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
2. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?
3. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present?
Repeat the GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix that BLAST uses.
Re-examine the honey bee/SSPA_ECO57 alignment.
1. Are both Glutathione transferase domains present in the honey bee protein??
2. Look at the alignments to the homologs above and below SSPA_ECO57. Based on those. aligments, do you think the Glutathione-S-Trfase C-like domain is really missing from the honey bee protein?
3. Why did the alignment become shorter?
4. Why would a domain appear to be present in the first (BLOSUM50) search, but not in the second (BLOSUM62)?
Do the same Honey bee GSTD1 search (295842263) using the Course BLAST [pgm] WWW page (turn on Pfam Annotation on the PIR1 database)
1. Take a look at the output.
  1. How long is the query sequence?
  2. How many sequences are in the PIR1 database?
  3. What scoring matrix was used?
  4. What were the gap penalties?
  5. What are the numbers after the description of the library sequence? Which one is best for inferring homology?
  6. Looking at an alignment, where are the boundaries of the alignment (the best local region)?
2. What is the highest scoring non-homolog?
3. How do the BLASTP E()-values compare with the FASTA (BLASTP62) E()-values for the distantly related mammalian and plant sequences?
Exploring domains with local alignments --- Calmodulin
1. Use lalign/plalign [pgm] to examine local similarities between calmodulin CALM1_HUMAN and itself.
2. Trace the top scoring (non-identical alignment) on the right along the plot of the alignment on the left.
3. How many domains are aligned in the top-scoring alignment?
4. Do the same alignment [pgm], but check the "annotate sequence" boxes to highlight the annotated EF-hand domains.
5. What happens to the domain alignment plot when you use the VT40 scoring matrix. Do you see the same sets of alignments with VT40? With shallow scoring matrices, are the domains fully aligned?
6. What parts of the domains align with VT20?
Exploring domains with local alignments --- Cortactin (SRC8_HUMAN)
1. Use lalign/plalign [pgm] to examine local similarities between SRC8_HUMAN and itself. Check the options to "annotate sequence 1 domains" and "annotate sequence 2 domains".
  1. Identify the third highest scoring alignment on both the alignment plot and the text alignment.
  2. What is the percent identity of the alignment?
  3. How many domains are aligned in this alignment?
  4. Estimate the fraction identical in the first 20 residues of the alignment. In the last 100 residues.
  5. Do the ends of the alignment correspond to the domain boundaries?
2. Based on the percent identities you saw in the sequence alignment (a), what would the appropriate scoring matrix be to accurately identify the cortactin domains?
  1. Using a "correct" scoring matrix, the third highest scoring alignment on both the alignment plot and the text alignment.
  2. What is the percent identity of the alignment (did you pick the right matrix?)
  3. How many domains are aligned?
  4. Do the ends of the alignment correspond to the domain boundaries?
3. Try some other scoring matrices. What happens to the relationship between domain ends and alignment ends when the scoring matrix is too deep (BP62)? Too shallow (VT10)?
Working with short sequences -- when the scoring matrix matters -- (Thursday presentation)
1. Use the FASTX [pgm] to compare the Honey bee GSTD1 mRNA (NM_001178028, gi|GI:295842262) to the PIR1 database. How does the sensitivity of this translated DNA vs protein search compare with your earlier protein:protein search? Why might a translated DNA vs protein search be less sensitive than a protein:protein library search? (Hint: how long is the mRNA?)
2. Now do the same search, but use only exon 3 of the Honey bee GSTD1 gene, which corresponds to nt 456-597. Use the Subset-range to select the exon 3 nucleotides from GSTD1 and run FASTX [pgm] again.
  1. What is the E()-value with of the most distantly related homolog with BLOSUM50? What is the percent identity for this alignment?
  2. How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
  3. Why do you think the search is so much less sensitive?
  4. What is the percent identity for the GSTA1_RAT alignment?
  5. Looking at percent identities in the Scoring matrix drop-down, what might be a better scoring matrix for the GSTA1_RAT alignment?
  6. What is the E()-value for that alignment with the new scoring matrix?
  7. What is the E()-value of the highest scoring non-homolog with that scoring matrix?
3. Repeat the same translated FASTX search, using exon 2 of the Honey bee GSTD1 gene, which corresponds to nt 397-455 FASTX [pgm].
  1. What is the best E()-value with BLOSUM50? What is the percent identity for this alignment?
  2. How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
  3. What would be a more appropriate scoring matrix based on the percent identity?
  4. What are the E()-values with that scoring matrix?
  5. What is the E()-value of the highest scoring non-homolog with that scoring matrix?

Where to get the FASTA package: github.com/wrpearson/fasta36

The "normal" FASTA WWW site:

Contact Bill Pearson: wrp@virginia.edu