In the links
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence. In general, you should
click [pgm] links, but not [seq] links.
Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations
1. Use the FASTA search page [pgm] to compare Honey bee
glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
Homologs, non-homologs, and the statistical control.
What is the highest scoring non-homolog? (The non-homolog with the
highest alignment score, or the lowest E()-value.)
If the statistical estimates are accurate, what should the E()-value for the highest non-homolog (the highest score by chance) be? (This is a control for statistical accuracy.)
You can use the domain diagrams (colors) to identify distant homologs, and, by elimination, the highest scoring non-homolog. You can also use the Sequence Lookup link to Uniprot to look at Domains and Families.
What is the E()-value of the most distant homolog shown (based on displayed domain content)? Could there be more distant homologs?
How would you confirm that your candidate non-homolog was truly
unrelated? (Hint - compare your candidate non-homolog
with SwissProt or QFO78/Uniprot Ref for a more comprehensive test.)
Domains and alignment regions
There are three parts to the domain display, the domain structure of
the query (top) sequence (if available), the domain structure of the library (bottom)
sequence, and the domain alignment boundaries in the middle (inside the
alignment box). The boundaries and color of the alignment domain
coloring match the Region: sub-alignment scores.
Note that the alignment of Honey bee GSTD1
and SSPA_ECO57 includes portions of both the N-terminal and
C-terminal domains, but neither domain is completely aligned. Why do
you think the alignments do not include the complete domains?
Is your explanation for the partial domain alignment consistent the
the argument that domains have a characteristic length? How might you
test whether a complete domain is present?
In the subalignment scores, the Q value is -10 * log(p) for
the sub-alignment score, so Q=30.0 means p < 0.001.
GSTD1 search [pgm] using the BLASTP62/-11/-1
scoring matrix that
BLAST uses. Re-examine the SSPA_ECO57 alignment. Are
both Glutathione transferase domains present? Look at the alignments
to the homologs above and below SSPA_ECO57. Based on those
aligments, do you think the Glutathione-S-Trfase C-like domain is
really missing? Why did the alignment become so much shorter?
One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH,
with an E()-value of 0.11.
Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione
General Research to explore the domains contained
in EF1D2_ARATH homologs found in SwissProt. Use the "Use subset range" option to limit the search to the N-terminal region.
Does this secondary search support homology or non-homology? What is the percent identity of the first significant aligment with an annotated domain? What is the E()-value?
The portion of the protein you are searching with is very short. Based on the percent identity that you saw in the previous search, what would be a better scoring matrix? Why?
Try a search with that scoring matrix. What is the new E()-value with
the scoring matrix you chose?
2. Exploring domains and over-extension with local alignments -- death associated protein kinase (DAPK1_HUMAN)
Look up the domain structure of DAPK1_HUMAN
at Pfam [pgm].
What are the major (PfamA) domain regions on the protein?
Which of the domains is repeated?
In a local (LALIGN) alignment, where would you expect to see
overlapping domains like those in Calmodulin (CALM_HUMAN) and
Use lalign/plalign [pgm] to examine local
similarities between DAPK1_HUMAN and itself. Check the
options to "annotate sequence 1 domains" and "annotate sequence 2
domainss". Annotate one of the sequences with "Interpro Domains/UniProt features", and the other with "Uniprot Domains/Uniprot Features".
Do you see the domains you expected from Pfam? Do they map in the same places?
Repeat the LALIGN/PLALIGN
analysis lalign/plalign [pgm], but select the
subset of the protein where the repeated domains are found (350-700) on both the query (first) and subject (second) sequence.
Looking at the first or second non-identical self-alignment:
What is the overall percent identity of the alignment?
What is the range in identity accross the different aligned Ankryin domains?
Do the ends of the first alignment correspond to the domain boundaries?
How long are the ankyrin domains?
Based on the percent identities you saw in part (c), what would the
appropriate scoring matrix be to accurately identify the ankyrin
Using a "correct" scoring matrix, are the alignment boundaries more accurate?
What is the percent identity of the alignment (did you pick the right matrix?)
3. Exploring domains and alignment over-extension -- cortactin (SRC8_HUMAN)
Compare SRC8_HUMAN [pgm] (human cortactin) to the SwissProt protein sequence database.
Looking at the colored rectangles to the right of the list of
best scores, what are the green domains and blue domains?
How many proteins have homologous green domains?
How many significant alignments only have a blue domain? Do you think those proteins are homologous?
Looking at the top five alignments, how many cortactin orthologs
do you see? (ortholog, same protein, different species).
In the SRC8 HUMAN:CHICK alignment, both the query and the subject (library) sequences align seven cortactin domains and an SH3 domain. In addition, two regions (one before the cortactin domain cluster and one after) are well conserved, but do not have annotated domains (NODOM). Are these non-domain (NODOM) regions as well conserved as the annotated domains?
Look at the SRC8_HUMAN:HCLS1_MOUSE alignment. How many cortactin
domains does HCLS1_MOUSE contain? How much score does the NODOM between the cortactin domains and the SH3 domain contribute? Why is it included in the alignment? Is it likely to be homologous?
Is the NODOM between the cortactin domains and the SH3 domain likely
to be homologous in the SRC8_HUMAN:DBNLB_XENLA alignment? What data would convince you that the sequences were homologous?
What scoring matrix should be used to reduce over-extension from the SH3 domain?
4. -- Searching for sequences with known structure -- death associated protein kinase (DAPK1_HUMAN)
Search [pgm] the protein structure database (PDB Structures - NCBI) using the DAPK1_HUMAN protein.
How much of the protein has a known structure?
To double check your
answer, search [pgm] the PDB structure sequences using
the three domain regions (Kinase, Ankyrin, Death) identified by Pfam
the local domain plots.
Are there homologs to the Death domain with known structures?
Try searching the protein structure database with a glutathione S-transferase sequence (e.g. GSTT1_DROME [pgm] or GSTM1_HUMAN [pgm]) or a calmodulin sequence (CALM_HUMAN [pgm]), annotating both the query and PDB database. How well do the Interpro domains line up with the structural domains.