In the links
below, [pgm]
indicates a link with most of the information filled in; e.g. the
program name, query, and
library. [seq] links go to the
NCBI, for more information about the sequence. In general, you should
click [pgm] links, but not [seq] links.
Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations
What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated?
Homology (and non-homology) can also be inferred by domain relationships. Try the same
Honey bee GSTD1 search [pgm] search using the Annotate Query and
Database Annotations: set to show .
Does the domain display change your mind about the highest scoring non-homolog?
There are three parts to the domain display, the domain structure of
the query (top) sequence, the domain structure of the library (bottom)
sequence, and the domain alignment score in the middle (inside the
alignment box). The boundaries and color of the alignment domain
coloring match the Region: sub-alignment scores.
Note that the alignment of Honey bee GSTD1
and SSPA_ECO57 includes portions of both the N-terminal and
C-terminal domains, but neither domain is completely aligned. Why do
you think the alignments do not include the complete domains?
Is your explanation for the partial domain alignment consistent the
the argument that domains are atomic? How might you test whether a
complete domain is present?
Note that this Honey bee glutathione transferase shares significant
similarity with both sequences from bacteria (SSPA_ECO57, stringent
starvation protein) and mammals. How might you test whether the
stringent starvation protein is homologous to glutathione
transferases? (Hint - compare your candidate non-homolog
with SwissProt for a more comprehensive test.)
One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH,
with an E()-value of 0.11.
Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione
transferase homolog?
Use the
General Research to explore the domains contained
in EF1D2_ARATH homologs found in SwissProt.
Does this secondary search support homology or non-homology?
Compare the Drosophila glutathione transferase GSTT1_DROME [pgm] sequence to the sequences with solved 3D structures (Library: making certain that the Annotations: is set.)
Are the structural domains in GSTT1_DROME consistent with the InterPro domains?
Looking pdb|2DSAA or pdb|1EV4A, how many "Glutaredoxin" domains appear to be present?
Looking at pdb|3NIVA, what do you think the "correct" structural domain annotation should be? How could you confirm you hypothesis?
Compare the expectation (E()) value for the distant
relationship between honey bee GSTD1 and
GSTP1_HUMAN (class-pi). How would you demonstrate that
Honey bee GSTD1 is homologous to GSTP1_HUMAN?
Examine how the expectation value changes with different scoring
matrices (BlastP62, VT160) and different gap penalties. (The default
scoring matrix for the FASTA programs is BLOSUM50, with gap penalties
of -10 to open a gap and -2 for each residue in the gap - e.g. -12 for
a one residue gap).
What happens to the E()-value for the close Honey bee GSTD1 vs
GSTT1_DROME alignment with the different matrices and different gap
penalties? If the score decreases, does the E()-value always get
worse (increase)?
What happens to the E()-value for distant homologs,
like GSTA1_RAT with the different matrices and different gap
penalties?
What happens to the E()-value for the highest scoring unrelated
sequence with the different matrices?
2. Do the same Honey bee GSTD1 search (295842263) using the Course
BLAST [pgm] WWW page.
Take a look at the output.
How long is the query sequence?
How many sequences are in the PIR1 database?
What scoring matrix was used?
What were the gap penalties?
What are the numbers after the description of the library sequence? Which one is best for inferring homology?
Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog?
How do the BLASTP E()-values compare with the
FASTA (BLASTP62) E()-values for the distantly related
mammalian and plant sequences?
Try a limited search at the NCBI/BLAST web site. Compare Honey bee GSTD1 (295842263) to the Human Refseq proteins using BLASTP [pgm].
What is the highest scoring non-homolog?
What is the most distant, statistically significant (E() < 0.001) human homolog?
Use the FASTA Web site to compare Honey bee GSTD1 to human proteins with FASTA
Significant similarities within sequences - domain duplication and
over-extension - searching sequences with known structures3. Exploring domains with local alignments -- death associated protein kinase (DAPK1_HUMAN)
Look up the domain structure of DAPK1_HUMAN
at Pfam [pgm].
What are the major (PfamA) domain regions on the protein?
Which of the domains is repeated?
In a local (LALIGN) alignment, where would you expect to see
overlapping domains like those in Calmodulin (CALM1_HUMAN) and
Cortactin (SRC8_HUMAN)?
Use lalign/plalign [pgm] to examine local
similarities between DAPK1_HUMAN and itself. Select
"Annotate Query Sequence" to use "Interpro domains/no features", and
do the same for "Annotate Target Sequence". Do you see the domains
you expected from Pfam? Do they map in the same places? (You can
confirm by selecting "Pfam26 Domains" for the Target sequence, while
using "Interpro domains" for the Query.)
Repeat the LALIGN/PLALIGN
analysis lalign/plalign [pgm], but select the
subset of the protein where the repeated domains are found (within 50
residues). Looking at the first or second non-identical self-alignment:
What is the overall percent identity of the alignment?
What is the range in identity accross the different aligned Ankryin domains?
Do the ends of the first alignment correspond to the domain boundaries?
How long are the ankyrin domains?
Based on the percent identities you saw in part (n), what would the
appropriate scoring matrix be to accurately identify the ankyrin
domains?
Using a "correct" scoring matrix, are the alignment boundaries more accurate?
What is the percent identity of the alignment (did you pick the right matrix?)
4. -- Searching for sequences with known structure -- death associated protein kinase (DAPK1_HUMAN)
Search [pgm] the protein structure database (PDB Structures - NCBI) using the DAPK1_HUMAN protein.
How much of the protein has a known structure?
To double check your
answer, search [pgm] the PDB structure sequences using
the three domain regions (Kinase, Ankyrin, Death) identified by Pfam
the local domain plots.
Are there homologs to the Death domain with known structures?
Try searching the protein structure database with a glutathione S-transferase sequence (e.g. GSTT1_DROME [pgm] or GSTM1_HUMAN pgm]) or a calmodulin sequence (CALM1_HUMAN [pgm]), annotating both the query and PDB database. How well do the Interpro domains line up with the structural domains.
5. Working with short sequences -- when the scoring matrix matters II
Use
the FASTX
[pgm] to compare the Honey bee GSTD1 mRNA (NM_001178028,
gi|GI:295842262) to the PIR1 database. How does the sensitivity of
this translated DNA vs protein search compare with your earlier
protein:protein search?
Now do the same search, but use only exon 3 of the Honey bee GSTD1 gene, which corresponds to nt 456-597. Use the Subset-range to select the exon 3 nucleotides from GSTD1 and run FASTX [pgm] again.
What is the best E()-value with BLOSUM50? What is the percent identity for this alignment?
How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
What would be a more appropriate scoring matrix based on the percent identity?
Do a search with the more appropriate scoring matrix. What is the E()-value with that scoring matrix?
Does the domain structure correspond to the exon structure?
Repeat the same translated FASTX search, using exon 2 of the Honey bee GSTD1 gene, which corresponds to nt 397-455 FASTX [pgm].
What is the best E()-value with BLOSUM50? What is the percent identity for this alignment?
How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
What would be a more appropriate scoring matrix based on the percent identity?
What is the E()-value with that scoring matrix?
Programming Exercise 1 -- Comparing matrices
In the next steps, you will write a script, using BioPerl, that
automatically runs the searches done in Exercise 2, and tabulates
E()-values, bit scores, alignment length, and percent identity.
Use the program bp_blastp.pl in your home directory.
This script compares a sequence to a database using BioPerl's RemoteBlast
function to search the NCBI set of Human refseq proteins using BLOSUM62 by
default. This exercise will refer to the GSTM1_HUMAN and GSTT1_DROME sequences, which you should download as well.
You can change the scoring matrix by specifying --matrix
PAM30; like the BLASTP web site, available matrices are BLOSUM45,
BLOSUM62, BLOSUM80, PAM70, and PAM30.
The bp_blastp_seq.pl script will do a search at the
NCBI BLAST site; you can modify this script to simply parse the
results of a BLAST search (or FASTA search) that you have done
locally. Our local copy of SwissProt proteins is
called swissprot.
After downloading GSTM1_HUMAN into the
file gstm1_human.fa, run the
script: bp_blastp_seq.pl gstm1_human.fa >
gstm1_human.bp62 saving the output to a
file gstm1_human.bp62 and answer the questions from
question 2a(ii) above. Run the script using the PAM70 and PAM30
scoring matrices, saving the output to files (.pam70, .pam30) and look
at E()-values, bits, and alignment lengths.
Write a new script that can look at the
three bp_blastp_seq.pl output files and merges the results
from the three searches.
Read in each of the initial search files (gstm1_human.bp62,
.pam70, .pam30), and save the results from each search in a
Hash, using the accession as a key, so that the E()-value, bits,
alignment length, and fraction identical are saved for each library
hit and for each matrix. For example, make a structure:
{ 'NP_671489' => { descr => "glutathione S-transferase ...", BLOSUM62 => { bits=> 317.0, expect=>7e-110, ...}, PAM30 => { bits=> 498.0, expect=>8e-162, ...} }
Then, display a table that reports the E()-value, bit score, alignment
length, and fraction identical for each library sequence/target found
with BLOSUM62, with blanks for targets that were not found.
What happens to the bit score, alignment length, and fraction identical for
closely related sequences as the scoring matrix becomes more
shallow? For distantly related sequences?
bp_blastp_seq.pl includes code to restrict the search to
sequences matching an ENTREZ_QUERY. When this option is used, the
E()-values are not calculated correctly.
Try your script that tabulates scores and alignment lengths for different scoring matrices using GSTT1_DROME.
Programming Exercise 2 -- High-scoring unrelated sequences
Many computational biologists, who use percent identity, rather than
E()-value, to infer homology, believe that if you do enough searches
and look for transitive homology, eventually non-homologous sequences
will be linked. This exercise is designed to test this hypothesis, by
looking for high-scoring non-homologous sequences using the strategy
outlined in Problem 1. We will search with a query, and then try to
find the highest scoring non-homolog by doing reverse-searches with
the candidate non-homologs.
Use the program bp_blastp.pl in your home directory.
bp_blastp.pl is
identical to bp_blastp_seq.pl, except that instead of taking
a FASTA sequence file as an argument, it takes an accession
(e.g. GSTT1_DROME or NP_000843. Since this
exercise runs secondary searches based on hits from an initial search,
you need a script that runs a search using the accession from a
previous search.
Make a modified script (bp_blastp_swiss.pl) that takes an
accession (query or target) and compares that sequence to swissprot,
saving the list of accessions and E()-values to a file. (
bp_blastp_swiss.pl query_acc > query.hits).
For each of the initial hits to the query in the original bp_blastp.pl search against human Refseq proteins, use the bp_blastp_swiss.pl script to identify homologs in swissprot. Run bp_blastp_swiss.pl against all the initial human RefSeq hits with 1e-6 < E() < 10.0.
Then, for each of the library/target SwissProt results, write a script
that reads in the hit accessions with significant E()-values, and
looks for sequences that can be used to link the initial query
sequence to the candidate non-homology library/target sequences. If
the initial query sequence and the library/target sequence both hit
some of the same sequences in SwissProt with good statistical
significance, the library/target sequence is a transitive homolog.
Using this SwissProt transitive search strategy, find the highest
scoring non-homolog library/target sequence from the initial search
with either GSTM1_HUMAN or GSTT1_DROME
against human RefSeq.
This strategy overlooks one critical feature of transitive homology.
What other alignment property should be included in the transitivity
test?
Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches6. In the next three exercises, we will try to find
gstt1_drome homologs in the Arabidopsis genome, using
(a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c)
protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.
In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.
What are the E()-values for Arabidopsis THETA1, phi, zeta-class 1and TAU
BLASTX
Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) against Arabidopsis proteins using
BLASTX [pgm]. If the search fails, try turning off low complexity filtering.
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
Finally, try the DNA:DNA comparison.
Use
BLASTN [pgm] to compare dmgst
(gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.