fasta.bioch.virginia.edu/mol_evol

Programming for Biology

These exercises use programs on the FASTA WWW Search page[pgm] and the BLAST WWW Search page [pgm].

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.

Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

1. Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press , not .

Take a look at the output.
1. How long is the query sequence?
2. How many sequences are in the PIR1 database?
3. What scoring matrix was used?
4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
6. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated?
Homology (and non-homology) can also be inferred by domain relationships. Try the same Honey bee GSTD1 search [pgm] search using the Annotate Query and Database Annotations: set to show .
1. Does the domain display change your mind about the highest scoring non-homolog?
2. There are three parts to the domain display, the domain structure of the query (top) sequence, the domain structure of the library (bottom) sequence, and the domain alignment score in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
3. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?
4. Is your explanation for the partial domain alignment consistent the the argument that domains are atomic? How might you test whether a complete domain is present?
Note that this Honey bee glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_ECO57, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint - compare your candidate non-homolog with SwissProt for a more comprehensive test.)
One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH, with an E()-value of 0.11.
1. Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione transferase homolog?
2. Use the General Research to explore the domains contained in EF1D2_ARATH homologs found in SwissProt.
3. Does this secondary search support homology or non-homology?
Compare the Drosophila glutathione transferase GSTT1_DROME [pgm] sequence to the sequences with solved 3D structures (Library: making certain that the Annotations: is set.)
1. Are the structural domains in GSTT1_DROME consistent with the InterPro domains?
2. Looking pdb|2DSAA or pdb|1EV4A, how many "Glutaredoxin" domains appear to be present?
3. Looking at pdb|3NIVA, what do you think the "correct" structural domain annotation should be? How could you confirm you hypothesis?
Compare the expectation (E()) value for the distant relationship between honey bee GSTD1 and GSTP1_HUMAN (class-pi). How would you demonstrate that Honey bee GSTD1 is homologous to GSTP1_HUMAN?
Examine how the expectation value changes with different scoring matrices (BlastP62, VT160) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap - e.g. -12 for a one residue gap).
1. What happens to the E()-value for the close Honey bee GSTD1 vs GSTT1_DROME alignment with the different matrices and different gap penalties? If the score decreases, does the E()-value always get worse (increase)?
2. What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
3. What happens to the E()-value for the highest scoring unrelated sequence with the different matrices?

2. Do the same Honey bee GSTD1 search (295842263) using the Course BLAST [pgm] WWW page.

Take a look at the output.
1. How long is the query sequence?
2. How many sequences are in the PIR1 database?
3. What scoring matrix was used?
4. What were the gap penalties?
5. What are the numbers after the description of the library sequence? Which one is best for inferring homology?
6. Looking at an alignment, where are the boundaries of the alignment (the best local region)?
What is the highest scoring non-homolog?
How do the BLASTP E()-values compare with the FASTA (BLASTP62) E()-values for the distantly related mammalian and plant sequences?
Try a limited search at the NCBI/BLAST web site. Compare Honey bee GSTD1 (295842263) to the Human Refseq proteins using BLASTP [pgm].
1. What is the highest scoring non-homolog?
2. What is the most distant, statistically significant (E() < 0.001) human homolog?
3. Use the FASTA Web site to compare Honey bee GSTD1 to human proteins with FASTA

Significant similarities within sequences - domain duplication and over-extension - searching sequences with known structures

3. Exploring domains with local alignments -- death associated protein kinase (DAPK1_HUMAN)

Look up the domain structure of DAPK1_HUMAN at Pfam [pgm].
1. What are the major (PfamA) domain regions on the protein?
2. Which of the domains is repeated?
3. In a local (LALIGN) alignment, where would you expect to see overlapping domains like those in Calmodulin (CALM1_HUMAN) and Cortactin (SRC8_HUMAN)?
Use lalign/plalign [pgm] to examine local similarities between DAPK1_HUMAN and itself. Select "Annotate Query Sequence" to use "Interpro domains/no features", and do the same for "Annotate Target Sequence". Do you see the domains you expected from Pfam? Do they map in the same places? (You can confirm by selecting "Pfam26 Domains" for the Target sequence, while using "Interpro domains" for the Query.)
Repeat the LALIGN/PLALIGN analysis lalign/plalign [pgm], but select the subset of the protein where the repeated domains are found (within 50 residues). Looking at the first or second non-identical self-alignment:
1. What is the overall percent identity of the alignment?
2. What is the range in identity accross the different aligned Ankryin domains?
3. Do the ends of the first alignment correspond to the domain boundaries?
4. How long are the ankyrin domains?
Based on the percent identities you saw in part (n), what would the appropriate scoring matrix be to accurately identify the ankyrin domains?
1. Using a "correct" scoring matrix, are the alignment boundaries more accurate?
2. What is the percent identity of the alignment (did you pick the right matrix?)

4. -- Searching for sequences with known structure -- death associated protein kinase (DAPK1_HUMAN)

Search [pgm] the protein structure database (PDB Structures - NCBI) using the DAPK1_HUMAN protein.

How much of the protein has a known structure?
To double check your answer, search [pgm] the PDB structure sequences using the three domain regions (Kinase, Ankyrin, Death) identified by Pfam the local domain plots.
Are there homologs to the Death domain with known structures?
Try searching the protein structure database with a glutathione S-transferase sequence (e.g. GSTT1_DROME [pgm] or GSTM1_HUMAN pgm]) or a calmodulin sequence (CALM1_HUMAN [pgm]), annotating both the query and PDB database. How well do the Interpro domains line up with the structural domains.

5. Working with short sequences -- when the scoring matrix matters II

Use the FASTX [pgm] to compare the Honey bee GSTD1 mRNA (NM_001178028, gi|GI:295842262) to the PIR1 database. How does the sensitivity of this translated DNA vs protein search compare with your earlier protein:protein search?
Now do the same search, but use only exon 3 of the Honey bee GSTD1 gene, which corresponds to nt 456-597. Use the Subset-range to select the exon 3 nucleotides from GSTD1 and run FASTX [pgm] again.
1. What is the best E()-value with BLOSUM50? What is the percent identity for this alignment?
2. How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
3. What would be a more appropriate scoring matrix based on the percent identity?
4. Do a search with the more appropriate scoring matrix. What is the E()-value with that scoring matrix?
5. Does the domain structure correspond to the exon structure?
Repeat the same translated FASTX search, using exon 2 of the Honey bee GSTD1 gene, which corresponds to nt 397-455 FASTX [pgm].
1. What is the best E()-value with BLOSUM50? What is the percent identity for this alignment?
2. How long is the longest possible translated amino-acid sequence? How long is the protein alignment?
3. What would be a more appropriate scoring matrix based on the percent identity?
4. What is the E()-value with that scoring matrix?

Programming Exercise 1 -- Comparing matrices

In the next steps, you will write a script, using BioPerl, that automatically runs the searches done in Exercise 2, and tabulates E()-values, bit scores, alignment length, and percent identity.

Use the program bp_blastp.pl in your home directory.
This script compares a sequence to a database using BioPerl's RemoteBlast function to search the NCBI set of Human refseq proteins using BLOSUM62 by default. This exercise will refer to the GSTM1_HUMAN and GSTT1_DROME sequences, which you should download as well.
```
wget http://fasta.bioch.virginia.edu/mol_evol/gstt1_drome.fa
wget http://fasta.bioch.virginia.edu/mol_evol/gstm1_human.fa
```
You can change the scoring matrix by specifying --matrix PAM30; like the BLASTP web site, available matrices are BLOSUM45, BLOSUM62, BLOSUM80, PAM70, and PAM30.
The bp_blastp_seq.pl script will do a search at the NCBI BLAST site; you can modify this script to simply parse the results of a BLAST search (or FASTA search) that you have done locally. Our local copy of SwissProt proteins is called swissprot.
After downloading GSTM1_HUMAN into the file gstm1_human.fa, run the script:
bp_blastp_seq.pl gstm1_human.fa > gstm1_human.bp62
saving the output to a file gstm1_human.bp62 and answer the questions from question 2a(ii) above. Run the script using the PAM70 and PAM30 scoring matrices, saving the output to files (.pam70, .pam30) and look at E()-values, bits, and alignment lengths.
Write a new script that can look at the three bp_blastp_seq.pl output files and merges the results from the three searches.
1. Read in each of the initial search files (gstm1_human.bp62, .pam70, .pam30), and save the results from each search in a Hash, using the accession as a key, so that the E()-value, bits, alignment length, and fraction identical are saved for each library hit and for each matrix. For example, make a structure:
  { 'NP_671489' => { descr => "glutathione S-transferase ...", BLOSUM62 => { bits=> 317.0, expect=>7e-110, ...}, PAM30 => { bits=> 498.0, expect=>8e-162, ...} }
  Then, display a table that reports the E()-value, bit score, alignment length, and fraction identical for each library sequence/target found with BLOSUM62, with blanks for targets that were not found.
  [ merge_hits.pl ]
```
wget http://fasta.bioch.virginia.edu/mol_evol/merge_hits.shtml -O merge_hits.pl
```
2. What happens to the bit score, alignment length, and fraction identical for closely related sequences as the scoring matrix becomes more shallow? For distantly related sequences?
bp_blastp_seq.pl includes code to restrict the search to sequences matching an ENTREZ_QUERY. When this option is used, the E()-values are not calculated correctly.
Try your script that tabulates scores and alignment lengths for different scoring matrices using GSTT1_DROME.

Programming Exercise 2 -- High-scoring unrelated sequences

Many computational biologists, who use percent identity, rather than E()-value, to infer homology, believe that if you do enough searches and look for transitive homology, eventually non-homologous sequences will be linked. This exercise is designed to test this hypothesis, by looking for high-scoring non-homologous sequences using the strategy outlined in Problem 1. We will search with a query, and then try to find the highest scoring non-homolog by doing reverse-searches with the candidate non-homologs.

Use the program bp_blastp.pl in your home directory. bp_blastp.pl is identical to bp_blastp_seq.pl, except that instead of taking a FASTA sequence file as an argument, it takes an accession (e.g. GSTT1_DROME or NP_000843. Since this exercise runs secondary searches based on hits from an initial search, you need a script that runs a search using the accession from a previous search.
Make a modified script (bp_blastp_swiss.pl) that takes an accession (query or target) and compares that sequence to swissprot, saving the list of accessions and E()-values to a file. (bp_blastp_swiss.pl query_acc > query.hits).
For each of the initial hits to the query in the original bp_blastp.pl search against human Refseq proteins, use the bp_blastp_swiss.pl script to identify homologs in swissprot. Run bp_blastp_swiss.pl against all the initial human RefSeq hits with 1e-6 < E() < 10.0.
Then, for each of the library/target SwissProt results, write a script that reads in the hit accessions with significant E()-values, and looks for sequences that can be used to link the initial query sequence to the candidate non-homology library/target sequences. If the initial query sequence and the library/target sequence both hit some of the same sequences in SwissProt with good statistical significance, the library/target sequence is a transitive homolog.
Using this SwissProt transitive search strategy, find the highest scoring non-homolog library/target sequence from the initial search with either GSTM1_HUMAN or GSTT1_DROME against human RefSeq.
This strategy overlooks one critical feature of transitive homology. What other alignment property should be included in the transitivity test?

Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches

6. In the next three exercises, we will try to find gstt1_drome homologs in the Arabidopsis genome, using (a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c) protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.

In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.

BLASTP Compare the GSTT1_DROME [seq] (gi|121694) protein sequence to Arabidopsis proteins using BLASTP [pgm].
What are the E()-values for Arabidopsis THETA1, phi, zeta-class 1and TAU
BLASTX
Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) against Arabidopsis proteins using BLASTX [pgm]. If the search fails, try turning off low complexity filtering.
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
TBLASTN. Use GSTT1_DROME [seq] (gi|121694) against translated Arabidopsis DNA using TBLASTN [pgm].
What are the E()-values for Arabidopsis THETA-1, phi, zeta-class 1and TAU?
Finally, try the DNA:DNA comparison. Use BLASTN [pgm] to compare dmgst (gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.
Are there detectable Arabidopsis homologues?

Where to get the FASTA package: github.com/wrpearson/fasta36

The "normal" FASTA WWW site:

Contact Bill Pearson: wrp@virginia.edu