Biol4230 Mid-term project - due 5:00 PM, March 31

The goal of the project is to have you use many of the methods that were introduced in the first half of the course, in particular similarity searching, automated scripting of sequence downloads, multiple sequence alignment, and evolutionary tree reconstruction. Overall, the project is to start with a possibly protein sequence, identify 20 - 30 homologs, including homologs from rat and mouse (for a human query), but also homologs that share less than 30% sequence identity, and then build evolutionary trees using both protein and DNA sequences.


  1. Identify a human protein family for searching and tree building. Do NOT use class-mu glutathione transferases. You should use a protein family that has a relatively constant length. Some of the families you might consider are:
    MAP Kinase:sp|P45985|MP2K4_HUMAN
    Lactoglutathione lyase:sp|Q04760|LGUL_HUMAN
    Trypsin-like serine Proteases:sp|P07477|TRY1_HUMAN
    Subtilisin-like serine proteases:sp|Q8NBP7|PCSK9_HUMAN (isolate Peptidase_S8 domain)
    glucose transporter: sp|P11166|GTR1_HUMAN
    DS protein phosphatase: sp|P28562|DUS1_HUMAN
    cytochrome P450: sp|P04798|CP1A1_HUMAN
    proline isomerase: sp|Q9NWM8|FKB14_HUMAN
    If you wish to use your own protein family, the protein should be between 100 - 500 amino-acids, should contain a single domain (or multiple domains that are usually found together), and have distant homologs (<30% identity with E()<1e-6).

    Find a set of 20 - 30 clear homologs (E < 1E-6), some of which share less than 30% identity, preferably with homologs from human, mouse, and rat (as well as more distant organisms).

  2. Build a multiple sequence alignment of the proteins (or homologous domains) using muscle. Look at the multiple sequence alignment to ensure that there are not large numbers of gaps across the alignment (there will be some gaps in the distant homologs, but hopefully they will be clustered).

  3. Download the set of mRNA (cDNA) sequences that code for your proteins. This is easiesr if you work with RefSeq proteins, because every RefSeq protein (NP_) has a RefSeq mRNA sequence (NM_) that can be translated into the protein. Uniprot proteins do not always have cDNAs.

  4. Use the muscle protein alignment to direct a mRNA (cDNA) DNA alignment using tranalign.

  5. Build distance and parsimony trees for both your protein and DNA alignments, and a DNA maximum likelihood tree.

  6. Use the consense program to evaluate the consistency of the five trees: protein distance, DNA distance, protein parsimony, DNA parsimony, and DNA maximum likelihood.

  7. Bootstrap 1: Use seqboot to generate a 100 sequence bootstrap dataset for the protein sequences, and then evaluate the robustness of the protdist/fitch tree using the boostrap approach.
  8. Bootstrap 2: Use seqboot to generate a 100 sequence bootstrap dataset for the DNA sequences, and then evaluate the robustness of the dnaml tree using the boostrap approach.

  9. Compare the three different consensus trees that you made. Tree 1: consensus of 5 phylogenetic methods (distance/parsimony prot/DNA + dnaml). Tree 2: 100 bootstraps of protein sequences/protdist/fitch. Tree 3: 100 bootstraps of DNA sequences/dnamlk.
  10. The final tree: Take the consensus of consensus tree (from step 9 above), and use it as an -intree file to direct the production of a final tree, either using ffitch with protein distances, or dnaml with a DNA multiple alignment. Also, compare how well the consensus tree-driven fit to the data compares with the unconstrained (no input tree) fit to the data (which you did earlier).

(stay tuned for updates). If you are a graduate student, then you must do some additional work.

  1. Identify a set of your sequences that share 60-90% identity (throw away the distant homologs).

  2. Run the codeml program on the aligned DNA codons of your reduced dataset, and compare M0 (one uniform rate), to M1 and M2, and M7, M8, to test to see whether some of the sites in your dataset support either positive or negative selection. For more information, look at the pamlDOC.pdf document on the Collab/Resources site, and take a look at the gstm_codeml.ctl file, as well as ajm_gstm.phy and ajm_gstm.tree in the same folder. The gstm_codeml.ctl file shows how to test for 5 models, M0, M1, M2, M7, and M8, in one run. The log-likelihoods, and proportions in each of the omega-rate categories can be seen in the output file, as well as any sites under strong positive selection (if they exist).
  3. Report the probabilities (frequencies) and omega's for each model (M0,1,2,7,8) as well as the log-likelihoods for the different models. Using the discussion in the pamlDOC.pdf document on page 29ff, determine the number of free parameters and use the Likelihood Ratio test to see if the different models have significant differences in likelihood.

IMPORTANT: You should write a script that does every step of the analysis. I should be able to copy that script to an empty directory, run the script (and possibly an associated accession list) and reproduce your results. You do NOT have to script the initial BLAST/SSEARCH search, but, given a list of protein accessions, the script should reproduce your results.