Biol4230 Mid-term project - due 5:00 PM, March 31
The goal of the project is to have you use many of the methods that were introduced in the first half of the course, in particular similarity searching, automated scripting of sequence downloads, multiple sequence alignment, and evolutionary tree reconstruction.
Overall, the project is to start with a possibly protein sequence, identify 20 - 30 homologs, including homologs from rat and mouse (for a human query), but also homologs that share less than 30% sequence identity, and then build evolutionary trees using both protein and DNA sequences.
Identify a human protein family for searching and tree building. Do NOT use class-mu glutathione transferases. You should use a protein family that has a relatively constant length. Some of the families you might consider are:
If you wish to use your own protein family, the protein should be between 100 - 500 amino-acids, should contain a single domain (or multiple domains that are usually found together), and have distant homologs (<30% identity with E()<1e-6).
|Trypsin-like serine Proteases:||sp|P07477|TRY1_HUMAN|
|Subtilisin-like serine proteases:||sp|Q8NBP7|PCSK9_HUMAN (isolate Peptidase_S8 domain)|
|glucose transporter: ||sp|P11166|GTR1_HUMAN|
|DS protein phosphatase: ||sp|P28562|DUS1_HUMAN|
|cytochrome P450: ||sp|P04798|CP1A1_HUMAN|
|proline isomerase: ||sp|Q9NWM8|FKB14_HUMAN|
Find a set of 20 - 30 clear homologs (E < 1E-6), some of which share less than 30% identity, preferably with homologs from human, mouse, and rat (as well as more distant organisms).
Build a multiple sequence alignment of the proteins (or homologous domains) using muscle. Look at the multiple sequence alignment to ensure that there are not large numbers of gaps across the alignment (there will be some gaps in the distant homologs, but hopefully they will be clustered).
Download the set of mRNA (cDNA) sequences that code for your proteins.
This is easiesr if you work with RefSeq proteins, because every RefSeq
protein (NP_) has a RefSeq mRNA sequence (NM_) that can be translated
into the protein. Uniprot proteins do not always have cDNAs.
Use the muscle protein alignment to direct a mRNA (cDNA) DNA alignment using tranalign.
Build distance and parsimony trees for both your protein and DNA alignments, and a DNA maximum likelihood tree.
Use the consense program to evaluate the consistency of the five trees: protein distance, DNA distance, protein parsimony, DNA parsimony, and DNA maximum likelihood.
Bootstrap 1: Use seqboot to generate a 100 sequence bootstrap dataset for the protein sequences, and then evaluate the robustness of the protdist/fitch tree using the boostrap approach.
Bootstrap 2: Use seqboot to generate a 100 sequence bootstrap dataset for the DNA sequences, and then evaluate the robustness of the dnaml tree using the boostrap approach.
Compare the three different consensus trees that you made. Tree 1: consensus of 5 phylogenetic methods (distance/parsimony prot/DNA + dnaml). Tree 2: 100 bootstraps of protein sequences/protdist/fitch. Tree 3: 100 bootstraps of DNA sequences/dnamlk.
The final tree: Take the consensus of consensus tree (from step
9 above), and use it as an -intree file to direct the
production of a final tree, either using ffitch with protein
distances, or dnaml with a DNA multiple alignment. Also, compare how
well the consensus tree-driven fit to the data compares with the
unconstrained (no input tree) fit to the data (which you did earlier).
(stay tuned for updates). If you are a graduate student, then you must do some additional work.
Identify a set of your sequences that share 60-90% identity (throw
away the distant homologs).
Run the codeml program on the aligned DNA codons of your
reduced dataset, and compare M0 (one uniform rate), to M1 and M2, and
M7, M8, to test to see whether some of the sites in your dataset
support either positive or negative selection. For more information,
look at the pamlDOC.pdf document on the Collab/Resources site, and
take a look at the gstm_codeml.ctl file, as well
as ajm_gstm.phy and ajm_gstm.tree in the same
folder. The gstm_codeml.ctl file shows how to test for 5
models, M0, M1, M2, M7, and M8, in one run. The log-likelihoods, and
proportions in each of the omega-rate categories can be seen in the
output file, as well as any sites under strong positive selection (if
Report the probabilities (frequencies) and omega's for each model
(M0,1,2,7,8) as well as the log-likelihoods for the different
models. Using the discussion in the pamlDOC.pdf document on page
29ff, determine the number of free parameters and use the Likelihood
Ratio test to see if the different models have significant
differences in likelihood.
You should write a script that does every step of the analysis. I should be able to copy that script to an empty directory, run the script (and possibly an associated accession list) and reproduce your results. You do NOT have to script the initial BLAST/SSEARCH search, but, given a list of protein accessions, the script should reproduce your results.