Biol4230 - PHYLIP Exercises (Using EMBOSS)

PHYLIP Exercises

PHYLIP (Phylogenetic Inference Package) provides a set of "classic" phylogeny programs that have been available since 1980 Phylip Home Page.

Unfortunately, in part because they were written in the 80's, the user interface is quite primitive, and in some ways somewhat hostile. Fortunately, the PHYLIP programs have been repackaged as part of the EMBOSS software package, which provides a much more modern command line interface around the PHYLIP programs. In addition, EMBOSS provides some other very helpful programs for producing files in the correct format.

This workshop will use the EMBOSS programs on interactive.hpc to construct evolutionary trees using protein and DNA sequences. It is possible to run the workshop on hpc, but you will NOT be able to use the EMBOSS versions of the programs.

This series of exercises will be your homework for Wednesday, March 14. Please do the exercises in a new biol4230/hwk6 directory. Though we will do this exercise interactively today, please create a phylip.sh shell script file that shows exactly the steps you used to do the analyses.

Before you can use the EMBOSS programs, you will need to ensure that seqprg/emboss/bin is in your path. Check to see that the EMBOSS programs are in your path by looking at help on one of them:
```
seqret -help -verbose
```
All of the EMBOSS programs have a -help option, that you will need to use to learn how to specify the program input and output file names, and other options.
On interactive.hpc.virginia.edu, copy the files gstm.alib and gstm.nlib from ${SLIB2}/biol4230/data/phylip to a new hwk6 directory.
Align the gstm.alib sequences using
```
muscle -stable -in gstm.alib -out gstm.a_aln
```
(the -stable option ensures that the output alignment is in the same sequence order as the input)
By default, muscle writes out the result in FASTA format, which you can use to produce the DNA alignment. You may also want to write out the alignment in Clustalw format (option -clw) to look at alignment conservation.
Looking at either the FASTA or ClustalW format multiple sequence alignment, how many gaps do you see? Do you think a different alignment program would produce a different multiple sequence alignment?
Use tranalign program to align the protein sequences in gstm.alib to the DNA sequences in gstm.nlib.

tranalign -asequence gstm.nlib -bsequence gstm.a_aln -outseq gstm.n_aln
Look at the gstm.n_align file. Is it in PHYLIP format?
Use the seqret program:
```
seqret  -osformat2 phylip -sequence gstm.a_aln -outseq gstm.a_phy
```
to reformat gstm.a_aln and gstm.n_aln alignments in FASTA format into PHYLIP format (gstm.a_phy, gstm.n_phy).
Use the fprotdist program to build a matrix of protein distances from gstm.a_phy.
Use the fdnadist program to build a matrix of DNA distances from gstm.n_phy
```
fprotdist -sequence gstm.a_phy -outfile gstm.a_dist
fdnadist -sequence gstm.n_phy -outfile gstm.n_dist -method f
```
Use the ffitch and fkitsch programs to build trees from the protein and DNA distance files. For fitch (but not fkitsch), you should specify an outgroup: -outgrno 19
```
ffitch -datafile gstm.a_dist -outtreefile gstm.a_dist_tree -outfile gstm.a_dist_log -outgrno 19
```
When you run the program, it will ask for an (optional) -intreefile, which you do not need (or have). Just hit return, or create a file with a blank line in it (not empty, it must have one newline). If you call it "blank-line.txt", you can run:
```
ffitch -datafile gstm.a_dist -outtreefile gstm.a_dist_tree -outfile gstm.a_dist_log -outgrno 19 < blank-line.txt
```
And the program will run properly.
1. Looking at the program output (-outfile option), do both the protein distance and DNA distance trees look the same? Do the ffitch and fkitsch trees look the same?
2. The data set includes several paralogous human, mouse, and rat glutathione S-transferases. Can you identify the mouse/rat orthologs?
3. Can you identify any mouse/human orthologs? What evolutionary events might cause the human/mouse orthologs to be more difficult to identify?
Use the fprotpars and fdnapars programs to build trees from the protein and DNA alignment files.
Use the fdnaml and fdnamlk programs to build trees from the DNA alignment file.
Use the fconsense program to compare the trees. To do this, you must combine all the tree files you produced with the different programs into one:
```
cat gst_m.pdist_tree gst_m.ddist_tree gst_m.ppars_tree gst_m.dpars_tree gst_m.dml_tree > gst_m.all_trees
fconsense -intreefile gst_m.all_trees
```
1. Which parts of the tree are found by all 3 methods on both DNA and protein datasets?
2. Does one method (distance/parsimony/maximum likelihood) do a better job of assigning mouse/human orthologs than the others?

Homework 6 (hwk6) due Wednesday, March 14, at noon, should provide a script, with comments, that does each of the analyses listed above. A second file, answers.txt should answer the additional questions in parts:

3. — multiple alignment and gaps

7. — are the trees the same, which are the orthlogs

10. — which parts of tree are consistent, which method identifies more orthologs

Course home page