NAME SEQEVOLVER PURPOSE SeqEvolver simulates the evolution of a collection of DNA or protein sequences under Jones/Dayhoff's PAM model, with user-defined substitution rates and evolution time. USAGE Usage: seqevolver.pl --seq_file [sequence_file] --out_file [out_file] --rates_file [rates_file] --time [time] [--force/--noforce] [--benner/--pascarella] [--nucleotide/--protein] [--biased/--uniform] [--rate_var_on/--rate_var_param '0.45 10'] [--nodashes] [--loss_gain] [--loss_gain_rate 0.00298] Restrictions/usage: 1. SeqEvolver only works on sequence libraries in the FASTA sequence format. 2. "sequence_file" is the full name of the file that contains the "ancestral" sequences. 3. "outfile" is the name of the file you want the evolved sequences to be written to. 4. "t" is time, and can be any real number, and ratesfile is a file that has as many evolutionary rates as there are sequences. You can use whatever units you want as long as (time * rate) yields a PAM distance. 5. The default is to use force and overwrite existing output files, and not ask for user interaction. Use the --noforce flag to ask for permission to overwrite existing files and to turn off interactive features 6. Use the --benner or --pascarella flag to include indels. The --benner flag uses a model based on Benner, et al. J. Mol. Biol. (1993) 229, 1065-1082. The --pascarella flag uses a model based on Pascarella, et al., J. Mol. Biol. (1992) 224, 461-471. Deletions will be -'s and inserts will be lower case letters. 7. Use the --nucleotide or --protein flags to force the program to interpret the sequence files as nucleotide or protein sequences respectively (in the absence of these flags, the program will guess at the sequence type based on ACTG content). Note that sequences should be either all nucleotides or all proteins, because that is how SeqEvolver will treat them. 8. The --uniform (uniform) or --biased (biased) flags are for DNA - the --uniform flag tells SeqEvolver to use a probability matrix in which transitions and transversion occur with equal frequency, while the --biased flag indicates that SeqEvolver should use a probability matrix in which transition occur about 3 times as often as transversions. For details, see States, et al. (1991) Methods: A Companion to Methods in Enzymology. 3(1), 66-70. 9. Currently SeqEvolver has no indel model implemented for DNA, so the --benner and --nucleotide flags may not be used together (this will generate an error). 10. Use --rate_var_on to turn on random rate variation. By default a window size of 10 will be used (meaning the rate will change randomly every 10 amino acids) and the standard deviation of the rates chosen for each window will be (0.45 * PAM distance). If you want to supply your own standard deviation for the rates and your own window size, use the --rate_var_param 'SD WINDOWSIZE', for example: --rate_var_param '0.3 15' That would set the standard deviation of the rates to be 0.3 * PAM distance and the window size to 15. The --rate_var_param flag implicitly sets the --rate_var_on flag. The default values of 0.45 * PAM distance comes from an experiment I (jr) did - I looked at several thousand alignments and observed how the percent id (and thus the PAM distance) varied in window along the alignment. It turns out the standard deviation of the percent id varied almost exactly linearly as a function of percent id of the entire alignment. Maybe someday I'll actually publish this! To implement random rate variation, the sequence is divided up into windows (default is 10 amino acids), and each window is assigned a mutation rate. The mutation rate is drawn from a normal distribution around the PAM distance for the whole sequence. The default standard deviation for the normal distribution is 0.45 * PAM distance for the whole sequence. The user can substitute 0.45 with any fraction he wants for the rate variation. No support for a 'constant' normal distribution irrespective of PAM distance is currently supported. If you want to do this, you're going to have to do separate SeqEvolver runs for each PAM distance, changing the 'fraction' parameter for each run so that that the standard deviation (fraction * PAM distance) is constant. NOTES =head2 Point mutations (substitutions) Point mutations are generated by first generating a distance specific probability matrix for each distance. This is done by matrix multiplication of the appropriate PAM100, PAM10 and PAM1 matrices (e.g. PAM14 = PAM10 X PAM1 X PAM1 X PAM1 X PAM1). Then, for each residue, a random number between 0 and 1 is selected and for each column (corresponding to the identity of the unmutated residue), the probabilities are added together until the random number is less or equal to the sum of the probabilities, and the residue is changed to the residue in that row; in other words, the residue is changed to the first residue corresponding to the row for which the cumulative sum is greater than the random number. (If that doesn't make sense, see pseudocode below.) Note that if SeqEvolver doesn't recognize a residue, it leaves it unchanged. It will, however, at least tell you that it doesn't recognize the residue, provided you are listening to STDERR. The probability matrices for protein are based on Jones D.T., et al. (1992) CABIOS 8(3), 275-282 - these matrices are basically an update of Margaret Dayhoff's original PAM matrices. Two DNA matrices are available, both based on States, et al. (1991) Methods: A Companion to Methods in Enzymology. 3(1), 66-70. The two different DNA matrices correspond to two different models of DNA mutation. In the uniform model, transitions and transversions occur with equal probability. In the biased model, transitions occur about three times as often as transversions. The latter model is a bit more realistic. I will add more elaborate models as my needs (or other's requests) dictate; feel free to contact me (jtr4v@virginia.edu) if you'd like me to incorporate another model. Insertions/Deletions (Indels) Two insertions/deletion (indel) models are implemented for protein sequences - that of Benner et al., (Benner, et al. J. Mol. Biol. (1993) 229, 1065-1082) and that of Pascarella, et al. (Pascarella, et al., J. Mol. Biol. (1992) 224, 461-471). A few comments/caveats: - In the Benner model, the probability of an indel (but NOT indel size) increases with PAM distance. - After deciding to allow an indel, the algorithm randomly chooses between insertions and deletions using a coin toss (50/50). - The Benner model assumes a Zipfian distribution of indel sizes: p(indel len=k) = constant / (k**exponent). In contrast to an exponential distribution, the average length generated by a Zipfian distribution can vary wildly with each trial. This means that the Zipfian distribution is much more likely to occasionally produce an extremely long indel. - The amino acids in inserts are chosen randomly based on the amino-acid frequencies in SwissProt. Pseudocode
#construct distance specific scoring matrix for (0 .. hundreds) { if (matrix is undefined) { matrix = PAM100; next; } matrix = matrix * PAM100; } for (0 .. tens) { if (matrix is undefined) { matrix = PAM10; next; } matrix = matrix * PAM10; } for (0 .. ones) { if (matrix is undefined) { matrix = PAM10; next; } matrix = matrix * PAM1; } #point mutations foreach ( residue ) { probability[0 .. num_letters - 1] = column of probabilities for residue from current_matrix; random = rand(1); for n( 0 .. num_letters - 1 ) { sum = sum + probability[n]; if ( random <= sum ) { new_residue = residue[n]; break; } } } #insertions/deletions foreach ( residue ) { random = rand(1); threshold = 0.0224 - 0.0219 * exp( -0.0102 ? target_pam_distance); # Benner et al. - or - threshold = ; # Pascarella et al. if ( random < threshold ) { length = (choose length from Zipfian distribution); # Benner, et al. - or - length = (choose length from ... ) # Pascarella, et al. random = rand(1); if ( random > 0.5 ) # insertion { insert (length) number of random amino acids; } else { delete (length) number of amino acids; } } }FEEDBACK jtr4v@virginia.edu REPORTING BUGS jtr4v@virginia.edu AUTHORS Justin Reese, Todd Wood REVISION LOG Version 0.2: Changed the random rate selecting algorithm to choose random PAM numbers rather than % identities. 10/30/97 Version 0.3: Changing random number generation to read a set of random numbers generated from Splus, these are rates of mutation Version 0.4: Changed it to generate a new PAM matrix for each sequence. Version 1.0: Altered evolver to allow random sequences to be inserted and deleted. Version 1.1: Created a routine for self-calculating the mutational probability matrix from a PAM1 matrix, so I don't have to call a separate program each time. I'm also removing the randomizing/pseudo-transfer module, since it is terribly unrealistic. 2/2/99 Also added a -f flag to automatically overwrite existing files. evolver will never overwrite a file that does not have write permission, even with the -f flag. Version 1.2: Now introduces random rate variation with the "s" parameter. Version 1.21: Added multiple PAM matrices to increase speed. Version 1.22: Took out the "random rate variation" crap since it was dumb. 6/23/99 Version 2.0: Fixed a very subtle problem with the way distance specific scoring matrices were calculated; before 2.0, the distances were corrected for the fact that the starting point is PAM1 by subtracting 1 from the distances. However, if the distance happens to be a multiple of 10, the corrected distance is "-1 PAM1's", which the loop for $n1 ignores and therefore the correction never gets done (i.e. it can't subtract distance from substitution matrices). In 2.0 on, 1 PAM is subtracted from the desired distance ($dist[$k]) instead of from the number of PAM1's needed ($n1). Version 3.0: Changes: Changed random seed from default (time) to more complex seed involving time and PID. Created a subroutine (makeindel) to include indel's into mutation model based on the indel model of Benner, Cohen and Gonnet (J. Mol. Biol. 1993 229, 1065-1082). Use the -g to allow indel's. Version 3.5: - Moved matrices into code, made makematrices subroutine to load matrices into variable - Changed code so that if rates file is not found, it will prompt user to default to rates of '1' for all proteins in sequence file - Made -w flag for web interface. Sequence_file is read in through STDIN (cat, echo, <, etc), only the time argument and ratesfile is supplied. This saves from having to write the sequences to a temp file... Version 4.0: - Implemented DNA mutation 3/2001 - Comments: Do not mix nucleotide and protein sequences together in the same sequence file. The program will interpret the sequences as either all nucleotide or all protein sequences, so mixing the two will result in one type (either nucleotides or proteins) not being mutated correctly. - V.4.08 Implemented indel model of Pascarella, et al., based on J. Mol. Biol. (1992) 224, 461-471. CHANGED TO SEQEVOLVER, Version 0.5 - March 2002 - Various minor fixes Version 0.6 - Added -d option which supresses -'s when writing out sequences Version 0.65 - Changed formula for calculating distances slightly - used to int(time) before multiplying times rate. This (silently) prevented non-integer times. Now dist = int(time * rate), which allows non-integer times (with accompanying rounding error dangers). Version 0.7 - Added -r options for random rate variation within a sequence. IMPORTED INTO CVS (consult CVS log for changes implemented after 3/11/03)