NAME
    SEQEVOLVER

PURPOSE
    SeqEvolver simulates the evolution of a collection of DNA or protein
    sequences under Jones/Dayhoff's PAM model, with user-defined
    substitution rates and evolution time.

USAGE
    Usage: seqevolver.pl --seq_file [sequence_file] --out_file [out_file]
    --rates_file [rates_file] --time [time] [--force/--noforce]
    [--benner/--pascarella] [--nucleotide/--protein] [--biased/--uniform]
    [--rate_var_on/--rate_var_param '0.45 10'] [--nodashes] [--loss_gain]
    [--loss_gain_rate 0.00298]

    Restrictions/usage: 1. SeqEvolver only works on sequence libraries in
    the FASTA sequence format. 2. "sequence_file" is the full name of the
    file that contains the "ancestral" sequences. 3. "outfile" is the name
    of the file you want the evolved sequences to be written to. 4. "t" is
    time, and can be any real number, and ratesfile is a file that has as
    many evolutionary rates as there are sequences. You can use whatever
    units you want as long as (time * rate) yields a PAM distance. 5. The
    default is to use force and overwrite existing output files, and not ask
    for user interaction. Use the --noforce flag to ask for permission to
    overwrite existing files and to turn off interactive features 6. Use the
    --benner or --pascarella flag to include indels. The --benner flag uses
    a model based on Benner, et al. J. Mol. Biol. (1993) 229, 1065-1082. The
    --pascarella flag uses a model based on Pascarella, et al., J. Mol.
    Biol. (1992) 224, 461-471. Deletions will be -'s and inserts will be
    lower case letters. 7. Use the --nucleotide or --protein flags to force
    the program to interpret the sequence files as nucleotide or protein
    sequences respectively (in the absence of these flags, the program will
    guess at the sequence type based on ACTG content). Note that sequences
    should be either all nucleotides or all proteins, because that is how
    SeqEvolver will treat them. 8. The --uniform (uniform) or --biased
    (biased) flags are for DNA - the --uniform flag tells SeqEvolver to use
    a probability matrix in which transitions and transversion occur with
    equal frequency, while the --biased flag indicates that SeqEvolver
    should use a probability matrix in which transition occur about 3 times
    as often as transversions. For details, see States, et al. (1991)
    Methods: A Companion to Methods in Enzymology. 3(1), 66-70. 9. Currently
    SeqEvolver has no indel model implemented for DNA, so the --benner and
    --nucleotide flags may not be used together (this will generate an
    error). 10. Use --rate_var_on to turn on random rate variation. By
    default a window size of 10 will be used (meaning the rate will change
    randomly every 10 amino acids) and the standard deviation of the rates
    chosen for each window will be (0.45 * PAM distance). If you want to
    supply your own standard deviation for the rates and your own window
    size, use the --rate_var_param 'SD WINDOWSIZE', for example:

        --rate_var_param '0.3 15'

        That would set the standard deviation of the rates to be 0.3 * PAM distance
        and the window size to 15. The --rate_var_param flag implicitly sets the
        --rate_var_on flag.

        The default values of 0.45 * PAM distance comes from an experiment I (jr)
        did - I looked at several thousand alignments and observed how the percent
        id (and thus the PAM distance) varied in window along the alignment. It
        turns out the standard deviation of the percent id varied almost exactly
        linearly as a function of percent id of the entire alignment. Maybe someday
        I'll actually publish this!

        To implement random rate variation, the sequence is divided up into
        windows (default is 10 amino acids), and each window is assigned a
        mutation rate. The mutation rate is drawn from a normal distribution
        around the PAM distance for the whole sequence. The default standard
        deviation for the normal distribution is 0.45 * PAM distance for the
        whole sequence. The user can substitute 0.45 with any fraction he wants
        for the rate variation. No support for a 'constant' normal distribution
        irrespective of PAM distance is currently supported. If you want to do this,
        you're going to have to do separate SeqEvolver runs for each PAM distance,
        changing the 'fraction' parameter for each run so that that the standard
        deviation (fraction * PAM distance) is constant.

NOTES
=head2 Point mutations (substitutions)
    Point mutations are generated by first generating a distance specific
    probability matrix for each distance. This is done by matrix
    multiplication of the appropriate PAM100, PAM10 and PAM1 matrices (e.g.
    PAM14 = PAM10 X PAM1 X PAM1 X PAM1 X PAM1). Then, for each residue, a
    random number between 0 and 1 is selected and for each column
    (corresponding to the identity of the unmutated residue), the
    probabilities are added together until the random number is less or
    equal to the sum of the probabilities, and the residue is changed to the
    residue in that row; in other words, the residue is changed to the first
    residue corresponding to the row for which the cumulative sum is greater
    than the random number. (If that doesn't make sense, see pseudocode
    below.) Note that if SeqEvolver doesn't recognize a residue, it leaves
    it unchanged. It will, however, at least tell you that it doesn't
    recognize the residue, provided you are listening to STDERR.

    The probability matrices for protein are based on Jones D.T., et al.
    (1992) CABIOS 8(3), 275-282 - these matrices are basically an update of
    Margaret Dayhoff's original PAM matrices. Two DNA matrices are
    available, both based on States, et al. (1991) Methods: A Companion to
    Methods in Enzymology. 3(1), 66-70. The two different DNA matrices
    correspond to two different models of DNA mutation. In the uniform
    model, transitions and transversions occur with equal probability. In
    the biased model, transitions occur about three times as often as
    transversions. The latter model is a bit more realistic. I will add more
    elaborate models as my needs (or other's requests) dictate; feel free to
    contact me (jtr4v@virginia.edu) if you'd like me to incorporate another
    model.

  Insertions/Deletions (Indels)
    Two insertions/deletion (indel) models are implemented for protein
    sequences - that of Benner et al., (Benner, et al. J. Mol. Biol. (1993)
    229, 1065-1082) and that of Pascarella, et al. (Pascarella, et al., J.
    Mol. Biol. (1992) 224, 461-471).

    A few comments/caveats: - In the Benner model, the probability of an
    indel (but NOT indel size) increases with PAM distance. - After deciding
    to allow an indel, the algorithm randomly chooses between insertions and
    deletions using a coin toss (50/50). - The Benner model assumes a
    Zipfian distribution of indel sizes: p(indel len=k) = constant /
    (k**exponent). In contrast to an exponential distribution, the average
    length generated by a Zipfian distribution can vary wildly with each
    trial. This means that the Zipfian distribution is much more likely to
    occasionally produce an extremely long indel. - The amino acids in
    inserts are chosen randomly based on the amino-acid frequencies in
    SwissProt.

  Pseudocode
    <pre> #construct distance specific scoring matrix for (0 .. hundreds) {
    if (matrix is undefined) { matrix = PAM100; next; } matrix = matrix *
    PAM100; } for (0 .. tens) { if (matrix is undefined) { matrix = PAM10;
    next; } matrix = matrix * PAM10; } for (0 .. ones) { if (matrix is
    undefined) { matrix = PAM10; next; } matrix = matrix * PAM1; }

    #point mutations foreach ( residue ) { probability[0 .. num_letters - 1]
    = column of probabilities for residue from current_matrix; random =
    rand(1); for n( 0 .. num_letters - 1 ) { sum = sum + probability[n]; if
    ( random <= sum ) { new_residue = residue[n]; break; } } }

    #insertions/deletions foreach ( residue ) { random = rand(1); threshold
    = 0.0224 - 0.0219 * exp( -0.0102 ? target_pam_distance); # Benner et al.
    - or - threshold = ; # Pascarella et al.

            if ( random <  threshold )
            {
                    length = (choose length from Zipfian distribution); # Benner, et al.
                         - or - 
                    length = (choose length from ... ) # Pascarella, et al.
                    random = rand(1);
                    if ( random > 0.5 ) # insertion
                    {
                            insert (length) number of random amino acids;
                    }
                    else
                    {
                            delete (length) number of amino acids;
                    }
            }
    }

    </pre>

FEEDBACK
    jtr4v@virginia.edu

REPORTING BUGS
    jtr4v@virginia.edu

AUTHORS
    Justin Reese, Todd Wood

REVISION LOG
    Version 0.2: Changed the random rate selecting algorithm to choose
    random PAM numbers rather than % identities. 10/30/97

    Version 0.3: Changing random number generation to read a set of random
    numbers generated from Splus, these are rates of mutation

    Version 0.4: Changed it to generate a new PAM matrix for each sequence.

    Version 1.0: Altered evolver to allow random sequences to be inserted
    and deleted.

    Version 1.1: Created a routine for self-calculating the mutational
    probability matrix from a PAM1 matrix, so I don't have to call a
    separate program each time. I'm also removing the
    randomizing/pseudo-transfer module, since it is terribly unrealistic.
    2/2/99 Also added a -f flag to automatically overwrite existing files.
    evolver will never overwrite a file that does not have write permission,
    even with the -f flag.

    Version 1.2: Now introduces random rate variation with the "s"
    parameter.

    Version 1.21: Added multiple PAM matrices to increase speed.

    Version 1.22: Took out the "random rate variation" crap since it was
    dumb. 6/23/99

    Version 2.0: Fixed a very subtle problem with the way distance specific
    scoring matrices were calculated; before 2.0, the distances were
    corrected for the fact that the starting point is PAM1 by subtracting 1
    from the distances. However, if the distance happens to be a multiple of
    10, the corrected distance is "-1 PAM1's", which the loop for $n1
    ignores and therefore the correction never gets done (i.e. it can't
    subtract distance from substitution matrices). In 2.0 on, 1 PAM is
    subtracted from the desired distance ($dist[$k]) instead of from the
    number of PAM1's needed ($n1).

    Version 3.0: Changes: Changed random seed from default (time) to more
    complex seed involving time and PID.

    Created a subroutine (makeindel) to include indel's into mutation model
    based on the indel model of Benner, Cohen and Gonnet (J. Mol. Biol. 1993
    229, 1065-1082). Use the -g to allow indel's.

    Version 3.5: - Moved matrices into code, made makematrices subroutine to
    load matrices into variable - Changed code so that if rates file is not
    found, it will prompt user to default to rates of '1' for all proteins
    in sequence file - Made -w flag for web interface. Sequence_file is read
    in through STDIN (cat, echo, <, etc), only the time argument and
    ratesfile is supplied. This saves from having to write the sequences to
    a temp file...

    Version 4.0: - Implemented DNA mutation 3/2001 - Comments: Do not mix
    nucleotide and protein sequences together in the same sequence file. The
    program will interpret the sequences as either all nucleotide or all
    protein sequences, so mixing the two will result in one type (either
    nucleotides or proteins) not being mutated correctly. - V.4.08
    Implemented indel model of Pascarella, et al., based on J. Mol. Biol.
    (1992) 224, 461-471.

    CHANGED TO SEQEVOLVER, Version 0.5 - March 2002 - Various minor fixes

    Version 0.6 - Added -d option which supresses -'s when writing out
    sequences

    Version 0.65 - Changed formula for calculating distances slightly - used
    to int(time) before multiplying times rate. This (silently) prevented
    non-integer times. Now dist = int(time * rate), which allows non-integer
    times (with accompanying rounding error dangers).

    Version 0.7 - Added -r options for random rate variation within a
    sequence.

    IMPORTED INTO CVS (consult CVS log for changes implemented after
    3/11/03)