FASTA/SSEARCH/[T]FASTX/Y/LALIGN
Section: User Commands (1)
Updated: local
FASTA Introduction
Index
- NAME
-
- DESCRIPTION
-
- Running the FASTA programs
-
- FASTA program options
-
- Option summary:
-
- Reading sequences from STDIN
-
- Environment variables:
-
- AUTHOR
-
NAME
fasta35, fasta35_t*
| scan a protein or DNA sequence library for similar
sequences
|
fastx35, fastx35_t
| compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and
reverse frames.
|
tfastx35, tfastx35_t
|
compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
|
fasty35, fasty35_t
|
compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and reverse
frames.
|
tfasty35, tfasty35_t
|
compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
|
fasts35, fasts35_t
|
compare unordered peptides to a protein sequence database
|
fastm35, fastm35_t
|
compare ordered peptides (or short DNA sequences)
to a protein (DNA) sequence database
|
tfasts35, tfasts35_t | compare unordered peptides to a translated DNA
sequence database
|
fastf35, fastf35_t | compare mixed peptides to a protein sequence database
|
tfastf35, tfastf35_t | compare mixed peptides to a translated DNA
sequence database
|
ssearch35, ssearch35_t | compare a protein or DNA sequence to a
sequence database using the Smith-Waterman algorithm.
|
ggsearch35, ggsearch35_t | compare a protein or DNA sequence to a
sequence database using a global alignment (Needleman-Wunsch)
|
glsearch35, glearch35_t | compare a protein or DNA sequence to a
sequence database with alignments that are global in the query and
local in the database sequence (global-local).
|
lalign35 | produce multiple non-overlapping alignments for protein
and DNA sequences using the Huang and Miller sim algorithm for the
Waterman-Eggert algorithm.
|
prss35, prfx35 | discontinued; all the FASTA programs will estimate
statistical significance using 500 shuffled sequence scores if two
sequences are compared.
|
*_t programs are threaded, to run faster on multi-core processors.
DESCRIPTION
Release 3.5 of the FASTA package provides a modular set of sequence
comparison programs that can run on conventional single processor
computers or in parallel on multiprocessor computers. More than a
dozen programs - fasta35, fastx35/tfastx35, fasty35/tfasty35,
fasts35/tfasts35, fastm35, fastf35/tfastf35, ssearch35, ggsearch35,
and glsearch35 - are currently available.
All of the comparison programs share a set of basic command line
options; additional options are available for individual comparison
functions.
Threaded versions of the FASTA programs (fasta35_t, ssearch35_t, etc.)
will run in parallel on modern Linux and Unix multi-core or
multi-processor computers. Accelerated versions of the Smith-Waterman
algorithm are available for architectures with the Intel SSE2 or
Altivec PowerPC architectures, which can speed-up Smith-Waterman
calculations 10 - 20-fold.
In addition to the serial and threaded versions of the FASTA programs,
PVM and MPI parallel versions are available as pv35compfa,
mp35compfaf, pv35compsw, mp35compsw, etc. For more
information, see pvcomp.1, readme.pvm_mpi. The PVM/MPI
program versions use same command line options as the serial and
threaded FASTA program versions.
Running the FASTA programs
Although the FASTA programs can be run interactively, prompting for a query
file and a library, it is usually more convenient to run them from the Unix, MacOSX terminal, or Windows shell command line. Thus,
fasta35_t -q -option1 -option2 -option3 query.file library.file > fasta.output
runs the threaded version of fasta35 program, without asking for
any input (-q), setting various parameter and output options,
comparing the sequences in query.file to the sequences in
library.file. Optional arguments to the FASTA programs must
precede the query.file, library.file, and optional ktup
arguments. The FASTA program provides an option (-O) for
sending output to a file, but generally it is better to simply
redirect output with the ">" shell symbol.
FASTA program options
The default scoring matrix and gap penalties used by each of the
programs have been selected for high sensitivity searches with the
various algorithms. The default program behavior can be modified by
providing command line options before the query.file and
library.file arguments. Command line options can also be used in
interactive mode.
Command line arguments come in several classes.
(1) Commands that specify the comparison type. FASTA, FASTS, FASTM,
SSEARCH, GGSEARCH, and GLSEARCH can compare either protein or DNA
sequences, and attempt to recognize the comparison type by looking the
residue composition. -n, -p specify DNA (nucleotide) or
protein comparison, respectively. -U specifies RNA comparison.
(2) Commands that limit the set of sequences compared: -1,
-3, -M.
(3) Commands that modify the scoring parameters: -f gap-open penaltyP, -g
gap-extend penalty, -h inter-codon frame-shift, -j
within-codon frame-shift, -s scoring-matrix, -r
match/mismatch score, -x X:X score.
(4) Commands that modify the algorithm (mostly FASTA and [T]FASTX/Y):
-c, -w, -y, -o. The -S can be used to
ignore lower-case (low complexity) residues during the initial score
calculation.
(5) Commands that modify the output: -A, -b number, -C
width, -d number, -L, -m 0-11, -w
line-width, -W context-width, -X offset1,ofset2
(6) Commands that affect statistical estimates: -Z, -k.
Option summary:
- -1
-
Sort by "init1" score (obsolete)
- -3
-
(TFASTX/Y35 only) use only forward frame translations
- -a #
-
"SHOWALL" option attempts to align all of both sequences in FASTA and SSEARCH.
- -A
-
(FASTA35 DNA comparison only) force Smith-Waterman alignment for
output. Smith-Waterman is the default for FASTA protein alignment and
[T]FASTX/Y, but not for DNA comparisons with FASTA.
- -b #
-
number of best scores to show (must be < expectation cutoff if -E is given).
By default, this option is no longer used; all scores better than the
expectation (E()) cutoff are listed.
- -B
-
show z-scores rather than bit scores (for compatibility with much
older versions).
- -c #
-
threshold for band optimization (FASTA, [T]FASTX/Y)
- -C #
-
length of name abbreviation in alignments, default = 6. Must be less
than 20.
- -d #
-
number of best alignments to show ( must be < expectation (-E) cutoff)
- -D
-
turn on debugging mode. Enables checks on sequence alphabet that
cause problems with tfastx35, tfasty35 (only available after compile
time option).
- -E #
-
expectation value upper limit for score and alignment display.
Defaults are 10.0 for FASTA35 and SSEARCH35 protein searches, 5.0 for
translated DNA/protein comparisons, and 2.0 for DNA/DNA searches.
- -f #
-
penalty for opening a gap.
- -F #
-
expectation value lower limit for score and alignment display.
-F 1e-6 prevents library sequences with E()-values lower than 1e-6
from being displayed. Use to shift focus to more distant
relationships.
- -g #
-
penalty for additional residues in a gap
- -h #
-
([T]FASTX/Y only) penalty for a frameshift between two codons.
- -j #
-
([T]FASTY only) penalty for a frameshift within a codon.
- -H
-
turn off histogram display. (The meaning of -H is reversed with the
PVM/MPI parallel versions, where the histogram display is off by default).
- -i
-
(FASTA DNA, [T]FASTX/Y) compare against
only the reverse complement of the library sequence.
- -k
-
specify number of shuffles for statistical parameter estimation (default=500).
- -l str
-
specify FASTLIBS file
- -L
-
report long sequence description in alignments (up to 200 characters).
- -m 0,1,2,3,4,5,6,9,10,11
-
alignment display options. -m 0, 1, 2, 3
display different types of alignments. -m 4 provides an
alignment "map" on the query. -m 5 combines the alignment map
and a -m 0 alignment. -m 6 provides an HTML output.
- -m 9
-
does not change the alignment output, but provides
alignment coordinate and percent identity information with the best
scores report. -m 9c adds encoded alignment information to the
-m 9; -m 9i provides only percent identity and alignment
length information with the best scores. With current versions of the
FASTA programs, independent -m options can be combined;
e.g. -m 1 -m 9c -m 6.
- -m 11
-
provide lav format output from lalign35. It does not
currently affect other alignment algorithms. The lav2ps and
lav2svg programs can be used to convert lav format output
to postscript/SVG alignment "dot-plots".
- -M #-#
-
molecular weight (residue) cutoffs. -M "101-200" examines only sequences that are 101-200 residues long.
- -n
-
force query to nucleotide sequence
- -N #
-
break long library sequences into blocks of # residues. Useful for
bacterial genomes, which have only one sequence entry. -N 2000 works
well for well for bacterial genomes.
- -o
-
(FASTA) turn fasta band optimization off during initial phase. This was
the behavior of fasta1.x versions (obsolete).
- -O file
-
send output to file.
- -p
-
Force query sequence type to protein.
- -P "file type"
-
specify a PSI-BLAST PSSM file of type "type". Available types are:
0 - ascii PSSM file, produced by blastpgp -Q file.pssm
1 - binary (architecture dependent) PSSM file, produced by blastpgp -C file.pssm -u 0
2 - binary ASN.1 (architecture independent) PSSM file, produced by blastpgp -C file.pssm -u 2
- -q/-Q
-
quiet option; do not prompt for input
- -r "+n/-m"
-
(DNA only) values for match/mismatch for DNA comparisons. +n is
used for the maximum positive value and -m is used for the
maximum negative value. Values between max and min, are rescaled, but
residue pairs having the value -1 continue to be -1.
- -R file
-
save all scores to statistics file (previously -r file)
- -s name
-
specify substitution matrix. BLOSUM50 is used by default;
PAM250, PAM120, and BLOSUM62 can be specified by setting -s P120,
P250, or BL62. With this version, many more scoring matrices are
available, including BLOSUM80 (BL80), and MDM10, MDM20, MDM40 (Jones,
Taylor, and Thornton, 1992 CABIOS 8:275-282; specified as -s M10, -s
M20, -s M40). Alternatively, BLASTP1.4 format scoring matrix files can
be specified. BL80, BL62, and P120 are scaled in 1/2 bit units; all
the other matrices use 1/3 bit units. DNA scoring matrices can also
be specified with the "-r" option.
- -S
-
treat lower case letters in the query or database as low complexity
regions that are equivalent to 'X' during the initial database scan,
but are treated as normal residues for the final alignment display.
Statistical estimates are based on the 'X'ed out sequence used during
the initial search. Protein databases (and query sequences) can be
generated in the appropriate format using John Wooton's "pseg"
program, available from ftp://ncbi.nlm.nih.gov/pub/seg/pseg. Once you
have compiled the "pseg" program, use the command:
-
pseg database.fasta -z 1 -q > database.lc_seg
- -t #
-
Translation table - [t]fastx35 and [t]fasty35 support the BLAST
tranlation tables. See the
NCBI Genetic Code site.
In addition, you can score for the end of a protein match with '-t -t'
which will add "*" to the end of your query sequences (but your
protein library sequences must also have '*'). Built in protein
matrices know about '*:*' matches; if you want to use '-t t' with your
own matrix, you will need to include '*' in the matrix.
- -T #
-
(threaded, parallel only) number of threads or workers to use (set by
default to 4 at compile time).
- -U
-
Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base pairs (by
scoring "G-A" and "T-C" as "G-G" -1). Search only one strand.
- -V "?$%*"
-
Allow special annotation characters in query sequence. These characters
will be displayed in the alignments on the coordinate number line.
- -w #
-
line width for similarity score, sequence alignment, output.
- -W #
-
context length (default is 1/2 of line width -w) for programs,
like fasta and ssearch, that provide additional sequence context.
- -x #match,#mismatch
-
scores used for matches to 'X:X','N:N', '*:*' matches, and the corresponding
specified in the scoring matrix. If only one value is given, it is
used for both values.
- -X "#,#"
-
offsets query, library sequence for numbering alignments
- -y #
-
Width for band optimization; by default 16 for DNA and protein ktup=2;
32 for protein ktup=1;
- -z #
-
Specify statistical calculation. Default is -z 1 for local
similarity searches, which uses regression against the length of the
library sequence. -z -1 disables statistics. -z 0 estimates
significance without normalizing for sequence length. -z 2 provides
maximum likelihood estimates for lambda and K, censoring the 250
lowest and 250 highest scores. -z 3 uses Altschul and Gish's
statistical estimates for specific protein BLOSUM scoring matrices and
gap penalties. -z 4,5: an alternate regression method. -z 6 uses a
composition based maximum likelihood estimate based on the method of
Mott (1992) Bull. Math. Biol. 54:59-75. -z 11,12,14,15,16: compute
the regression against scores of randomly shuffled copies of the
library sequences. Twice as many comparisons are performed, but
accurate estimates can be generated from databases of related
sequences. -z 11 uses the -z 1 regression strategy, etc.
- -Z db_size
-
Set the apparent database size used for expectation value calculations
(used for protein/protein FASTA and SSEARCH, and for [T]FASTX/Y).
Reading sequences from STDIN
The FASTA programs have been modified to accept a query sequence from
the unix "stdin" data stream. This makes it much easier to use
fasta35 and its relatives as part of a WWW page. To indicate that
stdin is to be used, use "@" as the query sequence file name. "@" can
also be used to specify a subset of the query sequence to be used,
e.g:
cat query.aa | fasta35 -q @:50-150 s
would search the 's' database with residues 50-150 of query.aa. FASTA
cannot automatically detect the sequence type (protein vs DNA) when
"stdin" is used and assumes protein comparisons by default; the '-n'
option is required for DNA for STDIN queries.
Environment variables:
- FASTLIBS
-
location of library choice file (-l FASTLIBS)
- SMATRIX
-
default scoring matrix (-s SMATRIX)
- SRCH_URL
-
the format string used to define the option to re-search the
database.
- REF_URL
-
the format string used to define the option to lookup the library
sequence in entrez, or some other database.
AUTHOR
Bill Pearson
wrp@virginia.EDU