fasta-36.3.8e/psisearch2/
directory now
provides psisearch2_msa.pl
and psisearch2_msa.py
, functionally identical scripts
for iterative searching with psiblast
or ssearch36
. psisearch2-msa.pl
offers an
option, --query_seed
, that can dramatically reduce
false-positives caused by alignment overextension, with very
little loss of search sensitivity.
fasta-36.3.8d/scripts/
directory now provides a
script, annot_blast_btop2.pl
that allows annotations and
sub-alignment scoring on BLAST alignments that use the tabular format
with BTOP alignment encoding.
1 [ - GST_N 88 ] - 90 [ - GST_C 208 ] -Since the closing "]" was associated with the previous "[", domains could not overlap.
The new format is:
1 - 88 GST_N 90 - 208 GST_Cwhich allows annotations of the form:
1 - 88 GST_N 75 - 123 GST-middle 90 - 208 GST_C
fasta-36.3.8/scripts
directory,
e.g. ann_pfam_www_e.pl
(Pfam) and ann_up_www2_e.pl
(Uniprot) to support this new format. If the domain annotations
provided by Pfam or Uniprot overlap, then overlapping domains are
provided. The _e.pl
new scripts can be directed to provide
non-overlapping domains, using the boundary averaging strategy in
the older scripts, by specifying the --no-over
option.FASTA version 36.3.6f extends previous versions in several ways:
-XI
, that causes the
alignment programs to report 100% identity only when there are no
mismatches. In previous versions, one mismatch in 10,000 would round
up to 100.0% identity; with -XI
, the identity will be
reported as 99.9%.
Additional bug fixes are documented in fasta-36.3.6f/doc/readme.v36
FASTA version 36.3.6 provides two new features:
(fasta-36.3.5 January 2013)
The NCBI's transition from BLAST to BLAST+ several years ago broke the
ability of ssearch36
to use PSSMs, because psiblast
did not produce the binary ASN.1 PSSMs that ssearch36
could
parse. With the January 2013 fasta-36.3.5f
,
release ssearch36
can read binary ASN.1 PSSM files produced
by the NCBI datatool
utility.
See fasta_guide.pdf for more information
(look for the -P
option).
fasta36
presents a short help message, and
fasta36 -help
presents a complete list of options. To see the interactive prompts, use
fasta36 -I
.
Likewise, the score histogram is no longer shown by default; use
the -H
option to show the histogram (or compile with
-DSHOW_HIST for previous behavior).
The _t
(fasta36_t
) versions of the programs are
built automatically on Linux/MacOSX machines and
named fasta36
, etc. (the programs are threaded by default,
and only one program version is built).
Documentation has been significantly revised and updated.
See doc/fasta_guide.pdf
for a description of the programs and options.
fasta36
, ssearch36
,
[t]fast[xy]36
), if the library sequence contains additional
significant alignments, they will be displayed with the alignment
output, and as part of -m 9
output (the initial list of high
scores).
By default, the statistical threshold for alternate alignments (HSPs) is the E()-threshold / 10.0. For proteins, the default expect threshold is E() < 10.0, the secondary threshold for showing alternate alignments is thus E() < 1.0. Fror translated comparisons, the E()-thresholds are 5.0/0.5; for DNA:DNA 2.0/0.2.
Both the primary and secondary E()-thresholds are set with the -E "prim sec" command line option. If the secondary value is betwee zero and 1.0, it is taken as the actual threshold. If it is > 1.0, it is taken as a divisor for the primary threshold. If it is negative, alternative alignments are disabled and only the best alignment is shown.
-z 21, 22, 26
, provide a second E()-value
estimate based on shuffles of the highest scoring sequences.
-m 8
provides the same output format as
tabular BLAST; -m 8C
mimics tabular blast with comment
lines. -m 9C
provides CIGAR encoded alignments.
(fasta-36.3.4) Alignment option -m B
provides BLAST-like alignments (no context, coordinates at the beginning and end of the alignment line, Query/Sbjct
.
fasta36
, [t]fast[xy]36
). By
default (fasta36.3) fasta36
, [t]fast[xy]36
can use
a similar strategy to BLAST to set the thresholds for combining
ungapped regions and performing band alignments. This dramatically
reduces the number of band alignments performed, for a speed increase
of 2 - 3X. The original statistical thresholds can be enabled with
the -c O
(upper-case letter 'O') command line option.
Protein and translated protein alignment programs can also use ktup=3
for increased speed, though ktup=2 is still the default.
Statistical thresholds can dramatically reduce the number of "optimized" scores, from which statistical estimates are calculated. To address this problem, the statistical estimation procedure has been adjusted to correct for the fraction of scores that were optimized. This process can dramatically improve statistical accuracy for some matrices and gap pentalies, e.g. BLOSUM62 -11/-1.
With the new joining thresholds, the
-c "E-opt E-join"
options have expanded meanings. -c "E-opt E-join"
calculates a threshold designed (but not guaranteed) to do band
optimization and joining for that fraction of sequences. Thus, -c
"0.02 0.1"
seeks to do band optimization (E-opt) on 2% of alignments,
and joining on 10% of alignments. -c "40 10"
sets the gap
threshold as in earlier versions.
-e expand_script.sh
) is available that allows
the set of sequences that are aligned to be larger than the set of
sequences searched. When the -e expand_script.sh
option is
used, the expand_script.sh
script is run with an input
argument that is a file of accession numbers and E()-values; this
information can be used to produce a fasta-formatted list of
additional sequences, which will then be compared and aligned (if they
are significant), and included in the list of high scoring sequences
and the alignments. The expanded set of sequences does not change the
database size o statisical parameters, it simply expands the set of
high-scoring sequences.
-m F
option can be used to produce multiple output formats in different files from the same search. For example, -m "F9c,10 m9c10.output" -m "FBB blastBB.output"
produces two output files in addition to the normally formatted output sent to stdout
. The m9c10.output
file contains -m 9c
score descriptions and -m 10
alignments, while blastBB.output
contains BLAST-like output (-m BB
).
-1
, -B
, -o
, -x
, -y
) have
become extended options, available via the -X
(upper case X) option.
The old -X off1,off2
option is now -o off1,off2
.
By default, the program will read up to 2 GB (32-bit systems) or 12 GB
(64-bit systems) of the database into memory for multi-query searches.
The amount of memory available for databases can be set with
the -XM4G
option.
fasta-36.3.2
ggsearch36 (global/global)
and glsearch36 now incorporate SSE2 accelerated global
alignment, developed by Michael Farrar. These programs are now about
20-fold faster.
fasta-36.2.1
(and later versions) are fully threaded, both for
searches, and for alignments. The programs routinely run 12 - 15X
faster on dual quad-core machines with "hyperthreading".
In translated sequence comparisons, annotations are only available for the protein sequence.
Add ability to search a subset of a library using a file name and a list of accession/gi numbers. This version introduces a new filetype, 10, which consists of a first line with a target filename, format, and accession number format-type, and optionally the accession number format in the database, followed by a list of accession numbers. For example:
</slib2/blast/swissprot.lseg 0:2 4| 3121763 51701705 7404340 74735515 ...Tells the program that the target database is swissprot.lseg, which is in FASTA (library type 0) format.
The accession format comes after the ":". Currently, there are four accession formats, two that require ordered accessions (:1, :2), and two that hash the accessions (:3, :4) so they do not need to be ordered. The number and character after the accession format (e.g. "4|") indicate the offset of the beginning of the accession and the character that terminates the accession. Thus, in the typical NCBI Fasta definition line:
>gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)The offset is 4 and the termination character is '|'. For databases distributed in FASTA format from the European Bioinformatics Institute, the offset depends on the name of the database, e.g.
>SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).and the delimiter is ' ' (space, the default).
Accession formats 1 and 3 expect strings; accession formats 2 and 4 work with integers (e.g. gi numbers).
A new program is available, lav2svg
, which creates SVG (Scalable Vector
Graphics) output. In addition, ps_lav
,
which was introduced May 30, 2007, has been replaced
by lav2ps
. SVG files are more easily edited with Adobe
Illustrator than postscript (lav2ps
) files.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148Note, however, that the subset range applied to the library will be applied to every sequence in the library - not just the first - and that the same subset range is applied to each sequence. This probably makes sense only if the library contains a single sequence (this is also true for the query sequence file).
fasta34
with development version fasta35
.
Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix, "-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
ggssearch35(_t)
, glsearch35(_t)
can now use PSSMs.
ps_lav
(now lav2ps or lav2svg) -- which can be used to plot the lav
output of
lalign35 -m 11
.
lalign35 -m 11 | lav2psreplaces
plalign
(from FASTA2
).
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa) s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97 Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218) ^^^^^^^^^^^^^^where the highlighted text was either: "Smith-Waterman" or "banded Smith-Waterman". In fact, scores were calculated in other ways, including global/local for
fasts
and fastf
. With the addition of
ggsearch35,
glsearch35,
and lalign35,
there are many more ways to
calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
"trans. Smith-Waterman", "global/local", "trans. global/local",
"global/global (N-W)". The last option is a global global alignment,
but with the affine gap penalties used in the Smith-Waterman
algorithm.
ggsearch35(_t)
and glsearch35(_t)
are now available.
ggsearch35(_t)
calculates an alignment score that is global in the
query and global in the library; glsearch35(_t)
calculates an alignment
that is global in the query and local, while local in the library
sequence. The latter program is designed for global alignments to domains.
Both programs assume that scores are normally distributed. This
appears to be an excellent approximation for ggsearch35 scores, but
the distribution is somewhat skewed for global/local (glsearch)
scores. ggsearch35(_t)
only compares the query to library sequences
that are beween 80% and 125% of the length of the query; glsearch
limits comparisons to library sequences that are longer than 80% of
the query. Initial results suggest that there is relatively little
length dependence of scores over this range (scores go down
dramatically outside these ranges).
lalign
(SIM) algorithm has been moved from FASTA21 to
FASTA35. A plalign
equivalent is also available using lalign -m 11 | lav2ps
or | lav2svg
.
The statistical estimates for lalign35
should be much more accurate
than those from the earlier lalign, because lambda and K are estimated
from shuffles.
In addition, all programs can now generate accurate statistical
estimates with shuffles if the library has fewer than 500 sequences.
If the library contains more than 500 sequences and the sequences are
related, then the -z 11 option should be used.
p