FASTA36 annotation file format

Run FASTA Programs         Compare Two sequences

Recent versions of the FASTA36 programs (FASTA, SSEARCH, GGSEARCH, FASTX, etc) have the ability to annotate the final alignment using external information. For standard FASTA searches against the PIR1 and SwissProt databases, the program can run a script to query Uniprot, InterPro, or Pfam to learn the location of active sites and domain boundaries.

On the Search Databases, Find duplications, and Compare two sequences page, you can specify the annotations to be applied to your sequences by uploading files in the appropriate format. That format is:

>Accession_info      # must match first line of query or library sequence
position (number)<tab>symbol<tab>value<tab>description
...
For example:
>sp|P09488|GSTM1_HUMAN
1       -       88      Glutathione_S-Trfase_N :1
7       V       F       Mutagen: Reduces catalytic activity 100- fold.
23      *       -       MOD_RES: Phosphotyrosine (By similarity).
33      *       -       MOD_RES: Phosphotyrosine (By similarity).
34      *       -       MOD_RES: Phosphothreonine (By similarity).
90      -       208     Glutathione_S_Trfase/Cl_chnl_C :2
108     V       Q       Mutagen: Reduces catalytic activity by half.
108     V       S       Mutagen: Changes the properties of the enzyme toward some substrates.
109     V       I       Mutagen: Reduces catalytic activity by half.
116     #       -       BINDING: Substrate.
116     V       A       Mutagen: Reduces catalytic activity 10-fold.
116     V       F       Mutagen: Slight increase of catalytic activity.
173     V       N       in allele GSTM1B; dbSNP:rs1065411.
210     V       T       in dbSNP:rs449856.

The symbol character can be any printing non-alphabetic symbol, e.g. *,#,^,%.

Three symbols have a special meaning: 'V' (uppercase-V), '[', and ']'. V (uppercase-V) specifies that the position is a variant position and the <value> column is the alternate residue at that position (there can be only one alternate value; if there are several, the <position> can be duplicated on the next line).

'[' and ']' specify the beginning and end of a region or domain, which which can be used to score sub-alignments. '[' -- ']' regions must be non-overlapping. The position of the previous ']' must be smaller than the next '[', thus in the example above, the Glutathione_S-Trfase_N domain ends at position 88, before the start of the Glutathione_S-Trfase/Cl_chnl at position 90. If the region/domain description is followed by ':number', where 1 <= number <= 8, then the domain will be colored as specified using the HTML colors: 1=lightgreen, 2=lightblue, 3=pink, 4=cyan, 5=tan, 6=gold, 7=plum, 8=darkgreen. If you label a domain NODOM, it will be colored slategrey.

If you do not see your annotations (particularly on the second–target or library–sequence), it probably means that the description annotation did not match your annotation file exactly (including upper/lower case).


Run FASTA Programs         Compare Two sequences