Dept. of Biochemistry and Molecular Genetics
Box 800733
Charlottesville, VA 22908
FAX: (434)-924-5069
The FASTA programs can be used to search protein and DNA sequence databases, and to confirm the statistical significance of a match by comparing the alignment score to a distribution of scores produced by shuffled sequences. Programs are also available to display local alignments.
The FASTA programs extend the BLAST local similarity programs by offering rigorously optimal local and global alignments, a much wider range of alignment scoring matrices (Pearson, 2013), and translated alignments that allow frame-shifts, and alignments with oligo-peptides and oligo-nucleotides. In addition, the FASTA programs can highlight functional residues in alignments, and provide sub-alignment scoring to reduce alignment over-extension (Mills and Pearson, 2013, Gonzalez and Pearson, 2010).
We have a long-standing interest in exploiting protein sequence information, both for understanding better how new protein sequences arise and for understanding the relationship between protein sequence and protein structure. Since the description of the FASTP program in 1985, our group has been developing more effective methods for identifying distantly related protein sequences. Over the past 10 years, state-of-the-art methods have improved to where proteins that have diverged from a common ancestor in the past billion years are likely to be detected by sequence similarity searching. We hope to push back that threshold to beyond 2 billion years (near the time when prokaryotes and eukaryotes diverged), but already it is possible to identify novel proteins that are likely to have emerged in the last 500 - 800 million years. If we can identify proteins that emerged in the last 100 - 250 million years, it may be possible to identify the mechanisms by which new proteins are formed.
ISMB 2000 Tutorial on Protein Evolution and Protein Sequence Comparison (PDF file)
Insana, G., Martin, M. J., Pearson, W. R. (2024) "Improved selection of canonical proteins for reference proteomes" NAR Genom Bioinform. 10.1093/nargab/lqae066 PMCID: PMC11165316 [PDF]
Triant, D. A., Pearson, W. R. (2022) "Comparison of detection methods and genome quality when quantifying nuclear mitochondrial insertions in vertebrate genomes" Front Genet. 13:984513 doi: 10.3389/fgene.2022.984513 PMCID: PMC9723244 [PDF]
Pearson, W. R., Li, W., and Lopez, R. (2017) "Query-seeded iterative similarity searching improves selectivity 5—20-fold" Nuc. Acids Res. 10.1093/nar/gkw1207 [PDF].
Triant, D. A. and Pearson, W. R. (2015) "Most partial domains in proteins are alignment and annotation artifacts" Genome Biology 16:99 [Entrez] [Journal] [PDF]
Furnham N, Holliday GL, de Beer TA, Jacobsen JO, Pearson WR, Thornton JM. (2013) "The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes." Nucleic Acids Res. 42:D485-489 [Entrez] [PDF]"
Mills, L. J. and Pearson, W. R. (2013) "Adjusting Scoring Matrices to Correct Overextended Alignments" Bioinformatics. 29:3007-2013 doi: 10.1093/bioinformatics/btt517. [Entrez] [PDF]
Pearson, W. R. (2013) Curr. Prot. Bioinformatics Chapter 3: Unit 3.5 "Selecting the Right Similarity-Scoring Matrix" doi: 10.1002/0471250953.bi0305s43. [Entrez] [PDF]
Pearson, W. R. (2013) "An Introduction to Similarity ("Homology") Searching" Curr. Prot. Bioinformatics Chapter 3: Unit 3.1 doi: 10.1002/0471250953.bi0301s42. [Entrez] [PDF]
Li W, McWilliam H, Goujon M, Cowley A, Lopez R, Pearson WR. (2012) "PSI-Search: iterative HOE-reduced profile SSEARCH searching." Bioinformatics. [Entrez] [PDF]
Holliday GL, Andreini C, Fischer JD, Rahman SA, Almonacid DE, Williams ST, Pearson WR. (2012) "MACiE: exploring the diversity of biochemical reactions." Nucleic Acids Res. 2012 Jan;40(Database issue):D783-9. [Entrez] [PDF]
M. W. Gonzalez and W. R. Pearson (2010) Bioinformatics "RefProtDom: A protein database with improved domain boundaries and homology relationships" 26:2361-2361 [Entrez] [PDF]
M. L. Sierk, M. E. Smoot, E. J. Bass, and W. R. Pearson (2010) "Improving pairwise sequence alignment accuracy using near-optimal alignments" BMC Bioinformatics 11:146 doi:10.1186/1471-2105-11-146 [Entrez] [PDF]
M. W. Gonzalez and W. R. Pearson (2010) "Homologous over-extension: a challenge for iterative similarity searches" Nuc. Acids Research 38:2177-2189 [Entrez] [PDF]
D. T. Lavelle and W. R. Pearson (2010) "Globally, unrelated protein sequences appear random" Bioinformatics 26:310-318 [Entrez] [PDF]
W. R. Pearson and M. L. Sierk (2005) "The limits of protein sequence comparison?" Curr Opin Struct Biol. 15:254-260. [Entrez] [PDF]
W. R. Pearson and T. C. Wood (2001) "Statistical significance in biological sequence comparison" in Handbook of Statistical Genetics, D. J. Balding, M. Bishop, and C. Cannings eds. London: Wiley, pp. 39-65
T. C. Wood and W. R. Pearson Evolution of Protein Sequences and Structures (1999) J.Mol. Biol. 291:977-995
Pearson, W. R., (1998) Empirical statistical estimates for sequence similarity scores J.Mol. Biol. 276:71-84 [Entrez].
Xu, S.-j., Wang, Y.-p., Roe, B., Pearson, W. R. (1998) Characterization of the Human Class Mu Glutathione S-Transferase Gene Cluster and the GSTM1 Deletion. J. Biol. Chem. 273:3517-3527. [Entrez]
Pearson, W. R., Vorachek, W. R., Xu, S., Berger, R., Hart, I., Vannais, D., and Patterson, D. (1993) Identification of class-mu glutathione transferase genes GSTM1 - GSTM5 on human chromosome 1p13. Am. J. Human Genet. 53:220-233. [Entrez]
Daly, A. K., Thomas, D. J., Cooper, J., Pearson, W. R., Neal, D. E., and Idle, J. R. (1993) Homozygous deletion of the glutathione S-transferase M1 (GSTM1) gene is a risk factor in bladder cancer. Brit. Med. J. 307:481-482. [Entrez]