Renaming FASTA sequences

Renaming FASTA sequences - rename_fastaseq.py

tranalign requires that the sequences in your DNA sequence file have the same names as the sequences in your protein sequence file (at least the first word, or accession, of the sequence). If the names don't match, you will get an error message.

I have written a short program that takes a list of accessions and renames the sequences in a DNA sequence file using those accessions.

~wrp/biol4230/proj1/rename_fastaseq.py --acc_file gstm.a_accs gstm.nlib

Will replace the description lines in gstm.nlib of the form:

>ref|NM_000561.3| Homo sapiens glutathione S-transferase mu 1 (GSTM1)

with

>gstm1_hum >ref|NM_000561.3| Homo sapiens glutathione S-transferase mu 1 (GSTM1)

if "gstm1_hum" is the first accession in gstm.a_accs

You can use this script to replace the names of either protein OR DNA accessions in any FASTA file, which should make it easier to (1) use informative names, (uniprot ID's) for your sequences and (2) make sure you have the same names for both protein and DNA sequences.

You must provide a --acc_file name_file_name file on the command line with the FASTA file of sequences to be renamed. This file can simply contain a list of accessions, e.g.:

GSTM1_HUMAN
GSTM2_HUMAN
GSTM1_MOUSE
...

Or it can contain the FASTA definition lines from a template (e.g. protein sequence file). For example:

grep '^>' gstm.alib > gstm.a_accs
~/biol4230/proj1/rename_fastaseq.py --acc_file gstm.a_accs gstm.nlib > gstm.nlib_renamed

will take the text between '>' and the first space as the new set of accession names. This should make it easier to ensure that the protein names and DNA names are the same by simply running grep on the protein FASTA file.

rename_fastaseq.py is very simple, it assumes that the order of names in the -acc_file is the same as the order of sequences in the file argument (gstm.nlib).

You can use this script to take a file of NP/XP_ proteins and their corresponding NM/XM_ mRNAs and convert the refseq accessions to consistent Uniprot IDs.

In addition, rename_fastaseq.py can be used to ensure that phylip accessions are less than 10 characters long.

Unfortunately, the various phylip/fphylip programs are inconsistent about this, but the names (accessions) of the sequences should be 10 characters or less, and not include '.'s ('_' is OK).

Last modified: Tuesday, 27-Mar-2018 14:26:28 EDT