In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.
Take a look at the output.
Login (ssh) to interactive.hpc.virginia.edu
We will use a customized .bashrc initialization file to provide access the blast programs and sequence databases.
If you already have a .bashrc (ls -la .bashrc, then you should make a copy of it before replacing it with the biol4230 version. Later, you can merge the new and old files using an editor. If your existing .bash_profile file "sources" .bashrc, you do not need to change it.
cd # make sure you are in your home directory cp ~wrp/biol4230/bash/.bash_profile . cp ~wrp/biol4230/bash/.bashrc . source ~/.bashrc echo $PATHOnce you have copied these files, when you login to interactive.hpc, you will be running on a separate host machine.
echo $PATH | grep -i seqprg
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=NP_001171499&rettype=fasta&retmode=text" > honeybee_gst.aaOr copy the ~wrp/biol4230/hwk2/honeybee_gst.aa file into your own homework (biol4230/hwk2) directory.
Database | Blast name | FASTA abbreviation | # sequences |
---|---|---|---|
PIR1 | pir1 | a | ~13,000 |
SwissProt | uniprot_sprot | s | ~500,000 |
QFO78 | qfo78 | q | ~1,000,000 |
You can see the names of some of the blastp databases by looking in the /data/slib/bl_dbs directory: (but pir1 does not have this format)
ls ${BLAST_DB}/*.pal # protein databases ls ${BLAST_DB}/*.nal # DNA databases
You can see the list of libraries that fasta36 knows about either by looking for ${FASTA_DB}/*.lseg files:
ls ${FASTA_DB}/*.lseg # protein databases onlyor by running the program in the interactive (fasta36 -I) mode, or by In interactive After you enter the name of your query sequence (honeybee_gst.aa), you will be given a list of sequence libraries, and their abbreviations. (-I interactive mode is useful for learning about available sequence libraries, but normally you should run in non-interactive mode.)
fasta36 -h fasta36 -help > fasta.helpLikewise:
blastp -h blastp -help > blastp.help
fasta36 honeybee_gst.aa a > honeybee_v_pir1.fa_result
Look at the honeybee_v_pir1.fa_result file. Does the library hit summary look exactly the same? How it is different?
blastp -query honeybee_gst.aa -db ${BLAST_DB}/pir1 > honeybee_v_pir1.bp_result
Note that with blastp, the query and library sequence files are specified with arguments (-query, -db). With fasta36 (and the other FASTA programs), they are specified as the first and second non-option commands. With fasta, all options must precede the query and library file names.
fasta36 -m 8 query.file library > fasta_output.tab blastp -outfmt 6 -query query.file -db library.file > blast_output.tabRun the fasta36 and blastp honeybee_gst.aa vs PIR1 searches producing tabular output and saving it to a file.
You will parse these tabular results files for this weekend's homework
NP_001171499.1 sp|P20432|GSTT1_DROME 58.05 206 86 1 3 208 2 206 1.2e-56 215.0 NP_001171499.1 sp|P04907|GSTF3_MAIZE 35.10 170 98 19 2 158 3 166 2e-09 58.2While the blastp -outfmt 6 output looks like this:
NP_001171499.1 P20432 57.767 206 86 1 3 208 2 206 3.19e-87 255 NP_001171499.1 P04907 27.381 168 107 2 2 158 3 166 6.51e-11 58.5To get blastp output that looks the same as fasta36 output, use the command option:
blastp -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -query ... -db ...
cut -f 2 fasta_output.file | cut -d \| -f 2Produces a set of accessions. If you do a search against SwissProt, you may need to remove the version number with "| cut -d . -f 1".
awk '{if ($11 > 0.1 && $11 <= 2.0) print $2}' output_file.name | cut -d \| -f 2