Molecular Evolution - Similarity Searching Exercises from the command line


These exercises use programs on the class-XX compute cluster. To run the exercises from the command line, you first must: ssh username@class-xx (02 - 04, 06,07,090)

To make certain the programs are set up properly, type:

fasta36  or
fasta36 -help

  1. Searching with FASTA

    To do a similarity search from the command line, we need: (1) a program; (2) a query sequence; and a database. The and programs databases have been installed. Today, we will use five search programs: fasta36, ssearch36, blastp (protein:protein), and fastx36/blastx (DNA:protein).

    1. You can transfer a query sequence from your laptop, OR you can extract a sequence from the databasese on the server using the blastdbcmd program.

      Before uing the blastdbcmd, you must type

      module load bioware
      
      blastdbcmd -entry gstt1_drome > gstt1_drome.aa
      
      will find the SwissProt entry gstt1_drome and write it to the file gstt1_drome.

    2. The FASTA programs use the command line syntax: program -opt1 -opt2 query library. Thus:
      fasta36 gstt1_drome.aa /class/shared/seq_db/pir1.lseg > gstt1_drome.fa_out
      
      Will compare the gstt1_drome.aa sequence to the pir1.lseg database. You can also search SwissProt with '/class/shared/seq_db/swissprot.lseg'. (You can abbreviate the pir1 database as 'a' and the SwissProt database as 's'.)

    3. Run the search, then look at the output using an editor, or transferring the file to your laptop, or typing more gstt1_drome.fa_out.

    4. Compare the command at the top of your output file with the commands run on the web site.

      To see a complete list of fasta36 options, type:

      fasta36 -help
      
    5. Try searching with a different scoring matrix; the option -s BP62 sets the scoring matrix and gap penalties to match blastp. Be certain to save new searches in different file names. Try to make the filenames descriptive, e.g. gstt1_drome_swissprot_BP62.fa_out for:
      fasta36 -s BP62 gstt1_drome.aa s > gstt1_drome_swissprot_BP62.fa_out
      
      (note that the scoring matrix option -s BP62 MUST preceed the query and library file names.)

      Try a shallow scoring matrix, e.g. -s MD40 or -s MD20.

      A complete list of options is available here.

    6. Try doing the same search with ssearch36 (Smith-Waterman):
      ssearch36 gstt1_drome.aa/class/shared/seq_db/pir1.lseg> gstt1_drome.ss_out
      
      How do the results compare with fasta36. Do you find more significant scores?

    7. Try using ssearch36 using alternate scoring matrices (-s BP62).

  2. Searching with BLAST.

    BLAST uses a different command line syntax; every argument has a name, e.g. -query gstt1_drome.aa -db swissprot. To see a complete list of blastp options, type:

    blastp -help
    
    1. To do a similar search, type:
      blastp -query gstt1_drome.aa -db /class/shared/seq_db/pir1 > gstt1_drome.bl_out
      
    2. To search swissprot, type:
      blastp -query gstt1_drome.aa -db swissprot > gstt1_drome_swissprot.bl_out