Biol4230 - Similarity Searching Exercises

These exercises use programs on the FASTA WWW Search page[pgm] and the BLAST WWW Search page [pgm].

In the links below, [pgm] indicates a link with most of the information filled in; e.g. the program name, query, and library. [seq] links go to the NCBI, for more information about the sequence. In general, you should click [pgm] links, but not [seq] links.

Identifying homologs and non-homologs; using domain annotations
  1. Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 (GSTD1) NP_001171499/ H9KLY5_APIME [seq] to the PIR1 Annotated protein sequence database. Be sure to press , not .

    Take a look at the output.

    1. How long is the query sequence?
    2. How many sequences are in the PIR1 database?
    3. What scoring matrix was used?
    4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
    6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
    7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
    8. What is the E()-value, percent identity, and alignment length of the worst scoring sequence shown? Are there sequences in the database with worse scores?
  2. Homology (and non-homology) can also be inferred from domain relationships. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
    1. Ignoring the domain annotations, what is the highest scoring non-homologous sequence (the highest scoring -- lowest E()-value -- sequence that is unlikely to share any structural similarity with the honey bee query sequence?
    2. Test your candidate non-homolog(s) by comparing it to SwissProt using the General re-search link. Does it show significant similarity to any glutathione transferases?
    3. Is your testing by re-searching consistent with the domain annotation coloring?
    4. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do might the alignments not include the complete domains?
    5. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present?
  3. Repeat the honeybee GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix that BLAST uses. Re-examine the GSTD1:SSPA_ECO57 alignment. Are both glutathione transferase domains present? Look at the alignments to the homologs above and below SSPA_ECO57. Based on those aligments, do you think the Glutathione-S-Trfase C-like domain is really missing? Why did the alignment become so much shorter?
  4. One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH, with an E()-value of 0.11.
    1. Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione transferase homolog?
    2. Use the General Research to explore the domains contained in EF1D2_ARATH homologs found in SwissProt.
    3. Does this second search support homology or non-homology?

    blastp and fasta36 on the command line
  5. Command line (script-able) similarity searching;

    Login (ssh) to (or possibly

    On both machines, we will use the .bashrc initialization file to provide access the blast programs and sequence databases.

    1. Copy the customized .bashrc and .bash_profile files to your account.

      If you already have a .bashrc (ls -la .bashrc, then you should make a copy of it before replacing it with the biol4230 version. Later, you can merge the new and old files using an editor. If your existing .bash_profile file "sources" .bashrc, you do not need to change it.

      cd    # make sure you are in your home directory
      cp ~wrp/biol4230/bash/.bash_profile .
      cp ~wrp/biol4230/bash/.bashrc .
      source ~/.bashrc
      echo $PATH
      Once you have copied these files, when you login to interactive.hpc, you will be running on a separate host machine.
      As a result, YOU MUST LOGOUT (^D) TWICE to fully logout. If you see udc- in your prompt, you are not fully logged out.

      On interactive.hpc your path will include python and ncbi-blast.

      echo $PATH | grep -i ncbi
  6. Create a biol4230/hwk2 directory (mkdir) and cd into it.
  7. Download the Honey Bee sequence from the NCBI using curl:
    curl "" > honeybee_gst.aa
    Or copy the ~wrp/biol4230/hwk2/honeybee_gst.aa file into your own homework (biol4230/hwk2) directory.
  8. The following databases are available for searching with both the fasta36 / ssearch36 and blastp programs. All blast names must be preceeded by ${BLAST_DB}/, e.g. pir1 is referred to as ${BLAST_DB}/pir1.
    # sequences
    PIR1pir1a ~13,000
    SwissProtswissprotq ~500,000

    You can see the names of some of the blastp databases by looking in the /data/slib/bl_dbs directory: (but pir1 does not have this format)

    ls ${BLAST_DB}/*.pal  # protein databases
    ls ${BLAST_DB}/*.nal  # DNA databases

    You can see the list of libraries that fasta36 knows about either by looking for ${FASTA_DB}/*.lseg files:

    ls ${FASTA_DB}/*.lseg # protein databases only
    or by running the program in the interactive (fasta36 -I) mode, or by In interactive After you enter the name of your query sequence (honeybee_gst.aa), you will be given a list of sequence libraries, and their abbreviations. (-I interactive mode is useful for learning about available sequence libraries, but normally you should run in non-interactive mode.)
  9. Running fasta36 and blastp non-interactively.
    1. Look at the help commands. To see how to run the programs, and the various command line options, type:
      fasta36 -h
      fasta36 -help >
      blastp -h
      blastp -help >
    2. Do the same FASTA search you did on the web site by typing:
      fasta36 honeybee_gst.aa a > honeybee_v_pir1.fa_result

      Look at the honeybee_v_pir1.fa_result file. Does the library hit summary look exactly the same? How it is different?

    3. Do the search with blastp:
      blastp -query honeybee_gst.aa -db ${BLAST_DB}/pir1 > honeybee_v_pir1.bp_result

      Note that with blastp, the query and library sequence files are specified with arguments (-query, -db). With fasta36 (and the other FASTA programs), they are specified as the first and second non-option commands. With fasta, all options must precede the query and library file names.

    4. Both fasta36 and blastp can display search results in different formats. The simplest format to parse with other programs is tabular format:
      fasta36 -m 8 query.file library >
      blastp -outfmt 6 -query query.file -db library.file >
      Run the fasta36 and blastp honeybee_gst.aa vs PIR1 searches producing tabular output and saving it to a file.

      You will parse these tabular results files for this weekend's homework

    blastp output parsing tricks for your homework:
    1. By default, the fasta36 -m 8 output looks like this:
      NP_001171499.1	sp|P20432|GSTT1_DROME	58.05	206	86	1	3	208	2	206	1.2e-56	215.0
      NP_001171499.1	sp|P04907|GSTF3_MAIZE	35.10	170	98	19	2	158	3	166	2e-09	58.2
      While the blastp -outfmt 6 output looks like this:
      NP_001171499.1	P20432	57.767	206	86	1	3	208	2	206	3.19e-87	255
      NP_001171499.1	P04907	27.381	168	107	2	2	158	3	166	6.51e-11	58.5
      To get blastp output that looks the same as fasta36 output, use the command option:
      blastp -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -query ... -db ...
    2. You can use the "cut -f 2" command to isolate the second column, and pipe (|) it to "cut -d \|" again to separate the parts of sp|P20432 |GSTT1DROME:
        cut -f 2 fasta_output.file | cut -d \| -f 2
      Produces a set of accessions. If you do a search against SwissProt, you may need to remove the version number with "| cut -d . -f 1".
    3. You can use the "awk" command to isolate the subset of hits that you want ($11 has the E()-value; $2 has the subject sequence id):
      awk '{if ($11 > 0.001 && $11 <= 2.0) print $2}' | cut -d \| -f 2

Biol4230 Schedule