fasta.bioch.virginia.edu/mol_evol/pfb_python_matrices.html

The goals of this exercise are to:

  1. Do some simple parsing of BLAST tabular format output files to extract some information
  2. Evaluate the accuracy of similarity statistics
  3. Demonstrate to yourself that scoring matrices have different target identities and alignment lengths.

On fasta.bioch.virginia.edu/mol_evol/data are sets of results from either SSEARCH or BLASTP searching either a 200 or 800 amino acid random protein sequence against the QFO78 library of 78 Uniprot Reference Proteomes.

There are 4 sets of results for each of the two searching algorithms, ssearch and blastp., a set of searches with one random sequence, either 200 aa or 800 aa long with either ssearch (ss_rand5-) or blastp(blast_rand5-), using one of 7 scoring matrices for ssearch and one of 4 for blastp. In addition, there are the same set of searches using 10 query sequences (randall instead of only one (rand5 -- the fifth random sequence).

Your goal this afternoon will be to write a script that reads each of the sets of data from either the SSEARCH or BLASTP outputs and produces a table with each of the scoring matrices as row, and the percent identity, alignment length, and E()-values for columns, for the top hit from each of the searches.

To do this, you will need to:

  1. Download a set of SSEARCH or BLASTP results.

  2. Write a program that will take take an argument from the command line, which you can use to specify either rand5-200 or rand5-800, and concatenate it with a scoring matrix name (BL50, BP62, etc. for SSEARCH, BLOSUM62, BLOSUM80, etc. for BLASTP, so that you can open and each result file and associate the results with a scoring matrix.

  3. To parse the BLASTP tabular output file, you must:
    1. remove the newline character
    2. skip lines beginning with "#"
    3. use line.split('\t') to break each result line into its parts, which are: qseqid, sseqid, percid, alen, mismat, gaps, q_start, q_end, s_start, s_end, evalue, bits
      consider breaking the line up and saving the results to a dictionary with:
      this_data=dict(zip(field_names, line.split('\t')))
    4. As soon as you have a result line, save this_data, close the file, and move to the next result file.

  4. Save the results in a dictionary using the matrix name as the key, and then print out the values you want ('percid', 'alen', and 'evalue').

  5. Does the alignment length, percent identity, or evalue depend on the query sequence length?

  6. Compare the SSEARCH results with the BLAST results. Which program gives a better range of alignment lengths and percent identities?
  7. For a more challenging exercise, parse the results from the ss_randall-200... and ss_randall-800... searches, which give the results for 10 random query sequences. Calculate the median percid, alen, and evalue from the 10 results for each scoring matrix.

results.html