Biol4230 - Python/Matrix homework 4 - DUE Monday, Feb. 12, 5:00 PM


In addition to putting the python programs an shell scripts for this homework in biol4230/hwk4, also create a file hwk4.notes that links the names of your programs with the question and contains the output for each of the python programs.


Possibly helpful suggestions
  1. Answer questions 3e,f,g and 4e,f from Fridays' PSSM lab.
  2. Download 10: 200 aa and 10: 800 aa random protein sequences from http://www.bioinformatics.org/sms2/random_protein.html. You might put the two sets of ten random sequences in two different directories, or give them file names that distinguish them.

  3. Write a python program that does searches with each each of 5 scoring matrices from either ssearch36 (e.g. BL50, BP62, VT160, VT80, VT40) or blastp (BL45, BL62, BL80, PAM70, PAM30), and calculates three sets of averages from 10 searches (using the 10 shorter sequences) against SwissProt (This exercise is easier with ssearch36).

    1. calculate the average alignment length and percent identity from the best match (one average)
    2. the second best match (a second average)
    3. and the 5th best match (a third average)
    The result should be 5 sets of 3 average alignment lengths, and 5 sets of three average percent identities. (To produce 5 hits from blast, you may need to increase the E()-value threshold.)

  4. Repeat step 2 for the longer query sequences, reporting only the averages for the best hit. Do the average percent identities and alignment lengths differ for the longer and shorter query sequences?

    Why might the alignment lengths differ for some matrices, but not others?

  5. How do the average alignment lengths and percent identities compare to the values shown here?
  6. Going back to the alignment lengths and percent identities that you saved from questions 2g, 3g, and 4g, how does the percent identity and alignment length change as the PSSM becomes more diverse? Estimate the approximate scoring matrix equivalents for the three PSSMs that you searched with.


Possibly helpful suggestions
Biol4230 Schedule