ECG2024 - PSSMs/PSI-BLAST

Computational Genomics, Dec 5, 2024

PSI-BLAST and PSSMs

These exercises use the PSI-SEARCH2 web site, as well as the ECG2024 BLAST, ECG2024 CHAPS, ECG2024 PSI-BLAST WWW pages.

The new PSI-SEARCH2 web site does iterative PSI-SEARCH searches and highlights domain content.

ECG2024 CHAPS allows you to enter a set of sequences, generate a multiple alignment, and use that multiple aligment for a PSI-BLAST search.

Additional information on the CHAPS program, which takes a set of sequences, produces a multiple alignment, and then uses the multiple alignment with PSI-BLAST, can be found here.

Iterative similarity searching

Use the PSISEARCH2 search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database.
1. Take a look at the output.
  1. How many homologs have E()-value < 0.001? How many homologs are do not share statistically significant similarity after the first iteration?
  2. What scoring matrix was used for the first iteration?
2. Run 2, 3, 4 and 5 iterations.
  1. Watch how the E()-value of a non-significant homolog changes the first time it is significant, and the first time it is included in the PSSM.
  2. Track the number of homologs with E()<0.001, and the number of homologs that are not statistically significant. Does the PSSM improve homolog detection?
  3. Look at the percent identity for some of the more distant homologs as the iterations progress. How low are the percent identities for the most distant homologs with E()<10^-6?
  4. What happens to the domain coverage of SSPA_ECO57 as the iterations progress?
  5. What is the E()-value of the highest non-homolog (different domains) at each iteration?
3. Re-run this the search from the first iteration: PSISEARCH2 search page [pgm], but, after the first iteration search, check the box left of the DCAM_YEAST protein, to include a non-homologous sequence in the profile. and run two more iterations. What is the E()-value of DCAM_YEAST in the second iteration? In the third?
Is there a missing GST-C domain in sp|P30151|EF1B_XENLA? Copy the first 100 aa of sp|P30151|EF1B_XENLA here and paste the sequence into the PSISEARCH2 search page [pgm] and use the PSI-SEARCH2 program to search the QFO78 protein sequence database. Take a look at the output.
1. Are there any annotated domains in the statistically significant matches?
2. Run iterations 2 and 3 (or more). How many C.GST-C Pfam domains are found with statistically significant matches?
3. How might you "independently" test the hypothesis that EF1B_XENLA contains a C.GST-C domain?
Looking at profiles/PSSMs -- the effect of diversity
Using the CHAPS WWW page, make a multiple alignment and generate a PSSM using the two sequences: gstm1_human, gstm2_human run CHAPS. After generating the alignment with Run ClustalW Now, select Generate PSSM Now.
Examine the PSSM (position specific scoring matrix). Compare the values to BLOSUM62 by identifying some highly conserved positions (':'), and look at the matrix at those positions.
The weights of each residue on shown on the right half of the PSSM.
Try the same process with: gstm1_human, gstm2_human, gstm3_human, gstm1_mouse run CHAPS. Does the scoring matrix or weighting change much?
Try the sequences gstm1_human, gstm3_human, gstp1_human, gsta1_human, gstt1_human, hpgds_human, run CHAPS. Now look at the the scoring matrix and weighting. (Again, look at highly conserved sites and compare to BLOSUM62.)

Iterative searching with PSI-BLAST
Using the ECG2024 PSI-BLAST [pgm] page, search the PIR1 database using to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Set Iterations to 5 and E() cutoff to 1e-4. Are the E()-values for GSTA1_RAT and GSTA4_RAT the same as the ones you saw in question 4?
Using the ECG2024 PSI-BLAST [pgm] page, search the PIR1 database using to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME setting the E()-cutoff to 0.01. Does PSI-BLAST ever include a non-glutathione transferase homolog?
The honey bee GST has the same the domain structure of gstt1_drome at Pfam. Compare that to the domain structure of the most distant statistically significant sequence you found (again at Pfam ).
Try the same search setting the E() cutoff to 0.2. What is the final E()-value for SYEP_HUMAN Bifunctional glutamate/proline--tRNA ligase.
Do the same series of searches using sp|P08100|OPSD_HUMAN [pgm]. Set the E() cutoff to 1e-4 and search the PIR1 database for 10 iterations. Compare the converged results for a search with and without composition-based statistics.
Try searching using the PSSM's you generated in the CHAPS/PSSM section. Search the swissprot database, which has been annotated to indicate most GST homologs.
1. Search with two sequences: gstm1_human, gstm2_human run CHAPS
2. Search with four sequences: gstm1_human, gstm2_human, gstm3_human, gstm1_mouse run CHAPS.
3. Try the sequences gstm1_human, gstm3_human, gstp1_human, gsta1_human, gstt1_human, gsto1_human, hpgds_human, run CHAPS.
In each of the searches, try to determine how broad the initial search was, and watch out for high-scoring unrelated sequences.

Computational Genomics Home Page