Biol4230 - PSSMs/HMMs

fasta.bioch.virginia.edu/labs/biol4230/hmmer_demo.html

These exercises use the UVa BLAST, UVa CHAPS, UVa PSI-SEARCH2 WWW pages.

UVa CHAPS allows you to enter a set of sequences, generate a multiple alignment, and use that multiple aligment for a PSI-SEARCH2 search.

Additional information on the CHAPS program, which takes a set of sequences, produces a multiple alignment, and then uses the multiple alignment with PSI-SEARCH2, can be found here.

Looking at HMMs -- the effect of diversity

Building an HMM (2 sequences) Using the CHAPS WWW page, make a multiple alignment and generate a HMM using the two sequences: gstm1_human, gstm2_human run CHAPS [pgm].
1. Build a multiple sequence alignment by selecting and
2. Next, go down to the bottom third of the screen and select
3. Take a look at the HMM.
  1. What do you think the letters in the row HMM A C ... indicate? How many letters are in that row? What are the numbers down the left side of the window?
  2. There are three rows associated with each nummber on the left, two rows of 20, and one row of 7. With what part of the HMM are the 7 numbers associated?
  3. In all three rows, do large numbers apprear to be "good" or "bad"?
4. Now run a HMMSearch search against PIR1 with the HMM you just built using
5. How many glutathione transferase homologs do you see?
6. What is the E()-value of the most distant glutathione transferase?
7. HMMSearch does not provide a percent identity. Why not?
8. Go back and also run the MSA of those two sequences against PIR1 using PSI-SEARCH2. How does HMMSearch compare with PSI-SEARCH2?
Searching with a diverse HMM Try the sequences gstm1_human, gstm3_human, gstp1_human, gsta1_human, gstt1_human, gsto1_human, hpgds_human, run CHAPS [pgm].
Repeat the previous steps for building a Multiple Sequence Alignment and sending it to HMMSearch.
1. Build a multiple sequence alignment by selecting and .
  Take a quick glance at the multiple sequence alignment; how many large gapped regions do you see?
2. Next, select in the bottom third of the page.
3. Examine the HMM in detail.
  1. Look to the very far right of lettered columns of the HMM. Does the HMM start at residue 1? If not, where does it start?
  2. Look back at the multiple sequence alignment, and find some locations with gaps. (There each block of the multiple sequence alignment is 50 columns wide, so the second block starts at residue 51).
  3. Look through the positions in the HMM at the 7-number rows, which correspond to match/insert/delete transitions. What happens in that row at positions with gaps?
4. Now run a HMMSearch search against PIR1 the HMM you just built ().
5. How many glutathione transferase homologs do you see?
6. What are the E()-value of the most distant glutathione transferase?
7. Look at the near-significant alignment with SYEP_HUMAN. What is the alignment length? Do you think SYEP_HUMAN is likely to be a homolog?
Using the multiple sequence alignment you just made, also build a PSSM, send it to PSI-SEARCH2, and search against the PIR1 database. Which strategy is more successful, PSI-SEARCH2 or HMMSearch?
Effects of non-homologs Try the sequences gstm1_human, gstp1_human, gstt1_human, narj_eco57, dyr_bpt4, and tpis_rabit (the last three are non-glutathione transferases) run CHAPS [pgm].
Repeat the previous steps for building a Multiple Sequence Alignment and sending it to PSI-SEARCH2 (or PSI-BLAST if it doesn't work).
1. Build a multiple sequence alignment by selecting and .
  Looking at the multiple sequence alignment, does it look a lot different from the multiple sequence alignment from the previous sequences?
2. Next, select
3. Now run an HMMSearch search against PIR1 the HMM you just built ().
4. How many glutathione transferase homologs do you see?
5. Can you tell the glutathione S-transferase homologs from non-homologs?
6. What are the E()-value of the most distant glutathione transferase?
Using the contaminated multiple sequence alignment you just made, also build a PSSM, send it to PSI-SEARCH2, and search against the PIR1 database. Is one method, or the other, less sensitive to non-homologous contamination?
Why do you think a contaminated PSSM/HMM is so good at finding homologs (producing low E()-values) to sequences in the multiple sequence alignment, but does not produce significant E()-values to other proteins?

biol4230