Biol4230 - Accession translation/XML homework

Biol4230 - Python/Accession linking homework 5 - DUE Monday, Feb. 19, 5:00 PM

In addition to putting the python programs for this homework in biol4230/hwk5, also create a file hwk5.notes that links the names of your programs with the question and contains the output for each of the python programs.

Possibly helpful suggestions

Answer questions 2 and 5 from the lab in hwk5.notes
Write a python script that uses esearch.fcgi to do a search at the NCBI and download all the human refseq protein accessions for GSTM* Use the search term:
```
GSTM*+AND+human[organism]+AND+srcdb_refseq[prop]
```
and the strategy shown in Thursday's handout, slide 33.
1. Combine the esummary strategy to get protein lengths (on slide 31 of the revised Thursday handout) with the XML based esearch strategy (handout, slide 33) to print the Accession ['Caption'], length ['Length'], and description ['Title'] for each of the sequences found by the search.
2. Use regular expressions to restrict the print out to Accessions from proteins that contain "glutathione" in their ['Title'], and do not contain PREDICTED in their ['Title'].
Map each of the genuine "glutathione" containing (not "PREDICTED") Refseq accessions to Uniprot accessions at the Uniprot ID mapping site
1. are all the human proteins present in Uniprot?
2. are the mapped proteins the same length?
3. are the mapped proteins the same identical sequence?
Look up the domain content for each of the Uniprot accessions in Pfam.
For each of the human proteins that can be mapped to Uniprot and Pfam, how many of the proteins have Pfam domains that are less than 50% of the Pfam family model length?

Biol4230 Schedule