CSHL Computational Genomics

NCBI Exercises, Part 1


These exercises all begin at the NCBI Home Page.
   I. Entrez Searching: Controlled Vocabularies
Goal: Understand how different Entrez databases translate the query "cancer"
  1. Term translation in Entrez PubMed
    1. On the NCBI Home Page, enter cancer in the search box and click Go. (Leave the database menu set at All Databases.)
    2. Click on the number of hits to PubMed.
    3. Click the Details tab above the search results.
    4. How did Entrez translate this query? What new terms do you see? What field limits are used? Look in particular for the field [MeSH Terms].
    5. Medical Subject Headings (MeSH) is a controlled vocabulary used to index all PubMed abstracts.
    6. Edit your query in the text box so that only the term "neoplasms"[MeSH Terms] remains. Click the Search button under the text box to run the modified search. The resulting set only contain abstracts relevant to neoplastic disease.

  2. Term translation in Entrez Protein
    1. Select Protein from the Database pulldown menu. Clear the search box and type cancer. Click Go.
    2. Click the Details tab above the search results.
    3. How did Entrez translate this query? What field limits are used here and not in PubMed?
    4. Entrez Taxonomy, searched with the field [organism], is a controlled vocabulary used to index all molecular biology data at NCBI. In the sequence databases (eg. nucleotide, protein, genome, popset, snp), Entrez translates this query to retrieve records from the crustacean genus Cancer rather than records corresponding to neoplastic disease.
    5. To retrieve only sequences from the genus Cancer, edit your query in the text box so that only the term "cancer"[organism] remains. Click the Search button under the text box to run the modified search.
    6. To retrieve sequences more related to neoplastic disease, change the query to cancer[title] and click Go. This query retrieves records that contain the word "cancer" in their definition lines (titles).

  3. Term translation in Entrez Taxonomy
    1. Select Taxonomy from the Database pulldown menu. Clear the search box and type cancer. Click Go.
    2. You should see the single record for the genus Cancer. Click the Details tab above the search results.
    3. How did Entrez translate this query? What field limits are used?
    4. In Taxonomy, Entrez simply searches for unfielded query terms using [All Names].
    5. Clear your search box and search Taxonomy for rock crab. Click the name of the resulting hit. Do you see why the query "rock crab" found this record?
    6. Entrez Taxonomy automatically assoicates scientific names with common names annotated on the taxonomy record.

   II. Exploring Entrez Taxonomy
Goal: Locate all data at NCBI for a particular species (wine grape)
  1. Find the record for wine grapes in Entrez Taxonomy
    1. On the NCBI Home Page, select Taxonomy from the Database menu.
    2. Enter wine grape in the search box and click Go.
    3. Click the record for wine grapes (Vitis vinifera).

  2. Locate genome data for wine grapes
    1. The table on the right lists all data for this species. There is one Genome Sequence. Click on the number. What genomic sequence is this?
    2. Go back to the wine grape taxonomy page. Click on the number of genes. What kinds of genes are these? How are they related to the genomic sequence you found? Find out by choosing Genome Links from the Display pulldown menu.
    3. In the Links menu, select Taxonomy. Click on the name to open the record.

  3. Locate nucleotide sequence data for wine grapes
    1. Which database has the most data for this organism?
    2. Click on the number of records in Nucleotide Core. Looking at the tabs above your results, how many of these sequences are mRNAs?

  4. View data across the entire genus Vitis
    1. Again go back to the wine grape taxonomy page. Now click on Vitis in the lineage in the center of the page (the last node). This displays all species within the genus Vitis for which NCBI has data.
    2. Check the Nucleotide and Protein boxes at the top of the page (check other ones, too, if you want). Click the Display button. You should now see colored numbers indicating the number of records in each database for each taxon. Which other grape species have a large amount of data (>1000 records)?

   III. Nucleotide Data in Entrez: Using Limits, Fields and Links
Goals:
  • View in FASTA format the curated mRNAs for zebrafish estrogen receptors
  • Find the Gene records for these receptors
  • Retrieve and identify all nucleotide records associated with a given receptor gene
  1. Retrieve all zebrafish mRNAs
    1. On the NCBI Home Page, set the Database pulldown to CoreNucleotide and enter zebrafish[organism] in the search box. Click Go.
    2. Click on the Limits tab the above the results.
    3. Set the Molecule pulldown menu to mRNA and the Only from menu to RefSeq (for NCBI Reference Sequences). Click Go.
    4. This set corresponds to the current transcriptome for zebrafish (a non-redundant set of all mRNAs).

  2. Limit the set to estrogen receptors
    1. Click the Preview/Index tab (to the right of Limits).
    2. Select Title from the All Fields menu, and type estrogen receptor in the text box to the right. Click the Index button.
    3. Click the first term in the list, estrogen receptor, and then click the AND button above the list. Notice that the term "estrogen receptor"[title] has been added to your query in the search box at the top of the page.
    4. Click Preview to see how many records this query retrieves.
    5. Click on the number of records retrieved (to the right of the query in your history).

  3. View the FASTA sequences for the estrogen receptor mRNAs
    1. Which of these records are actually estrogen receptors (not predicted)? These should correspond to the three records that are NM RefSeqs.
    2. Click the checkboxes to the left of each of the three estrogen receptors. Then choose FASTA from the Display menu above the results (just below the Preview/Index tab). FASTA sequences for the three records should appear.

  4. Find the Gene records for the estrogen receptor mRNAs
    1. Go back in your browser to the Entrez results page. Make sure the three estrogen receptor mRNAs are still checked.
    2. Select Gene Links from the Display menu above the results. You will now be in Entrez Gene, and the three linked Gene records will appear.

  5. Retrieve all nucleotide records associated with a single estrogen receptor gene
    1. Locate the gene record for the type 1 receptor with symbol esr1.
    2. Click CoreNucleotide in the Links menu to the right of the esr1 gene.
    3. Identify records for the chromosome, genomic contig, BAC clone (DKEY-147L14), curated mRNA, and primary mRNAs from DDBJ, EMBL, and GenBank.
    4. In the Reports menu to the right of the BAC clone, click Revision History. How many times has the sequence of this record been updated? Compare version 9 to version 8 by clicking appropriate radio buttons under columns I and II and then clicking Show. Notice in particular the definition line, sequence length, GenBank division, and COMMENT. What changed?

   IV. Protein Data in Entrez: Sequences, Structures and Domains
Goals: For the protein target of the drug gleevec, do the following:
  • compare the bound conformations of the drug and the normal substrate
  • find the curated protein sequence
  • find the taxonomic distribution for all proteins in Entrez that share the same domain architecture as the target
  1. Retreive a structure with bound gleevec
    1. On the NCBI Home Page, type gleevec in the search box and click Go.
    2. Click on the hits to PubChem Compound.
    3. Click on the Protein3D tab above the results to limit the records to those bound to a 3D structure.
    4. Click on the image of the resulting chemical (CID 5291), and then on the Protein Structures link to the right of the image on the summary page.
    5. While any of these structures could be used, we will focus on 1IEP. Click on that accession to open the record.

  2. Locate the drug binding site using Cn3D
    1. Note that the MMDB-ID for this structure is 16291. We will need it later. Also, make a note of what species this structure is from.
    2. Click the structure image to launch Cn3D.
    3. Basic Cn3D controls:
      • To rotate, click the left mouse button and drag
      • To translate (move), hold down Shift and click/drag
      • To zoom, hold down Ctrl and click/drag
    4. In the structure window, select Style / Coloring Shortcuts / Molecule. The gleevec molecules in each of the two chains should now be easily visible as ball-and-stick models.
    5. Zoom in on one of the gleevec molecules and double-click it. It should turn yellow.
    6. In the structure window, select Show/Hide / Select by distance / Residues Only. Set the distance cutoff to 3.2 Angstroms and click OK.
    7. The residues in contact with gleevec should now be highlighted yellow. Point but do not click your mouse over the letter of each highlighted residue, and find its residue number in the lower left corner of the sequence window. Make a note of which residues contact gleevec (use the loc numbers, ie loc 64).
    8. Quit Cn3D when you're done.

  3. View the protein target's sequence aligned to a curated domain model
    1. On the structure summary page for 1IEP, click the red bar labeled TyrKc under chain A. This will insert the 1IEP sequence into the tyrosine kinase domain model.
    2. On the cd00192 summary page, click the Structure View button on the left side of the page under the Structure heading.
    3. Cn3D should now show several aligned structures and sequences for the tyrosine kinase domain. The second sequence will be that of 1IEP chain A (1iepa), labeled "query".

  4. Import the protein target's structure
    1. In the sequence window, select Edit / Enable Editor.
    2. In the sequence window, select Imports / Show Imports.
    3. In the imports window, select Edit / Import Structure.
    4. Choose Via Network and click OK.
    5. Enter 16291 (the MMDB-ID for 1IEP) and click OK.
    6. Choose 1IEP_A and click OK.
    7. Click OK in the message box that appears indicating that we will need to save the file to see the imported structure.

  5. Align the protein target's structure to the model
    1. The alignment that appears is the VAST structural alignment between 1FGI and 1IEP. If you scroll across the alignment, you will see a few red-shaded regions indicating alignment problems that need to be resolved to fit the domain model.
    2. In the imports window, select Algorithms / Block Align Single and then click anywhere on the pair of sequences.
    3. In the window that appears, uncheck Global alignment and click OK.
    4. In the imports window, select Alignments / Merge All.
    5. Close the imports window.
    6. To view the imported structure, we need to save the current file and reload it into Cn3D.
    7. In the structure window, select File / Save As..., and click Yes to the two questions. Save the file to disk (remember the path!).
    8. In the structure window, select File / Open, browse to your file and open it.
    9. Choose File / Realign Structures to align the new structure.

  6. View the bound conformations of gleevec and ATP in the structural alignment
    1. To see the binding sites more easily, we will view only two of these structures: 1IR3 (with bound ATP) and the imported structure, 1IEP, with bound gleevec.
    2. In the structure window, select Show/Hide / Pick Structures....
    3. Click on the PDB codes so that only 1IR3_A and 1IEP_A are highlighted in dark blue. (The subdomains of these chains will also be highlighted.) Click Done.
    4. In the structure window, select Style / Coloring Shortcuts / Molecule. Gleevec should now appear as a light blue molecule, while ATP will appear brown. Zoom in to see them.
    5. In the structure window, select CDD / CDD Overview, then click Show Annotations Panel. Select ATP binding pocket in the Annotations list on the left and click Highlight.
    6. Now double-click gleevec and ATP, one after another, to highlight both ligands as well as the binding site. In the structure window, select Show/Hide / Show Selected Residues. Now only the two ligands and the binding site should be shown. Click anywhere in the sequence window to remove the highlighting.
    7. What portions of gleevec overlap ATP in the binding site? The binding site residues should be colored pink in the sequence alignment. If you position your mouse over the residues in the second row (query, 1IEP) and look at the residue positions in the lower left, do you find any that you found in close contact with gleevec in step 2 above?
    8. Quit Cn3D when you're done.

  7. Find the curated protein sequence for the target
    1. Go back in your browser to the structure summary page for 1IEP.
    2. Click the Protein link to the left of chain A.
    3. We will now find the most similar RefSeq protein to this sequence. Click BLink to the right, next to the Links menu.
    4. Select REFSEQ from the Keep only menu just above the list of proteins. Click Select.
    5. The three top sequences all have the same BLAST score. What is unusual about these sequences? What kingdoms are they from? The majority of these proteins are from metazoans, as is the query (1IEP). How do you explain that two sequences from a different kingdom are two of the three most similar sequences to the query? Since we are trying to find the gene associated with the structure, click on sequence among the top three from the same species as the structure (NP_033724).

  8. Retrieve all proteins in Entrez with the same domain architecture as the target
    1. Click the Conserved Domains link in the upper right. What other domains besides the tyrosine kinase does this protein contain?
    2. Click the Search for similar domain architectures button.
    3. Scroll down on the page and click the boxes to the left of the four domains in the query:
      • cd00173 (Src homology 2)
      • cd00192 (tyrosine kinase)
      • pfam00018 (Src homology 3)
      • pfam08919 (F-actin binding)
    4. Then click Subset by selected domains (not Taxonomy!!). This limits the proteins to only those that contain all three domains.
    5. Only one architecture should remain, representing over 50 sequences. Click on the number of sequences.
    6. On the next page, click the Look Up Sequences in Entrez button below the graphic. These sequences are all the proteins that contain exactly the same domain architecture as the gleevec target.

  9. View the taxonomic distribution of these proteins
    1. Select Taxonomy Links from the Display menu.
    2. To view these taxa on a tree, select Common Tree from the Display menu.
    3. Explore branches of the tree by checking boxes to the left of the node names and clicking Choose at the top.
    4. What taxonomic node contains all of these organisms? What can you conclude about the distribution of these kinases among known organisms?