CSHL Computational Genomics

NCBI Exercises, Part 2


These exercises all begin at the NCBI Home Page.
   I. Investigating SNPs and Phenotypes
Goals:
  • Identify the gene associated with a nucleotide sequence
  • View the annotation of this gene in the human, mouse and rat genomes
  • Find a human phenotype associated with polymorphisms in this gene
  • Find the SNP associated with the phenotype and the population frequencies of its alleles
  • Interpret the SNP in the context of the protein domain and its PSSM
  • Search for instances of the SNP in the human EST database
  1. Identify the gene associated with a nucleotide sequence
    1. On the NCBI Home Page, enter BQ669567 in the search box. Click Go. (Leave the database menu set to All Databases.)
    2. To which of the three component nucleotide databases (CoreNuc, EST, or GSS) does this sequence belong? Click that database, then the accession, and review the record. What species and tissue type are the source of this sequence?
    3. Click Links in the upper right and choose UniGene.
    4. Click the UniGene cluster number to open the record. With what gene is this sequence associated?
    5. Scroll down to the Gene Expression section and click Expression Profile. In what tissues and disease states is this gene most highly expressed?

  2. View the annotation of this gene in the human, mouse and rat genomes
    1. Go back in your browser to the UniGene page. Click Links in the upper right and choose Gene.
    2. Take a moment to review the Summary section. What is the biological function of this gene?
    3. The Genomic regions, transcripts, and products section lists the NM/NP pairs for each annotated transcript variant for a gene. How many transcript variants are annotated for this gene? Locate the RefSeq accessions for the mRNA(s) and protein(s).
    4. To view this gene in the Map Viewer, click the See ... in MapViewer link in the Genomic Context section. A large number of maps are shown, with the Gene map on the far right.
    5. To view syntenic regions in the mouse and rat genomes, we only need the Gene maps from these three species. Click Maps & Options either on the left side bar or on the upper right. Using the Remove button, remove all maps from the right menu except the Gene map (click on the map, then click Remove).
    6. Now select mouse from the Org menu above the left map menu. Use Add to add the mouse Gene map. Repeat this to add the rat Gene map.
    7. Use the Move UP and Move DOWN buttons to move the selected maps relative to one another (the bottom map will become the rightmost map). Click Apply to redraw the maps, then click OK to close the window.
    8. To draw the three maps to the same scale (bp/pixel), check the Synteny 1:1 box in the left side bar. You may want to zoom out to see all three genes. Position your mouse at the vertical midpoint of one of the genes and click on the gray vertical line. Choose Zoom out x2 from the menu that appears. Which gene is the smallest in bp length?
    9. Look for bold numbers to the left of each map near the vertical midpoints. These indicate the chromosome numbers in the mouse and rat genomes. What mouse and rat chromosomes are syntenic to the displayed region of the human genome?

  3. Find a human phenotype associated with polymorphisms in this gene
    1. To the right of the gene symbol on the human map, click the OMIM link.
    2. OMIM = Online Mendelian Inheritance in Man, a database of human disease phenotypes. OMIM records preceded by a "%" indicate a confirmed phenotype or phenotypic locus for which the molecular basis is not known. A "+" indicates an entry linked to a gene of known sequence and phenotype.
    3. Click the OMIM number preceded by a "+".
    4. In the table of contents in the left side bar, click Allelic Variants. Review the description of variant .0001. Keep in mind the biological function of the gene you found on the Gene page. What could be the link between this gene and these phenotypes?
    5. Make a note of the amino acid change, found in square brackets in the title of the variant.

  4. Find the associated SNP and the population frequencies of its alleles
    1. Click the dbSNP link to the right of variant .0001.
    2. You should see the summary for rs1051740.
    3. The page displays the allele itself in the top right. Further down the page is the GeneView section. Locate the accession numbers for the mRNA (NM) and protein (NP) sequences in the GeneView table for the reference build of the human genome. What is the number of the amino acid in the NP that is changed? What are the wild type and mutant residues? Does this match what was displayed in OMIM?
    4. Scroll down to the Population Diversity section. This section displays genotype and allele frequencies from several populations, including European, African American, Asian, and Sub-saharan African. Click on the population names to learn more about them if you like.
    5. In which kinds of populations is the ancestral allele especially enriched? Which populations contain the highest frequencies of the mutant allele?

  5. Interpret the SNP in the context of the protein domain and its PSSM
    1. First make sure that you know the residue number of the SNP in the NP sequence (amino acid position in the GeneView table).
    2. In the NCBI Resource Links section (just above Population Diversity), click on the NP accession number under 3D structure mapping. On the next page, the SNP id will be highlighted in red. Make a note of the position of the SNP in the query (Protein Context) and the structure (Structure Neighbor). Check the box to the right of the SNP under the Cn3D column, then click the Selected button under the list.
    3. The position of the SNP will be marked by a small green triangle under the query sequence bar. In what conserved domain (CD) does the SNP occur? Click on the colored bar representing that CD.
    4. Review the text summary at the top of the page. Does this description make sense given what we know about the function of the gene you found in step 1? Does this provide any insight into a possible molecular basis of the phenotype you found in step 3?
    5. Locate the column of the SNP in the multiple alignment using the position of the SNP in the query (second row of the alignment). What amino acids are present at this position in this subset of 10 sequences? Verify that the SNP position in the structure you noted above is the same as the position number of this column in the master sequence (the top sequence, 1QO7_A).
    6. We can investigate the allowed residues at this position by viewing the PSSM for the CD. Click here to open the PSSM Viewer. Under "PSSM to View" enter the accession for the CD in the box. Under "Protein to Align to PSSM" select Accession or GI and enter the accession of the query (NP_000111). Click the Stacked Bar View button. Now enter the position number of the SNP in the query sequence (NP_000111) in the box to the right of the Jump button. Set the consensus menu to query and click Jump.
    7. The SNP position should now be the leftmost column in the display. Green residues have positive scores in the PSSM, and red residues have negative scores. What residues have positive scores? To confirm, click the bar graphic above the column to view the frequencies and scores. What can you conclude about what residues are allowed at this position? What score does the mutant residue in the SNP receive? What does this suggest?

  6. Search for instances of the SNP in the human EST database
    1. We will do this using a translated BLAST search, using the RefSeq protein for the gene as query (NP_000111). First, go to the NCBI Home Page and click on the BLAST link at the top of the page. Click tblastn.
    2. Enter NP_000111 in the search box. Change the Database menu to Expressed sequence tags. Enter human in the Organism box. Click the BLAST button to begin your search.
    3. When your search is finished, click Reformat these Results at the top of the page. In the Alignment View menu, select flat query-anchored with dots for identities. This will create a multiple alignment view with identical residues represented as periods (.). Click the View Report button.
    4. Scroll down to the query-anchored alignment (after the list of hits). Find the SNP position in the alignment. Do you see evidence for the SNP in the EST data? Find the EST we started with (BQ669567) in the alignment. What allele does the EST contain for this SNP?


   II. Exploring Genomes with BLAST
Goals:
  • Search for an annotated chicken gene homologous to human FOXP2
  • Search for a chicken protein homologs using BLASTp
  • Search the chicken genome for homologs using genomic BLAST
  • Compare the chicken homologs to the human protein
  1. Search for an annotated chicken gene homologous to human FOXP2
    1. On the NCBI Home Page, choose Gene from the Database menu and type foxp2[sym] AND human[orgn] in the search box. Click Go.
    2. Click the gene symbol to open the record. In the Summary paragraph, the gene product is described as containing a DNA binding domain and a polymer tract of a single amino acid. Which amino acid? Further down on the page in the RefSeq section, you should find descriptions of the four transcript variants for this gene. In which part of the protein do the variants differ?
    3. In the list of links in the upper right, click Homologene. Click on the title to open the record.
    4. In the Genes section at the top, is there a gene for chicken (G. gallus) listed? Click on the name of the gene to load the record.
    5. How many transcript variants are shown for the chicken gene? Scroll down to the Related Sequences section. Is there any support for these transcripts in GenBank?

  2. Search for a chicken protein homologs using BLASTp
    1. Return to the NCBI Home Page and click on the BLAST link at the top of the page. Then click on protein blast (blastp).
    2. Enter NP_055306 (the human FOXP2 protein) in the search box. Choose Non-redundant protein sequences as the database.
    3. Enter chicken in the Organism box. Click the BLAST button to start the search.
    4. When your search is done, scroll down to the table showing the hits. Several of the top hits have very small e-values. Are these proteins FOXP2? Are there other proteins as well? Make a note of which proteins have the highest scores.

  3. Search the chicken genome for a homolog using genomic BLAST
    1. Return to the NCBI Home Page and click on the BLAST link at the top of the page. In the BLAST Assembled Genomes section, click Gallus gallus.
    2. Type NP_055306 in the search box and choose TBLASTN from the Program menu. Click Begin Search.
    3. Click View Report to format your results. When your search is done, click the Genome View button above the graphic. Looking at the table below the genome graphic, which chromosome contains the best hit?
    4. In the table, click on the Score link to sort the results by score. This sorts by score, and not e-value! Click on the accession of the contig with the lowest e-value.
    5. The BLAST hits will be shaded red on the right and will be indicated by small colored bars on the various maps. The best hits are in a cluster near the top of the chromosome. Click on the gray line of any map at the position of the BLAST hits, and then choose Show 1M (1 Megabase) from the menu. The BLAST hits should be a bit more spread out. Zoom in more by clicking on a gray map line in the middle of the cluster of hits, and choose Zoom in x4 from the menu.
    6. The Model, RNA, Gene, and Contig maps are shown. Is there a gene annotated at the positions of the BLAST hits? Are there annotated RNAs? Is there a predicted gene model? How do the BLAST hits compare with any of these annotated items (gene, RNA, model)?
    7. Click on the gene name (on the Gene map) to see a box of detailed information. What gene is this?
    8. Which exons contain the BLAST hits? Are they in the N-terminal or C-terminal region of the protein product?

  4. View evidence supporting the gene annotation found by genomic BLAST
    1. Click on Maps & Options. Using the Add button, add the rnaGga (chicken mRNA) and the ugGga (chicken UniGene) maps. Click Apply to redraw the maps.
    2. What mRNA support is there for this gene annotation? Is there a full-length transcript? Where is there EST support from UniGene?
    3. The transcripts in the RefSeq_RNA map all begin with XM. Can you now explain why?

  5. Examine a second homolog found by genomic BLAST
    1. Scroll up to the top of the page and click the BLAST link at the top.
    2. Click Recent Results and then click the Request ID of your chicken genomic BLAST (it should be the one at the top).
    3. As before, click the Genome View button and the Score link to sort the results by score. Look carefully and click the contig accession with the second lowest e-value (not score!)
    4. Repeat the analysis above for this set of BLAST hits. What gene is hit? What supporting evidence is there in GenBank for this annotation?

  6. Compare the chicken homologs to the human protein
    1. Return to the NCBI home page and click the BLAST link at the top of the page. Use Recent Results to retrieve your BLASTp search against chicken nr sequences (it should be the second Request ID in the list).
    2. We will now compare several of the chicken homologs to the human FOXP2 protein. Open a new browser window and navigate to the NCBI Home Page. Click BLAST at the top of the page, then click protein blast.
    3. Enter NP_055306 into the search box. Enter the accessions of at least the top three hits from your blastp search in the Entrez Query box (separate them with spaces). This will limit the database to only these sequences.
    4. Run the search. When the search is done, click Reformat these Results at the top of the page. In the Alignment view menu, choose flat query-anchored with dots for identities. Click View Report.
    5. Where do the sequence differences occur between these proteins? Is this consistent with the results from genomic BLAST? Are there consistent differences between the human and chicken sequences?