ECG2016 - Over Representation Analysis (ORA)

Computational Genomics, Oct 29, 2016

For this workshop, you will use the NCBI GEO2R tool to run a simple differential gene expression analysis on a dataset of your own choice found from GEO. Use the "Advanced Search Builder" GEO query function to select:

DataSet Type: "expression profiling by array"
Organism: "human"
Number of Samples: 6-10 (pick one number, 6 or 8 or 10)
a disease, tissue, celltype, etc of interest to you (e.g. "melanoma") [all fields]

Now, paste the GEO Dataset accession number into GEO2R:

https://www.ncbi.nlm.nih.gov/geo/geo2r/

Use the GEO2R application to setup two experimental groups relevant to the study you chose, and assign each sample into the appropriate group. Click the "Top 250" button to see the table of most differentially expressed genes (rank sorted by statistical significance). Look at raw P. values vs. the adj. P values and consider whether you found any genes of "interest".

Download the tabulation of all results by clicking "save all results".

Import this tab-separated file into Galaxy, where you will use the Galaxy text transformation tools to cut out columns of interest, remove unwanted header rows, remove unwanted leading/trailing quotes, etc.

Using the Galaxy scatterplot tool, generate a plot of FDRs (adj. P values) vs. raw P values. For bonus points: generate a plot of -log(P value) vs. log(Fold Change) using Galaxy.
What is the relationship between FDR and P values in your results? Do you have any significant DEGs? What is the relationship between the fold change and significance of true DEGs? of non-DEGs?
Cut out the gene symbol column and export out of Galaxy into a file "genes.tab"; upload this file to the GOrilla application at:
http://cbl-gorilla.cs.technion.ac.il/
Run the analysis by choosing "All" ontologies in Step 4, and consider changing the P value threshold to 1e-4 or even lower; click the "show in REViGO" checkbox. This "single list" analysis will use all of the genes, ranked by statistical significance to look for over-representation, regardless of DEG status or FDR; once you've completed this and answered the questions below, go back and repeat the GOrilla analysis with two separate gene lists, one for DEGs better than some FDR threshold of interest (say 10%), and one list with all other genes (the "background" list).
Here is an example some differentially expressed genes: ORA_results.tabular

If you have problems getting the gene lists via Galaxy, you can use this file: ORA_target.tabular for the target gene set, and ORA_background.tabular for your background. Inspect the GO enrichment plots; do you see "crosstalk" between closely related/nested terms? For the most significantly enriched terms, what is the extent (magnitude) of the enrichment; does the GO term and assoc- iated gene list suggest candidate hypotheses to you?
Follow the "Visualize output in REViGO" link to see a different representation of the GOrilla enrichment results; are the enriched terms very different, semantically?
Go to http://www.reactome.org/PathwayBrowser/#/ and click the "Tour:" button in the upper right corner to watch a short video. Then click the "Analysis:" button; upload your target gene list from before to explore connected pathways.
Does Reactome generate the same biological hypotheses as GOrilla? Using your powers for biological insight, can you rationalize the differences?

Computational Genomics Home Page