Bioconductor, EdgeR, and Gene Expression

Summary - Install Bioconductor, import data, run EdgeR using two different modes

These exercises will follow the protocols described in Anders, S. et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols 8, 1765-1786 (2013) to analyze the GSE Ensembl data described in Thursday's handout.

We will skip over the mapping of fragments to the genome, and focus on normalization, quality control, and identification of differentially expressed genes with edgeR. This starts at step 14 of the protocol on p. 1777. source("https://bioconductor.org/biocLite.R")

To run these excercises, you will need to have Bioconductor installed on your laptop computer:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()   # installs a lot of stuff
biocLite("BiocUpgrade")  # ignore error messages - you may be up-to-date
biocLite(c("edgeR","DESeq"))
library(edgeR)

To do these exercises, we will start with files similar to those from last week:

  1. Transfer the file: interactive.hpc.virginia.edu:/nv/md_rdlib2/biol4230/data/rna-seq/GSE_HTS.txt to your laptop.

  2. Read in the file to a data.frame using:
    GSE_HTS<-read.table(file="GSE_HTS.txt",sep="\t", header=T,row.names=1)
    GSE_groups=c(rep('GM128',3),rep("H1",4),rep("MCF7",3))
    

  3. Summarize the GSE_HTS data.frame using str and summary. How many columns are there? How many rows are there? Do you see the same differences in maximum abundance I presented in class?

  4. Make a logical vector that specifies the rows (genes) that have more than 9 counts in the two samples with 3 replicates, and more than 12 counts in the sample with four replicates.
    keep_cnts<-rowSums(GSE_HTS[,1:3]) > 9 & rowSums(GSE_HTS[,4:7]) > 12 & rowSums(GSE_HTS[,8:10]) > 9
    
    How many samples are saved? (Compare length(keep_cnts) to length(keep_cnts[keep_cnts]). Why are the numbers different?

  5. EdgeR works best when genes with less than 1 read per million in n of the samples, where n is the size of the smallest group of replicates.
    GSE_cpms <- cpm(GSE_HTS)
    keep_cnts <- keep_cnts  & rowSums(GSE_cpms > 1) >= 3
    GSE_HTS1 <- GSE_HTS[keep_cnts,]
    
    Do a summary on GSE_HTS and GSE_HTS1. How has the first quartile changed? Has the maximum changed? The minimum? How many rows are in GSE_HTS? In GSE_HTS1?

    Plot a boxplot of log10(GSE_HTS1) (boxplot(log10(GSE_HTS[keep_cnts,])). Why are there so many errors? Do the medians, quartiles, and maxima match what you saw in the summary(). What is the dynamic range between the median and the maximum numbers of read counts? The difference between the first and third quartiles?

  6. To use the edgeR functions, we must put our data into an edgeR DGEList data structure.
    GSE_dge1<-DGEList(counts=GSE_HTS1,lib.size=colSums(GSE_HTS1),group=GSE_groups)
    
    Take a look at the contents of GSE_dge1 using str(GSE_dge1) (you could also try summary, but it fails. What do you see in the data structure in addition to the data?

    You can look at the individual parts of the GSE_dge1 list using [[]] indexing, e.g.:

    summary(GSE_dge1[[1]])
    summary(GSE_dge1[[2]])
    
    [[]] allows you to look at an individual part of a list.

  7. So that we can compare the datasets, we first need to normalize the data from the different samples:
    GSE_dge1<- calcNormFactors(GSE_dge1)
    str(GSE_dge1)
    
    Look at the GSE_dge1 structure with str(). What happened?

  8. The edgeR plotMDS() function plots samples on a two-dimensional scatterplot so that distances on the plot approximate the typical log2 fold changes between the samples (like a PCA, but with only two dimensions). Make a plotMDS plot of GSE_dge1.
    plotMDS(GSE_dge1,col=c("green","red","blue")[factor(GSE_groups)])
    
    Are the replicates similar? Do the different cell lines differ?

    To look for statistically significant differences in expression, we must first have estimates of the amount of non-biological variation for genes expressed at different levels. This can be calculated using the estimateCommonDisp() and esimateTagwiseDisp() functions. Run these functions, and look to see how the GSE_dge1 structure has changed. Then plot the relationship between expression and variance using plotMeanVar(). What does the solid black line with slope=1 show in the plotMeanVar plot.

    GSE_dge1<-estimateCommonDisp(GSE_dge1)
    str(GSE_dge1)
    GSE_dge1<-estimateTagwiseDisp(GSE_dge1)
    plotMeanVar(GSE_dge1,show.tagwise.vars=T,NBline=T)
    
    The NB line shows the variance predicted by the "Negative Binomial" distribution. Does the variance in the samples match the NB model? How does the predicted NB variance differ from the solid black line?

  9. Now that we have normalized the counts, and looked at the properties of the variance, we can compare gene expression under two different conditions, and look for the most significant differences.
    GSE_dge1_GH<-exactTest(GSE_dge1,pair=c("GM128","H1"))   # do the statistical tests
    GSE_tt_GH<-topTags(GSE_dge1_GH,n=nrow(GSE_dge1))          # sort by the most significant differences
    head(GSE_tt_GH$table,20)                                  # show the top 10 most significant
    
    How many of the 20 most significant changes are larger in H1 than GM128? How abundant (CPM) are the most significant changes?

    Confirm that the most significant changes actually have substantial changes in counts:

    GSE_HTS1[row.names(head(GSE_tt_GH$table,20)),c(1:3,4:7)]
    

  10. Extract the names of the differentially expressed genes with FDR < 0.05? How many are there In that set, how many do you think are false positives? How many genes have FDR < 0.001? Plot the fold-change vs abundance for FDR < 0.05 and FDR < 0.001.
    GSE_rn_05<-row.names(GSE_tt_GH[GSE_tt_GH$table$FDR<0.05,])
    GSE_rn_001<-row.names(GSE_tt_GH[GSE_tt_GH$table$FDR<0.001,])
    plotSmear(GSE_dge1,de.tags=GSE_rn_05)
    plotSmear(GSE_dge1,de.tags=GSE_rn_001)
    

  11. Finally, do some volcano plots to show the relationship between FDR and abundance.
    plot(GSE_tt_GH$table[,1], -log10(GSE_tt_GH$table[,4]),xlab="logFC",ylab="-log10(FDR)")
    points(GSE_tt_GH$table[GSE_rn_05,1],-log10(GSE_tt_GH$table[GSE_rn_05,4]),pch=20,col='red')
    points(GSE_tt_GH$table[GSE_rn_001,1],-log10(GSE_tt_GH$table[GSE_rn_001,4]),pch=20,col='green')
    

The following code plots (copied from research.stowers-institute.org/efg/R/Math/VennDiagram.htm) a 2-way Venn diagram from two lists of gene names (e.g. a list from row.names(GSE_tt_GH$table$FDR < 0.01) and row.names(GSE_tt_MH$table$FDR < 0.01)).
require(limma)
Venn2 <- function(set1, set2, names)
{
  stopifnot( length(names) == 2)
 
  # Form universe as union of all three sets
  universe <- sort( unique( c(set1, set2) ) )
 
  Counts <- matrix(0, nrow=length(universe), ncol=2)
  colnames(Counts) <- names
 
  for (i in 1:length(universe))
  {
V    Counts[i,1] <- universe[i] %in% set1
    Counts[i,2] <- universe[i] %in% set2
  }
 
  vennDiagram( vennCounts(Counts) )
}
This function is called by:
Venn2(row_names_cond1, row_names_cond2, names=c("Cond1","Cond2"))

In the homework below, use this function to compare the statitically significant gene list for GM128 vs H1 to the list from GM128 vs MCF7.


Homework

Due Monday, April 11, at 5 pm.

  1. In a file called hwk8/answers.txt, answer the questions in parts 4, 7, and 8 above.

  2. Do the same differential expression analysis, but with GM128 vs MCF7. List the 20 most significant differentially expressed genes in that set, with logFC and FDR.

  3. Use the Venn2 function above to compare the statistically significant changes in GM128 vs H1 to GM128 vs MCF7. Draw the diagrams using two different FDR thresholds.

    Are there significant differences that go up in one comparison but down in the other?


Course home page