Summary - Install Bioconductor, import data, run EdgeR using two different modes
These exercises will follow the protocols described in Anders, S. et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols 8, 1765-1786 (2013) to analyze the GSE Ensembl data described in Thursday's handout.
We will skip over the mapping of fragments to the genome, and focus on normalization, quality control, and identification of differentially expressed genes with edgeR. This starts at step 14 of the protocol on p. 1777. source("https://bioconductor.org/biocLite.R")
To run these excercises, you will need to have Bioconductor installed on your laptop computer:
## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite() # installs a lot of stuff biocLite("BiocUpgrade") # ignore error messages - you may be up-to-date biocLite(c("edgeR","DESeq")) library(edgeR)
To do these exercises, we will start with files similar to those from last week:
GSE_HTS<-read.table("https://fasta.bioch.virginia.edu/biol4230/labs/data/GSE_HTS.txt",sep="\t", header=T,row.names=1) GSE_groups=c(rep('GM128',3),rep("H1",4),rep("MCF7",3))
keep_cnts<-rowSums(GSE_HTS[,1:3]) > 9 & rowSums(GSE_HTS[,4:7]) > 12 & rowSums(GSE_HTS[,8:10]) > 9How many samples are saved? (Compare length(keep_cnts) to length(keep_cnts[keep_cnts]). Why are the numbers different?
GSE_cpms <- cpm(GSE_HTS) keep_cnts <- keep_cnts & rowSums(GSE_cpms > 1) >= 3 GSE_HTS1 <- GSE_HTS[keep_cnts,]Do a summary on GSE_HTS and GSE_HTS1. How has the first quartile changed? Has the maximum changed? The minimum? How many rows are in GSE_HTS? In GSE_HTS1?
Plot a boxplot of log10(GSE_HTS1) adding a small number for zeros (boxplot(log10(GSE_HTS[keep_cnts,]+0.1)). How many samples have a gene with zero counts? Does this agree with summary(GSE_HTS[keep_cnts,])? Do the medians, quartiles, and maxima match what you saw in the summary(). What is the dynamic range between the median and the maximum numbers of read counts? The difference between the first and third quartiles?
GSE_dge1<-DGEList(counts=GSE_HTS1,lib.size=colSums(GSE_HTS1),group=GSE_groups)Take a look at the contents of GSE_dge1 using str(GSE_dge1) (you could also try summary, but it fails). What do you see in the data structure in addition to the data?
You can look at the individual parts of the GSE_dge1 list using [[]] indexing, e.g.:
summary(GSE_dge1[[1]]) summary(GSE_dge1[[2]])[[]] allows you to look at an individual part of a list.
Or, you can find the names of the different parts of GSE_dge1 with attributes(GSE_dge1). Once you know the names, you can get the contents using $name, e.g.
summary(GSE_dge1$counts)
Do a boxplot(log10(GSE_dge1$counts+0.1)).
GSE_dge1<- calcNormFactors(GSE_dge1) str(GSE_dge1)Look at the GSE_dge1 structure with str(). What happened?
plotMDS(GSE_dge1,col=c("green","red","blue")[factor(GSE_groups)])Are the replicates similar? Do the different cell lines differ?
To look for statistically significant differences in expression, we must first have estimates of the amount of non-biological variation for genes expressed at different levels. This can be calculated using the estimateCommonDisp() and esimateTagwiseDisp() functions. Run these functions, and look to see how the GSE_dge1 structure has changed. Then plot the relationship between expression and variance using plotMeanVar(). What does the solid black line with slope=1 show in the plotMeanVar plot.
GSE_dge1<-estimateCommonDisp(GSE_dge1) str(GSE_dge1) GSE_dge1<-estimateTagwiseDisp(GSE_dge1) plotMeanVar(GSE_dge1,show.tagwise.vars=T,NBline=T)The NB line shows the variance predicted by the "Negative Binomial" distribution. Does the variance in the samples match the NB model? How does the predicted NB variance differ from the solid black line?
Take another look at the contents of DGE_gse1 with attributes() and str(). Do a: boxplot(log10(GSE_dge1$pseudo.counts+0.1)). How does the earlier boxplot and this boxplot differ?
GSE_dge1_GH<-exactTest(GSE_dge1,pair=c("GM128","H1")) # do the statistical tests GSE_tt_GH<-topTags(GSE_dge1_GH,n=nrow(GSE_dge1)) # sort by the most significant differences head(GSE_tt_GH$table,20) # show the top 10 most significantHow many of the 20 most significant changes are larger in H1 than GM128? How abundant (CPM) are the most significant changes?
Confirm that the most significant changes actually have substantial changes in counts:
GSE_HTS1[row.names(head(GSE_tt_GH$table,20)),c(1:3,4:7)]
GSE_rn_05<-row.names(GSE_tt_GH[GSE_tt_GH$table$FDR<0.05,]) GSE_rn_001<-row.names(GSE_tt_GH[GSE_tt_GH$table$FDR<0.001,]) plotSmear(GSE_dge1,de.tags=GSE_rn_05) plotSmear(GSE_dge1,de.tags=GSE_rn_001)
plot(GSE_tt_GH$table[,1], -log10(GSE_tt_GH$table[,4]),xlab="logFC",ylab="-log10(FDR)") points(GSE_tt_GH$table[GSE_rn_05,1],-log10(GSE_tt_GH$table[GSE_rn_05,4]),pch=20,col='red') points(GSE_tt_GH$table[GSE_rn_001,1],-log10(GSE_tt_GH$table[GSE_rn_001,4]),pch=20,col='green')What is the 'x'-axis in these plots?
require(limma) source("https://fasta.bioch.virginia.edu/biol4230/labs/venn_diagram.R")This function is called by:
Venn2(row_names_cond1, row_names_cond2, names=c("Cond1","Cond2"))For example: Venn2(GSE_rn_05, GSE_rn_001, names=c("FDR<0.05", "FDR<0.001")) What should this Venn diagram look like?
In the homework below, use this function to compare the statitically significant gene list for GM128 vs H1 to the list from GM128 vs MCF7.
Due Monday, April 9, at 5 pm.
Are there significant differences that go up in one comparison but down in the other?