Biol4230 - The 'R' programming environment

The 'R' programming environment

Summary - Run 'R', do some basic data manipulations, import a dataset and plot some results.

To run these excercises, you will need to have the 'R' statistics package installed on your laptop computer:

http://www.rstudio.com/products/rstudio/download/

In this lab, we will be exploring how to use R. We will work on generating and accessing elements/components from R objects including vectors, matrices, lists, factors, data frames and functions (both built in and user defined). We will also explore R's basic graphics utilities including plot, hist, and boxplot.

Vectors

Create a numerical vector num of all the integers from 11 to 20 named num using the sequence generating operator :. Look at the vector.
Use this vector to generate some logical vectors (lg1...lg?) by specifying conditions using comparison operators lg1 <- >15, lg2 <- >=16, lg3 <- <12, etc. Look at the logical (lg1...lg?) vectors. Use the logical vectors to extract entries from num[]
Generate a character vector named char using the concatenate function c(...) with the strings 'R', 'Python', 'Bioconductor', and 'RNA-seq'. Use this vector to create 2 logical vectors, lg7 and lg8, using the comparison operators == 'Python' and !='R'. Again, look at the vectors and use them to extract the contents of char[].
Create a mixed vector named mix1 that contains values with a decimal point and integers using the c(...) function. View the types of all the vectors you have produced using the mode() function, or the str() function.
Create a mixed vector named mix2 that contains values with a decimal point, integers and characters with the c(...) function. What type of vector is produced? Again, check by typing mix2 and hitting ``enter'' on your keyboard as well as using the mode function.
Extract subsets of mix1 and mix2 using negative indexes together with the : operator and the c(...) function.
Perform the following mathematical operations on num: num/num, num*num, num**2, num + num, 2*num and num - num. Try doing some arithmetic on the mix1 vector.

Matrices

Create a 5 column matrix named mat from num using the matrix() function and filling in the values by row first.
```
mat <- matrix(num,ncol=5,byrow=T)
```
Look at the matrix, look at str(mat), and dim(mat))
Extract the element in the second row and third column of mat.
Extract the full first row.
Extract the full fourth column of mat.
Extract all rows and the 4th and 5th columns of mat using the : operator and c() command.
Create a logical vector lgm by checking to see which elements in the first row of mat are <= 14. Use lgm to extract the columns of mat <=14.
Perform the following mathematical operations on mat: mat/mat, mat*mat, mat**2, mat + mat, 2*mat and mat - mat.

Lists and data.frames

Generate a list named ExpList with three components: ExpLevel (3 numeric elements), Exp (3 logical elements with at least one TRUE) and GeneName (3 character elements).
```
ExpList <- list(ExpLevel=c(1,2,3), Exp=c(F,T,T), GeneName=c("p53","cMyc", "GSTM1"))
```
View ExpList
Extract the GeneName component using the $ operator and single brackets, [], after ExpList. Do you notice any differences in the outputs?
Extract the third element of the GeneName component. Extract the ExpLevel and GeneName components in one view using single brackets after ExpList, [].
Generate a character vector of length 3 named ids, e.g "Gene1", "Gene2", "Gene3".
Type help(as.data.frame). Read the help page.
Use the function as.data.frame on the list ExpList to generate a data frame named ExpData with row names ids (setting stringsAsFactors=F). View ExpData.
Extract the first row and then the third column (two separate operations) of ExpData using indexes. Use the $ operator to extract the Exp column.
Extract the rows that are TRUE in the Exp column.
Check the attributes of ExpData using dim() and mode() functions.

Plotting data

Make a small test dataset of sequential x-values and y-values where y = x + random noise:
```
x <- seq(0, 10, by=0.1)
y <- x + rnorm(length(x), mean=0, sd=0.1)
```

Plot x vs y, and draw a line through points:

plot(x,y,xlim=c(0,10), ylim=c(-2,12),pch=18, col='red')
lines(x,x,col='blue')

Add another set of data to it with similar properties and plot it with different colors:
```
z <- x + rnorm(length(x), mean=0, sd=0.5)
points(x,z,pch=18, col='green')
```
Generate 1000 normally distributed datapoints and plot their histogram:
```
rn <- rnorm(1000, mean=2.0, sd=0.5)
hist(rn)
```
Try a boxplot of the same data. What are the values of each of the horizontal lines in the boxplot? (e.g. the center, top, and bottom of the box? the "wiskers" between the dashed vertical lines and the circles?)

Reading in data

Transfer the file: interactive.hpc.virginia.edu:/apps/teaching/bioinfo4230/data/rna-seq/GSE_FPKM.tab to your laptop.

Read in the file to a data.frame using:

rna<-read.table(url("http://fasta.bioch.virginia.edu/biol4230/labs/data/GSE_FPKM.tab"),sep="\t", header=T,row.names=NULL)

(row.names=NULL is required to fix the "duplicate row names" problem)

Summarize the data file using str and/or summary. How many columns are there? How many have expression data? How many rows are there?
Make a logical vector that specifies the rows (genes) that have more than 10 counts (FPKM) in MCF.7_Rep1, Rep2, and Rep3.
```
MCF7.gt10 <- rna1$MCF.7_Rep1 > 10 & rna1$MCF.7_Rep2 > 10 & rna1$MCF.7_Rep3 > 10
```
What are the distributions of counts for all the expression data? (summary(rna[MCF7.gt10,]); note that the logical vector is of genes (rows), so it must come before the comma [,]
How many genes are selected with this criteria? (dim(rna[MCF7.gt10,]))
Make two new arrays, MCF7.ave and MCF7.sd[], which have the average of the three replicates and the variance of the three replicates for the rows with >10 counts in each replicate. To do this, you MUST use the apply() function across the rows:
```
MCF7.ave <- apply(GSE_FPKM[MCF7.gt10,2:4],1,mean)
```
Here, the 1 tells apply to apply the mean() function across each row. You must do the same thing for var()
Plot the log2() of the variance (y-axes) against the log10() of the average.

Homework

Put the answers and 'R' scripts in a new directory: ~/biol4230/hwk7/.

In a file called answers, copy and paste examples of your lab-work from questions Matrix:3; data.frames:6, 8. Answer the question in Plotting: 4.
Answer the questions in "Reading in data" (3, 4) and copy and paste a part of the MCF7.ave and MCF.var arrays in answers.
Provide a file with the set of 'R' commands (and comments) necessary to answer questions 3 and 4. The file should also include the commands necessary to read the data, and select the subset of genes used for each step.
Identify the 5 most highly expressed genes (on average) for each of the three experimental replicates in GSE_FPKM (MCF.7, GM12892, and H1.hESC.
Present the last plot (log(var) vs log(ave)) for the GSE_FPKM.dat data for the four replicate H1.hESC) experiments, limiting to genes that have at least 1 FPKM in each of the replicate experiments. Upload the plot to the collab assignment page.

Due Wednesday, April 4 at 5 PM.

Course home page