Summary - Run 'R', do some basic data manipulations, import a dataset and plot some results.
To run these excercises, you will need to have the 'R' statistics package installed on your laptop computer:
http://www.rstudio.com/products/rstudio/download/
plot
, hist
, and boxplot
.
Vectors
num
using the sequence generating operator :
. Look at the vector.
lg1
...lg?
) by specifying conditions using
comparison operators lg1 <- >15
, lg2
<- >=16
, lg3 <- <12
, etc.
Look at the logical
(lg1...lg?
) vectors. Use the logical vectors to extract
entries from num[]
char
using the concatenate function c(...)
with the strings 'R', 'Python', 'Bioconductor', and 'RNA-seq'. Use this vector to create 2 logical vectors, lg7
and lg8
, using the comparison operators == 'Python'
and !='R'
. Again, look at the vectors and use them to extract the contents of char[]
.
mix1
that contains values with a decimal point and integers using the c(...)
function. View the types of all the vectors you have produced using the mode()
function, or the str()
function.
mix2
that contains values with a decimal point, integers and characters with the c(...)
function. What type of vector is produced? Again, check by typing mix2
and hitting ``enter'' on your keyboard as well as using the mode
function.
mix1
and mix2
using negative indexes together with the :
operator and the c(...)
function.
num
: num/num
, num*num
, num**2
, num
+ num
, 2*num
and num - num
. Try
doing some arithmetic on the mix1
vector.
mat
from num
using the matrix()
function and filling in the values by row first.
mat <- matrix(num,ncol=5,byrow=T)
Look at the matrix, look at str(mat)
, and dim(mat)
)
mat
.
Extract the full first row.
Extract the full fourth column of mat
.
Extract all rows and the 4th and 5th columns of mat
using the :
operator and c()
command.
lgm
by checking to see which elements in the first row of mat
are <=
14. Use lgm
to extract the columns of mat
<=14
.
mat
: mat/mat
, mat*mat
, mat**2
, mat + mat
, 2*mat
and mat - mat
.
ExpList
with three components: ExpLevel
(3 numeric elements), Exp (3 logical elements with at least one
TRUE
) and GeneName (3 character elements).
ExpList <- list(ExpLevel=c(1,2,3), Exp=c(F,T,T), GeneName=c("p53","cMyc", "GSTM1"))View
ExpList
$
operator and single brackets, []
,
after ExpList
. Do you notice any differences in the outputs?
ExpList
, []
.
ids
, e.g "Gene1", "Gene2", "Gene3".
help(as.data.frame)
. Read the
help page.
as.data.frame
on the list
ExpList
to generate a data frame named
ExpData
with row names ids
(setting
stringsAsFactors=F
). View ExpData
.
ExpData
using
indexes. Use the $
operator to extract the Exp column.
TRUE
in the Exp
column.
ExpData
using dim()
and
mode()
functions.
x <- seq(0, 10, by=0.1) y <- x + rnorm(length(x), mean=0, sd=0.1)
plot(x,y,xlim=c(0,10), ylim=c(-2,12),pch=18, col='red') lines(x,x,col='blue')
z <- x + rnorm(length(x), mean=0, sd=0.5) points(x,z,pch=18, col='green')
rn <- rnorm(1000, mean=2.0, sd=0.5) hist(rn)Try a boxplot of the same data. What are the values of each of the horizontal lines in the boxplot? (e.g. the center, top, and bottom of the box? the "wiskers" between the dashed vertical lines and the circles?)
interactive.hpc.virginia.edu:/apps/teaching/bioinfo4230/data/rna-seq/GSE_FPKM.tab
to your laptop.
rna<-read.table(url("http://fasta.bioch.virginia.edu/biol4230/labs/data/GSE_FPKM.tab"),sep="\t", header=T,row.names=NULL)(row.names=NULL is required to fix the "duplicate row names" problem)
MCF7.gt10 <- rna1$MCF.7_Rep1 > 10 & rna1$MCF.7_Rep2 > 10 & rna1$MCF.7_Rep3 > 10What are the distributions of counts for all the expression data? (summary(rna[MCF7.gt10,]); note that the logical vector is of genes (rows), so it must come before the comma [,]
MCF7.ave <- apply(GSE_FPKM[MCF7.gt10,2:4],1,mean)Here, the 1 tells apply to apply the mean() function across each row. You must do the same thing for var()
Put the answers and 'R' scripts in a new directory: ~/biol4230/hwk7/.
Identify the 5 most highly expressed genes (on average) for each of the three experimental replicates in GSE_FPKM (MCF.7, GM12892, and H1.hESC.
Due Wednesday, April 4 at 5 PM.