We will be using a handful of Bioconductor packages. These are installed using the function biocLite which you can source from the web:
source("http://www.bioconductor.org/biocLite.R")
biocLite()
or you can run the bioc_install in the rafalib package.
library(rafalib)
bioc_install()
Download and install the Bioconductor package SpikeInSubset
library(BiocInstaller)
biocLite("SpikeInSubset")
and then load the library and the mas133 data:
library(SpikeInSubset)
data(mas133)
Now make the following plot of the first two samples and compute the correlation:
e <- exprs(mas133)
plot(e[,1], e[,2], main = paste0("corr=", signif(cor(e[,1], e[,2]), 3)), cex = 0.5)
k <- 3000
b <- 1000
polygon(c(-b,k,k,-b), c(-b,-b,k,k), col = "red", density=0, border = "red")
What proportion of the points are inside the box?
sample1 <- e[, 1] < k
sample2 <- e[, 2] < k
sum(sample1 & sample2) / nrow(e)
## [1] 0.9475336
Now make the sample plot with log:
plot(log2(e[,1]), log2(e[,2]))
k <- log2(3000)
b <- log2(0.5)
polygon(c(b,k,k,b), c(b,b,k,k), col = "red", density = 0, border = "red")
What is an advantage of taking the log?
Make an MA-plot:
e <- log2(exprs(mas133))
plot((e[,1]+e[,2])/2, e[,2]-e[,1], cex = 0.5)
The two samples we are plotting are replicates (they are random samples from the same batch of RNA). The correlation of the data was 0.997 in the original scale and 0.96 in the log-scale. High correlations are sometimes confused with evidence of replication. However, replication implies we get very small differences between the observations, which is better measured with distance or differences.
What is the standard deviation of the log ratios for this comparison?
y <- e[,2] - e[,1]
sd(y)
## [1] 0.7767887
How many fold changes above 2 do we see?
table(abs(y) > 2)
##
## FALSE TRUE
## 21429 871