Exercises

To explain background we will use Affymetrix microarray data. Several approaches to modeling data from these arrays have been proposed and here we explore data from the spike-in experiment which provides many insights.

We will need the following libraries:

library(rafalib)
library(affy)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(SpikeIn)
library(SpikeInSubset)
library(hgu133atagcdf)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'

The SpikeIn package includes raw probe-level data for two spike-in experiments, SpikeIn133 and SpikeIn95. The SpikeInSubset includes a six sample subset of this experiment along with processed data with different the MAS5 and RMA algorithms:

data(package="SpikeInSubset")

Note that the spike-in experimental design is described in the phenoData slot. If you type:

data(SpikeIn133)
head(pData( SpikeIn133) )[1]
##                                          203508_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1     0.000
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1     0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt3_R1     0.250
## 12_13_02_U133A_Mer_Latin_Square_Expt4_R1     0.500
## 12_13_02_U133A_Mer_Latin_Square_Expt5_R1     1.000
## 12_13_02_U133A_Mer_Latin_Square_Expt6_R1     2.000
dim(pData(SpikeIn133))
## [1] 42 42

We can see the the amounts that “203508_at” was spiked in at by looking the column associated with that gene:

pData(SpikeIn133) [ ,"203508_at"]
##  [1]   0.000   0.125   0.250   0.500   1.000   2.000   4.000   8.000
##  [9]  16.000  32.000  64.000 128.000 256.000 512.000   0.000   0.125
## [17]   0.250   0.500   1.000   2.000   4.000   8.000  16.000  32.000
## [25]  64.000 128.000 256.000 512.000   0.000   0.125   0.250   0.500
## [33]   1.000   2.000   4.000   8.000  16.000  32.000  64.000 128.000
## [41] 256.000 512.000

From inspecting pData(SpikeIn133) we can see that concentration series of 14 concentrations repeats itself three times. It is important to remember that in the raw probe-level data, there are several features for each of these genes.

Now we are ready to look at some data. Here is an example showing data for two arrays with the spiked-in genes highlighted. This data has already been processed from the probe-level to gene-level data using an approach developed by the manufacturer.

library(SpikeInSubset)
data(mas133)
e=exprs(mas133)##get expression
A=(log2(e[,4])+log2(e[,1]))/2
M=log2(e[,4]/e[,1])

##find the genes that were spiked in to have 
##fold changes of 2
siNames=colnames(pData(mas133))
siNames=siNames[pData(mas133)[4,]/pData(mas133)[1,]==2]
spikeinIndex=match(siNames,rownames(e))

mypar(1,1)
splot(A,M,ylim=c(-4,4),cex=0.5)
abline(h=c(-1,1),col=1,lwd=2,lty=2)
points(A[spikeinIndex],M[spikeinIndex],bg=2,pch=21)

Note that distinguishing the spiked-in genes from the highly variable (noisy) data seems impossible. So why so much noise? It turns out that modeling can help us explain this.

We can look at probe level data to see this. Here we show data from just one gene: “203508_at”. We have 11 probes trying to measure the amount of expression for gene, but we see quite different values.

data(SpikeIn133)
pd=pData(SpikeIn133)[1:14,] ##pick the first 14, rest are reps
pns=probeNames(SpikeIn133)
pms=pm(SpikeIn133)[,1:14] ##pick the first 14, rest are reps
ind=which(pns==colnames(pd)[1]) ##probes in gene 1
concentration=pd[,1]
concentration[concentration==0]= 1/16
mypar(1,1)
matplot(log2(concentration),t(log2(pms[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity")

Question 3.7.1

From the analysis above, we see that for some of the samples we know that the intended concentration was 0 which implies the signal we observe is background. For the gene “203508_at”, the intended concentrations can be seen in the first column of pData. We see that in three arrays this gene was not spiked-in:

Note! you need to run the following block of code, because now we want to include the replicate experiments for the 14 experimental concentrations.

pd=pData(SpikeIn133) ## use all the replicates in phenoData function to Spike133 data
pms=pm(SpikeIn133) ## used all replicates in probeMatch function to Spike133 data

j = which(colnames(pd)=="203508_at")
concentration=pd[,j]
which(concentration == 0)
## [1]  1 15 29

Note that with 11 probes we have observed intensities (and then replicates of these) for when this gene is only background.

What is the smallest observed background level of these intensities (in original scale not log, and including the replicates)?

j = which(colnames(pd)=="203508_at")
ind=which(pns==colnames(pd)[j]) ## probeNames from the SpikeIn133
##probes in gene 1
concentration=pd[,1]
i = which(concentration==0)
min( pms[ind,i] )
## [1] 26

What is the largest observed background level of these intensities (in original scale not log, and including the replicates)?

max( pms[ind,i] )
## [1] 468.3

Question 3.7.2

Again, we will use all the replicates of the concentrations. We have three replicate experiments in which the spiked-in concentration of this gene was 0:

pd=pData(SpikeIn133) ## use all the replicates
pms=pm(SpikeIn133) ## use all the replicates
j = which(colnames(pd)=="203508_at")
concentration=pd[,j]
i = which(concentration==0)

The MM (mismatch) probes are designed to capture these values:

mms = mm(SpikeIn133) ## use all the replicates

Make a plot of the PMs versus MMs for the probes for gene 203508_at and for the experiments in which we know the spiked-in concentration of this gene was 0. Note that here we are observing background in the PMs and the MMs are intended as an estimate. From the plot you will see that a log transformation is more appropriate.

What is the correlation between the logs of the MMs and PMs for gene 203508_at for the experiments which spiked-in concentration of 0. (Just take the correlation over all 3 x 11 = 33 values. You can use as.vector to coerce a matrix to a vector.)

plot(pms[ind,i],mms[ind,i])

plot(log(pms[ind,i]),log(mms[ind,i]))

mms2=as.vector(log(mms[ind, which(pData(SpikeIn133)[1]==0)]))
pms2=as.vector(log(pms[ind, which(pData(SpikeIn133)[1]==0)]))

corr = cor(pms2,mms2)
corr
## [1] 0.9305712

Note that while correlation is high, the difference between these two is substantial.

hist(log2(mms2)-log2(pms2))

Question 3.7.3

We are now going to compare the approach of subtracting the MMs to the RMA approach described in the videos. We start by applying the background correction to the entire set:

bg1 = bg.correct.mas(SpikeIn133)
bg2 = bg.correct.rma(SpikeIn133)
pd= pData(SpikeIn133)
pns=probeNames(SpikeIn133)
pms1=pm(bg1) 
pms2=pm(bg2)

ind=which(pns==colnames(pd)[1]) ##probes in gene 1
concentration=pd[,1]
concentration[concentration==0]= 1/16

mypar(1,2)
matplot(log2(concentration),t(log2(pms1[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity",ylim=c(0,13))
matplot(log2(concentration),t(log2(pms2[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity",ylim=c(0,13))

We can see that the first approach (subtracting MM) seems to perform slightly better at removing the lack of sensitivity (low slope) towards the end. However, let’s study the variance. If we take a close look at pData(SpikeIn133) we see that there are there triplicate samples. Here are two examples.

pData(SpikeIn133)[c(1,15,29),]
##                                          203508_at 204563_at 204513_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1         0         0           0
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2         0         0           0
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3         0         0           0
##                                          204205_at 204959_at 207655_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1     0.125     0.125       0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2     0.125     0.125       0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3     0.125     0.125       0.125
##                                          204836_at 205291_at 209795_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1      0.25      0.25      0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2      0.25      0.25      0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3      0.25      0.25      0.25
##                                          207777_s_at 204912_at 205569_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1         0.5       0.5       0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2         0.5       0.5       0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3         0.5       0.5       0.5
##                                          207160_at 205692_s_at 212827_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1         1           1         1
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2         1           1         1
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3         1           1         1
##                                          209606_at 205267_at 204417_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1         2         2         2
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2         2         2         2
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3         2         2         2
##                                          205398_s_at 209734_at 209354_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1           4         4         4
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2           4         4         4
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3           4         4         4
##                                          206060_s_at 205790_at 200665_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1           8         8           8
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2           8         8           8
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3           8         8           8
##                                          207641_at 207540_s_at 204430_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1        16          16          16
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2        16          16          16
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3        16          16          16
##                                          203471_s_at 204951_at 207968_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1          32        32          32
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2          32        32          32
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3          32        32          32
##                                          AFFX-r2-TagA_at AFFX-r2-TagB_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1              64              64
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2              64              64
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3              64              64
##                                          AFFX-r2-TagC_at AFFX-r2-TagD_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1              64             128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2              64             128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3              64             128
##                                          AFFX-r2-TagE_at AFFX-r2-TagF_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1             128             128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2             128             128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3             128             128
##                                          AFFX-r2-TagG_at AFFX-r2-TagH_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1             256             256
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2             256             256
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3             256             256
##                                          AFFX-DapX-3_at AFFX-LysX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1            256            512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2            256            512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3            256            512
##                                          AFFX-PheX-3_at AFFX-ThrX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1            512            512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2            512            512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3            512            512
pData(SpikeIn133)[c(2,16,30),]
##                                          203508_at 204563_at 204513_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1     0.125     0.125       0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2     0.125     0.125       0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3     0.125     0.125       0.125
##                                          204205_at 204959_at 207655_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1      0.25      0.25        0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2      0.25      0.25        0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3      0.25      0.25        0.25
##                                          204836_at 205291_at 209795_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1       0.5       0.5       0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2       0.5       0.5       0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3       0.5       0.5       0.5
##                                          207777_s_at 204912_at 205569_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1           1         1         1
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2           1         1         1
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3           1         1         1
##                                          207160_at 205692_s_at 212827_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1         2           2         2
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2         2           2         2
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3         2           2         2
##                                          209606_at 205267_at 204417_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1         4         4         4
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2         4         4         4
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3         4         4         4
##                                          205398_s_at 209734_at 209354_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1           8         8         8
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2           8         8         8
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3           8         8         8
##                                          206060_s_at 205790_at 200665_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1          16        16          16
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2          16        16          16
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3          16        16          16
##                                          207641_at 207540_s_at 204430_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1        32          32          32
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2        32          32          32
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3        32          32          32
##                                          203471_s_at 204951_at 207968_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1          64        64          64
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2          64        64          64
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3          64        64          64
##                                          AFFX-r2-TagA_at AFFX-r2-TagB_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1             128             128
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2             128             128
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3             128             128
##                                          AFFX-r2-TagC_at AFFX-r2-TagD_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1             128             256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2             128             256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3             128             256
##                                          AFFX-r2-TagE_at AFFX-r2-TagF_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1             256             256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2             256             256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3             256             256
##                                          AFFX-r2-TagG_at AFFX-r2-TagH_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1             512             512
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2             512             512
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3             512             512
##                                          AFFX-DapX-3_at AFFX-LysX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1            512              0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2            512              0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3            512              0
##                                          AFFX-PheX-3_at AFFX-ThrX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1              0              0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2              0              0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3              0              0

Use the rowSds function to Compute the SD of the log intensities for each probe across the triplicate arrays c(1,15,29). Then compute the average log intensity A across these three replicates. Plot SD versus A for both bg1 (method 1) and bg2 (method 2) above.

library(genefilter)
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
ind = c(1,15,29)
pm1 = log2( pm(bg1)[,ind])
pm2 = log2( pm(bg2)[,ind])

SD1 = rowSds(pm1)
A1 = rowMeans(pm1)
SD2 = rowSds(pm2)
A2 = rowMeans(pm2)
mypar(2,1)
splot(A1,SD1,ylim=c(0,3),cex=.25)
splot(A2,SD2,ylim=c(0,3),cex=.25)

Which would best describe a comparison of the SD based on these two plots:

Method 1 shows larger variability especially for the lower values of A.