To explain background we will use Affymetrix microarray data. Several approaches to modeling data from these arrays have been proposed and here we explore data from the spike-in experiment which provides many insights.
We will need the following libraries:
library(rafalib)
library(affy)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(SpikeIn)
library(SpikeInSubset)
library(hgu133atagcdf)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
The SpikeIn package includes raw probe-level data for two spike-in experiments, SpikeIn133 and SpikeIn95. The SpikeInSubset includes a six sample subset of this experiment along with processed data with different the MAS5 and RMA algorithms:
data(package="SpikeInSubset")
Note that the spike-in experimental design is described in the phenoData slot. If you type:
data(SpikeIn133)
head(pData( SpikeIn133) )[1]
## 203508_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0.000
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt3_R1 0.250
## 12_13_02_U133A_Mer_Latin_Square_Expt4_R1 0.500
## 12_13_02_U133A_Mer_Latin_Square_Expt5_R1 1.000
## 12_13_02_U133A_Mer_Latin_Square_Expt6_R1 2.000
dim(pData(SpikeIn133))
## [1] 42 42
We can see the the amounts that “203508_at” was spiked in at by looking the column associated with that gene:
pData(SpikeIn133) [ ,"203508_at"]
## [1] 0.000 0.125 0.250 0.500 1.000 2.000 4.000 8.000
## [9] 16.000 32.000 64.000 128.000 256.000 512.000 0.000 0.125
## [17] 0.250 0.500 1.000 2.000 4.000 8.000 16.000 32.000
## [25] 64.000 128.000 256.000 512.000 0.000 0.125 0.250 0.500
## [33] 1.000 2.000 4.000 8.000 16.000 32.000 64.000 128.000
## [41] 256.000 512.000
From inspecting pData(SpikeIn133) we can see that concentration series of 14 concentrations repeats itself three times. It is important to remember that in the raw probe-level data, there are several features for each of these genes.
Now we are ready to look at some data. Here is an example showing data for two arrays with the spiked-in genes highlighted. This data has already been processed from the probe-level to gene-level data using an approach developed by the manufacturer.
library(SpikeInSubset)
data(mas133)
e=exprs(mas133)##get expression
A=(log2(e[,4])+log2(e[,1]))/2
M=log2(e[,4]/e[,1])
##find the genes that were spiked in to have
##fold changes of 2
siNames=colnames(pData(mas133))
siNames=siNames[pData(mas133)[4,]/pData(mas133)[1,]==2]
spikeinIndex=match(siNames,rownames(e))
mypar(1,1)
splot(A,M,ylim=c(-4,4),cex=0.5)
abline(h=c(-1,1),col=1,lwd=2,lty=2)
points(A[spikeinIndex],M[spikeinIndex],bg=2,pch=21)
Note that distinguishing the spiked-in genes from the highly variable (noisy) data seems impossible. So why so much noise? It turns out that modeling can help us explain this.
We can look at probe level data to see this. Here we show data from just one gene: “203508_at”. We have 11 probes trying to measure the amount of expression for gene, but we see quite different values.
data(SpikeIn133)
pd=pData(SpikeIn133)[1:14,] ##pick the first 14, rest are reps
pns=probeNames(SpikeIn133)
pms=pm(SpikeIn133)[,1:14] ##pick the first 14, rest are reps
ind=which(pns==colnames(pd)[1]) ##probes in gene 1
concentration=pd[,1]
concentration[concentration==0]= 1/16
mypar(1,1)
matplot(log2(concentration),t(log2(pms[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity")
From the analysis above, we see that for some of the samples we know that the intended concentration was 0 which implies the signal we observe is background. For the gene “203508_at”, the intended concentrations can be seen in the first column of pData. We see that in three arrays this gene was not spiked-in:
Note! you need to run the following block of code, because now we want to include the replicate experiments for the 14 experimental concentrations.
pd=pData(SpikeIn133) ## use all the replicates in phenoData function to Spike133 data
pms=pm(SpikeIn133) ## used all replicates in probeMatch function to Spike133 data
j = which(colnames(pd)=="203508_at")
concentration=pd[,j]
which(concentration == 0)
## [1] 1 15 29
Note that with 11 probes we have observed intensities (and then replicates of these) for when this gene is only background.
What is the smallest observed background level of these intensities (in original scale not log, and including the replicates)?
j = which(colnames(pd)=="203508_at")
ind=which(pns==colnames(pd)[j]) ## probeNames from the SpikeIn133
##probes in gene 1
concentration=pd[,1]
i = which(concentration==0)
min( pms[ind,i] )
## [1] 26
What is the largest observed background level of these intensities (in original scale not log, and including the replicates)?
max( pms[ind,i] )
## [1] 468.3
Again, we will use all the replicates of the concentrations. We have three replicate experiments in which the spiked-in concentration of this gene was 0:
pd=pData(SpikeIn133) ## use all the replicates
pms=pm(SpikeIn133) ## use all the replicates
j = which(colnames(pd)=="203508_at")
concentration=pd[,j]
i = which(concentration==0)
The MM (mismatch) probes are designed to capture these values:
mms = mm(SpikeIn133) ## use all the replicates
Make a plot of the PMs versus MMs for the probes for gene 203508_at and for the experiments in which we know the spiked-in concentration of this gene was 0. Note that here we are observing background in the PMs and the MMs are intended as an estimate. From the plot you will see that a log transformation is more appropriate.
What is the correlation between the logs of the MMs and PMs for gene 203508_at for the experiments which spiked-in concentration of 0. (Just take the correlation over all 3 x 11 = 33 values. You can use as.vector to coerce a matrix to a vector.)
plot(pms[ind,i],mms[ind,i])
plot(log(pms[ind,i]),log(mms[ind,i]))
mms2=as.vector(log(mms[ind, which(pData(SpikeIn133)[1]==0)]))
pms2=as.vector(log(pms[ind, which(pData(SpikeIn133)[1]==0)]))
corr = cor(pms2,mms2)
corr
## [1] 0.9305712
Note that while correlation is high, the difference between these two is substantial.
hist(log2(mms2)-log2(pms2))
We are now going to compare the approach of subtracting the MMs to the RMA approach described in the videos. We start by applying the background correction to the entire set:
bg1 = bg.correct.mas(SpikeIn133)
bg2 = bg.correct.rma(SpikeIn133)
pd= pData(SpikeIn133)
pns=probeNames(SpikeIn133)
pms1=pm(bg1)
pms2=pm(bg2)
ind=which(pns==colnames(pd)[1]) ##probes in gene 1
concentration=pd[,1]
concentration[concentration==0]= 1/16
mypar(1,2)
matplot(log2(concentration),t(log2(pms1[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity",ylim=c(0,13))
matplot(log2(concentration),t(log2(pms2[ind,])),xlab="log (base 2) concentration",ylab="log (base 2) instensity",ylim=c(0,13))
We can see that the first approach (subtracting MM) seems to perform slightly better at removing the lack of sensitivity (low slope) towards the end. However, let’s study the variance. If we take a close look at pData(SpikeIn133) we see that there are there triplicate samples. Here are two examples.
pData(SpikeIn133)[c(1,15,29),]
## 203508_at 204563_at 204513_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0 0 0
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 0 0 0
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 0 0 0
## 204205_at 204959_at 207655_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0.125 0.125 0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 0.125 0.125 0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 0.125 0.125 0.125
## 204836_at 205291_at 209795_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0.25 0.25 0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 0.25 0.25 0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 0.25 0.25 0.25
## 207777_s_at 204912_at 205569_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0.5 0.5 0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 0.5 0.5 0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 0.5 0.5 0.5
## 207160_at 205692_s_at 212827_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 1 1 1
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 1 1 1
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 1 1 1
## 209606_at 205267_at 204417_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 2 2 2
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 2 2 2
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 2 2 2
## 205398_s_at 209734_at 209354_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 4 4 4
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 4 4 4
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 4 4 4
## 206060_s_at 205790_at 200665_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 8 8 8
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 8 8 8
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 8 8 8
## 207641_at 207540_s_at 204430_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 16 16 16
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 16 16 16
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 16 16 16
## 203471_s_at 204951_at 207968_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 32 32 32
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 32 32 32
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 32 32 32
## AFFX-r2-TagA_at AFFX-r2-TagB_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 64 64
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 64 64
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 64 64
## AFFX-r2-TagC_at AFFX-r2-TagD_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 64 128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 64 128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 64 128
## AFFX-r2-TagE_at AFFX-r2-TagF_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 128 128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 128 128
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 128 128
## AFFX-r2-TagG_at AFFX-r2-TagH_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 256 256
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 256 256
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 256 256
## AFFX-DapX-3_at AFFX-LysX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 256 512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 256 512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 256 512
## AFFX-PheX-3_at AFFX-ThrX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R1 512 512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R2 512 512
## 12_13_02_U133A_Mer_Latin_Square_Expt1_R3 512 512
pData(SpikeIn133)[c(2,16,30),]
## 203508_at 204563_at 204513_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0.125 0.125 0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 0.125 0.125 0.125
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 0.125 0.125 0.125
## 204205_at 204959_at 207655_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0.25 0.25 0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 0.25 0.25 0.25
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 0.25 0.25 0.25
## 204836_at 205291_at 209795_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0.5 0.5 0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 0.5 0.5 0.5
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 0.5 0.5 0.5
## 207777_s_at 204912_at 205569_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 1 1 1
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 1 1 1
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 1 1 1
## 207160_at 205692_s_at 212827_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 2 2 2
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 2 2 2
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 2 2 2
## 209606_at 205267_at 204417_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 4 4 4
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 4 4 4
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 4 4 4
## 205398_s_at 209734_at 209354_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 8 8 8
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 8 8 8
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 8 8 8
## 206060_s_at 205790_at 200665_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 16 16 16
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 16 16 16
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 16 16 16
## 207641_at 207540_s_at 204430_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 32 32 32
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 32 32 32
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 32 32 32
## 203471_s_at 204951_at 207968_s_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 64 64 64
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 64 64 64
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 64 64 64
## AFFX-r2-TagA_at AFFX-r2-TagB_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 128 128
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 128 128
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 128 128
## AFFX-r2-TagC_at AFFX-r2-TagD_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 128 256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 128 256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 128 256
## AFFX-r2-TagE_at AFFX-r2-TagF_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 256 256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 256 256
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 256 256
## AFFX-r2-TagG_at AFFX-r2-TagH_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 512 512
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 512 512
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 512 512
## AFFX-DapX-3_at AFFX-LysX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 512 0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 512 0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 512 0
## AFFX-PheX-3_at AFFX-ThrX-3_at
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0 0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R2 0 0
## 12_13_02_U133A_Mer_Latin_Square_Expt2_R3 0 0
Use the rowSds function to Compute the SD of the log intensities for each probe across the triplicate arrays c(1,15,29). Then compute the average log intensity A across these three replicates. Plot SD versus A for both bg1 (method 1) and bg2 (method 2) above.
library(genefilter)
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:base':
##
## anyNA
ind = c(1,15,29)
pm1 = log2( pm(bg1)[,ind])
pm2 = log2( pm(bg2)[,ind])
SD1 = rowSds(pm1)
A1 = rowMeans(pm1)
SD2 = rowSds(pm2)
A2 = rowMeans(pm2)
mypar(2,1)
splot(A1,SD1,ylim=c(0,3),cex=.25)
splot(A2,SD2,ylim=c(0,3),cex=.25)
Which would best describe a comparison of the SD based on these two plots:
Method 1 shows larger variability especially for the lower values of
A.