if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.22")
BiocManager::install("affy")
BiocManager::install("SpikeIn")
BiocManager::install("SpikeInSubset")Affymetrix Array Statistical Model
The intensity y of a single probe is given by:
y = B + αS
Where B is the background noise composed of many different terms including optical effects and non-specific binding effects (this is important as this means it will be non-systematic and so normally distributed according to the CLT).
Alpha is a gain factor and S is the specific binding. S will also be a random variable to account for probe effects and measurement error, and these errors are assumed to be multiplicative.
log(S) = θ + φ + ε
The variability of B and α between arrays means that even if S is the same it will not appear to be the same. This can be demonstrated by the use of spiked in amounts of a gene.
There is also attenuation bias - low concentrations are more strongly affected by background. This can be seen from the Spike In datasets available in Bioconductor and accessed using the affy library.
First you need to install Bioconductor and the appropriate libraries.
The full spiked in data contains 3 repeats of 14 different nominal concentrations. The spiked in subset contains 3 repeats of 2 different nominal concentrations chosen from the 14 (Irizarry et al. 2003). That means that there are 42 different data points collected across the 14 different concentrations. These are recorded for 42 different probes.
library("affy")
library("hgu133atagcdf")
library("SpikeIn")
#Load the full spikein133 dataset this is for the HGU133 reference datase
data("SpikeIn133")
#Check the size of the spiked in dataset.
dim(pData(SpikeIn133))[1] 42 42
#Use the pData function to look at the treatments for a particular probe/gene
(pData(SpikeIn133))[1] 203508_at
12_13_02_U133A_Mer_Latin_Square_Expt1_R1 0.000
12_13_02_U133A_Mer_Latin_Square_Expt2_R1 0.125
12_13_02_U133A_Mer_Latin_Square_Expt3_R1 0.250
12_13_02_U133A_Mer_Latin_Square_Expt4_R1 0.500
12_13_02_U133A_Mer_Latin_Square_Expt5_R1 1.000
12_13_02_U133A_Mer_Latin_Square_Expt6_R1 2.000
12_13_02_U133A_Mer_Latin_Square_Expt7_R1 4.000
12_13_02_U133A_Mer_Latin_Square_Expt8_R1 8.000
12_13_02_U133A_Mer_Latin_Square_Expt9_R1 16.000
12_13_02_U133A_Mer_Latin_Square_Expt10_R1 32.000
12_13_02_U133A_Mer_Latin_Square_Expt11_R1 64.000
12_13_02_U133A_Mer_Latin_Square_Expt12_R1 128.000
12_13_02_U133A_Mer_Latin_Square_Expt13_R1 256.000
12_13_02_U133A_Mer_Latin_Square_Expt14_R1 512.000
12_13_02_U133A_Mer_Latin_Square_Expt1_R2 0.000
12_13_02_U133A_Mer_Latin_Square_Expt2_R2 0.125
12_13_02_U133A_Mer_Latin_Square_Expt3_R2 0.250
12_13_02_U133A_Mer_Latin_Square_Expt4_R2 0.500
12_13_02_U133A_Mer_Latin_Square_Expt5_R2 1.000
12_13_02_U133A_Mer_Latin_Square_Expt6_R2 2.000
12_13_02_U133A_Mer_Latin_Square_Expt7_R2 4.000
12_13_02_U133A_Mer_Latin_Square_Expt8_R2 8.000
12_13_02_U133A_Mer_Latin_Square_Expt9_R2 16.000
12_13_02_U133A_Mer_Latin_Square_Expt10_R2 32.000
12_13_02_U133A_Mer_Latin_Square_Expt11_R2 64.000
12_13_02_U133A_Mer_Latin_Square_Expt12_R2 128.000
12_13_02_U133A_Mer_Latin_Square_Expt13_R2 256.000
12_13_02_U133A_Mer_Latin_Square_Expt14_R2 512.000
12_13_02_U133A_Mer_Latin_Square_Expt1_R3 0.000
12_13_02_U133A_Mer_Latin_Square_Expt2_R3 0.125
12_13_02_U133A_Mer_Latin_Square_Expt3_R3 0.250
12_13_02_U133A_Mer_Latin_Square_Expt4_R3 0.500
12_13_02_U133A_Mer_Latin_Square_Expt5_R3 1.000
12_13_02_U133A_Mer_Latin_Square_Expt6_R3 2.000
12_13_02_U133A_Mer_Latin_Square_Expt7_R3 4.000
12_13_02_U133A_Mer_Latin_Square_Expt8_R3 8.000
12_13_02_U133A_Mer_Latin_Square_Expt9_R3 16.000
12_13_02_U133A_Mer_Latin_Square_Expt10_R3 32.000
12_13_02_U133A_Mer_Latin_Square_Expt11_R3 64.000
12_13_02_U133A_Mer_Latin_Square_Expt12_R3 128.000
12_13_02_U133A_Mer_Latin_Square_Expt13_R3 256.000
12_13_02_U133A_Mer_Latin_Square_Expt14_R3 512.000
#List the concentrations used for the first probe label 203508_at
pData(SpikeIn133) [ ,"203508_at"] [1] 0.000 0.125 0.250 0.500 1.000 2.000 4.000 8.000 16.000
[10] 32.000 64.000 128.000 256.000 512.000 0.000 0.125 0.250 0.500
[19] 1.000 2.000 4.000 8.000 16.000 32.000 64.000 128.000 256.000
[28] 512.000 0.000 0.125 0.250 0.500 1.000 2.000 4.000 8.000
[37] 16.000 32.000 64.000 128.000 256.000 512.000
#Extract the probe names from the phenotype data from the spikein133 list object
Index <- which(probeNames(SpikeIn133)%in%colnames(pData(SpikeIn133)))
#Extract the positive match intensities from the list of spiked in probes
pms <- pm(SpikeIn133)[Index,]
#Extract the gene names from the probeNames
genenames <- probeNames(SpikeIn133)[Index]
#Extract the nominal concentrations from the dataset
nominal <- t(sapply(probeNames(SpikeIn133)[Index],function(i) pData(SpikeIn133)[,i]))
#Calculate the log2 values of the nominal concentrations
x <- as.vector(log2(nominal))
#Calculate the log2 values for the resulting intensities
y <- as.vector(log2(pms))
avg <- tapply(y,x,mean,na.rm=TRUE)This extracts the data subset we want to plot from the SpikeInSubset and converts it to log2 values. This is then plotted with:
plot(jitter(x),y,las=1,pch=".",
ylab="Log (Base 2) PM Intensity",
xlab="Nominal Log (Base 2) Concentration")
lines(as.numeric(names(avg)),avg,lwd=3,col="green")This graph clearly does not follow a linear relationship between concentration and Intensity such as you have with the Beer-Lambert law in solution absorbances. This makes dealing with mixed concentrations over a wide-range statistically challenging, just from a signal intensity perspective, before taking into account the challenges of probe and biological variation.
The additive-multiplicative model can be tested by plotting a large series of probe intensities for a given nominal concentration (chosen as 1pmol, with a log2 value of 0), against a normally distributed error. The resulting QQ-plot should lie on the diagonal if the assumption of normally distributed errors is correct.
This is plotted with:
qqnorm(y[x==0],pch=".")
qqline(y[x==0],col="red")There is a significant deviation from normality across the complete set of data. Looking at the preceding graphs this is because of a very large right-skew at a nominal concentration of 1pmol.
The qqplot for 2pmol is much closer to normality.
qqnorm(y[x==1],pch=".")
qqline(y[x==1],col="red")At high concentrations there are a significant number of lower than expected intensities and the skew moves left. Looking at the pattern of the increase of this skew with concentration this suggests poor design of the probes. Yet another reason why Michael Eisen said stop using Microarrays for gene expression analysis.
However we have all of the data and we can try and extract some useful information from it so long as we realise that it is poorly quantitative and that we should only use it for differences between groups not absolute values.