Visualising the .CEL Files.

As with all statistical data the first thing you need to do is exploratory data analysis. You need to look at the .CEL files to make sure that they are all as expected and that none of them have any obvious errors.

I downloaded all of the files from the E-GEOD-4127 dataset to a sub-folder of the current location.

BiocManager::install("estrogen")
library("Biobase")
library("affy")
library("hgu133acdf")
setwd("E-GEOD-4127")
batch <- ReadAffy()
batch
AffyBatch object
size of arrays=712x712 features (30 kb)
cdf=HG-U133A (22283 affyids)
number of samples=29
number of genes=22283
annotation=hgu133a
notes=
image(batch[,1:2])

If you go through all of the images then there are no obvious issues. There is some variation in the overall intensity between the different files. This will reflect different amounts of sample loaded or possibly sample decay. To deal with those issues we need to normalise the data - to give them all the same baseline intensity.

It is also worth looking at an example of an unusable .CEL file to see when one should be rejected. This comes from the Estrogen packaged and can be found in the R library in the package folder in the extdata subfolder.

library("affy")
badc = ReadAffy("bad.cel")
image(badc)

In this case something has been dropped on the array and the measurements are unreliable. This arrays should be discarded even though there are some valid data values.