You are presented a lot of information about this experiment, and at the end the description of the samples (Columns descriptions) and the beginning of the actual Data Table. Each row in the data table are the measured values for one probe (spot) on the microarray. The cryptic name of this probe is in the first column (ID_REF). The gene (and protein produced from that gene) each probe detects is in the second column (IDENTIFIER).
The GEOData is a specialized class, for GEO derived datasets only. For our further analyses we need to convert it to something that fits the general functions of Bioconductor. This is the ExpressionSet class, and there is a special function to do this conversion
library(Biobase)
library(GEOquery)
library(limma)
library(gplots)
#Downloading the data set
gds <- getGEO('GDS1542', destdir=".")
class(gds)
## [1] "GDS"
## attr(,"package")
## [1] "GEOquery"
eset <- GDS2eSet(gds)
The ExpressionSet contains basically the same information as the GDS, but structured differently. Let’s currently just extract the actual values as a standard matrix:
expdata <- exprs(eset)
dim(expdata)
## [1] 22283 8
head(expdata)
## GSM50777 GSM50778 GSM50779 GSM50780 GSM50781 GSM50782 GSM50783
## 1007_s_at 124.1 113.4 107.3 127.6 128.0 104.2 93.4
## 1053_at 29.0 74.1 32.3 36.7 35.7 29.0 33.1
## 117_at 26.4 4.3 13.1 17.7 23.9 18.3 23.2
## 121_at 358.9 282.5 314.5 238.5 367.8 279.1 307.3
## 1255_g_at 30.7 14.8 21.0 12.1 21.4 16.1 31.7
## 1294_at 56.2 50.0 59.9 48.7 68.7 63.0 59.7
## GSM50784
## 1007_s_at 109.5
## 1053_at 50.7
## 117_at 19.1
## 121_at 262.5
## 1255_g_at 11.7
## 1294_at 59.2
We lost the gene (IDENTIFIER) column from the table.
sum(is.na(expdata))# There is one data point missing
## [1] 1
w <- which(apply(is.na(expdata), 1, sum) > 0 )
temp <- expdata[-w, ]
Now check how the data looks using a box plot:
boxplot(as.data.frame(expdata))
Almost all values are close to zero and there seems to be an extended tail of outlying high values. Actually, microarray data is not at all normal distributed. But if we make the 2-logarithm of data it is much better:
logdata <- log2(expdata)
boxplot(as.data.frame(logdata))
This is much better. Furthermore, the mean of all the samples are approximately the same, showing that on a global scale, every microarray worked equally well technically.
In a meaningful experiment with many variables, the purpose is to detect the (relatively small) subset of variables that change significantly between the conditions studied. Naively, we could now just make a lot of t-tests, one for each probe and see which are significantly changing. But this will give us a lot of false positives just by chance, even if there is no true change of any probe between the samples.
Thus, we want to eliminate as many variables as possible that we do not consider biologically meaningful before doing statistical testing. Furthermore in our kind of experiment with microarrays, data tend to be very noisy when the signal is very low. Let’s have a look at that by plotting the standard deviation against the mean for every probe:
probemeans <- apply(logdata, 1, mean)
probesd <- apply(logdata, 1, sd)
plot(probemeans, probesd)
It is clear that standard deviation is higher at lower means. Some probes have very high standard deviations - these may be truly changing, something we will investigate further on.
Let’s first eliminate the 25% of the probes that have weakest signal:
q25 <- quantile(probemeans, 0.25,na.rm = T)
whichtosave <- which(probemeans > q25)
q25logdata <- logdata[whichtosave,]
Since we are only interested in probes that change, and thus have high variability, we can also remove those with very low variability. A way to do that is to filter on the inter-quartile range (using the IQR function). Keep only those with an IQR above 1.5:
mydata <- q25logdata[apply(q25logdata, 1, IQR) > 1.5, ]
At this point it would be interesting to do a Principal Component Analysis (PCA). This basically transforms the data to find new orthogonal variables that explain most of the variation in the dataset.
This variation can be analysed either between probes (rows) or samples (columns). We will use the function prcomp() for the PCA. It does the analysis between rows. For our purpose the samples are most interesting to compare, so we need to interchange columns and rows in mydata.
tdata <- aperm(mydata)
Now we use function prcomp() on the transposed data:
pca <- prcomp(tdata, scale=T)
The first component explains the largest part of the variance, but not all. In the best of worlds, this accounts for the difference between our experimental conditions, otherwise we have some unknown batch effect that dominate the experiment.
We can plot the samples in relation to the first two components:
plot(pca$x, type="n")
text(pca$x, rownames(pca$x), cex=0.5)
More interesting is maybe to see which experimental condition they belong to. For that purpose we extract this from the ExpressionSet (remember that?, use show(eset) again…)
conditions <- phenoData(eset)$agent
plot(pca$x, type="n")
text(pca$x, labels=conditions, cex=0.5)
Luckily the first component divides the samples by condition. However, in one sample something else seems to be going on, sending it away along the second component. That can be worth remembering when judging the final results.
Another informative plot is to do a dendrogram of correlation between the samples. First we make a correlation matrix between the samples, than a hierarchical clustering is performed:
pearsonCorr <- as.dist(1 - cor(mydata))
hC <- hclust(pearsonCorr)
plot(hC, labels = sampleNames(eset))
The heights of the branches indicate how distant the samples are. Recall which sample was in each condition by putting condition as labels:
plot(hC, labels = conditions)
The two groups and the sample “in between” are even more evident here than in the PCA.
Another useful visualization of large datasets is the heatmap. R clusters the data both on rows columns with this command:
heatmap(mydata, col=greenred(100))
Red corresponds to high, green to low values. The dendrogram on top is basically the same as we produced earlier, in a slightly different order. Towards the bottom of the heatmap are the probes clustered that discriminate the two conditions clearly.
Let’s now find the probes change significantly between the conditions. There are several tools to do that, more or less advanced. The most simplistic would be to do a t-test for every probe. This is however not a good idea. We have a lot of probes, and the few samples will give the estimate of variance low precision in many cases and give us many false negatives and positives. However, it turns out that the variance of probes with approximately the same expression level is rather similar, and hence one can let probes “borrow” variance from each other to get better variance estimates. Such a method is employed in the limma package (LInear Models for Microarrays). limma needs to see the whole dataset, including the high variance probes, to do correct variance estimations. Thus we will go back and use our ExpressionSet containing all data. In addition to the actual data, limma needs a model matrix, basically information on which conditions each sample represents. R has a standard function for defining model matrices, model.matrix. It uses the ~ operator to define dependencies. Each input variable should be a factor, so let’s first make a factor out of the conditions (agent in our ExpressionSet):
condfactor <- factor(eset$agent)
Construct a model matrix, and assign the names to the columns:
design <- model.matrix(~0+condfactor)
For the columns of the design matrix you could use any names. I choose “ctrl” for the control samples, and “tnf” for the chemically treated.
colnames(design) <- c("ctrl", "tnf")
The first 4 samples have ‘1’ in the control level, and the following 4 have the ‘1’ in the other level. If we had had a multifactor experiment (ANOVA style data), we would have included more levels and assigned them in appropriate combinations to the samples. The next command estimates the variances:
fit <- lmFit(eset, design)
Now we need to define the conditions we want to compare - trivial in this case since there are only two conditions:
contrastmatrix <- makeContrasts(tnf - ctrl,levels=design)
The following commands calculate the p-values for the differences between the conditions defined by the contrast matrix:
fit <- contrasts.fit(fit, contrastmatrix)
ebayes <- eBayes(fit)
A lot of stuff here, one of the most interesting is the p.value. Let’s make a histogram of the p values:
hist(ebayes$p.value)
You see that the number of probes are enriched close to p = 0.00. This surplus is due to the genes that are significantly changing between the conditions. If the data had a skewed distribution we might see an accumulation at the p = 1.00 end, indicating a problem with our data.
If the data were completely random, the p values would be equally distributed from 0 to 1, thus if we from random data picked the probes that had p < 0.05 as “significant”, we would pick exactly 1/20 of all probes, all being false positives. Unfortunately, there is no way to discriminate the surplus true probes from the false positives. This is problem with any study where a lot of variables are tested, for instance in large sociology studies. However, there are ways to regulate the p-value cut off in order to have control over the false positives. A common way is the Benjamini-Hochberg adjustment. This adjustment allows you to set a limit to how large fraction of false positive variables (false discovery rate, or FDR) you accept in the results.
This adjustment, at 5% FDR, is built into the decideTests function:
results <- decideTests(ebayes)
decideTests produces 0, 1, or -1 for each probes, telling if that probe did not pass the test (0), or was significantly increased (1), or decreased (-1)
A neat function to display the result of just two conditions is the Venn diagram:
vennDiagram(results)
Extract the original data for the most changing probes:
resData <- exprs(eset)[results != 0,]
Add gene symbols as row names:
geneSymbol <- as.array(fData(eset)[,"Gene symbol"])
gs <- geneSymbol[c(which(results != 0))]
rownames(resData) <- gs
Add p-values in an extra column:
pvalues <- ebayes$p.value[results != 0,]
resData <- cbind(resData, pvalues)
And add-p values corrected for multiple testing (q-values):
adj.pvalues <- p.adjust(ebayes$p.value, method="BH")
adj.pvalues <- adj.pvalues[results != 0]
resData <- cbind(resData, adj.pvalues)
Now we have generated both quality control graphs and a table of probes that change significantly between the samples. It’s time for the medical scientists to take over and let them make the biological conclusions.