GEO stands for Gene Expression Omnibus. It’s a public database for microarray datasets and it’s hosted at NCBI. Its address is https://www.ncbi.nlm.nih.gov/gds. We will go through some example.
Let’s say that we want to analyze a dataset related to the effect of smoking in humans. In GEO Datasets (as well as in NCBI), we can use keywords to search for specific entries. Now, I’ll use the keywords “Smoking” in title, and “Human” as an organism. The search results are here. We choose the third datasets for this example, i.e. the GDS3309.
We will proceed with the following steps:
grep
command. Cleaning means that you should remove all lines starting with an ^
, or a !
or a #
.First, let’s clean the dataset
grep v [\!\^\#] GDS3309.soft > GDS3309.clean
We need to inspect the file (you can use ls S GDS3309.clean) to see what are the NULL characters and in general if the file “looks good”
raw.data < read.table("GDS3309.clean", sep="\t", header=TRUE)
dim(raw.data)
## [1] 22277 17
The first two columns contain the probe ID and the gene name for each row. Thus, we need to create a file that does not contain these two columns.
data < raw.data[,c(1,2)]
## Let's have a look at the head of the file
head(data)
## GSM227868 GSM227870 GSM227871 GSM227874 GSM227876 GSM227877 GSM227878
## 1 5236.4800 6183.20000 5171.48000 4553.4000 6094.6600 5976.880 7713.8800
## 2 237.5170 225.65300 225.01700 134.2670 228.1800 235.723 262.2080
## 3 100.4390 159.24800 136.26700 45.6665 166.1320 132.373 47.6219
## 4 1123.9900 1258.00000 923.85900 619.6690 1369.2800 1215.570 1102.9000
## 5 34.7732 9.23039 8.90546 29.3123 60.9262 55.368 61.0796
## 6 348.8550 399.54000 429.15500 342.0000 1161.0500 486.014 754.3390
## GSM227880 GSM227869 GSM227872 GSM227873 GSM227875 GSM227879 GSM227881
## 1 6781.240 4948.9900 5029.5700 5892.1700 6534.86000 8393.7200 6367.6200
## 2 254.174 196.7490 188.7460 202.1670 174.62200 178.9760 220.6430
## 3 127.556 160.1350 125.7430 163.9650 51.95170 140.1410 325.3390
## 4 1377.010 1195.4000 1176.8900 959.1010 926.23500 984.3440 1850.0100
## 5 83.120 37.0302 30.8798 31.6348 4.08673 64.7151 15.3595
## 6 760.100 403.2180 478.7310 475.1940 393.89700 527.1790 469.2370
## GSM227882
## 1 8266.4800
## 2 215.6060
## 3 140.8650
## 4 1200.3600
## 5 29.2318
## 6 499.7910
First, we will check if the data are normalized. It’s easy to check this via a boxplot
. If the data are normalized then, they should appear more or less at the same level in a boxplot.
Since the data
are a data.frame
, it’s very simple to construct a boxplot for the different columns of the data.frame:
boxplot(data)
A problem that appears in this boxplot is that there some very large values and really many small values. Thus, we cannot really have a clear picture of the data. This is common in microarrays, where the majority of the genes are expressed at low level, whereas a small portion of the genes are expressed at high level. Therefore, we usually transform the data using the log_2 transformation.
data2 < log2(data) ## again remember that most of the operations in R are performed in an elementwise manner.
##now let's make a boxplot
boxplot(data2, main=expression(paste("Boxplot of the ", log2, " data")))
It’s obvious that data are normalized since the boxplot for each column is about at the same level. Normalization is necessary when we want to perform comparisons between different samples. Thus, we are ready now to proceed with the differential expression analysis, that is to detect the genes that behave differently in different classes of the data.
At step 1, we will need to see how are data are partitioned. Thus, we visit the NCBI web page where the dataset is stored. This is, here
This is a very simple experimental design with only one factor, smoking, with two levels: “control” and “smokers”. Control samples are 8, whereas there are 7 smokers.
labels < c(rep("control", 8), rep("smokers", 7))
Let’s now perform hypothesis testing for only one gene, the first one.
The idea behind the ttest
is that we summarize the information of two vectors of values, each belonging to one class (e.g., “control” and “smokers”) in a single number. This numbers is larger when the values of the two classes differ a lot in comparison to the variance within the classes. Thus we define a new quantity called tstatistic as ( t = ), where ( _1 ) and ( _2 ) are the means of each class and VAR is a function of the variances of the two classes. Moreover, we know how the tstatistic follows a certain distribution when the two classes come from the SAME population, i.e., the two vectors of values are sampled from just a single population. This allows us to do something very important: We know what values of the tstatistic to expect when the two vectors do not differ because they come from different populations, but they may just differ due to random fluctuations within a population. If the value of the tstatistic is too large (absolute value) to be considered a ‘part of this distribution’, then we reject the hypothesis that the two vectors have been obtained from the same distribution.
The function t.test
performs the ttest. As arguments it accepts two vectors, one for each class. The order of the two vectors is not important. The help of this function returns the following information:
t.test {stats}  R Documentation 
Performs one and two sample ttests on vectors of data.
t.test(x, ...) ## Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula' t.test(formula, data, subset, na.action, ...)
x

a (nonempty) numeric vector of data values. 
y

an optional (nonempty) numeric vector of data values. 
alternative

a character string specifying the alternative hypothesis, must be one of 
mu

a number indicating the true value of the mean (or difference in means if you are performing a two sample test). 
paired

a logical indicating whether you want a paired ttest. 
var.equal

a logical variable indicating whether to treat the two variances as being equal. If 
conf.level

confidence level of the interval. 
formula

a formula of the form 
data

an optional matrix or data frame (or similar: see 
subset

an optional vector specifying a subset of observations to be used. 
na.action

a function which indicates what should happen when the data contain 
…

further arguments to be passed to or from methods. 
The formula interface is only applicable for the 2sample tests. The t.test
accepts two vectors one from the first class and another for the second class. The rest of the arguments control details about the test. For example, if you know that the variances of the two classes are equal, you can set that var.equal=TRUE
. In general, however, this is not known, thus we set that var.equal=F
.
Now, let’s perform the t.test
in the first gene.
result < t.test( data2[1, labels == 'control'], data2[1, labels == 'smokers'], var.equal=F )
result
##
## Welch Two Sample ttest
##
## data: data2[1, labels == "control"] and data2[1, labels == "smokers"]
## t = 0.77695, df = 11.362, pvalue = 0.4531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4260015 0.2030735
## sample estimates:
## mean of x mean of y
## 12.52459 12.63605
The result of the ttest returns a lot of information:
In fact, there are two values that are interesting for us. The first, is the pvalue and the second is the difference of the means, i.e.
## To obtain the names of the objects we can use the function names
names(result)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
result$p.value
## [1] 0.4530608
diff(result$estimate)
## mean of y
## 0.111464
The pvalue will tell us if the difference of the means is large enough so that we reject the null hypothesis with the sample that we have. If the sample size is large, then we can reject the null hypothesis even if the difference between the means is small. See the following example:
v1 < rnorm(100, 0, 1)
v2 < rnorm(100, 0.2, 1)
t.test(v1, v2)
##
## Welch Two Sample ttest
##
## data: v1 and v2
## t = 0.76654, df = 198, pvalue = 0.4443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3872247 0.1704509
## sample estimates:
## mean of x mean of y
## 0.01062515 0.11901205
v3 < rnorm(10000, 0, 1)
v4 < rnorm(10000, 0.2, 1)
t.test(v3, v4)
##
## Welch Two Sample ttest
##
## data: v3 and v4
## t = 13.655, df = 19998, pvalue < 2.2e16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2198125 0.1646283
## sample estimates:
## mean of x mean of y
## 0.01512736 0.20734777
Thus, even though the two samples are coming from two distributions, with mean 0 and 0.2, respectively, in the first case we cannot reject the null, whereas in the second case we can reject the null. This is because the sample size is large in the second case.
Therefore, it is not only enough to report the pvalue but also the difference between the means. The difference between the means (or other values, such as ratios etc may be biologically important).
Next we have to obtain the pvalues for all genes. Thus, we need to apply the function t.test
for all the rows of the expression matrix. However, there is a problem. The t.test
is applied for two vectors. On the other hand, the apply
function of the R, is applied on whole rows. Thus, we create a function that will appropriately use the t.test
.
my.ttest < function(v, labels)
{
levels < unique(labels)
v1 < v[ labels == levels[1]]
v2 < v[ labels == levels[2]]
pval < t.test(v1, v2, var.equal = F)$p.value
pval
}
Now, we can easily apply this function for the whole matrix
allpvalues < apply(data2, 1, my.ttest, labels)
## let's see which p values are smaller than a threshold
sig.inds < which(allpvalues < 0.001)
sig.inds
## [1] 675 1962 1963 1964 1965 2630 2896 5276 6441 7044 9185
## [12] 14024 14458
There are only a few rows (genes) with pvalue smaller than the specified threshold. Now, we can visualize these results. A nice way is to use the heatmaps. So, an intuitive result is that if we would use only these gene to represent the expression values of the two classes, then the two sets of samples would be separated. We will use the function heatmap.2
from the package gplots
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
data.sig < as.matrix(data2[sig.inds,])
rownames(data.sig) < raw.data[sig.inds,2]
## Note that heatmap.2 accepts only matrices
heatmap.2(as.matrix(data.sig))