GEO stands for Gene Expression Omnibus. It’s a public database for microarray datasets and it’s hosted at NCBI. Its address is https://www.ncbi.nlm.nih.gov/gds. We will go through some example.
Let’s say that we want to analyze a dataset related to the effect of smoking in humans. In GEO Datasets (as well as in NCBI), we can use keywords to search for specific entries. Now, I’ll use the keywords “Smoking” in title, and “Human” as an organism. The search results are here. We choose the third datasets for this example, i.e. the GDS3309.
We will proceed with the following steps:
grep command. Cleaning means that you should remove all lines starting with an ^, or a ! or a #.First, let’s clean the dataset
grep -v [\!\^\#] GDS3309.soft > GDS3309.clean
We need to inspect the file (you can use ls -S GDS3309.clean) to see what are the NULL characters and in general if the file “looks good”
raw.data <- read.table("GDS3309.clean", sep="\t", header=TRUE)
dim(raw.data)
## [1] 22277 17
The first two columns contain the probe ID and the gene name for each row. Thus, we need to create a file that does not contain these two columns.
data <- raw.data[,-c(1,2)]
## Let's have a look at the head of the file
head(data)
## GSM227868 GSM227870 GSM227871 GSM227874 GSM227876 GSM227877 GSM227878
## 1 5236.4800 6183.20000 5171.48000 4553.4000 6094.6600 5976.880 7713.8800
## 2 237.5170 225.65300 225.01700 134.2670 228.1800 235.723 262.2080
## 3 100.4390 159.24800 136.26700 45.6665 166.1320 132.373 47.6219
## 4 1123.9900 1258.00000 923.85900 619.6690 1369.2800 1215.570 1102.9000
## 5 34.7732 9.23039 8.90546 29.3123 60.9262 55.368 61.0796
## 6 348.8550 399.54000 429.15500 342.0000 1161.0500 486.014 754.3390
## GSM227880 GSM227869 GSM227872 GSM227873 GSM227875 GSM227879 GSM227881
## 1 6781.240 4948.9900 5029.5700 5892.1700 6534.86000 8393.7200 6367.6200
## 2 254.174 196.7490 188.7460 202.1670 174.62200 178.9760 220.6430
## 3 127.556 160.1350 125.7430 163.9650 51.95170 140.1410 325.3390
## 4 1377.010 1195.4000 1176.8900 959.1010 926.23500 984.3440 1850.0100
## 5 83.120 37.0302 30.8798 31.6348 4.08673 64.7151 15.3595
## 6 760.100 403.2180 478.7310 475.1940 393.89700 527.1790 469.2370
## GSM227882
## 1 8266.4800
## 2 215.6060
## 3 140.8650
## 4 1200.3600
## 5 29.2318
## 6 499.7910
First, we will check if the data are normalized. It’s easy to check this via a boxplot. If the data are normalized then, they should appear more or less at the same level in a boxplot.
Since the data are a data.frame, it’s very simple to construct a boxplot for the different columns of the data.frame:
boxplot(data)
A problem that appears in this boxplot is that there some very large values and really many small values. Thus, we cannot really have a clear picture of the data. This is common in microarrays, where the majority of the genes are expressed at low level, whereas a small portion of the genes are expressed at high level. Therefore, we usually transform the data using the log_2 transformation.
data2 <- log2(data) ## again remember that most of the operations in R are performed in an element-wise manner.
##now let's make a boxplot
boxplot(data2, main=expression(paste("Boxplot of the ", log2, " data")))
It’s obvious that data are normalized since the boxplot for each column is about at the same level. Normalization is necessary when we want to perform comparisons between different samples. Thus, we are ready now to proceed with the differential expression analysis, that is to detect the genes that behave differently in different classes of the data.
At step 1, we will need to see how are data are partitioned. Thus, we visit the NCBI web page where the dataset is stored. This is, here
The Experimental Design of the GDS3309 Dataset
This is a very simple experimental design with only one factor, smoking, with two levels: “control” and “smokers”. Control samples are 8, whereas there are 7 smokers.
labels <- c(rep("control", 8), rep("smokers", 7))
Let’s now perform hypothesis testing for only one gene, the first one.
The idea behind the t-test is that we summarize the information of two vectors of values, each belonging to one class (e.g., “control” and “smokers”) in a single number. This numbers is larger when the values of the two classes differ a lot in comparison to the variance within the classes. Thus we define a new quantity called t-statistic as ( t = ), where ( _1 ) and ( _2 ) are the means of each class and VAR is a function of the variances of the two classes. Moreover, we know how the t-statistic follows a certain distribution when the two classes come from the SAME population, i.e., the two vectors of values are sampled from just a single population. This allows us to do something very important: We know what values of the t-statistic to expect when the two vectors do not differ because they come from different populations, but they may just differ due to random fluctuations within a population. If the value of the t-statistic is too large (absolute value) to be considered a ‘part of this distribution’, then we reject the hypothesis that the two vectors have been obtained from the same distribution.
The function t.test performs the t-test. As arguments it accepts two vectors, one for each class. The order of the two vectors is not important. The help of this function returns the following information:
| t.test {stats} | R Documentation |
Performs one and two sample t-tests on vectors of data.
t.test(x, ...)
## Default S3 method:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
## S3 method for class 'formula'
t.test(formula, data, subset, na.action, ...)
x
|
a (non-empty) numeric vector of data values. |
y
|
an optional (non-empty) numeric vector of data values. |
alternative
|
a character string specifying the alternative hypothesis, must be one of |
mu
|
a number indicating the true value of the mean (or difference in means if you are performing a two sample test). |
paired
|
a logical indicating whether you want a paired t-test. |
var.equal
|
a logical variable indicating whether to treat the two variances as being equal. If |
conf.level
|
confidence level of the interval. |
formula
|
a formula of the form |
data
|
an optional matrix or data frame (or similar: see |
subset
|
an optional vector specifying a subset of observations to be used. |
na.action
|
a function which indicates what should happen when the data contain |
…
|
further arguments to be passed to or from methods. |
The formula interface is only applicable for the 2-sample tests. The t.test accepts two vectors one from the first class and another for the second class. The rest of the arguments control details about the test. For example, if you know that the variances of the two classes are equal, you can set that var.equal=TRUE. In general, however, this is not known, thus we set that var.equal=F.
Now, let’s perform the t.test in the first gene.
result <- t.test( data2[1, labels == 'control'], data2[1, labels == 'smokers'], var.equal=F )
result
##
## Welch Two Sample t-test
##
## data: data2[1, labels == "control"] and data2[1, labels == "smokers"]
## t = -0.77695, df = 11.362, p-value = 0.4531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4260015 0.2030735
## sample estimates:
## mean of x mean of y
## 12.52459 12.63605
The result of the t-test returns a lot of information:
In fact, there are two values that are interesting for us. The first, is the p-value and the second is the difference of the means, i.e.
## To obtain the names of the objects we can use the function names
names(result)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
result$p.value
## [1] 0.4530608
diff(-result$estimate)
## mean of y
## -0.111464
The p-value will tell us if the difference of the means is large enough so that we reject the null hypothesis with the sample that we have. If the sample size is large, then we can reject the null hypothesis even if the difference between the means is small. See the following example:
v1 <- rnorm(100, 0, 1)
v2 <- rnorm(100, 0.2, 1)
t.test(v1, v2)
##
## Welch Two Sample t-test
##
## data: v1 and v2
## t = -0.76654, df = 198, p-value = 0.4443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3872247 0.1704509
## sample estimates:
## mean of x mean of y
## 0.01062515 0.11901205
v3 <- rnorm(10000, 0, 1)
v4 <- rnorm(10000, 0.2, 1)
t.test(v3, v4)
##
## Welch Two Sample t-test
##
## data: v3 and v4
## t = -13.655, df = 19998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2198125 -0.1646283
## sample estimates:
## mean of x mean of y
## 0.01512736 0.20734777
Thus, even though the two samples are coming from two distributions, with mean 0 and 0.2, respectively, in the first case we cannot reject the null, whereas in the second case we can reject the null. This is because the sample size is large in the second case.
Therefore, it is not only enough to report the p-value but also the difference between the means. The difference between the means (or other values, such as ratios etc may be biologically important).
Next we have to obtain the p-values for all genes. Thus, we need to apply the function t.test for all the rows of the expression matrix. However, there is a problem. The t.test is applied for two vectors. On the other hand, the apply function of the R, is applied on whole rows. Thus, we create a function that will appropriately use the t.test.
my.ttest <- function(v, labels)
{
levels <- unique(labels)
v1 <- v[ labels == levels[1]]
v2 <- v[ labels == levels[2]]
pval <- t.test(v1, v2, var.equal = F)$p.value
pval
}
Now, we can easily apply this function for the whole matrix
allpvalues <- apply(data2, 1, my.ttest, labels)
## let's see which p values are smaller than a threshold
sig.inds <- which(allpvalues < 0.001)
sig.inds
## [1] 675 1962 1963 1964 1965 2630 2896 5276 6441 7044 9185
## [12] 14024 14458
There are only a few rows (genes) with p-value smaller than the specified threshold. Now, we can visualize these results. A nice way is to use the heatmaps. So, an intuitive result is that if we would use only these gene to represent the expression values of the two classes, then the two sets of samples would be separated. We will use the function heatmap.2 from the package gplots
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
data.sig <- as.matrix(data2[sig.inds,])
rownames(data.sig) <- raw.data[sig.inds,2]
## Note that heatmap.2 accepts only matrices
heatmap.2(as.matrix(data.sig))
Note that the heatmap.2 has a default behavior to scale the values per row. Thus, a heatmap of the original matrix should be like:
library(gplots)
data.sig <- as.matrix(data2[sig.inds,])
rownames(data.sig) <- raw.data[sig.inds,2]
## Note that heatmap.2 accepts only matrices
heatmap.2(as.matrix(data.sig), scale="row")
Let’s assume a measurement for a single gene (e.g., one significant gene).
value <- data.sig[1,]
names <- labels
names.ind <- (names == 'smokers') + 0
mydata1 <- value[names == 'control']
mydata2 <- value[names == 'smokers']
boxplot(list(control=mydata1, smokers=mydata2))
plot(names.ind, value)
The structure of a basic linear model is: \(outcome_i = (model) + error_i \)
The structure of a basic linear model is: ( A_i = (b_0 + b_1 G_i) + _i )
In other words, we want to explain the expression level of a gene as a function of the class it belongs to. Here, the variable G_i is either 0 or 1, denoting the class. Let’s now assume the data of the class 0. If we replace the G_i with 0, then ( A_0 = (b_0 + b_1 0 ) + _0 = b_0 + _0). This tell us that the expression data for the class ‘0’ should be around the ( b_0 ) value. All the different values within the ‘0’ class are explained by the ( epsilon_0 ) value.
Now, if we see the data of the class ‘1’, ( A_1 = (b_0 + b_1 1) + _1 = b_0 + b_1 + _1 ). Therefore, the two datasets differ by ( b_1 ), which is the effect that the class ‘1’ offers to the data.
Since both the ( epsilon ) values have a mean of 0, it turns out that the mean value for class ‘1’ equals to ( mean_1 = b_0 + b_1 ), therefore, ( b_1 = mean_1 - b_0 = mean_1 - mean_0 ). Now, if we consider the linear model that passes through the two mean values, we see that the slope of this regression lines equals to the ( b_1 ) (since the definition of the slope is ( )). Note also, that the two classes only differ by the ( b_1 ) variable.
Eventually, we want to figure out whether the ( b_1 ) is significantly different than 0. If it’s not then the two classes do not differ statistically, or at least we cannot say with the data we have that they differ. To test exactly this hypothesis we can use the function lm of R.
lm.model = lm(value~as.factor(names))
summary(lm.model)
##
## Call:
## lm(formula = value ~ as.factor(names))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00480 -0.14384 -0.06412 0.28407 0.73724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2439 0.1877 33.264 5.77e-14 ***
## as.factor(names)smokers 1.4129 0.2748 5.142 0.000189 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5309 on 13 degrees of freedom
## Multiple R-squared: 0.6704, Adjusted R-squared: 0.645
## F-statistic: 26.44 on 1 and 13 DF, p-value: 0.0001893
as.factor(names)
## [1] control control control control control control control control
## [9] smokers smokers smokers smokers smokers smokers smokers
## Levels: control smokers
If we calculate the means for the two classes, we have:
## The mean of the '0' sample is the intercept
mean1 <- mean(mydata1)
mean2 <- mean(mydata2)
mean(mydata2) - mean(mydata1)
## [1] 1.412925
Also, if we perform a t-test:
myttest <- t.test(mydata1, mydata2, var.equal = T)
myttest
##
## Two Sample t-test
##
## data: mydata1 and mydata2
## t = -5.1421, df = 13, p-value = 0.0001893
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.0065440 -0.8193069
## sample estimates:
## mean of x mean of y
## 6.243862 7.656787
We see that the t-test is equivalent to testing whether a linear model has a ‘0’ or non-0 slope. Using the linear model will be advantageous when we would like to have more than one independent variable (e.g., smoking/non-smoking, male/female).