GEO Expression Analysis

Without using Bioconductor

GEO stands for Gene Expression Omnibus. It’s a public database for microarray datasets and it’s hosted at NCBI. Its address is https://www.ncbi.nlm.nih.gov/gds. We will go through some example.

Let’s say that we want to analyze a dataset related to the effect of smoking in humans. In GEO Datasets (as well as in NCBI), we can use keywords to search for specific entries. Now, I’ll use the keywords “Smoking” in title, and “Human” as an organism. The search results are here. We choose the third datasets for this example, i.e. the GDS3309.

We will proceed with the following steps:

Download the dataset. For this use the DataSet SOFT file.
Unzip the dataset
Clean the dataset, using the appropriate combination of grep command. Cleaning means that you should remove all lines starting with an ^, or a ! or a #.
The final dataset should contain as many columns as the samples plus 2 additional. The first and the second column of the datasets are the probe IDs and the Gene IDs, respectively. Note Genes and Probes are not the same. However, most of the times you can treat probe IDs as ‘genes’. Read this forum.
Visit again the website from which you obtained the dataset. Go to the bottom of the page, where it’s written “Experiment design and value distribution”. Click on that (even though it seems that it’s not clickable). Now, you can obtain more details for the dataset. The important information is the classes of each sample (e.g. healthy, sick etc). This information is important if you want to setup your experiment correctly.
Now we are ready to proceed with the analysis of the dataset in R.

First, let’s clean the dataset

grep -v [\!\^\#] GDS3309.soft > GDS3309.clean

We need to inspect the file (you can use ls -S GDS3309.clean) to see what are the NULL characters and in general if the file “looks good”

raw.data <- read.table("GDS3309.clean", sep="\t", header=TRUE)
dim(raw.data)

## [1] 22277    17

The first two columns contain the probe ID and the gene name for each row. Thus, we need to create a file that does not contain these two columns.

data <- raw.data[,-c(1,2)]

## Let's have a look at the head of the file
head(data)

##   GSM227868  GSM227870  GSM227871 GSM227874 GSM227876 GSM227877 GSM227878
## 1 5236.4800 6183.20000 5171.48000 4553.4000 6094.6600  5976.880 7713.8800
## 2  237.5170  225.65300  225.01700  134.2670  228.1800   235.723  262.2080
## 3  100.4390  159.24800  136.26700   45.6665  166.1320   132.373   47.6219
## 4 1123.9900 1258.00000  923.85900  619.6690 1369.2800  1215.570 1102.9000
## 5   34.7732    9.23039    8.90546   29.3123   60.9262    55.368   61.0796
## 6  348.8550  399.54000  429.15500  342.0000 1161.0500   486.014  754.3390
##   GSM227880 GSM227869 GSM227872 GSM227873  GSM227875 GSM227879 GSM227881
## 1  6781.240 4948.9900 5029.5700 5892.1700 6534.86000 8393.7200 6367.6200
## 2   254.174  196.7490  188.7460  202.1670  174.62200  178.9760  220.6430
## 3   127.556  160.1350  125.7430  163.9650   51.95170  140.1410  325.3390
## 4  1377.010 1195.4000 1176.8900  959.1010  926.23500  984.3440 1850.0100
## 5    83.120   37.0302   30.8798   31.6348    4.08673   64.7151   15.3595
## 6   760.100  403.2180  478.7310  475.1940  393.89700  527.1790  469.2370
##   GSM227882
## 1 8266.4800
## 2  215.6060
## 3  140.8650
## 4 1200.3600
## 5   29.2318
## 6  499.7910

First, we will check if the data are normalized. It’s easy to check this via a boxplot. If the data are normalized then, they should appear more or less at the same level in a boxplot.

Since the data are a data.frame, it’s very simple to construct a boxplot for the different columns of the data.frame:

boxplot(data)

A problem that appears in this boxplot is that there some very large values and really many small values. Thus, we cannot really have a clear picture of the data. This is common in microarrays, where the majority of the genes are expressed at low level, whereas a small portion of the genes are expressed at high level. Therefore, we usually transform the data using the log_2 transformation.

data2 <- log2(data) ## again remember that most of the operations in R are performed in an element-wise manner. 

##now let's make a boxplot
boxplot(data2, main=expression(paste("Boxplot of the ", log2, " data")))

It’s obvious that data are normalized since the boxplot for each column is about at the same level. Normalization is necessary when we want to perform comparisons between different samples. Thus, we are ready now to proceed with the differential expression analysis, that is to detect the genes that behave differently in different classes of the data.

Differential Expression analysis

At step 1, we will need to see how are data are partitioned. Thus, we visit the NCBI web page where the dataset is stored. This is, here

The Experimental Design of the GDS3309 Dataset

This is a very simple experimental design with only one factor, smoking, with two levels: “control” and “smokers”. Control samples are 8, whereas there are 7 smokers.

labels <- c(rep("control", 8), rep("smokers", 7))

Let’s now perform hypothesis testing for only one gene, the first one.

The idea behind the t-statistic

The idea behind the t-test is that we summarize the information of two vectors of values, each belonging to one class (e.g., “control” and “smokers”) in a single number. This numbers is larger when the values of the two classes differ a lot in comparison to the variance within the classes. Thus we define a new quantity called t-statistic as ( t = ), where ( _1 ) and ( _2 ) are the means of each class and VAR is a function of the variances of the two classes. Moreover, we know how the t-statistic follows a certain distribution when the two classes come from the SAME population, i.e., the two vectors of values are sampled from just a single population. This allows us to do something very important: We know what values of the t-statistic to expect when the two vectors do not differ because they come from different populations, but they may just differ due to random fluctuations within a population. If the value of the t-statistic is too large (absolute value) to be considered a ‘part of this distribution’, then we reject the hypothesis that the two vectors have been obtained from the same distribution.

The function t.test performs the t-test. As arguments it accepts two vectors, one for each class. The order of the two vectors is not important. The help of this function returns the following information:

t.test {stats}

R Documentation

Student’s t-Test

Description

Performs one and two sample t-tests on vectors of data.

Usage

t.test(x, ...)

## Default S3 method:
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

## S3 method for class 'formula'
t.test(formula, data, subset, na.action, ...)

Arguments

`x`	a (non-empty) numeric vector of data values.
`y`	an optional (non-empty) numeric vector of data values.
`alternative`	a character string specifying the alternative hypothesis, must be one of `“two.sided”` (default), `“greater”` or `“less”`. You can specify just the initial letter.
`mu`	a number indicating the true value of the mean (or difference in means if you are performing a two sample test).
`paired`	a logical indicating whether you want a paired t-test.
`var.equal`	a logical variable indicating whether to treat the two variances as being equal. If `TRUE` then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.
`conf.level`	confidence level of the interval.
`formula`	a formula of the form `lhs ~ rhs` where `lhs` is a numeric variable giving the data values and `rhs` a factor with two levels giving the corresponding groups.
`data`	an optional matrix or data frame (or similar: see `model.frame`) containing the variables in the formula `formula`. By default the variables are taken from `environment(formula)`.
`subset`	an optional vector specifying a subset of observations to be used.
`na.action`	a function which indicates what should happen when the data contain `NA`s. Defaults to `getOption(“na.action”)`.
`…`	further arguments to be passed to or from methods.

Details

The formula interface is only applicable for the 2-sample tests. The t.test accepts two vectors one from the first class and another for the second class. The rest of the arguments control details about the test. For example, if you know that the variances of the two classes are equal, you can set that var.equal=TRUE. In general, however, this is not known, thus we set that var.equal=F.

Now, let’s perform the t.test in the first gene.

result <- t.test( data2[1, labels == 'control'], data2[1, labels == 'smokers'], var.equal=F )
result

## 
##  Welch Two Sample t-test
## 
## data:  data2[1, labels == "control"] and data2[1, labels == "smokers"]
## t = -0.77695, df = 11.362, p-value = 0.4531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4260015  0.2030735
## sample estimates:
## mean of x mean of y 
##  12.52459  12.63605

The result of the t-test returns a lot of information:

The value of the t-statistic: -0.77695: This means that the ‘control’ has a bit lower average than the ‘smokers’
The df, or degrees of freedom: this is a parameter for the t distribution. It is related to the sample sizes.
The p-value: i.e., the probability to observe such a value of the t-statistic or even more extreme when the null hypothesis is valid (i.e., when the two sets of values – the vectors – come from the same population).
The alternative hypothesis: Here it is True difference in means is different than 0. Well, 0 means that they come from the same population. Different than 0 mean that the two groups of values come from different population, however, we don’t know which population produces higher and which one produces lower values.
95 percent confidence interval. The meaning of this interval is the following: From the analysis we have done, we find some difference of between the two means. Here this difference is: 12.52459-12.63605 = -0.11146. However, we have assessed this difference by using a sample from the population. This means that perhaps the real difference between the means of the two poopulations might be different than this value that we have calculated. However, if we would calculate the interval with the same way, then 95% of the time would contain the true value 95% of the times.

In fact, there are two values that are interesting for us. The first, is the p-value and the second is the difference of the means, i.e.

## To obtain the names of the objects we can use the function names
names(result)

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

result$p.value

## [1] 0.4530608

diff(-result$estimate)

## mean of y 
## -0.111464

The p-value will tell us if the difference of the means is large enough so that we reject the null hypothesis with the sample that we have. If the sample size is large, then we can reject the null hypothesis even if the difference between the means is small. See the following example:

v1 <- rnorm(100, 0, 1)
v2 <- rnorm(100, 0.2, 1)
t.test(v1, v2)

## 
##  Welch Two Sample t-test
## 
## data:  v1 and v2
## t = -0.76654, df = 198, p-value = 0.4443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3872247  0.1704509
## sample estimates:
##  mean of x  mean of y 
## 0.01062515 0.11901205

v3 <- rnorm(10000, 0, 1)
v4 <- rnorm(10000, 0.2, 1)
t.test(v3, v4)

## 
##  Welch Two Sample t-test
## 
## data:  v3 and v4
## t = -13.655, df = 19998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2198125 -0.1646283
## sample estimates:
##  mean of x  mean of y 
## 0.01512736 0.20734777

Thus, even though the two samples are coming from two distributions, with mean 0 and 0.2, respectively, in the first case we cannot reject the null, whereas in the second case we can reject the null. This is because the sample size is large in the second case.

Therefore, it is not only enough to report the p-value but also the difference between the means. The difference between the means (or other values, such as ratios etc may be biologically important).

Next we have to obtain the p-values for all genes. Thus, we need to apply the function t.test for all the rows of the expression matrix. However, there is a problem. The t.test is applied for two vectors. On the other hand, the apply function of the R, is applied on whole rows. Thus, we create a function that will appropriately use the t.test.

my.ttest <- function(v, labels)
{
  levels <- unique(labels)
  v1 <- v[ labels == levels[1]]
  v2 <- v[ labels == levels[2]]
  pval <- t.test(v1, v2, var.equal = F)$p.value
  pval
}

Now, we can easily apply this function for the whole matrix

allpvalues <- apply(data2, 1, my.ttest, labels)

## let's see which p values are smaller than a threshold
sig.inds <- which(allpvalues < 0.001)
sig.inds

##  [1]   675  1962  1963  1964  1965  2630  2896  5276  6441  7044  9185
## [12] 14024 14458

There are only a few rows (genes) with p-value smaller than the specified threshold. Now, we can visualize these results. A nice way is to use the heatmaps. So, an intuitive result is that if we would use only these gene to represent the expression values of the two classes, then the two sets of samples would be separated. We will use the function heatmap.2 from the package gplots

library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

data.sig <- as.matrix(data2[sig.inds,])
rownames(data.sig) <- raw.data[sig.inds,2]
## Note that heatmap.2 accepts only matrices
heatmap.2(as.matrix(data.sig))

Note that the heatmap.2 has a default behavior to scale the values per row. Thus, a heatmap of the original matrix should be like:

library(gplots)
data.sig <- as.matrix(data2[sig.inds,])
rownames(data.sig) <- raw.data[sig.inds,2]
## Note that heatmap.2 accepts only matrices
heatmap.2(as.matrix(data.sig), scale="row")

Relation between t-test and linear model

Let’s assume a measurement for a single gene (e.g., one significant gene).

value <- data.sig[1,]
names <- labels
names.ind <- (names == 'smokers') + 0
mydata1 <- value[names == 'control']
mydata2 <- value[names == 'smokers']

boxplot(list(control=mydata1, smokers=mydata2))

plot(names.ind, value)

The structure of a basic linear model is: \(outcome_i = (model) + error_i \)

The structure of a basic linear model is: ( A_i = (b_0 + b_1 G_i) + _i )

In other words, we want to explain the expression level of a gene as a function of the class it belongs to. Here, the variable G_i is either 0 or 1, denoting the class. Let’s now assume the data of the class 0. If we replace the G_i with 0, then ( A_0 = (b_0 + b_1 0 ) + _0 = b_0 + _0). This tell us that the expression data for the class ‘0’ should be around the ( b_0 ) value. All the different values within the ‘0’ class are explained by the ( epsilon_0 ) value.

Now, if we see the data of the class ‘1’, ( A_1 = (b_0 + b_1 1) + _1 = b_0 + b_1 + _1 ). Therefore, the two datasets differ by ( b_1 ), which is the effect that the class ‘1’ offers to the data.

Since both the ( epsilon ) values have a mean of 0, it turns out that the mean value for class ‘1’ equals to ( mean_1 = b_0 + b_1 ), therefore, ( b_1 = mean_1 - b_0 = mean_1 - mean_0 ). Now, if we consider the linear model that passes through the two mean values, we see that the slope of this regression lines equals to the ( b_1 ) (since the definition of the slope is ( )). Note also, that the two classes only differ by the ( b_1 ) variable.

The relation to the t-test

Eventually, we want to figure out whether the ( b_1 ) is significantly different than 0. If it’s not then the two classes do not differ statistically, or at least we cannot say with the data we have that they differ. To test exactly this hypothesis we can use the function lm of R.

lm.model = lm(value~as.factor(names))
summary(lm.model)

## 
## Call:
## lm(formula = value ~ as.factor(names))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00480 -0.14384 -0.06412  0.28407  0.73724 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.2439     0.1877  33.264 5.77e-14 ***
## as.factor(names)smokers   1.4129     0.2748   5.142 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5309 on 13 degrees of freedom
## Multiple R-squared:  0.6704, Adjusted R-squared:  0.645 
## F-statistic: 26.44 on 1 and 13 DF,  p-value: 0.0001893

as.factor(names)

##  [1] control control control control control control control control
##  [9] smokers smokers smokers smokers smokers smokers smokers
## Levels: control smokers

If we calculate the means for the two classes, we have:

## The mean of the '0' sample is the intercept
mean1 <- mean(mydata1)
mean2 <- mean(mydata2)

mean(mydata2) -  mean(mydata1)

## [1] 1.412925

Also, if we perform a t-test:

myttest <- t.test(mydata1, mydata2, var.equal = T)
myttest

## 
##  Two Sample t-test
## 
## data:  mydata1 and mydata2
## t = -5.1421, df = 13, p-value = 0.0001893
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.0065440 -0.8193069
## sample estimates:
## mean of x mean of y 
##  6.243862  7.656787

Conclusion

We see that the t-test is equivalent to testing whether a linear model has a ‘0’ or non-0 slope. Using the linear model will be advantageous when we would like to have more than one independent variable (e.g., smoking/non-smoking, male/female).

Expression_analysis_Microarrays