GEO Expression Analysis

Without using Bioconductor

GEO stands for Gene Expression Omnibus. It’s a public database for microarray datasets and it’s hosted at NCBI. Its address is We will go through some example.

Let’s say that we want to analyze a dataset related to the effect of smoking in humans. In GEO Datasets (as well as in NCBI), we can use keywords to search for specific entries. Now, I’ll use the keywords “Smoking” in title, and “Human” as an organism. The search results are here. We choose the third datasets for this example, i.e. the GDS3309.

We will proceed with the following steps:

  1. Download the dataset. For this use the DataSet SOFT file.
  2. Unzip the dataset
  3. Clean the dataset, using the appropriate combination of grep command. Cleaning means that you should remove all lines starting with an ^, or a ! or a #.
  4. The final dataset should contain as many columns as the samples plus 2 additional. The first and the second column of the datasets are the probe IDs and the Gene IDs, respectively. Note Genes and Probes are not the same. However, most of the times you can treat probe IDs as ‘genes’. Read this forum.
  5. Visit again the website from which you obtained the dataset. Go to the bottom of the page, where it’s written “Experiment design and value distribution”. Click on that (even though it seems that it’s not clickable). Now, you can obtain more details for the dataset. The important information is the classes of each sample (e.g. healthy, sick etc). This information is important if you want to setup your experiment correctly.
  6. Now we are ready to proceed with the analysis of the dataset in R.

First, let’s clean the dataset

grep -v [\!\^\#] GDS3309.soft > GDS3309.clean

We need to inspect the file (you can use ls -S GDS3309.clean) to see what are the NULL characters and in general if the file “looks good” <- read.table("GDS3309.clean", sep="\t", header=TRUE)
## [1] 22277    17

The first two columns contain the probe ID and the gene name for each row. Thus, we need to create a file that does not contain these two columns.

data <-[,-c(1,2)]

## Let's have a look at the head of the file
##   GSM227868  GSM227870  GSM227871 GSM227874 GSM227876 GSM227877 GSM227878
## 1 5236.4800 6183.20000 5171.48000 4553.4000 6094.6600  5976.880 7713.8800
## 2  237.5170  225.65300  225.01700  134.2670  228.1800   235.723  262.2080
## 3  100.4390  159.24800  136.26700   45.6665  166.1320   132.373   47.6219
## 4 1123.9900 1258.00000  923.85900  619.6690 1369.2800  1215.570 1102.9000
## 5   34.7732    9.23039    8.90546   29.3123   60.9262    55.368   61.0796
## 6  348.8550  399.54000  429.15500  342.0000 1161.0500   486.014  754.3390
##   GSM227880 GSM227869 GSM227872 GSM227873  GSM227875 GSM227879 GSM227881
## 1  6781.240 4948.9900 5029.5700 5892.1700 6534.86000 8393.7200 6367.6200
## 2   254.174  196.7490  188.7460  202.1670  174.62200  178.9760  220.6430
## 3   127.556  160.1350  125.7430  163.9650   51.95170  140.1410  325.3390
## 4  1377.010 1195.4000 1176.8900  959.1010  926.23500  984.3440 1850.0100
## 5    83.120   37.0302   30.8798   31.6348    4.08673   64.7151   15.3595
## 6   760.100  403.2180  478.7310  475.1940  393.89700  527.1790  469.2370
##   GSM227882
## 1 8266.4800
## 2  215.6060
## 3  140.8650
## 4 1200.3600
## 5   29.2318
## 6  499.7910

First, we will check if the data are normalized. It’s easy to check this via a boxplot. If the data are normalized then, they should appear more or less at the same level in a boxplot.

Since the data are a data.frame, it’s very simple to construct a boxplot for the different columns of the data.frame:


A problem that appears in this boxplot is that there some very large values and really many small values. Thus, we cannot really have a clear picture of the data. This is common in microarrays, where the majority of the genes are expressed at low level, whereas a small portion of the genes are expressed at high level. Therefore, we usually transform the data using the log_2 transformation.

data2 <- log2(data) ## again remember that most of the operations in R are performed in an element-wise manner. 

##now let's make a boxplot
boxplot(data2, main=expression(paste("Boxplot of the ", log2, " data")))

It’s obvious that data are normalized since the boxplot for each column is about at the same level. Normalization is necessary when we want to perform comparisons between different samples. Thus, we are ready now to proceed with the differential expression analysis, that is to detect the genes that behave differently in different classes of the data.

Differential Expression analysis

At step 1, we will need to see how are data are partitioned. Thus, we visit the NCBI web page where the dataset is stored. This is, here

The Experimental Design of the GDS3309 Dataset

The Experimental Design of the GDS3309 Dataset

This is a very simple experimental design with only one factor, smoking, with two levels: “control” and “smokers”. Control samples are 8, whereas there are 7 smokers.

labels <- c(rep("control", 8), rep("smokers", 7))

Let’s now perform hypothesis testing for only one gene, the first one.

The idea behind the t-statistic

The idea behind the t-test is that we summarize the information of two vectors of values, each belonging to one class (e.g., “control” and “smokers”) in a single number. This numbers is larger when the values of the two classes differ a lot in comparison to the variance within the classes. Thus we define a new quantity called t-statistic as ( t = ), where ( _1 ) and ( _2 ) are the means of each class and VAR is a function of the variances of the two classes. Moreover, we know how the t-statistic follows a certain distribution when the two classes come from the SAME population, i.e., the two vectors of values are sampled from just a single population. This allows us to do something very important: We know what values of the t-statistic to expect when the two vectors do not differ because they come from different populations, but they may just differ due to random fluctuations within a population. If the value of the t-statistic is too large (absolute value) to be considered a ‘part of this distribution’, then we reject the hypothesis that the two vectors have been obtained from the same distribution.

The function t.test performs the t-test. As arguments it accepts two vectors, one for each class. The order of the two vectors is not important. The help of this function returns the following information:

t.test {stats} R Documentation

Student’s t-Test


Performs one and two sample t-tests on vectors of data.


t.test(x, ...)

## Default S3 method:
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

## S3 method for class 'formula'
t.test(formula, data, subset, na.action, ...)



a (non-empty) numeric vector of data values.


an optional (non-empty) numeric vector of data values.


a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”. You can specify just the initial letter.


a number indicating the true value of the mean (or difference in means if you are performing a two sample test).


a logical indicating whether you want a paired t-test.


a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.


confidence level of the interval.


a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups.


an optional matrix or data frame (or similar: see model.frame) containing the variables in the formula formula. By default the variables are taken from environment(formula).


an optional vector specifying a subset of observations to be used.


a function which indicates what should happen when the data contain NAs. Defaults to getOption(“na.action”).

further arguments to be passed to or from methods.


The formula interface is only applicable for the 2-sample tests. The t.test accepts two vectors one from the first class and another for the second class. The rest of the arguments control details about the test. For example, if you know that the variances of the two classes are equal, you can set that var.equal=TRUE. In general, however, this is not known, thus we set that var.equal=F.

Now, let’s perform the t.test in the first gene.

result <- t.test( data2[1, labels == 'control'], data2[1, labels == 'smokers'], var.equal=F )
##  Welch Two Sample t-test
## data:  data2[1, labels == "control"] and data2[1, labels == "smokers"]
## t = -0.77695, df = 11.362, p-value = 0.4531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4260015  0.2030735
## sample estimates:
## mean of x mean of y 
##  12.52459  12.63605

The result of the t-test returns a lot of information:

  • The value of the t-statistic: -0.77695: This means that the ‘control’ has a bit lower average than the ‘smokers’
  • The df, or degrees of freedom: this is a parameter for the t distribution. It is related to the sample sizes.
  • The p-value: i.e., the probability to observe such a value of the t-statistic or even more extreme when the null hypothesis is valid (i.e., when the two sets of values – the vectors – come from the same population).
  • The alternative hypothesis: Here it is True difference in means is different than 0. Well, 0 means that they come from the same population. Different than 0 mean that the two groups of values come from different population, however, we don’t know which population produces higher and which one produces lower values.
  • 95 percent confidence interval. The meaning of this interval is the following: From the analysis we have done, we find some difference of between the two means. Here this difference is: 12.52459-12.63605 = -0.11146. However, we have assessed this difference by using a sample from the population. This means that perhaps the real difference between the means of the two poopulations might be different than this value that we have calculated. However, if we would calculate the interval with the same way, then 95% of the time would contain the true value 95% of the times.

In fact, there are two values that are interesting for us. The first, is the p-value and the second is the difference of the means, i.e.

## To obtain the names of the objects we can use the function names
## [1] "statistic"   "parameter"   "p.value"     ""    "estimate"   
## [6] "null.value"  "alternative" "method"      ""
## [1] 0.4530608
## mean of y 
## -0.111464

The p-value will tell us if the difference of the means is large enough so that we reject the null hypothesis with the sample that we have. If the sample size is large, then we can reject the null hypothesis even if the difference between the means is small. See the following example:

v1 <- rnorm(100, 0, 1)
v2 <- rnorm(100, 0.2, 1)
t.test(v1, v2)
##  Welch Two Sample t-test
## data:  v1 and v2
## t = -0.76654, df = 198, p-value = 0.4443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3872247  0.1704509
## sample estimates:
##  mean of x  mean of y 
## 0.01062515 0.11901205
v3 <- rnorm(10000, 0, 1)
v4 <- rnorm(10000, 0.2, 1)
t.test(v3, v4)
##  Welch Two Sample t-test
## data:  v3 and v4
## t = -13.655, df = 19998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2198125 -0.1646283
## sample estimates:
##  mean of x  mean of y 
## 0.01512736 0.20734777

Thus, even though the two samples are coming from two distributions, with mean 0 and 0.2, respectively, in the first case we cannot reject the null, whereas in the second case we can reject the null. This is because the sample size is large in the second case.

Therefore, it is not only enough to report the p-value but also the difference between the means. The difference between the means (or other values, such as ratios etc may be biologically important).

Next we have to obtain the p-values for all genes. Thus, we need to apply the function t.test for all the rows of the expression matrix. However, there is a problem. The t.test is applied for two vectors. On the other hand, the apply function of the R, is applied on whole rows. Thus, we create a function that will appropriately use the t.test.

my.ttest <- function(v, labels)
  levels <- unique(labels)
  v1 <- v[ labels == levels[1]]
  v2 <- v[ labels == levels[2]]
  pval <- t.test(v1, v2, var.equal = F)$p.value

Now, we can easily apply this function for the whole matrix

allpvalues <- apply(data2, 1, my.ttest, labels)

## let's see which p values are smaller than a threshold
sig.inds <- which(allpvalues < 0.001)
##  [1]   675  1962  1963  1964  1965  2630  2896  5276  6441  7044  9185
## [12] 14024 14458

There are only a few rows (genes) with p-value smaller than the specified threshold. Now, we can visualize these results. A nice way is to use the heatmaps. So, an intuitive result is that if we would use only these gene to represent the expression values of the two classes, then the two sets of samples would be separated. We will use the function heatmap.2 from the package gplots

## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##     lowess
data.sig <- as.matrix(data2[sig.inds,])
rownames(data.sig) <-[sig.inds,2]
## Note that heatmap.2 accepts only matrices