TBIO599

Microarray data, limma, and GEO

Anton Wellstein, Garrett Graham
11/29/16

Microarray Technology

Installing R/RStudio

  • Using R, you can easily manage large data sets and combine data sources
  • RStudio is an IDE (Integrated Development Environment) for R that can be used to manage code, files, and objects in your R environment
  • Using R through RStudio provides an interface through which you can normalize data, generate plots, and perform downstream analysis

Installing R/RStudio

In the next set of slides we will be working in RStudio, so if you need help installing R/RStudio, let me know.

Your RStudio interface

After installing and starting RStudio, your interface should look something like this: RStudio

Installing Bioconductor

Bioconductor is a repository of software packages designed to deal with biological data, written in the R or C languages and meant to be run in R. The next step will to be to use the Bioconductor installer to install the framework for additional packages you will use in your analysis.

source("http://bioconductor.org/biocLite.R")
biocLite()

Installing Bioconductor packages

Use the Bioconductor installer to install packages that we will need for microarray analysis. The Bioconductor installer can be invoked as shown below, with the package name.

biocLite("limma")
biocLite("GEOquery")
biocLite("oligo")
biocLite("hgu133plus2.db")
biocLite("pd.hg.u133.plus.2")
biocLite("annotate")
biocLite("topGO")
biocLite("genefilter")
install.packages("gplots")
install.packages("ggplot2")
install.packages("wesanderson")
install.packages("xtable")

Loading Bioconductor packages

To use functions in a package, packages need to be loaded by name with the library() function.

library(limma)
library(GEOquery)
library(oligo)
library(gplots)
library(annotate)
library(topGO)
library(genefilter)
library(hgu133plus2.db)
library(ggplot2)
library(wesanderson)
library(pd.hg.u133.plus.2)
library(xtable)

Downloading GEO data

This GEO dataset is located at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7513 and has accession number GSE7531. These files can be downloaded and expanded manually or using the R commands listed.

getGEOSuppFiles(GEO = "GSE7513", makeDirectory = TRUE)

This command will create a new directory named “GSE7513”, containing a compressed archive of the raw data from GSE7513.

untar("./GSE7513/GSE7513_RAW.tar", exdir = "./GSE7513")

This command will uncompress the archive (.tar file) into the parent directory.

Processing Affy microarray data

Processing a group of Affymetrix CEL files. If not starting from a GEO archive, this is where you would begin with your own CEL files.

celfilelist<-list.files(path = "./GSE7513/", pattern = ".CEL", full.names = TRUE)

This command will list all files in GSE7513 that have “.CEL” in the filename.

cf<-read.celfiles(filenames = celfilelist)
print(class(cf))
[1] "ExpressionFeatureSet"
attr(,"package")
[1] "oligoClasses"

This command will read the list of CEL files (object “celfilelist”) and store them in the cf ExpressionFeatureSet object

Processing Affy microarray data

What is an ExpressionFeatureSet?

print(cf)
ExpressionFeatureSet (storageMode: lockedEnvironment)
assayData: 1354896 features, 29 samples 
  element names: exprs 
protocolData
  rowNames: GSM194580.CEL.gz GSM194581.CEL.gz ... GSM194616.CEL.gz
    (29 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM194580.CEL.gz GSM194581.CEL.gz ... GSM194616.CEL.gz
    (29 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hg.u133.plus.2 

This object stores all of your samples and contains information about the appropriate annotation for this type of chip (“pd.hg.u133.plus.2”)

Processing Affy microarray data

Histogram of intensity before normalization

hist(cf)

plot of chunk bioc11

Boxplot of intensity before normalization

boxplot(cf)

plot of chunk bioc12

Processing Affy microarray data

Normalizing Affy array data using RMA

cf.norm<-rma(object = cf)
Background correcting
Normalizing
Calculating Expression

It is important to normalize your arrays together if you intend to compare values across arrays. A number of normalization methods were developed using mismatch probes, GC correction, or quantile normalization methods. RMA has been used consistently and performs well on Affymetrix array design.

  • See also: filtering on variance (genefilter package)
  • Other normalization methods have been developed for Illumina, Agilent, Nanostring, etc.

Processing Affy microarray data

Histogram of intensity after normalization

hist(cf.norm)

plot of chunk bioc14

Boxplot of intensity after normalization

boxplot(cf.norm)

plot of chunk bioc15

Processing Illumina microarray data

Illumina data is provided in a different format out of the chip reader - they are usually single summary files of all probes across the set of chips used.

il<-read.ilmn(files = "./Illumina/2014-319 sample probe profile_c.csv", ctrlfiles = "./Illumina/control probes_c.csv", sep = ",")
Reading file ./Illumina/2014-319 sample probe profile_c.csv ... ...
Reading file ./Illumina/control probes_c.csv ... ...

The container object for Illumina data is different, but can also be normalized using limma functions.

print(class(il))
[1] "EListRaw"
attr(,"package")
[1] "limma"

Processing Illumina microarray data

The neqc() function in limma uses background probes for normalization and performs quantile correction on an Illumina data object.

il.norm<-neqc(il)

Processing Illumina microarray data

Un-normalized

boxplot(log2(il$E), ylim=c(2, 17), col = colorRampPalette(wes_palette("Cavalcanti"))(16))

plot of chunk ilmn4

After quantile normalization

boxplot(il.norm$E, ylim=c(2, 17), col = colorRampPalette(wes_palette("Cavalcanti"))(16))

plot of chunk ilmn5

Analyzing microarray data

Filtering out low expression and low variance probes can improve later statistics. Using the genefilter() function, we can remove all probes that have less than log2(100) expression in more than 20% of samples and have an inter-quartile range less than 0.25.

filter<-genefilter(cf.norm, filterfun(pOverA(p = 0.2, A = log2(100)), function(x) (IQR(x) > 0.25)))
cf.norm<-cf.norm[filter,]

Number of probes removed

table(filter)
filter
FALSE  TRUE 
41681 12994 

Analyzing microarray data

To make comparisons between groups, you will need to create a design matrix that describes your experiment.

s<-factor(c(rep('nn', 9), rep('other', 10), rep('nn', 5), rep('other', 5)))
mod<-model.matrix(~0+s)

Model matrix

print(head(mod, n = 25))
   snn sother
1    1      0
2    1      0
3    1      0
4    1      0
5    1      0
6    1      0
7    1      0
8    1      0
9    1      0
10   0      1
11   0      1
12   0      1
13   0      1
14   0      1
15   0      1
16   0      1
17   0      1
18   0      1
19   0      1
20   1      0
21   1      0
22   1      0
23   1      0
24   1      0
25   0      1

Analyzing microarray data

Using your model matrix, you can then fit a model to your data.

fit1<-lmFit(cf.norm, mod)

Specify the comparisons you want to make within your model groups.

contr<-makeContrasts(snn-sother, levels=mod)
fit2<-contrasts.fit(fit = fit1, contrasts = contr)
fit3<-eBayes(fit2)

Analyzing microarray data

Using Limma functions, you can output differential expression tables from the contrasts specified earlier.

topTable(fit3)
                logFC  AveExpr         t      P.Value    adj.P.Val
205624_at    4.854132 8.606239 11.852840 4.523845e-13 5.878284e-09
207134_x_at  4.157247 8.189943 11.113062 2.316839e-12 1.242165e-08
205683_x_at  4.210277 8.166218 11.018618 2.867859e-12 1.242165e-08
207741_x_at  3.604946 7.668160 10.732445 5.511768e-12 1.790498e-08
216474_x_at  4.149657 8.223559 10.487394 9.723602e-12 2.486243e-08
210084_x_at  4.021627 7.454531 10.416323 1.148027e-11 2.486243e-08
202286_s_at -4.845465 9.374239 -9.837543 4.548391e-11 8.443114e-08
201596_x_at -4.671954 9.156176 -9.740114 5.759272e-11 9.354498e-08
208650_s_at -4.142221 8.165348 -9.382051 1.385942e-10 2.000992e-07
217023_x_at  3.268101 7.497625  9.145264 2.500162e-10 3.248710e-07
                   B
205624_at   19.44162
207134_x_at 17.93582
205683_x_at 17.73808
207741_x_at 17.13120
216474_x_at 16.60221
210084_x_at 16.44717
202286_s_at 15.15719
201596_x_at 14.93522
208650_s_at 14.10748
217023_x_at 13.54975

Analyzing microarray data

You can also define the p value adjustment, number of genes to output, and the statistic to sort by (as well as other parameters)

print(xtable(topTable(fit3, number = 8, adjust.method = "BH", sort.by = "p"), digits = c(0, 2, 2, 2, -2, -2, 2)), type="html")
logFC AveExpr t P.Value adj.P.Val B
205624_at 4.85 8.61 11.85 4.52E-13 5.88E-09 19.44
207134_x_at 4.16 8.19 11.11 2.32E-12 1.24E-08 17.94
205683_x_at 4.21 8.17 11.02 2.87E-12 1.24E-08 17.74
207741_x_at 3.60 7.67 10.73 5.51E-12 1.79E-08 17.13
216474_x_at 4.15 8.22 10.49 9.72E-12 2.49E-08 16.60
210084_x_at 4.02 7.45 10.42 1.15E-11 2.49E-08 16.45
202286_s_at -4.85 9.37 -9.84 4.55E-11 8.44E-08 15.16
201596_x_at -4.67 9.16 -9.74 5.76E-11 9.35E-08 14.94

Annotating microarray data

You can also add gene symbols to your output by probe name.

tab<-topTable(fit3, number = 3e6, adjust.method = "BH", sort.by = "p")
pr<-row.names(tab)
sy<-getSYMBOL(pr, data = "hgu133plus2.db")
df<-data.frame(Symbol = sy, tab)
print(xtable(head(df, n = 8), digits = c(0, 0, 2, 2, 2, -2, -2, 2)), type='html')
Symbol logFC AveExpr t P.Value adj.P.Val B
205624_at CPA3 4.85 8.61 11.85 4.52E-13 5.88E-09 19.44
207134_x_at TPSB2 4.16 8.19 11.11 2.32E-12 1.24E-08 17.94
205683_x_at TPSAB1 4.21 8.17 11.02 2.87E-12 1.24E-08 17.74
207741_x_at TPSAB1 3.60 7.67 10.73 5.51E-12 1.79E-08 17.13
216474_x_at 4.15 8.22 10.49 9.72E-12 2.49E-08 16.60
210084_x_at TPSAB1 4.02 7.45 10.42 1.15E-11 2.49E-08 16.45
202286_s_at TACSTD2 -4.85 9.37 -9.84 4.55E-11 8.44E-08 15.16
201596_x_at KRT18 -4.67 9.16 -9.74 5.76E-11 9.35E-08 14.94

Exporting microarray data

You can also write the complete table to a text file for manipulation in other programs.

write.table(df, file = "./array_data_norm.txt", sep="\t", quote = FALSE)

This saves a file named “array_data_norm.txt” in the current working directory.

Visualizing microarray data

R has a number of heatmap functions. Here we will select the top 30 most significant probes and plot them in a heatmap.

df.subset<-df[1:30,]
ex<-exprs(cf.norm)[row.names(df.subset),]
sy.subset<-df.subset$Symbol
heatmap.2(ex, trace = "none", scale="row", margins = c(10, 7), col = colorRampPalette(c("red", "black", "green")), labRow = sy.subset)

plot of chunk plot1

Visualizing microarray data

You can also plot any one of the arrays against another.

dr<-data.frame(S1=exprs(cf.norm)[,1], S2=exprs(cf.norm)[,4])
m<-ggplot(dr, aes(x=S1, y=S2))
m<-m+geom_point()
m<-m+theme_classic(base_size = 12)
m

plot of chunk plot_ex2

Visualizing microarray data

You can plot all FC values vs p values in a volcano plot.

dat<-data.frame(df, n_log10_adj_pval = -c(log10(df$adj.P.Val)))
a<-ggplot(dat, aes(x = logFC, y = n_log10_adj_pval))
a<-a+ylab("-log10(adjusted P value)\n")
a<-a+xlab("logFC")
a<-a+theme_classic(base_size = 12)
a<-a+theme(legend.position="none")
a<-a+scale_color_manual(values = wes_palette("Zissou")[c(1, 3, 5)])
a<-a+geom_point()
plot(a)

plot of chunk plot3

Visualizing microarray data

You can plot all FC values vs p values in a volcano plot.

dat<-data.frame(df, n_log10_adj_pval = -c(log10(df$adj.P.Val)), col=ifelse(abs(df$logFC)>2 & df$adj.P.Val<0.01, 'A', ifelse(abs(df$logFC)>2, 'B', 'C')))
dat$Symbol<-c(as.character(dat$Symbol[1]),  rep(NA, 6), as.character(dat$Symbol[8]), rep(NA, 36), as.character(dat$Symbol[45]), rep(NA, length(dat$Symbol)-45))
a<-ggplot(dat, aes(x = logFC, y = n_log10_adj_pval, col=col, label=Symbol))
a<-a+ylab("-log10(adjusted P value)\n")
a<-a+xlab("logFC")
a<-a+theme_classic(base_size = 12)
a<-a+theme(legend.position="none")
a<-a+scale_color_manual(values = wes_palette("Zissou")[c(5, 3, 1)])
a<-a+geom_point()
a<-a+geom_vline(xintercept=c(2, -2))
a<-a+geom_hline(yintercept = -log10(0.01))
a<-a+geom_label()
plot(a)

plot of chunk plot3f

Visualizing microarray data

You can also select individual components to plot by probe or by gene symbol associated with a probe.

rn<-row.names(df)[which(sy=="CPA3")]
ex<-exprs(cf.norm)[rn,]
if(!is.vector(ex)){
  varfilter<-apply(ex, 1, var)
  ex<-ex[which.max(varfilter),]
  }
pl<-data.frame(Intensity = ex, Phenotype = s)
a<-ggplot(pl, aes(y = Intensity, x = Phenotype, fill = Phenotype))
a<-a+theme_classic(base_size = 16)
a<-a+scale_y_log10(breaks=c(1, 2, 4, 6, 8, 10, 12))
a<-a+ggtitle("CPA3 Expression\n")
a<-a+ylab("Log2 Intensity")
a<-a+xlab('')
a<-a+scale_fill_manual(values = wes_palette("Zissou")[c(1,4)], guide=FALSE)
a<-a+scale_x_discrete(labels=c('CD44+\nCD24-', 'Other'))
a<-a+geom_boxplot()
plot(a)

plot of chunk plot5

Visualizing microarray data

You can also select individual components to plot by probe or by gene symbol associated with a probe.

rn<-row.names(df)[which(sy=="TPSB2")]
ex<-exprs(cf.norm)[rn,]
if(!is.vector(ex)){
  varfilter<-apply(ex, 1, var)
  ex<-ex[which.max(varfilter),]
  }
pl<-data.frame(Intensity = ex, Phenotype = s)
a<-ggplot(pl, aes(y = Intensity, x = Phenotype, fill = Phenotype))
a<-a+theme_classic(base_size = 16)
a<-a+scale_y_log10(breaks=c(1, 2, 4, 6, 8, 10, 12))
a<-a+ggtitle("TPSB2 Expression\n")
a<-a+ylab("Log2 Intensity")
a<-a+xlab('')
a<-a+scale_fill_manual(values = wes_palette("Zissou")[c(1,4)], guide=FALSE)
a<-a+scale_x_discrete(labels=c('CD44+\nCD24-', 'Other'))
a<-a+geom_boxplot()
plot(a)

plot of chunk plot7

Analyzing microarray data

Gene set enrichment using the topGO package

selectDiff<-function(x){
  return(x <= 0.01)
}
gVec<-df$adj.P.Val
names(gVec)<-row.names(df)
tG<-new("topGOdata", ontology="BP", allGenes = gVec, annot = annFUN.db, geneSel = selectDiff, affyLib = "hgu133plus2", nodeSize = 10)
print(tG)

------------------------- topGOdata object -------------------------

 Description:
   -   

 Ontology:
   -  BP 

 12994 available genes (all genes from the array):
   - symbol:  205624_at 207134_x_at 205683_x_at 207741_x_at 216474_x_at  ...
   - score :  5.878284e-09 1.24216513e-08 1.24216513e-08 1.79049782e-08 2.48624332e-08  ...
   - 1029  significant genes. 

 11014 feasible genes (genes that can be used in the analysis):
   - symbol:  205624_at 207134_x_at 205683_x_at 207741_x_at 216474_x_at  ...
   - score :  5.878284e-09 1.24216513e-08 1.24216513e-08 1.79049782e-08 2.48624332e-08  ...
   - 834  significant genes. 

 GO graph (nodes with at least  10  genes):
   - a graph with directed edges
   - number of nodes = 5592 
   - number of edges = 12726 

------------------------- topGOdata object -------------------------

Analyzing microarray data

Gene set enrichment using the topGO package

elim <- runTest(tG, algorithm = "elim", statistic = "ks")
weight01 <- runTest(tG, algorithm = "weight01", statistic = "ks")
tabGO<-GenTable(object = tG, Elim = elim, Weight = weight01, orderBy = "Weight", ranksOf = "Elim", topNodes = 20)

Analyzing microarray data

Gene set enrichment using the topGO package

print(xtable(tabGO[1:4,], digits=c(0, 0, 0, 0, 0, 0, 0, -2, -2)), type = "html")
GO.ID Term Annotated Significant Expected Rank in Elim Elim Weight
1 GO:0050871 positive regulation of B cell activation 118 57 9 3 8.4e-28 < 1e-30
2 GO:0050853 B cell receptor signaling pathway 103 60 8 1 < 1e-30 < 1e-30
3 GO:0006910 phagocytosis, recognition 56 43 4 2 9.8e-30 9.8e-30
4 GO:0006958 complement activation, classical pathway 68 43 5 4 1.8e-25 1.8e-25