TBIO599
Anton Wellstein, Garrett Graham
11/29/16
In the next set of slides we will be working in RStudio, so if you need help installing R/RStudio, let me know.
After installing and starting RStudio, your interface should look something like this:
Bioconductor is a repository of software packages designed to deal with biological data, written in the R or C languages and meant to be run in R. The next step will to be to use the Bioconductor installer to install the framework for additional packages you will use in your analysis.
source("http://bioconductor.org/biocLite.R")
biocLite()
Use the Bioconductor installer to install packages that we will need for microarray analysis. The Bioconductor installer can be invoked as shown below, with the package name.
biocLite("limma")
biocLite("GEOquery")
biocLite("oligo")
biocLite("hgu133plus2.db")
biocLite("pd.hg.u133.plus.2")
biocLite("annotate")
biocLite("topGO")
biocLite("genefilter")
install.packages("gplots")
install.packages("ggplot2")
install.packages("wesanderson")
install.packages("xtable")
To use functions in a package, packages need to be loaded by name with the library() function.
library(limma)
library(GEOquery)
library(oligo)
library(gplots)
library(annotate)
library(topGO)
library(genefilter)
library(hgu133plus2.db)
library(ggplot2)
library(wesanderson)
library(pd.hg.u133.plus.2)
library(xtable)
This GEO dataset is located at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7513 and has accession number GSE7531. These files can be downloaded and expanded manually or using the R commands listed.
getGEOSuppFiles(GEO = "GSE7513", makeDirectory = TRUE)
This command will create a new directory named “GSE7513”, containing a compressed archive of the raw data from GSE7513.
untar("./GSE7513/GSE7513_RAW.tar", exdir = "./GSE7513")
This command will uncompress the archive (.tar file) into the parent directory.
Processing a group of Affymetrix CEL files. If not starting from a GEO archive, this is where you would begin with your own CEL files.
celfilelist<-list.files(path = "./GSE7513/", pattern = ".CEL", full.names = TRUE)
This command will list all files in GSE7513 that have “.CEL” in the filename.
cf<-read.celfiles(filenames = celfilelist)
print(class(cf))
[1] "ExpressionFeatureSet"
attr(,"package")
[1] "oligoClasses"
This command will read the list of CEL files (object “celfilelist”) and store them in the cf ExpressionFeatureSet object
What is an ExpressionFeatureSet?
print(cf)
ExpressionFeatureSet (storageMode: lockedEnvironment)
assayData: 1354896 features, 29 samples
element names: exprs
protocolData
rowNames: GSM194580.CEL.gz GSM194581.CEL.gz ... GSM194616.CEL.gz
(29 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: GSM194580.CEL.gz GSM194581.CEL.gz ... GSM194616.CEL.gz
(29 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hg.u133.plus.2
This object stores all of your samples and contains information about the appropriate annotation for this type of chip (“pd.hg.u133.plus.2”)
Histogram of intensity before normalization
hist(cf)
Boxplot of intensity before normalization
boxplot(cf)
Normalizing Affy array data using RMA
cf.norm<-rma(object = cf)
Background correcting
Normalizing
Calculating Expression
It is important to normalize your arrays together if you intend to compare values across arrays. A number of normalization methods were developed using mismatch probes, GC correction, or quantile normalization methods. RMA has been used consistently and performs well on Affymetrix array design.
Histogram of intensity after normalization
hist(cf.norm)
Boxplot of intensity after normalization
boxplot(cf.norm)
Illumina data is provided in a different format out of the chip reader - they are usually single summary files of all probes across the set of chips used.
il<-read.ilmn(files = "./Illumina/2014-319 sample probe profile_c.csv", ctrlfiles = "./Illumina/control probes_c.csv", sep = ",")
Reading file ./Illumina/2014-319 sample probe profile_c.csv ... ...
Reading file ./Illumina/control probes_c.csv ... ...
The container object for Illumina data is different, but can also be normalized using limma functions.
print(class(il))
[1] "EListRaw"
attr(,"package")
[1] "limma"
The neqc() function in limma uses background probes for normalization and performs quantile correction on an Illumina data object.
il.norm<-neqc(il)
Un-normalized
boxplot(log2(il$E), ylim=c(2, 17), col = colorRampPalette(wes_palette("Cavalcanti"))(16))
After quantile normalization
boxplot(il.norm$E, ylim=c(2, 17), col = colorRampPalette(wes_palette("Cavalcanti"))(16))
Filtering out low expression and low variance probes can improve later statistics. Using the genefilter() function, we can remove all probes that have less than log2(100) expression in more than 20% of samples and have an inter-quartile range less than 0.25.
filter<-genefilter(cf.norm, filterfun(pOverA(p = 0.2, A = log2(100)), function(x) (IQR(x) > 0.25)))
cf.norm<-cf.norm[filter,]
Number of probes removed
table(filter)
filter
FALSE TRUE
41681 12994
To make comparisons between groups, you will need to create a design matrix that describes your experiment.
s<-factor(c(rep('nn', 9), rep('other', 10), rep('nn', 5), rep('other', 5)))
mod<-model.matrix(~0+s)
Model matrix
print(head(mod, n = 25))
snn sother
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
11 0 1
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 1
18 0 1
19 0 1
20 1 0
21 1 0
22 1 0
23 1 0
24 1 0
25 0 1
Using your model matrix, you can then fit a model to your data.
fit1<-lmFit(cf.norm, mod)
Specify the comparisons you want to make within your model groups.
contr<-makeContrasts(snn-sother, levels=mod)
fit2<-contrasts.fit(fit = fit1, contrasts = contr)
fit3<-eBayes(fit2)
Using Limma functions, you can output differential expression tables from the contrasts specified earlier.
topTable(fit3)
logFC AveExpr t P.Value adj.P.Val
205624_at 4.854132 8.606239 11.852840 4.523845e-13 5.878284e-09
207134_x_at 4.157247 8.189943 11.113062 2.316839e-12 1.242165e-08
205683_x_at 4.210277 8.166218 11.018618 2.867859e-12 1.242165e-08
207741_x_at 3.604946 7.668160 10.732445 5.511768e-12 1.790498e-08
216474_x_at 4.149657 8.223559 10.487394 9.723602e-12 2.486243e-08
210084_x_at 4.021627 7.454531 10.416323 1.148027e-11 2.486243e-08
202286_s_at -4.845465 9.374239 -9.837543 4.548391e-11 8.443114e-08
201596_x_at -4.671954 9.156176 -9.740114 5.759272e-11 9.354498e-08
208650_s_at -4.142221 8.165348 -9.382051 1.385942e-10 2.000992e-07
217023_x_at 3.268101 7.497625 9.145264 2.500162e-10 3.248710e-07
B
205624_at 19.44162
207134_x_at 17.93582
205683_x_at 17.73808
207741_x_at 17.13120
216474_x_at 16.60221
210084_x_at 16.44717
202286_s_at 15.15719
201596_x_at 14.93522
208650_s_at 14.10748
217023_x_at 13.54975
You can also define the p value adjustment, number of genes to output, and the statistic to sort by (as well as other parameters)
print(xtable(topTable(fit3, number = 8, adjust.method = "BH", sort.by = "p"), digits = c(0, 2, 2, 2, -2, -2, 2)), type="html")
| logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|
| 205624_at | 4.85 | 8.61 | 11.85 | 4.52E-13 | 5.88E-09 | 19.44 |
| 207134_x_at | 4.16 | 8.19 | 11.11 | 2.32E-12 | 1.24E-08 | 17.94 |
| 205683_x_at | 4.21 | 8.17 | 11.02 | 2.87E-12 | 1.24E-08 | 17.74 |
| 207741_x_at | 3.60 | 7.67 | 10.73 | 5.51E-12 | 1.79E-08 | 17.13 |
| 216474_x_at | 4.15 | 8.22 | 10.49 | 9.72E-12 | 2.49E-08 | 16.60 |
| 210084_x_at | 4.02 | 7.45 | 10.42 | 1.15E-11 | 2.49E-08 | 16.45 |
| 202286_s_at | -4.85 | 9.37 | -9.84 | 4.55E-11 | 8.44E-08 | 15.16 |
| 201596_x_at | -4.67 | 9.16 | -9.74 | 5.76E-11 | 9.35E-08 | 14.94 |
You can also add gene symbols to your output by probe name.
tab<-topTable(fit3, number = 3e6, adjust.method = "BH", sort.by = "p")
pr<-row.names(tab)
sy<-getSYMBOL(pr, data = "hgu133plus2.db")
df<-data.frame(Symbol = sy, tab)
print(xtable(head(df, n = 8), digits = c(0, 0, 2, 2, 2, -2, -2, 2)), type='html')
| Symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 205624_at | CPA3 | 4.85 | 8.61 | 11.85 | 4.52E-13 | 5.88E-09 | 19.44 |
| 207134_x_at | TPSB2 | 4.16 | 8.19 | 11.11 | 2.32E-12 | 1.24E-08 | 17.94 |
| 205683_x_at | TPSAB1 | 4.21 | 8.17 | 11.02 | 2.87E-12 | 1.24E-08 | 17.74 |
| 207741_x_at | TPSAB1 | 3.60 | 7.67 | 10.73 | 5.51E-12 | 1.79E-08 | 17.13 |
| 216474_x_at | 4.15 | 8.22 | 10.49 | 9.72E-12 | 2.49E-08 | 16.60 | |
| 210084_x_at | TPSAB1 | 4.02 | 7.45 | 10.42 | 1.15E-11 | 2.49E-08 | 16.45 |
| 202286_s_at | TACSTD2 | -4.85 | 9.37 | -9.84 | 4.55E-11 | 8.44E-08 | 15.16 |
| 201596_x_at | KRT18 | -4.67 | 9.16 | -9.74 | 5.76E-11 | 9.35E-08 | 14.94 |
You can also write the complete table to a text file for manipulation in other programs.
write.table(df, file = "./array_data_norm.txt", sep="\t", quote = FALSE)
This saves a file named “array_data_norm.txt” in the current working directory.
R has a number of heatmap functions. Here we will select the top 30 most significant probes and plot them in a heatmap.
df.subset<-df[1:30,]
ex<-exprs(cf.norm)[row.names(df.subset),]
sy.subset<-df.subset$Symbol
heatmap.2(ex, trace = "none", scale="row", margins = c(10, 7), col = colorRampPalette(c("red", "black", "green")), labRow = sy.subset)
You can also plot any one of the arrays against another.
dr<-data.frame(S1=exprs(cf.norm)[,1], S2=exprs(cf.norm)[,4])
m<-ggplot(dr, aes(x=S1, y=S2))
m<-m+geom_point()
m<-m+theme_classic(base_size = 12)
m
You can plot all FC values vs p values in a volcano plot.
dat<-data.frame(df, n_log10_adj_pval = -c(log10(df$adj.P.Val)))
a<-ggplot(dat, aes(x = logFC, y = n_log10_adj_pval))
a<-a+ylab("-log10(adjusted P value)\n")
a<-a+xlab("logFC")
a<-a+theme_classic(base_size = 12)
a<-a+theme(legend.position="none")
a<-a+scale_color_manual(values = wes_palette("Zissou")[c(1, 3, 5)])
a<-a+geom_point()
plot(a)
You can plot all FC values vs p values in a volcano plot.
dat<-data.frame(df, n_log10_adj_pval = -c(log10(df$adj.P.Val)), col=ifelse(abs(df$logFC)>2 & df$adj.P.Val<0.01, 'A', ifelse(abs(df$logFC)>2, 'B', 'C')))
dat$Symbol<-c(as.character(dat$Symbol[1]), rep(NA, 6), as.character(dat$Symbol[8]), rep(NA, 36), as.character(dat$Symbol[45]), rep(NA, length(dat$Symbol)-45))
a<-ggplot(dat, aes(x = logFC, y = n_log10_adj_pval, col=col, label=Symbol))
a<-a+ylab("-log10(adjusted P value)\n")
a<-a+xlab("logFC")
a<-a+theme_classic(base_size = 12)
a<-a+theme(legend.position="none")
a<-a+scale_color_manual(values = wes_palette("Zissou")[c(5, 3, 1)])
a<-a+geom_point()
a<-a+geom_vline(xintercept=c(2, -2))
a<-a+geom_hline(yintercept = -log10(0.01))
a<-a+geom_label()
plot(a)
You can also select individual components to plot by probe or by gene symbol associated with a probe.
rn<-row.names(df)[which(sy=="CPA3")]
ex<-exprs(cf.norm)[rn,]
if(!is.vector(ex)){
varfilter<-apply(ex, 1, var)
ex<-ex[which.max(varfilter),]
}
pl<-data.frame(Intensity = ex, Phenotype = s)
a<-ggplot(pl, aes(y = Intensity, x = Phenotype, fill = Phenotype))
a<-a+theme_classic(base_size = 16)
a<-a+scale_y_log10(breaks=c(1, 2, 4, 6, 8, 10, 12))
a<-a+ggtitle("CPA3 Expression\n")
a<-a+ylab("Log2 Intensity")
a<-a+xlab('')
a<-a+scale_fill_manual(values = wes_palette("Zissou")[c(1,4)], guide=FALSE)
a<-a+scale_x_discrete(labels=c('CD44+\nCD24-', 'Other'))
a<-a+geom_boxplot()
plot(a)
You can also select individual components to plot by probe or by gene symbol associated with a probe.
rn<-row.names(df)[which(sy=="TPSB2")]
ex<-exprs(cf.norm)[rn,]
if(!is.vector(ex)){
varfilter<-apply(ex, 1, var)
ex<-ex[which.max(varfilter),]
}
pl<-data.frame(Intensity = ex, Phenotype = s)
a<-ggplot(pl, aes(y = Intensity, x = Phenotype, fill = Phenotype))
a<-a+theme_classic(base_size = 16)
a<-a+scale_y_log10(breaks=c(1, 2, 4, 6, 8, 10, 12))
a<-a+ggtitle("TPSB2 Expression\n")
a<-a+ylab("Log2 Intensity")
a<-a+xlab('')
a<-a+scale_fill_manual(values = wes_palette("Zissou")[c(1,4)], guide=FALSE)
a<-a+scale_x_discrete(labels=c('CD44+\nCD24-', 'Other'))
a<-a+geom_boxplot()
plot(a)
Gene set enrichment using the topGO package
selectDiff<-function(x){
return(x <= 0.01)
}
gVec<-df$adj.P.Val
names(gVec)<-row.names(df)
tG<-new("topGOdata", ontology="BP", allGenes = gVec, annot = annFUN.db, geneSel = selectDiff, affyLib = "hgu133plus2", nodeSize = 10)
print(tG)
------------------------- topGOdata object -------------------------
Description:
-
Ontology:
- BP
12994 available genes (all genes from the array):
- symbol: 205624_at 207134_x_at 205683_x_at 207741_x_at 216474_x_at ...
- score : 5.878284e-09 1.24216513e-08 1.24216513e-08 1.79049782e-08 2.48624332e-08 ...
- 1029 significant genes.
11014 feasible genes (genes that can be used in the analysis):
- symbol: 205624_at 207134_x_at 205683_x_at 207741_x_at 216474_x_at ...
- score : 5.878284e-09 1.24216513e-08 1.24216513e-08 1.79049782e-08 2.48624332e-08 ...
- 834 significant genes.
GO graph (nodes with at least 10 genes):
- a graph with directed edges
- number of nodes = 5592
- number of edges = 12726
------------------------- topGOdata object -------------------------
Gene set enrichment using the topGO package
elim <- runTest(tG, algorithm = "elim", statistic = "ks")
weight01 <- runTest(tG, algorithm = "weight01", statistic = "ks")
tabGO<-GenTable(object = tG, Elim = elim, Weight = weight01, orderBy = "Weight", ranksOf = "Elim", topNodes = 20)
Gene set enrichment using the topGO package
print(xtable(tabGO[1:4,], digits=c(0, 0, 0, 0, 0, 0, 0, -2, -2)), type = "html")
| GO.ID | Term | Annotated | Significant | Expected | Rank in Elim | Elim | Weight | |
|---|---|---|---|---|---|---|---|---|
| 1 | GO:0050871 | positive regulation of B cell activation | 118 | 57 | 9 | 3 | 8.4e-28 | < 1e-30 |
| 2 | GO:0050853 | B cell receptor signaling pathway | 103 | 60 | 8 | 1 | < 1e-30 | < 1e-30 |
| 3 | GO:0006910 | phagocytosis, recognition | 56 | 43 | 4 | 2 | 9.8e-30 | 9.8e-30 |
| 4 | GO:0006958 | complement activation, classical pathway | 68 | 43 | 5 | 4 | 1.8e-25 | 1.8e-25 |