Author: Nitesh Turaga, May 6th 2014
Getting the working directory, in my case
getwd()
[1] "/Users/niteshturaga/Documents/GeneExpressionDataAnalysis/affyData"
Setting the working directory, in my case:
setwd("~/Documents/GeneExpressionDataAnalysis/affyData/")
Check the date
date()
[1] "Thu May 8 23:09:56 2014"
Clear the workspace, note the two “embedded” functions
rm(list = ls())
We are going to use the following packages from bioconductor:
require(affy)
Loading required package: affy
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from 'package:stats':
xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, as.vector, cbind,
colnames, duplicated, eval, evalq, Filter, Find, get,
intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unlist
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
require(limma)
Loading required package: limma
Attaching package: 'limma'
The following object is masked from 'package:BiocGenerics':
plotMA
require(hgu95a.db)
Loading required package: hgu95a.db
Loading required package: AnnotationDbi
Loading required package: org.Hs.eg.db
Loading required package: DBI
require(annotate)
Loading required package: annotate
Load data previously stored
# Gene expression data
load("./objs/affyData.rda")
# Load linear model results
load("./objs/linearModel.rda")
Lets create the Functional Gene Set (FGS) list. We will get the data from the annotation medata package hgu95a.db . Use the package name as a function for listing the annotation content
hgu95a()
Quality control information for hgu95a:
This package has the following mappings:
hgu95aACCNUM has 12626 mapped keys (of 12626 keys)
hgu95aALIAS2PROBE has 33149 mapped keys (of 99696 keys)
hgu95aCHR has 11559 mapped keys (of 12626 keys)
hgu95aCHRLENGTHS has 93 mapped keys (of 93 keys)
hgu95aCHRLOC has 11493 mapped keys (of 12626 keys)
hgu95aCHRLOCEND has 11493 mapped keys (of 12626 keys)
hgu95aENSEMBL has 11455 mapped keys (of 12626 keys)
hgu95aENSEMBL2PROBE has 9699 mapped keys (of 27388 keys)
hgu95aENTREZID has 11562 mapped keys (of 12626 keys)
hgu95aENZYME has 2121 mapped keys (of 12626 keys)
hgu95aENZYME2PROBE has 786 mapped keys (of 975 keys)
hgu95aGENENAME has 11562 mapped keys (of 12626 keys)
hgu95aGO has 11241 mapped keys (of 12626 keys)
hgu95aGO2ALLPROBES has 16108 mapped keys (of 17608 keys)
hgu95aGO2PROBE has 12081 mapped keys (of 13737 keys)
hgu95aMAP has 11535 mapped keys (of 12626 keys)
hgu95aOMIM has 10474 mapped keys (of 12626 keys)
hgu95aPATH has 5383 mapped keys (of 12626 keys)
hgu95aPATH2PROBE has 228 mapped keys (of 229 keys)
hgu95aPFAM has 11322 mapped keys (of 12626 keys)
hgu95aPMID has 11548 mapped keys (of 12626 keys)
hgu95aPMID2PROBE has 340884 mapped keys (of 394407 keys)
hgu95aPROSITE has 11322 mapped keys (of 12626 keys)
hgu95aREFSEQ has 11522 mapped keys (of 12626 keys)
hgu95aSYMBOL has 11562 mapped keys (of 12626 keys)
hgu95aUNIGENE has 11547 mapped keys (of 12626 keys)
hgu95aUNIPROT has 11324 mapped keys (of 12626 keys)
Additional Information about this package:
DB schema: HUMANCHIP_DB
DB schema version: 2.1
Organism: Homo sapiens
Date for NCBI data: 2013-Sep12
Date for GO data: 20130907
Date for KEGG data: 2011-Mar15
Date for Golden Path data: 2010-Mar22
Date for Ensembl data: 2013-Sep3
Each environment in a metadata package maps identifiers to annotation information. The annotation information can be gene/identifiers centered, or annotation centered.
Example: “ID -> SYMBOL” is gene centered, while “PATHID -> genes” is annotation centered. We need the second type of mapping: one FGS to all the genes of the FGS
This extracts mapping between kegg and affy identifiers for this platform
kegg <- as.list(hgu95aPATH2PROBE)
This for GENE ONTOLOGY
go <- as.list(hgu95aGO2ALLPROBES)
The code above created a list where each element has name (the kegg identifiers) and contains a vector of identifiers, see below,
class(kegg)
[1] "list"
length(kegg)
[1] 229
str(head(kegg))
List of 6
$ 04610: chr [1:73] "37175_at" "33549_at" "39309_at" "39310_at" ...
$ 00232: chr [1:11] "38187_at" "38912_at" "1080_s_at" "1493_r_at" ...
$ 00983: chr [1:45] "38187_at" "38912_at" "1117_at" "1493_r_at" ...
$ 01100: chr [1:957] "38187_at" "38912_at" "36332_at" "33445_at" ...
$ 00380: chr [1:45] "36332_at" "37186_s_at" "39678_at" "34790_at" ...
$ 00970: chr [1:24] "36185_at" "40408_at" "32615_at" "35300_at" ...
To run a Wilcoxon rank-sum test we can use the geneSetTest() function in limma we need to identify the genes in the FGS, and give also a ranking statistics. To this end we will use the t-statistics obtained using the topTable() function
The geneSetTest() tests whether genes in the set are more differentially expressed than genes not in the set.
By contrast, a self-contained gene set test such as roast() tests whether genes in the test set are differentially expressed, in an absolute sense, without regard to any other genes on the array.
The test on the fisrt FGS: default parameters
geneSetTest(tG2$ID %in% kegg[[1]], tG2$t)
[1] 0.9378
The test on the fisrt FGS: shift toward down-reguated genes
geneSetTest(tG2$ID %in% kegg[[1]], tG2$t, alternative = "down")
[1] 0.9999
The test on the fisrt FGS: shift toward up-reguated genes
geneSetTest(tG2$ID %in% kegg[[1]], tG2$t, alternative = "up")
[1] 0.0001044
barcodeplot(tG2$t, tG2$ID %in% kegg[[1]])
The test on all KEGG FGS (after removing NA)
kegg2 <- kegg[!is.na(kegg)]
gse <- vector()
for (i in 1:length(kegg2)) {
gse[i] <- geneSetTest(tG2$ID %in% kegg2[[i]], tG2$t)
names(gse)[i] <- names(kegg2)[i]
}
### This is an alternative
tmp <- sapply(kegg2, function(gs, stat, ids) {
geneSetTest(ids %in% gs, stat)
}, stat = tG2$t, ids = tG2$ID)
str(gse)
Named num [1:228] 0.938 0.27 0.362 0.876 0.568 ...
- attr(*, "names")= chr [1:228] "04610" "00232" "00983" "01100" ...
head(gse)
04610 00232 00983 01100 00380 00970
0.93782 0.26954 0.36212 0.87648 0.56830 0.08388
gse <- vector()
for (i in 1:length(kegg)) {
# check if all values are NOT true
if (all(!tG2$ID %in% kegg[[i]])) {
gse[i] <- NA
} else {
gse[i] <- geneSetTest(tG2$ID %in% kegg[[i]], tG2$t)
}
names(gse)[i] <- names(kegg)[i]
}
Check how the gse vector looks, the length and the structure
length(gse)
[1] 229
str(gse)
Named num [1:229] 0.938 0.27 0.362 0.876 0.568 ...
- attr(*, "names")= chr [1:229] "04610" "00232" "00983" "01100" ...
This is the way we can visualize this: the first gene set
i = 10
barcodeplot(tG2$t, tG2$ID %in% kegg[[i]])
This is the way we can visualize this: the smallest pvalue
i = which.min(gse)
barcodeplot(tG2$t, tG2$ID %in% kegg[[i]])
This is the way we can visualize this: the largest pvalue
i = which.max(gse)
barcodeplot(tG2$t, tG2$ID %in% kegg[[i]])
We can also use the geneSetTest() function differently, you can learn about this function by typing
`?`(geneSetTest)
# Or simply print the arguments of the function
args(geneSetTest)
We can compare our FGS to randomly generated lists of genes of the same size of our FGS of interest and compute the p-value as the proportion of times our FGS returns a smaller p-value compared to the randomly generated lists.
geneSetTest(tG2$ID %in% kegg[[1]], tG2$t, alternative = "up", type = "t", ranks.only = FALSE,
nsim = 1000)
[1] 0.001998
geneSetTest(tG2$ID %in% kegg[[94]], tG2$t, alternative = "down", type = "t",
ranks.only = FALSE, nsim = 1000)
[1] 0.4895
limma enables you to run also self-contained tests as opposed to competitive tests. To do this you can use the roast() function
Extract the expression values
mat <- exprs(dat.rma)
Select gene expression for the genes in the FGS
sel <- rownames(mat) %in% kegg[[1]]
For this analysis the design matrix must already account for the contrasts
newDmat <- data.frame(Intercept = 1, CellType = dMat2[, 1] + dMat2[, 2] + (-1 *
(dMat2[, 3] + dMat2[, 4])))
Run the test on one FGS: the contrast is the column index from the design matrix
roast(y = mat, index = sel, design = newDmat, contrast = 2)
Active.Prop P.Value
Down 0.0274 0.929
Up 0.1233 0.072
Mixed 0.1507 0.334
The test on the first five FGS for KEGG
gse2 <- list()
for (i in 1:5) {
sel <- rownames(mat) %in% kegg[[i]]
gse2[[i]] <- roast(mat, index = sel, design = newDmat, contrast = 2)
names(gse2)[i] <- names(kegg)[i]
}
The test on the first five FGS for GO
gse.go <- list()
for (i in 1:5) {
sel <- rownames(mat) %in% go[[i]]
gse.go[[i]] <- roast(mat, index = sel, design = newDmat, contrast = 2)
names(gse.go)[i] <- names(go)[i]
}
Load the KEGG library
library(KEGG.db)
KEGG.db contains mappings based on older data because the original
resource was removed from the the public domain before the most
recent update was produced. This package should now be
considered deprecated and future versions of Bioconductor may
not have it available. Users who want more current data are
encouraged to look at the KEGGREST or reactome.db packages
library(GO.db)
installedPckgs <- installed.packages()[, "Package"]
# Here we define the list of desired libraries
pckgListBIOC <- c("KEGG.db")
# We source the biocLite.R script from the Bioconductor website
source("http://bioconductor.org/biocLite.R")
Bioconductor version 2.13 (BiocInstaller 1.12.1), ?biocLite for help
A newer version of Bioconductor is available after installing a new
version of R, ?BiocUpgrade for help
# Load the packages, or install them from Bioconductor if needed
for (pckg in pckgListBIOC) {
if (!pckg %in% installedPckgs)
biocLite(pckg)
require(pckg, character.only = TRUE)
}
require("KEGG.db")
Check the content of the KEGG metadata package
KEGG()
Quality control information for KEGG:
This package has the following mappings:
KEGGENZYMEID2GO has 3999 mapped keys (of 3999 keys)
KEGGEXTID2PATHID has 75100 mapped keys (of 75100 keys)
KEGGGO2ENZYMEID has 4129 mapped keys (of 4129 keys)
KEGGPATHID2EXTID has 3152 mapped keys (of 3152 keys)
KEGGPATHID2NAME has 390 mapped keys (of 390 keys)
KEGGPATHNAME2ID has 390 mapped keys (of 390 keys)
Additional Information about this package:
DB schema: KEGG_DB
DB schema version: 2.1
Date for KEGG data: 2011-Mar15
GO()
Quality control information for GO:
This package has the following mappings:
GOBPANCESTOR has 25193 mapped keys (of 25193 keys)
GOBPCHILDREN has 14497 mapped keys (of 25193 keys)
GOBPOFFSPRING has 14497 mapped keys (of 25193 keys)
GOBPPARENTS has 25193 mapped keys (of 25193 keys)
GOCCANCESTOR has 3232 mapped keys (of 3232 keys)
GOCCCHILDREN has 1070 mapped keys (of 3232 keys)
GOCCOFFSPRING has 1070 mapped keys (of 3232 keys)
GOCCPARENTS has 3232 mapped keys (of 3232 keys)
GOMFANCESTOR has 9602 mapped keys (of 9602 keys)
GOMFCHILDREN has 1930 mapped keys (of 9602 keys)
GOMFOFFSPRING has 1930 mapped keys (of 9602 keys)
GOMFPARENTS has 9602 mapped keys (of 9602 keys)
GOOBSOLETE has 1838 mapped keys (of 1838 keys)
GOTERM has 38028 mapped keys (of 38028 keys)
Additional Information about this package:
DB schema: GO_DB
DB schema version: 2.1
Date for GO data: 20130907
It appears that KEGGPATHID2NAME contains what we want, so you can use mget() to obtain the patways names
pathNames <- mget(names(gse), KEGGPATHID2NAME, ifnotfound = NA)
goNames <- mget(names(gse.go), GOTERM, ifnotfound = NA)
goNames <- lapply(goNames, Term)
Print pathNames and goNames
print(pathNames)
print(goNames)
Retrieve the smallest p-values
pathNames[names(pathNames) == names(gse[which.min(gse)])]
$`04662`
[1] "B cell receptor signaling pathway"
If they are in the same order
pathNames[which.min(gse)]
$`04662`
[1] "B cell receptor signaling pathway"
Retrieve the smallest p-values
pathNames2 <- pathNames[length(pathNames):1]
Always correct also if not in the same order
pathNames2[names(pathNames2) == names(gse[which.min(gse)])]
$`04662`
[1] "B cell receptor signaling pathway"
Wrong if they are in the same order
pathNames2[which.min(gse)]
$`00561`
[1] "Glycerolipid metabolism"
Retrieve fgs with p-value less than 0.001
pathNames[gse < 1e-05]
$`04510`
[1] "Focal adhesion"
$`04670`
[1] "Leukocyte transendothelial migration"
$`05100`
[1] "Bacterial invasion of epithelial cells"
$`04662`
[1] "B cell receptor signaling pathway"
$`05213`
[1] "Endometrial cancer"
$`05223`
[1] "Non-small cell lung cancer"
$`05216`
[1] "Thyroid cancer"
$<NA>
NULL
index <- which(gse < 0.01)
myFGS <- kegg[index]
sum(myFGS[[1]] %in% myFGS[[2]])
[1] 33
all <- unique(unlist(myFGS))
table(fgs1 = all %in% myFGS[[1]], fgs2 = all %in% myFGS[[2]])
fgs2
fgs1 FALSE TRUE
FALSE 1467 131
TRUE 136 33
vennDiagram(data.frame(fgs1 = all %in% myFGS[[1]], fgs2 = all %in% myFGS[[2]],
fgs3 = all %in% myFGS[[3]]))
library(multtest)
# Correct p-values
adjp <- mt.rawp2adjp(gse)
adjp <- adjp$adjp[order(adjp$index), ]
# Bonferroni only
adjp <- mt.rawp2adjp(gse, proc = "Bonferroni")
adjp <- adjp$adjp[order(adjp$index), ]
# Benjamini-Hochberg only
adjp <- mt.rawp2adjp(gse, proc = "BH")
adjp <- adjp$adjp[order(adjp$index), ]
Make an histogram of the p-values
hist(adjp[, 1], nclass = 20)
hist(adjp[, 2], nclass = 20, add = TRUE, col = "blue")
sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] multtest_2.18.0 BiocInstaller_1.12.1 GO.db_2.10.1
[4] KEGG.db_2.10.1 annotate_1.40.1 hgu95a.db_2.10.1
[7] org.Hs.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7
[10] AnnotationDbi_1.24.0 limma_3.18.13 affy_1.40.0
[13] Biobase_2.22.0 BiocGenerics_0.8.0 knitr_1.5
loaded via a namespace (and not attached):
[1] affyio_1.30.0 evaluate_0.5.5 formatR_0.10
[4] IRanges_1.20.7 MASS_7.3-33 preprocessCore_1.24.0
[7] splines_3.0.3 stats4_3.0.3 stringr_0.6.2
[10] survival_2.37-7 tools_3.0.3 XML_3.98-1.1
[13] xtable_1.7-3 zlibbioc_1.8.0