Rebecca Johnston 29/01/2014
library("ggplot2") # for graphing
library(plyr) # for data aggreagation
library("reshape2") # for reshaping data from wide to tall format
Load gene expression dataset:
prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)
## 'data.frame': 29949 obs. of 39 variables:
Load design matrix:
prDes <- readRDS("GSE4051_design.rds")
str(prDes)
## 'data.frame': 39 obs. of 4 variables:
## $ sidChar : chr "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
Extract data for one gene, put in a data.frame with experimental info:
set.seed(987)
(theGene <- sample(1:nrow(prDat), 1))
## [1] 14294
(pDat <- data.frame(prDes, gExp = unlist(prDat[theGene, ])))
## sidChar sidNum devStage gType gExp
## 12 Sample_20 20 E16 wt 9.881
## 13 Sample_21 21 E16 wt 10.590
## 14 Sample_22 22 E16 wt 10.280
## 15 Sample_23 23 E16 wt 10.220
## 9 Sample_16 16 E16 NrlKO 8.751
## 10 Sample_17 17 E16 NrlKO 9.902
## 11 Sample_6 6 E16 NrlKO 10.340
## 28 Sample_24 24 P2 wt 9.998
## 29 Sample_25 25 P2 wt 9.866
## 30 Sample_26 26 P2 wt 9.433
## 31 Sample_27 27 P2 wt 9.454
## 24 Sample_14 14 P2 NrlKO 9.842
## 25 Sample_3 3 P2 NrlKO 9.870
## 26 Sample_5 5 P2 NrlKO 9.874
## 27 Sample_8 8 P2 NrlKO 9.156
## 36 Sample_28 28 P6 wt 9.169
## 37 Sample_29 29 P6 wt 10.220
## 38 Sample_30 30 P6 wt 10.070
## 39 Sample_31 31 P6 wt 9.923
## 32 Sample_1 1 P6 NrlKO 9.703
## 33 Sample_10 10 P6 NrlKO 9.938
## 34 Sample_4 4 P6 NrlKO 9.870
## 35 Sample_7 7 P6 NrlKO 8.922
## 20 Sample_32 32 P10 wt 9.855
## 21 Sample_33 33 P10 wt 9.166
## 22 Sample_34 34 P10 wt 9.095
## 23 Sample_35 35 P10 wt 9.774
## 16 Sample_13 13 P10 NrlKO 9.583
## 17 Sample_15 15 P10 NrlKO 9.209
## 18 Sample_18 18 P10 NrlKO 9.725
## 19 Sample_19 19 P10 NrlKO 9.104
## 5 Sample_36 36 4_weeks wt 10.000
## 6 Sample_37 37 4_weeks wt 9.213
## 7 Sample_38 38 4_weeks wt 9.554
## 8 Sample_39 39 4_weeks wt 9.403
## 1 Sample_11 11 4_weeks NrlKO 9.488
## 2 Sample_12 12 4_weeks NrlKO 8.941
## 3 Sample_2 2 4_weeks NrlKO 9.324
## 4 Sample_9 9 4_weeks NrlKO 9.967
str(pDat)
## 'data.frame': 39 obs. of 5 variables:
## $ sidChar : chr "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
## $ gExp : num 9.88 10.59 10.28 10.22 8.75 ...
head(pDat)
## sidChar sidNum devStage gType gExp
## 12 Sample_20 20 E16 wt 9.881
## 13 Sample_21 21 E16 wt 10.590
## 14 Sample_22 22 E16 wt 10.280
## 15 Sample_23 23 E16 wt 10.220
## 9 Sample_16 16 E16 NrlKO 8.751
## 10 Sample_17 17 E16 NrlKO 9.902
Check data summary using aggregate function:
aggregate(gExp ~ gType, pDat, FUN = mean)
## gType gExp
## 1 wt 9.758
## 2 NrlKO 9.553
Make a strip plot:
ggplot(pDat, aes(gExp, gType)) + geom_point()
Perform a two-sample t test comparing wild type to the Nrl knockouts:
t.test(gExp ~ gType, pDat)
##
## Welch Two Sample t-test
##
## data: gExp by gType
## t = 1.479, df = 36.78, p-value = 0.1475
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07586 0.48605
## sample estimates:
## mean in group wt mean in group NrlKO
## 9.758 9.553
Save t test results by naming it:
ttRes <- t.test(gExp ~ gType, pDat)
str(ttRes)
## List of 9
## $ statistic : Named num 1.48
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 36.8
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.148
## $ conf.int : atomic [1:2] -0.0759 0.4861
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 9.76 9.55
## ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "gExp by gType"
## - attr(*, "class")= chr "htest"
Most useful components are test statistic and p-value:
ttRes$statistic
## t
## 1.479
ttRes$p.value
## [1] 0.1475
Draw a different gene at random or pick one for biological interest and look up the Affy probe ID. Use the t test, with and without the common variance assumption, the Wilcoxon, and/or the Kolmogorov-Smirnov test to assess differential expression. Can you pull test statistics and/or p-values from the different approaches into an common object, like a readable table? Are you getting the same message from the various approaches?
Choose probeID at random:
set.seed(675)
thatGene <- sample(1:nrow(prDat), 1)
(pDat <- data.frame(prDes, gExp = unlist(prDat[thatGene, ])))
## sidChar sidNum devStage gType gExp
## 12 Sample_20 20 E16 wt 10.830
## 13 Sample_21 21 E16 wt 11.120
## 14 Sample_22 22 E16 wt 10.980
## 15 Sample_23 23 E16 wt 11.030
## 9 Sample_16 16 E16 NrlKO 9.653
## 10 Sample_17 17 E16 NrlKO 10.470
## 11 Sample_6 6 E16 NrlKO 10.900
## 28 Sample_24 24 P2 wt 10.900
## 29 Sample_25 25 P2 wt 11.260
## 30 Sample_26 26 P2 wt 10.180
## 31 Sample_27 27 P2 wt 10.760
## 24 Sample_14 14 P2 NrlKO 10.260
## 25 Sample_3 3 P2 NrlKO 10.210
## 26 Sample_5 5 P2 NrlKO 10.430
## 27 Sample_8 8 P2 NrlKO 9.330
## 36 Sample_28 28 P6 wt 10.400
## 37 Sample_29 29 P6 wt 10.840
## 38 Sample_30 30 P6 wt 10.970
## 39 Sample_31 31 P6 wt 11.010
## 32 Sample_1 1 P6 NrlKO 10.450
## 33 Sample_10 10 P6 NrlKO 10.520
## 34 Sample_4 4 P6 NrlKO 10.760
## 35 Sample_7 7 P6 NrlKO 9.597
## 20 Sample_32 32 P10 wt 10.240
## 21 Sample_33 33 P10 wt 9.400
## 22 Sample_34 34 P10 wt 9.387
## 23 Sample_35 35 P10 wt 10.770
## 16 Sample_13 13 P10 NrlKO 9.808
## 17 Sample_15 15 P10 NrlKO 9.585
## 18 Sample_18 18 P10 NrlKO 10.180
## 19 Sample_19 19 P10 NrlKO 9.453
## 5 Sample_36 36 4_weeks wt 10.770
## 6 Sample_37 37 4_weeks wt 10.150
## 7 Sample_38 38 4_weeks wt 10.500
## 8 Sample_39 39 4_weeks wt 10.220
## 1 Sample_11 11 4_weeks NrlKO 10.240
## 2 Sample_12 12 4_weeks NrlKO 10.100
## 3 Sample_2 2 4_weeks NrlKO 10.230
## 4 Sample_9 9 4_weeks NrlKO 10.700
aggregate(gExp ~ gType, pDat, FUN = mean)
## gType gExp
## 1 wt 10.59
## 2 NrlKO 10.15
ggplot(pDat, aes(gExp, gType)) + geom_point()
T test (t.test) by default, common variance assumption = FALSE:
ttRes <- (t.test(gExp ~ gType, pDat, var.equal = FALSE))
str(ttRes)
## List of 9
## $ statistic : Named num 2.75
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 36.8
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.00919
## $ conf.int : atomic [1:2] 0.114 0.755
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 10.6 10.2
## ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "gExp by gType"
## - attr(*, "class")= chr "htest"
ttResT <- (t.test(gExp ~ gType, pDat, var.equal = TRUE))
str(ttResT)
## List of 9
## $ statistic : Named num 2.74
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 37
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.00938
## $ conf.int : atomic [1:2] 0.113 0.756
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 10.6 10.2
## ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ alternative: chr "two.sided"
## $ method : chr " Two Sample t-test"
## $ data.name : chr "gExp by gType"
## - attr(*, "class")= chr "htest"
Remember: Wilcoxon and Kolmogorov-Smirnov tests are non-parametric tests.
Wilcoxon test (wilcox.test) compares two samples by ranking them:
(wtRes <- wilcox.test(gExp ~ gType, data = pDat))
## Warning: cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: gExp by gType
## W = 285, p-value = 0.00791
## alternative hypothesis: true location shift is not equal to 0
str(wtRes)
## List of 7
## $ statistic : Named num 285
## ..- attr(*, "names")= chr "W"
## $ parameter : NULL
## $ p.value : num 0.00791
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "location shift"
## $ alternative: chr "two.sided"
## $ method : chr "Wilcoxon rank sum test with continuity correction"
## $ data.name : chr "gExp by gType"
## - attr(*, "class")= chr "htest"
Note there are identical values for gExp, therefore when ranking, ties exist. Hence the warning message… Not sure whether it is possible to suppress warnings within the function?
Kolmogorov-Smirnov (ks.test) quantifies the distance between empirical distribution functions of two samples. Must specify x vs. y here, i.e. gene expression values for wt vs. KO:
(kstRes <- ks.test(pDat$gExp[pDat$gType == "wt"], pDat$gExp[pDat$gType == "NrlKO"]))
## Warning: cannot compute exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: pDat$gExp[pDat$gType == "wt"] and pDat$gExp[pDat$gType == "NrlKO"]
## D = 0.4974, p-value = 0.01612
## alternative hypothesis: two-sided
str(kstRes)
## List of 5
## $ statistic : Named num 0.497
## ..- attr(*, "names")= chr "D"
## $ p.value : num 0.0161
## $ alternative: chr "two-sided"
## $ method : chr "Two-sample Kolmogorov-Smirnov test"
## $ data.name : chr "pDat$gExp[pDat$gType == \"wt\"] and pDat$gExp[pDat$gType == \"NrlKO\"]"
## - attr(*, "class")= chr "htest"
Combine results from different tests:
combineRes <- function(a, b, c, d) {
method <- c(a$method, b$method, c$method, d$method)
testStat <- c(a$statistic, b$statistic, c$statistic, d$statistic)
pValue <- c(a$p.value, b$p.value, c$p.value, d$p.value)
data.frame(method, testStat, pValue)
}
(results <- combineRes(ttRes, ttResT, wtRes, kstRes))
## method testStat pValue
## 1 Welch Two Sample t-test 2.7501 0.009185
## 2 Two Sample t-test 2.7407 0.009383
## 3 Wilcoxon rank sum test with continuity correction 285.0000 0.007910
## 4 Two-sample Kolmogorov-Smirnov test 0.4974 0.016123
No matter which statistical test we choose, we obtain different numbers, but reach same conclusion: the means of wt and KO for expression of this gene are not equal.
N.B. wilcox.test and ks.test could not compute exact p-values. However, I'm worried that pValue from the ks.test was slightly higher than the other tests. Whether this is to be expected or I have not specified the test correctly I'm not sure?
Let's grab the data from 6 genes: 3 are interesting ('hits'), 3 are not. Reshape the data to be tall and skinny, which is generally a good policy and allows us to learn more about data aggregation.
keepGenes <- c("1431708_a_at", "1424336_at", "1454696_at", "1416119_at", "1432141_x_at",
"1429226_at")
miniDat <- subset(prDat, rownames(prDat) %in% keepGenes)
miniDat <- data.frame(gExp = as.vector(t(as.matrix(miniDat))), gene = factor(rep(rownames(miniDat),
each = ncol(miniDat)), levels = keepGenes))
Sample ID in same order within design matrix and gene expression data. Therefore can put data frames together:
miniDat <- suppressWarnings(data.frame(prDes, miniDat))
str(miniDat)
## 'data.frame': 234 obs. of 6 variables:
## $ sidChar : chr "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
## $ gExp : num 10.6 11 10.8 10.9 9.2 ...
## $ gene : Factor w/ 6 levels "1431708_a_at",..: 4 4 4 4 4 4 4 4 4 4 ...
Plot strip plot:
ggplot(miniDat, aes(gExp, gType, colour = gType)) + geom_point(alpha = 0.3) +
facet_wrap(~gene, scale = "free_x") # x scale independent per probe
Use data aggregation techniques to conduct some two group comparisons for each of these 6 genes. Compare gene expression by genotype across all 6 genes:
t.test(gExp ~ gType, miniDat)
##
## Welch Two Sample t-test
##
## data: gExp by gType
## t = 2.952, df = 225.3, p-value = 0.00349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2551 1.2791
## sample estimates:
## mean in group wt mean in group NrlKO
## 8.935 8.168
Conceptually, we want to make a sub-data.frame for each gene and provide in the place of someDat in a t test call like above. Sometimes that is a useful first step, when building up a data aggregation task.
Subset miniDat to one gene, the first value in keepGenes:
someDat <- droplevels(subset(miniDat, gene == keepGenes[1]))
t.test(gExp ~ gType, someDat)
##
## Welch Two Sample t-test
##
## data: gExp by gType
## t = 9.838, df = 36.89, p-value = 7.381e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.570 2.384
## sample estimates:
## mean in group wt mean in group NrlKO
## 9.554 7.578
But how do we scale this up to all 6 genes? Welcome pylr…
Since our input, miniDat, is a data.frame, we will use functions that start with d. What do we want to get back, if anything? If we are happy to watch the t test results fly by on the screen, we can use d_ply() with .print =TRUE. But we can also store results into a list using dlply():
ttRes <- dlply(miniDat, ~gene, function(x) t.test(gExp ~ gType, x))
names(ttRes)
## [1] "1431708_a_at" "1424336_at" "1454696_at" "1416119_at"
## [5] "1432141_x_at" "1429226_at"
ttRes[["1454696_at"]]
##
## Welch Two Sample t-test
##
## data: gExp by gType
## t = 8.077, df = 33.49, p-value = 2.278e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.402 2.346
## sample estimates:
## mean in group wt mean in group NrlKO
## 12.85 10.97
If we knew in advance that we only wanted, say, the test statistic and the p-value, here's how we go after that directly via ddply():
(ttRes <- ddply(miniDat, ~gene, function(z) {
zz <- t.test(gExp ~ gType, z)
# round results to 4 digits
round(c(tStat = zz$statistic, pVal = zz$p.value), 4)
}))
## gene tStat.t pVal
## 1 1431708_a_at 9.8380 0.0000
## 2 1424336_at 9.0607 0.0000
## 3 1454696_at 8.0771 0.0000
## 4 1416119_at -0.1840 0.8551
## 5 1432141_x_at -0.1324 0.8954
## 6 1429226_at 0.0983 0.9223
In our last example, can you edit the inner function to use the Wilcoxon or KS test? Or maybe do the t test, the Wilcoxon, and the KS test and return all 3 p-values?
Make function separate:
multiTest <- function(x) {
tt <- t.test(gExp ~ gType, x)
wt <- wilcox.test(gExp ~ gType, x)
ks <- ks.test(x$gExp[x$gType == "wt"], x$gExp[x$gType == "NrlKO"])
round(c(tStat = tt$p.val, wStat = wt$p.val, kStat = ks$p.val), 6)
}
Call function within ddply:
(multitRes <- suppressWarnings(ddply(miniDat, ~gene, multiTest)))
## gene tStat wStat kStat
## 1 1431708_a_at 0.0000 0.0000 0.0000
## 2 1424336_at 0.0000 0.0000 0.0000
## 3 1454696_at 0.0000 0.0000 0.0000
## 4 1416119_at 0.8551 0.9668 0.7128
## 5 1432141_x_at 0.8954 0.9440 0.4107
## 6 1429226_at 0.9223 0.7465 0.9947
Scale up to 100 genes:
set.seed(123)
lotsGenes <- sample(1:nrow(prDat), 100)
head(lotsGenes)
## [1] 8613 23609 12248 26443 28163 1365
lotsDat <- prDat[lotsGenes, ]
str(lotsDat, max.level = 0)
## 'data.frame': 100 obs. of 39 variables:
lotsDat <- data.frame(gExp = as.vector(t(as.matrix(lotsDat))), gene = factor(rep(rownames(lotsDat),
each = ncol(lotsDat)), levels = rownames(lotsDat)))
str(lotsDat)
## 'data.frame': 3900 obs. of 2 variables:
## $ gExp: num 7.32 7.71 7.06 7.2 7.16 ...
## $ gene: Factor w/ 100 levels "1427773_a_at",..: 1 1 1 1 1 1 1 1 1 1 ...
head(lotsDat)
## gExp gene
## 1 7.323 1427773_a_at
## 2 7.711 1427773_a_at
## 3 7.061 1427773_a_at
## 4 7.204 1427773_a_at
## 5 7.160 1427773_a_at
## 6 7.975 1427773_a_at
lotsDat <- suppressWarnings(data.frame(prDes, lotsDat))
multiRes <- suppressWarnings(ddply(lotsDat, ~gene, multiTest))
head(multiRes)
## gene tStat wStat kStat
## 1 1427773_a_at 0.581648 0.35060 0.3278
## 2 1451078_at 0.059321 0.07438 0.0960
## 3 1433507_a_at 0.450034 0.46493 0.3550
## 4 1455179_at 0.161915 0.21380 0.4338
## 5 1457691_at 0.930725 0.74659 0.7005
## 6 1417272_at 0.008463 0.01190 0.1171
tail(multiRes)
## gene tStat wStat kStat
## 95 1428944_at 0.56520 0.60707 0.48081
## 96 1423646_at 0.05899 0.03890 0.10826
## 97 1450732_a_at 0.82216 0.86612 0.89555
## 98 1419234_at 0.62483 0.75724 0.95769
## 99 1460690_at 0.45426 0.58774 0.43375
## 100 1437301_a_at 0.02642 0.01628 0.02179
summary(lm(multiRes$tStat ~ multiRes$wStat))
##
## Call:
## lm(formula = multiRes$tStat ~ multiRes$wStat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3172 -0.0738 -0.0290 0.0730 0.4265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0355 0.0241 1.47 0.14
## multiRes$wStat 0.9174 0.0431 21.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.137 on 98 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.82
## F-statistic: 452 on 1 and 98 DF, p-value: <2e-16
Scatterplot various two-group tests against each other:
t.test vs. wilcox.test
Perform lm and obtain r-squared value:
fit <- summary(lm(multiRes$tStat ~ multiRes$wStat))
fit$r.squared
## [1] 0.8219
Plot together with r-squared value:
ggplot(multiRes, aes(tStat, wStat)) + geom_point() + stat_smooth(method = "lm",
se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) +
ggtitle(paste("P values from t.test vs. wilcox.test\nfor 100 random genes from prDat dataset\nr-squared =",
round(fit$r.squared, digits = 4)))
t.test vs. ks.test
fit <- summary(lm(multiRes$tStat ~ multiRes$kStat))
fit$r.squared
## [1] 0.6263
ggplot(multiRes, aes(tStat, kStat)) + geom_point() + stat_smooth(method = "lm",
se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) +
ggtitle(paste("P values from t.test vs. ks.test\nfor 100 random genes from prDat dataset\nr-squared =",
round(fit$r.squared, digits = 4)))
wilcox.test vs. ks.test
fit <- summary(lm(multiRes$wStat ~ multiRes$kStat))
fit$r.squared
## [1] 0.7261
ggplot(multiRes, aes(wStat, kStat)) + geom_point() + stat_smooth(method = "lm",
se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) +
ggtitle(paste("P values from wilcox.test vs. ks.test\nfor 100 random genes from prDat dataset\nr-squared =",
round(fit$r.squared, digits = 4)))
Therefore, t.test and wilcox.test have the most concordance in p values, whereas the t.test and ks.test have the least concordance in p values.
Convert your numeric matrix of p-values into a matrix of TRUE/FALSE or zeros and ones by hard-threshholding, e.g. at the conventional 0.05 level. Use apply() or a function from plyr to make some interesting row or column summaries. How many genes are significant according to the different tests? For each gene, how many of the tests return a significant p-value? How many genes are “hits” by all 3 methods, by exactly 2, by exactly 1 and by none?
Start with multiRes dataset:
head(multiRes)
## gene tStat wStat kStat
## 1 1427773_a_at 0.581648 0.35060 0.3278
## 2 1451078_at 0.059321 0.07438 0.0960
## 3 1433507_a_at 0.450034 0.46493 0.3550
## 4 1455179_at 0.161915 0.21380 0.4338
## 5 1457691_at 0.930725 0.74659 0.7005
## 6 1417272_at 0.008463 0.01190 0.1171
Convert data frame into matrix by removing gene names:
multiResMat <- multiRes[, c("tStat", "wStat", "kStat")]
multiResMat <- multiResMat < 0.05
head(multiResMat)
## tStat wStat kStat
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,] TRUE TRUE FALSE
Use apply to make summaries for each test. Summary by test (column):
apply(multiResMat, 2, sum)
## tStat wStat kStat
## 15 13 8
Summary by gene (row):
apply(multiResMat, 1, sum)
## [1] 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0
## [36] 0 0 0 0 1 2 0 0 0 0 0 0 0 3 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 3 0 0 2 2 0
## [71] 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0 3 0 0 0 0 0 0 0 1 0 0 0 3
Now make summaries using plyr instead. First use reshape2 melt function to convert data into tall format:
multiResM <- melt(multiRes, id.var = "gene", variable.name = "method", value.name = "pVal")
str(multiResM)
## 'data.frame': 300 obs. of 3 variables:
## $ gene : Factor w/ 100 levels "1427773_a_at",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ method: Factor w/ 3 levels "tStat","wStat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pVal : num 0.5816 0.0593 0.45 0.1619 0.9307 ...
head(multiResM)
## gene method pVal
## 1 1427773_a_at tStat 0.581648
## 2 1451078_at tStat 0.059321
## 3 1433507_a_at tStat 0.450034
## 4 1455179_at tStat 0.161915
## 5 1457691_at tStat 0.930725
## 6 1417272_at tStat 0.008463
Apply threshold as a function within ddply:
ddply(multiResM, ~method, summarize, pValAboveSig = sum(pVal < 0.05))
## method pValAboveSig
## 1 tStat 15
## 2 wStat 13
## 3 kStat 8
test <- ddply(multiResM, ~gene, summarize, pValAboveSig = sum(pVal < 0.05))
head(test)
## gene pValAboveSig
## 1 1427773_a_at 0
## 2 1451078_at 0
## 3 1433507_a_at 0
## 4 1455179_at 0
## 5 1457691_at 0
## 6 1417272_at 2
Can we summarize the above even further? Number of genes that had 0, 1, 2 and 3 p values < 0.05?
ddply(test, ~pValAboveSig, summarize, geneNumber = table(pValAboveSig))
## pValAboveSig geneNumber
## 1 0 83
## 2 1 5
## 3 2 5
## 4 3 7
How can we get this summary straight from multiResM? Without making test?