STAT540 Seminar 04: Two-group comparisons and data aggregation

Rebecca Johnston 29/01/2014

Load required libraries and read data
Two sample tests: one gene
Two sample tests: a handful of genes
The plyr package
Ideas for take-home work

Load required libraries and read data

library("ggplot2")  # for graphing
library(plyr)  # for data aggreagation
library("reshape2")  # for reshaping data from wide to tall format

Load gene expression dataset:

prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)

## 'data.frame':    29949 obs. of  39 variables:

Load design matrix:

prDes <- readRDS("GSE4051_design.rds")
str(prDes)

## 'data.frame':    39 obs. of  4 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Two sample tests: one gene

Extract data for one gene, put in a data.frame with experimental info:

set.seed(987)
(theGene <- sample(1:nrow(prDat), 1))

## [1] 14294

(pDat <- data.frame(prDes, gExp = unlist(prDat[theGene, ])))

##      sidChar sidNum devStage gType   gExp
## 12 Sample_20     20      E16    wt  9.881
## 13 Sample_21     21      E16    wt 10.590
## 14 Sample_22     22      E16    wt 10.280
## 15 Sample_23     23      E16    wt 10.220
## 9  Sample_16     16      E16 NrlKO  8.751
## 10 Sample_17     17      E16 NrlKO  9.902
## 11  Sample_6      6      E16 NrlKO 10.340
## 28 Sample_24     24       P2    wt  9.998
## 29 Sample_25     25       P2    wt  9.866
## 30 Sample_26     26       P2    wt  9.433
## 31 Sample_27     27       P2    wt  9.454
## 24 Sample_14     14       P2 NrlKO  9.842
## 25  Sample_3      3       P2 NrlKO  9.870
## 26  Sample_5      5       P2 NrlKO  9.874
## 27  Sample_8      8       P2 NrlKO  9.156
## 36 Sample_28     28       P6    wt  9.169
## 37 Sample_29     29       P6    wt 10.220
## 38 Sample_30     30       P6    wt 10.070
## 39 Sample_31     31       P6    wt  9.923
## 32  Sample_1      1       P6 NrlKO  9.703
## 33 Sample_10     10       P6 NrlKO  9.938
## 34  Sample_4      4       P6 NrlKO  9.870
## 35  Sample_7      7       P6 NrlKO  8.922
## 20 Sample_32     32      P10    wt  9.855
## 21 Sample_33     33      P10    wt  9.166
## 22 Sample_34     34      P10    wt  9.095
## 23 Sample_35     35      P10    wt  9.774
## 16 Sample_13     13      P10 NrlKO  9.583
## 17 Sample_15     15      P10 NrlKO  9.209
## 18 Sample_18     18      P10 NrlKO  9.725
## 19 Sample_19     19      P10 NrlKO  9.104
## 5  Sample_36     36  4_weeks    wt 10.000
## 6  Sample_37     37  4_weeks    wt  9.213
## 7  Sample_38     38  4_weeks    wt  9.554
## 8  Sample_39     39  4_weeks    wt  9.403
## 1  Sample_11     11  4_weeks NrlKO  9.488
## 2  Sample_12     12  4_weeks NrlKO  8.941
## 3   Sample_2      2  4_weeks NrlKO  9.324
## 4   Sample_9      9  4_weeks NrlKO  9.967

str(pDat)

## 'data.frame':    39 obs. of  5 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ gExp    : num  9.88 10.59 10.28 10.22 8.75 ...

head(pDat)

##      sidChar sidNum devStage gType   gExp
## 12 Sample_20     20      E16    wt  9.881
## 13 Sample_21     21      E16    wt 10.590
## 14 Sample_22     22      E16    wt 10.280
## 15 Sample_23     23      E16    wt 10.220
## 9  Sample_16     16      E16 NrlKO  8.751
## 10 Sample_17     17      E16 NrlKO  9.902

Check data summary using aggregate function:

aggregate(gExp ~ gType, pDat, FUN = mean)

##   gType  gExp
## 1    wt 9.758
## 2 NrlKO 9.553

Make a strip plot:

ggplot(pDat, aes(gExp, gType)) + geom_point()

plot of chunk unnamed-chunk-6

Perform a two-sample t test comparing wild type to the Nrl knockouts:

t.test(gExp ~ gType, pDat)

## 
##  Welch Two Sample t-test
## 
## data:  gExp by gType
## t = 1.479, df = 36.78, p-value = 0.1475
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07586  0.48605
## sample estimates:
##    mean in group wt mean in group NrlKO 
##               9.758               9.553

Save t test results by naming it:

ttRes <- t.test(gExp ~ gType, pDat)
str(ttRes)

## List of 9
##  $ statistic  : Named num 1.48
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 36.8
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.148
##  $ conf.int   : atomic [1:2] -0.0759 0.4861
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 9.76 9.55
##   ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "gExp by gType"
##  - attr(*, "class")= chr "htest"

Most useful components are test statistic and p-value:

ttRes$statistic

##     t 
## 1.479

ttRes$p.value

## [1] 0.1475

Draw a different gene at random or pick one for biological interest and look up the Affy probe ID. Use the t test, with and without the common variance assumption, the Wilcoxon, and/or the Kolmogorov-Smirnov test to assess differential expression. Can you pull test statistics and/or p-values from the different approaches into an common object, like a readable table? Are you getting the same message from the various approaches?

Choose probeID at random:

set.seed(675)
thatGene <- sample(1:nrow(prDat), 1)
(pDat <- data.frame(prDes, gExp = unlist(prDat[thatGene, ])))

##      sidChar sidNum devStage gType   gExp
## 12 Sample_20     20      E16    wt 10.830
## 13 Sample_21     21      E16    wt 11.120
## 14 Sample_22     22      E16    wt 10.980
## 15 Sample_23     23      E16    wt 11.030
## 9  Sample_16     16      E16 NrlKO  9.653
## 10 Sample_17     17      E16 NrlKO 10.470
## 11  Sample_6      6      E16 NrlKO 10.900
## 28 Sample_24     24       P2    wt 10.900
## 29 Sample_25     25       P2    wt 11.260
## 30 Sample_26     26       P2    wt 10.180
## 31 Sample_27     27       P2    wt 10.760
## 24 Sample_14     14       P2 NrlKO 10.260
## 25  Sample_3      3       P2 NrlKO 10.210
## 26  Sample_5      5       P2 NrlKO 10.430
## 27  Sample_8      8       P2 NrlKO  9.330
## 36 Sample_28     28       P6    wt 10.400
## 37 Sample_29     29       P6    wt 10.840
## 38 Sample_30     30       P6    wt 10.970
## 39 Sample_31     31       P6    wt 11.010
## 32  Sample_1      1       P6 NrlKO 10.450
## 33 Sample_10     10       P6 NrlKO 10.520
## 34  Sample_4      4       P6 NrlKO 10.760
## 35  Sample_7      7       P6 NrlKO  9.597
## 20 Sample_32     32      P10    wt 10.240
## 21 Sample_33     33      P10    wt  9.400
## 22 Sample_34     34      P10    wt  9.387
## 23 Sample_35     35      P10    wt 10.770
## 16 Sample_13     13      P10 NrlKO  9.808
## 17 Sample_15     15      P10 NrlKO  9.585
## 18 Sample_18     18      P10 NrlKO 10.180
## 19 Sample_19     19      P10 NrlKO  9.453
## 5  Sample_36     36  4_weeks    wt 10.770
## 6  Sample_37     37  4_weeks    wt 10.150
## 7  Sample_38     38  4_weeks    wt 10.500
## 8  Sample_39     39  4_weeks    wt 10.220
## 1  Sample_11     11  4_weeks NrlKO 10.240
## 2  Sample_12     12  4_weeks NrlKO 10.100
## 3   Sample_2      2  4_weeks NrlKO 10.230
## 4   Sample_9      9  4_weeks NrlKO 10.700

aggregate(gExp ~ gType, pDat, FUN = mean)

##   gType  gExp
## 1    wt 10.59
## 2 NrlKO 10.15

ggplot(pDat, aes(gExp, gType)) + geom_point()

plot of chunk unnamed-chunk-10

T test (t.test) by default, common variance assumption = FALSE:

ttRes <- (t.test(gExp ~ gType, pDat, var.equal = FALSE))
str(ttRes)

## List of 9
##  $ statistic  : Named num 2.75
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 36.8
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.00919
##  $ conf.int   : atomic [1:2] 0.114 0.755
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 10.6 10.2
##   ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "gExp by gType"
##  - attr(*, "class")= chr "htest"

ttResT <- (t.test(gExp ~ gType, pDat, var.equal = TRUE))
str(ttResT)

## List of 9
##  $ statistic  : Named num 2.74
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 37
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.00938
##  $ conf.int   : atomic [1:2] 0.113 0.756
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 10.6 10.2
##   ..- attr(*, "names")= chr [1:2] "mean in group wt" "mean in group NrlKO"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr " Two Sample t-test"
##  $ data.name  : chr "gExp by gType"
##  - attr(*, "class")= chr "htest"

Remember: Wilcoxon and Kolmogorov-Smirnov tests are non-parametric tests.
Wilcoxon test (wilcox.test) compares two samples by ranking them:

(wtRes <- wilcox.test(gExp ~ gType, data = pDat))

## Warning: cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  gExp by gType
## W = 285, p-value = 0.00791
## alternative hypothesis: true location shift is not equal to 0

str(wtRes)

## List of 7
##  $ statistic  : Named num 285
##   ..- attr(*, "names")= chr "W"
##  $ parameter  : NULL
##  $ p.value    : num 0.00791
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "location shift"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Wilcoxon rank sum test with continuity correction"
##  $ data.name  : chr "gExp by gType"
##  - attr(*, "class")= chr "htest"

Note there are identical values for gExp, therefore when ranking, ties exist. Hence the warning message… Not sure whether it is possible to suppress warnings within the function?

Kolmogorov-Smirnov (ks.test) quantifies the distance between empirical distribution functions of two samples. Must specify x vs. y here, i.e. gene expression values for wt vs. KO:

(kstRes <- ks.test(pDat$gExp[pDat$gType == "wt"], pDat$gExp[pDat$gType == "NrlKO"]))

## Warning: cannot compute exact p-value with ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  pDat$gExp[pDat$gType == "wt"] and pDat$gExp[pDat$gType == "NrlKO"]
## D = 0.4974, p-value = 0.01612
## alternative hypothesis: two-sided

str(kstRes)

## List of 5
##  $ statistic  : Named num 0.497
##   ..- attr(*, "names")= chr "D"
##  $ p.value    : num 0.0161
##  $ alternative: chr "two-sided"
##  $ method     : chr "Two-sample Kolmogorov-Smirnov test"
##  $ data.name  : chr "pDat$gExp[pDat$gType == \"wt\"] and pDat$gExp[pDat$gType == \"NrlKO\"]"
##  - attr(*, "class")= chr "htest"

Combine results from different tests:

combineRes <- function(a, b, c, d) {
    method <- c(a$method, b$method, c$method, d$method)
    testStat <- c(a$statistic, b$statistic, c$statistic, d$statistic)
    pValue <- c(a$p.value, b$p.value, c$p.value, d$p.value)
    data.frame(method, testStat, pValue)
}

(results <- combineRes(ttRes, ttResT, wtRes, kstRes))

##                                              method testStat   pValue
## 1                           Welch Two Sample t-test   2.7501 0.009185
## 2                                 Two Sample t-test   2.7407 0.009383
## 3 Wilcoxon rank sum test with continuity correction 285.0000 0.007910
## 4                Two-sample Kolmogorov-Smirnov test   0.4974 0.016123

No matter which statistical test we choose, we obtain different numbers, but reach same conclusion: the means of wt and KO for expression of this gene are not equal.

N.B. wilcox.test and ks.test could not compute exact p-values. However, I'm worried that pValue from the ks.test was slightly higher than the other tests. Whether this is to be expected or I have not specified the test correctly I'm not sure?

Two sample tests: a handful of genes

Let's grab the data from 6 genes: 3 are interesting ('hits'), 3 are not. Reshape the data to be tall and skinny, which is generally a good policy and allows us to learn more about data aggregation.

keepGenes <- c("1431708_a_at", "1424336_at", "1454696_at", "1416119_at", "1432141_x_at", 
    "1429226_at")
miniDat <- subset(prDat, rownames(prDat) %in% keepGenes)
miniDat <- data.frame(gExp = as.vector(t(as.matrix(miniDat))), gene = factor(rep(rownames(miniDat), 
    each = ncol(miniDat)), levels = keepGenes))

Sample ID in same order within design matrix and gene expression data. Therefore can put data frames together:

miniDat <- suppressWarnings(data.frame(prDes, miniDat))
str(miniDat)

## 'data.frame':    234 obs. of  6 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
##  $ gExp    : num  10.6 11 10.8 10.9 9.2 ...
##  $ gene    : Factor w/ 6 levels "1431708_a_at",..: 4 4 4 4 4 4 4 4 4 4 ...

Plot strip plot:

ggplot(miniDat, aes(gExp, gType, colour = gType)) + geom_point(alpha = 0.3) + 
    facet_wrap(~gene, scale = "free_x")  # x scale independent per probe

plot of chunk unnamed-chunk-17

Use data aggregation techniques to conduct some two group comparisons for each of these 6 genes. Compare gene expression by genotype across all 6 genes:

t.test(gExp ~ gType, miniDat)

## 
##  Welch Two Sample t-test
## 
## data:  gExp by gType
## t = 2.952, df = 225.3, p-value = 0.00349
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2551 1.2791
## sample estimates:
##    mean in group wt mean in group NrlKO 
##               8.935               8.168

Conceptually, we want to make a sub-data.frame for each gene and provide in the place of someDat in a t test call like above. Sometimes that is a useful first step, when building up a data aggregation task.

Subset miniDat to one gene, the first value in keepGenes:

someDat <- droplevels(subset(miniDat, gene == keepGenes[1]))
t.test(gExp ~ gType, someDat)

## 
##  Welch Two Sample t-test
## 
## data:  gExp by gType
## t = 9.838, df = 36.89, p-value = 7.381e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.570 2.384
## sample estimates:
##    mean in group wt mean in group NrlKO 
##               9.554               7.578

But how do we scale this up to all 6 genes? Welcome pylr…

The plyr package

Since our input, miniDat, is a data.frame, we will use functions that start with d. What do we want to get back, if anything? If we are happy to watch the t test results fly by on the screen, we can use d_ply() with .print =TRUE. But we can also store results into a list using dlply():

ttRes <- dlply(miniDat, ~gene, function(x) t.test(gExp ~ gType, x))
names(ttRes)

## [1] "1431708_a_at" "1424336_at"   "1454696_at"   "1416119_at"  
## [5] "1432141_x_at" "1429226_at"

ttRes[["1454696_at"]]

## 
##  Welch Two Sample t-test
## 
## data:  gExp by gType
## t = 8.077, df = 33.49, p-value = 2.278e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.402 2.346
## sample estimates:
##    mean in group wt mean in group NrlKO 
##               12.85               10.97

If we knew in advance that we only wanted, say, the test statistic and the p-value, here's how we go after that directly via ddply():

(ttRes <- ddply(miniDat, ~gene, function(z) {
    zz <- t.test(gExp ~ gType, z)
    # round results to 4 digits
    round(c(tStat = zz$statistic, pVal = zz$p.value), 4)
}))

##           gene tStat.t   pVal
## 1 1431708_a_at  9.8380 0.0000
## 2   1424336_at  9.0607 0.0000
## 3   1454696_at  8.0771 0.0000
## 4   1416119_at -0.1840 0.8551
## 5 1432141_x_at -0.1324 0.8954
## 6   1429226_at  0.0983 0.9223

Ideas for take-home work

In our last example, can you edit the inner function to use the Wilcoxon or KS test? Or maybe do the t test, the Wilcoxon, and the KS test and return all 3 p-values?

Make function separate:

multiTest <- function(x) {
    tt <- t.test(gExp ~ gType, x)
    wt <- wilcox.test(gExp ~ gType, x)
    ks <- ks.test(x$gExp[x$gType == "wt"], x$gExp[x$gType == "NrlKO"])
    round(c(tStat = tt$p.val, wStat = wt$p.val, kStat = ks$p.val), 6)
}

Call function within ddply:

(multitRes <- suppressWarnings(ddply(miniDat, ~gene, multiTest)))

##           gene  tStat  wStat  kStat
## 1 1431708_a_at 0.0000 0.0000 0.0000
## 2   1424336_at 0.0000 0.0000 0.0000
## 3   1454696_at 0.0000 0.0000 0.0000
## 4   1416119_at 0.8551 0.9668 0.7128
## 5 1432141_x_at 0.8954 0.9440 0.4107
## 6   1429226_at 0.9223 0.7465 0.9947

Scale up to 100 genes:

set.seed(123)
lotsGenes <- sample(1:nrow(prDat), 100)
head(lotsGenes)

## [1]  8613 23609 12248 26443 28163  1365

lotsDat <- prDat[lotsGenes, ]
str(lotsDat, max.level = 0)

## 'data.frame':    100 obs. of  39 variables:

lotsDat <- data.frame(gExp = as.vector(t(as.matrix(lotsDat))), gene = factor(rep(rownames(lotsDat), 
    each = ncol(lotsDat)), levels = rownames(lotsDat)))
str(lotsDat)

## 'data.frame':    3900 obs. of  2 variables:
##  $ gExp: num  7.32 7.71 7.06 7.2 7.16 ...
##  $ gene: Factor w/ 100 levels "1427773_a_at",..: 1 1 1 1 1 1 1 1 1 1 ...

head(lotsDat)

##    gExp         gene
## 1 7.323 1427773_a_at
## 2 7.711 1427773_a_at
## 3 7.061 1427773_a_at
## 4 7.204 1427773_a_at
## 5 7.160 1427773_a_at
## 6 7.975 1427773_a_at

lotsDat <- suppressWarnings(data.frame(prDes, lotsDat))
multiRes <- suppressWarnings(ddply(lotsDat, ~gene, multiTest))
head(multiRes)

##           gene    tStat   wStat  kStat
## 1 1427773_a_at 0.581648 0.35060 0.3278
## 2   1451078_at 0.059321 0.07438 0.0960
## 3 1433507_a_at 0.450034 0.46493 0.3550
## 4   1455179_at 0.161915 0.21380 0.4338
## 5   1457691_at 0.930725 0.74659 0.7005
## 6   1417272_at 0.008463 0.01190 0.1171

tail(multiRes)

##             gene   tStat   wStat   kStat
## 95    1428944_at 0.56520 0.60707 0.48081
## 96    1423646_at 0.05899 0.03890 0.10826
## 97  1450732_a_at 0.82216 0.86612 0.89555
## 98    1419234_at 0.62483 0.75724 0.95769
## 99    1460690_at 0.45426 0.58774 0.43375
## 100 1437301_a_at 0.02642 0.01628 0.02179

summary(lm(multiRes$tStat ~ multiRes$wStat))

## 
## Call:
## lm(formula = multiRes$tStat ~ multiRes$wStat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3172 -0.0738 -0.0290  0.0730  0.4265 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.0355     0.0241    1.47     0.14    
## multiRes$wStat   0.9174     0.0431   21.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.137 on 98 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.82 
## F-statistic:  452 on 1 and 98 DF,  p-value: <2e-16

Scatterplot various two-group tests against each other:

t.test vs. wilcox.test
Perform lm and obtain r-squared value:

fit <- summary(lm(multiRes$tStat ~ multiRes$wStat))
fit$r.squared

## [1] 0.8219

Plot together with r-squared value:

ggplot(multiRes, aes(tStat, wStat)) + geom_point() + stat_smooth(method = "lm", 
    se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) + 
    ggtitle(paste("P values from t.test vs. wilcox.test\nfor 100 random genes from prDat dataset\nr-squared =", 
        round(fit$r.squared, digits = 4)))

plot of chunk unnamed-chunk-26

t.test vs. ks.test

fit <- summary(lm(multiRes$tStat ~ multiRes$kStat))
fit$r.squared

## [1] 0.6263

ggplot(multiRes, aes(tStat, kStat)) + geom_point() + stat_smooth(method = "lm", 
    se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) + 
    ggtitle(paste("P values from t.test vs. ks.test\nfor 100 random genes from prDat dataset\nr-squared =", 
        round(fit$r.squared, digits = 4)))

plot of chunk unnamed-chunk-28

wilcox.test vs. ks.test

fit <- summary(lm(multiRes$wStat ~ multiRes$kStat))
fit$r.squared

## [1] 0.7261

ggplot(multiRes, aes(wStat, kStat)) + geom_point() + stat_smooth(method = "lm", 
    se = FALSE, colour = "black") + geom_abline(intercept = 0, aes(colour = "red")) + 
    ggtitle(paste("P values from wilcox.test vs. ks.test\nfor 100 random genes from prDat dataset\nr-squared =", 
        round(fit$r.squared, digits = 4)))

plot of chunk unnamed-chunk-30

Therefore, t.test and wilcox.test have the most concordance in p values, whereas the t.test and ks.test have the least concordance in p values.

Convert your numeric matrix of p-values into a matrix of TRUE/FALSE or zeros and ones by hard-threshholding, e.g. at the conventional 0.05 level. Use apply() or a function from plyr to make some interesting row or column summaries. How many genes are significant according to the different tests? For each gene, how many of the tests return a significant p-value? How many genes are “hits” by all 3 methods, by exactly 2, by exactly 1 and by none?

Start with multiRes dataset:

head(multiRes)

##           gene    tStat   wStat  kStat
## 1 1427773_a_at 0.581648 0.35060 0.3278
## 2   1451078_at 0.059321 0.07438 0.0960
## 3 1433507_a_at 0.450034 0.46493 0.3550
## 4   1455179_at 0.161915 0.21380 0.4338
## 5   1457691_at 0.930725 0.74659 0.7005
## 6   1417272_at 0.008463 0.01190 0.1171

Convert data frame into matrix by removing gene names:

multiResMat <- multiRes[, c("tStat", "wStat", "kStat")]
multiResMat <- multiResMat < 0.05
head(multiResMat)

##      tStat wStat kStat
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,]  TRUE  TRUE FALSE

Use apply to make summaries for each test. Summary by test (column):

apply(multiResMat, 2, sum)

## tStat wStat kStat 
##    15    13     8

Summary by gene (row):

apply(multiResMat, 1, sum)

##   [1] 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0
##  [36] 0 0 0 0 1 2 0 0 0 0 0 0 0 3 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 3 0 0 2 2 0
##  [71] 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0 3 0 0 0 0 0 0 0 1 0 0 0 3

Now make summaries using plyr instead. First use reshape2 melt function to convert data into tall format:

multiResM <- melt(multiRes, id.var = "gene", variable.name = "method", value.name = "pVal")
str(multiResM)

## 'data.frame':    300 obs. of  3 variables:
##  $ gene  : Factor w/ 100 levels "1427773_a_at",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ method: Factor w/ 3 levels "tStat","wStat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pVal  : num  0.5816 0.0593 0.45 0.1619 0.9307 ...

head(multiResM)

##           gene method     pVal
## 1 1427773_a_at  tStat 0.581648
## 2   1451078_at  tStat 0.059321
## 3 1433507_a_at  tStat 0.450034
## 4   1455179_at  tStat 0.161915
## 5   1457691_at  tStat 0.930725
## 6   1417272_at  tStat 0.008463

Apply threshold as a function within ddply:

ddply(multiResM, ~method, summarize, pValAboveSig = sum(pVal < 0.05))

##   method pValAboveSig
## 1  tStat           15
## 2  wStat           13
## 3  kStat            8

test <- ddply(multiResM, ~gene, summarize, pValAboveSig = sum(pVal < 0.05))
head(test)

##           gene pValAboveSig
## 1 1427773_a_at            0
## 2   1451078_at            0
## 3 1433507_a_at            0
## 4   1455179_at            0
## 5   1457691_at            0
## 6   1417272_at            2

Can we summarize the above even further? Number of genes that had 0, 1, 2 and 3 p values < 0.05?

ddply(test, ~pValAboveSig, summarize, geneNumber = table(pValAboveSig))

##   pValAboveSig geneNumber
## 1            0         83
## 2            1          5
## 3            2          5
## 4            3          7

How can we get this summary straight from multiResM? Without making test?