Gene expression table of 6 tissues represented by 1000 genes each. Please, download the data from the Google drive link. “Gene_expression_table.txt”
library(tidyverse)
exp <- read.delim("Gene_expression_table.txt", header = T, row.names = 1)
head(exp)
## Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1 Sample_B_rep2
## Gene_1 22.9429000 23.674800 25.902100 22.5873000 23.4013000
## Gene_2 0.0966727 0.000000 0.282938 0.1273130 0.4440570
## Gene_3 52.2926000 48.391900 53.052200 60.5590000 72.4147000
## Gene_4 0.3252890 0.313256 0.512266 0.0556837 0.0719282
## Gene_5 0.5296930 0.629770 2.150640 1.1608300 1.1623000
## Gene_6 0.0000000 0.000000 0.000000 0.0000000 0.0458154
## Sample_B_rep3
## Gene_1 22.3985000
## Gene_2 0.0835085
## Gene_3 53.1314000
## Gene_4 0.2759260
## Gene_5 1.4716900
## Gene_6 0.0000000
Exploratory plots
reshape2::melt(exp) %>% ggplot(aes(variable, log(value+0.01) )) + geom_boxplot()
## No id variables; using all as measure variables
Calculate gene expression variance
across samples (row
wise) using apply()
.
exp.vars <- apply(exp, 1, var, na.rm=TRUE)#row wise variance
#sample.vars <- apply(exp, 2, mean, na.rm=TRUE)#col wise mean
#Calculate gene mean expression
exp.means <- apply(exp, 1, mean, na.rm=T)
Examine the relationship between mean gene expression and its variance across samples.
plot(exp.means, exp.vars, pch=20)
Coefficient of variation (CV) is a better estimation of variance against mean value of gene expression when selecting gene features. Let’s see how to examine CV.
exp.sd <- apply(exp, 1, sd, na.rm=T)
gene_cv <- exp.sd/exp.means
plot(exp.means, gene_cv, pch=20, col="blue")
Testing gene expression values between samples of two conditions.
The Null hypthesis is that the mean expression of genes are same for both sample set.
pValues <- apply(exp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
gene_mean <- apply(exp, 1, mean, na.rm=T)
#Plot the p-values agains mean gene expression
plot(gene_mean, pValues, pch=20, col="red")
#Transform p-values
plot(gene_mean, -1*log10(pValues), pch=20, col="red")
abline(h = -1*log10(0.05), col="blue")
#Number of tests rejected the Null hypothesis
sum(pValues < 0.05, na.rm = T)
## [1] 190
“A gene with a significance cut-off of p < 0.05, means there is a 5% chance it is a false positive. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives! We would not want to sift through our “significant” genes to identify which ones are true positives.
There are a few common approaches for multiple test correction:
– Bonferroni: The adjusted p-value is calculated by: p-value * m (m = total number of tests). This is a very conservative approach with a high probability of false negatives, so is generally not recommended.
– FDR/Benjamini-Hochberg: Benjamini and Hochberg (1995) defined the concept of False Discovery Rate (FDR) and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. More info about BH.” source
p.adj <- p.adjust(pValues)
Reducing the number of potential tests will improve multiple hypothesis testing correction.
sum(gene_mean > 1.0 & gene_cv > 0.25, na.rm = T)
## [1] 354
exp.sub <- exp[gene_mean > 1.0 & gene_cv > 0.25,]
pValues.sub <- apply(exp.sub, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
p.adj.sub <- p.adjust(pValues.sub)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.4.0
## [9] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lubridate_1.8.0 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] cellranger_1.1.0 plyr_1.8.7 backports_1.4.1
## [10] reprex_2.0.2 evaluate_0.16 highr_0.9
## [13] httr_1.4.4 pillar_1.8.1 rlang_1.0.6
## [16] googlesheets4_1.0.1 readxl_1.4.1 rstudioapi_0.13
## [19] jquerylib_0.1.4 rmarkdown_2.15 labeling_0.4.2
## [22] googledrive_2.0.0 munsell_0.5.0 broom_1.0.0
## [25] compiler_4.1.0 modelr_0.1.9 xfun_0.32
## [28] pkgconfig_2.0.3 htmltools_0.5.3 tidyselect_1.1.2
## [31] fansi_1.0.3 crayon_1.5.1 tzdb_0.3.0
## [34] dbplyr_2.2.1 withr_2.5.0 grid_4.1.0
## [37] jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.3
## [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.0
## [43] cli_3.4.1 stringi_1.7.8 cachem_1.0.6
## [46] farver_2.1.1 reshape2_1.4.4 fs_1.5.2
## [49] xml2_1.3.3 bslib_0.4.0 ellipsis_0.3.2
## [52] generics_0.1.3 vctrs_0.5.1 tools_4.1.0
## [55] glue_1.6.2 hms_1.1.2 fastmap_1.1.0
## [58] yaml_2.3.5 colorspace_2.0-3 gargle_1.2.0
## [61] rvest_1.0.2 knitr_1.39 haven_2.5.0
## [64] sass_0.4.2