Exploring the gene expression data

Gene expression table of 6 tissues represented by 1000 genes each. Please, download the data from the Google drive link. “Gene_expression_table.txt”

library(tidyverse)
exp <- read.delim("Gene_expression_table.txt", header = T, row.names = 1)
head(exp)
##        Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1 Sample_B_rep2
## Gene_1    22.9429000     23.674800     25.902100    22.5873000    23.4013000
## Gene_2     0.0966727      0.000000      0.282938     0.1273130     0.4440570
## Gene_3    52.2926000     48.391900     53.052200    60.5590000    72.4147000
## Gene_4     0.3252890      0.313256      0.512266     0.0556837     0.0719282
## Gene_5     0.5296930      0.629770      2.150640     1.1608300     1.1623000
## Gene_6     0.0000000      0.000000      0.000000     0.0000000     0.0458154
##        Sample_B_rep3
## Gene_1    22.3985000
## Gene_2     0.0835085
## Gene_3    53.1314000
## Gene_4     0.2759260
## Gene_5     1.4716900
## Gene_6     0.0000000

Exploratory plots

reshape2::melt(exp) %>% ggplot(aes(variable, log(value+0.01) )) + geom_boxplot()
## No id variables; using all as measure variables

Calculate gene expression variance across samples (row wise) using apply().

exp.vars <- apply(exp, 1, var, na.rm=TRUE)#row wise variance

#sample.vars <- apply(exp, 2, mean, na.rm=TRUE)#col wise mean

#Calculate gene mean expression
exp.means <- apply(exp, 1, mean, na.rm=T)

Examine the relationship between mean gene expression and its variance across samples.

plot(exp.means, exp.vars, pch=20)

Coefficient of variation (CV) is a better estimation of variance against mean value of gene expression when selecting gene features. Let’s see how to examine CV.

exp.sd <- apply(exp, 1, sd, na.rm=T)

gene_cv <- exp.sd/exp.means

plot(exp.means, gene_cv, pch=20, col="blue")

Testing gene expression values between samples of two conditions.

The Null hypthesis is that the mean expression of genes are same for both sample set.

pValues <- apply(exp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)

gene_mean <- apply(exp, 1, mean, na.rm=T)

#Plot the p-values agains mean gene expression
plot(gene_mean, pValues, pch=20, col="red")

#Transform p-values
plot(gene_mean, -1*log10(pValues), pch=20, col="red")
abline(h = -1*log10(0.05), col="blue")

#Number of tests rejected the Null hypothesis
sum(pValues < 0.05, na.rm = T)
## [1] 190
Mutliple hypothesis testing correction

“A gene with a significance cut-off of p < 0.05, means there is a 5% chance it is a false positive. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives! We would not want to sift through our “significant” genes to identify which ones are true positives.

There are a few common approaches for multiple test correction:

– Bonferroni: The adjusted p-value is calculated by: p-value * m (m = total number of tests). This is a very conservative approach with a high probability of false negatives, so is generally not recommended.

– FDR/Benjamini-Hochberg: Benjamini and Hochberg (1995) defined the concept of False Discovery Rate (FDR) and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. More info about BH.source

p.adj <- p.adjust(pValues)
The importance of proper gene filtering

Reducing the number of potential tests will improve multiple hypothesis testing correction.

sum(gene_mean > 1.0 & gene_cv > 0.25, na.rm = T)
## [1] 354
exp.sub <- exp[gene_mean > 1.0 & gene_cv > 0.25,] 

pValues.sub <- apply(exp.sub, 1, function(x) t.test(x[1:3], x[4:6])$p.value)

p.adj.sub <- p.adjust(pValues.sub)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.8    ggplot2_3.4.0  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9          lubridate_1.8.0     assertthat_0.2.1   
##  [4] digest_0.6.29       utf8_1.2.2          R6_2.5.1           
##  [7] cellranger_1.1.0    plyr_1.8.7          backports_1.4.1    
## [10] reprex_2.0.2        evaluate_0.16       highr_0.9          
## [13] httr_1.4.4          pillar_1.8.1        rlang_1.0.6        
## [16] googlesheets4_1.0.1 readxl_1.4.1        rstudioapi_0.13    
## [19] jquerylib_0.1.4     rmarkdown_2.15      labeling_0.4.2     
## [22] googledrive_2.0.0   munsell_0.5.0       broom_1.0.0        
## [25] compiler_4.1.0      modelr_0.1.9        xfun_0.32          
## [28] pkgconfig_2.0.3     htmltools_0.5.3     tidyselect_1.1.2   
## [31] fansi_1.0.3         crayon_1.5.1        tzdb_0.3.0         
## [34] dbplyr_2.2.1        withr_2.5.0         grid_4.1.0         
## [37] jsonlite_1.8.0      gtable_0.3.0        lifecycle_1.0.3    
## [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.0       
## [43] cli_3.4.1           stringi_1.7.8       cachem_1.0.6       
## [46] farver_2.1.1        reshape2_1.4.4      fs_1.5.2           
## [49] xml2_1.3.3          bslib_0.4.0         ellipsis_0.3.2     
## [52] generics_0.1.3      vctrs_0.5.1         tools_4.1.0        
## [55] glue_1.6.2          hms_1.1.2           fastmap_1.1.0      
## [58] yaml_2.3.5          colorspace_2.0-3    gargle_1.2.0       
## [61] rvest_1.0.2         knitr_1.39          haven_2.5.0        
## [64] sass_0.4.2