Example R code solutions for the Week 7 Science/Health Science/Data Science Computer Lab, which uses data from Series GSE184932 on the NCBI Gene Expression Omnibus website are presented below.
This computer lab is designed to run alongside the content in the Introduction to Bioinformatics in R supplement. It might be helpful to have this material open as you look through these solutions.
No answer required.
p.adjust
functionExample R code is provided below.
base_p_values <- c(0.0003, 0.0085, 0.001, 0.0001, 0.045, 0.62, 0.009, 0.18, 0.92, 0.02)
There are 7 \(p\)-values below 0.05, so we would find 7 genes to be significantly differentially expressed.
p.adjust(base_p_values, method = "bonferroni")
## [1] 0.003 0.085 0.010 0.001 0.450 1.000 0.090 1.000 1.000 0.200
Using the Bonferroni correction, we find that only 3 of the \(p\)-values are now below 0.05.
No answer required.
Example R code is provided below.
# Check all p-values
p.adjust(base_p_values, method = "holm")
## [1] 0.0027 0.0595 0.0080 0.0010 0.1800 1.0000 0.0595 0.5400 1.0000 0.1000
p.adjust(base_p_values, method = "hochberg")
## [1] 0.0027 0.0540 0.0080 0.0010 0.1800 0.9200 0.0540 0.5400 0.9200 0.1000
p.adjust(base_p_values, method = "hommel")
## [1] 0.0027 0.0510 0.0080 0.0010 0.1800 0.9200 0.0540 0.5400 0.9200 0.1000
# Compute number of significant p-values
c(sum(p.adjust(base_p_values, method = "holm") < .05),
sum(p.adjust(base_p_values, method = "hochberg") < .05),
sum(p.adjust(base_p_values, method = "hommel") < .05)
)
## [1] 3 3 3
All three methods result in 3 \(p\)-values below 0.05. The Hochberg and Hommel methods produce near-identical results.
Example R code is provided below.
p.adjust(base_p_values, method = "fdr")
## [1] 0.001500000 0.018000000 0.003333333 0.001000000 0.064285714 0.688888889
## [7] 0.018000000 0.225000000 0.920000000 0.033333333
We see that there are 6 \(p\)-values below 0.05. The FDR method has been less stringent than the FWER methods, but has still removed one gene from contention.
No answer required
Example R code is provided below.
sars_data <- read.csv(file = "sars_data.csv", header = T)
head(sars_data)
## X...gene_id gene_name gene_biotype class pvalue.Mock_vs_VIC_2h
## 1 ENSG00000227232 WASH7P unprocessed_pseudogene human 0.5676453
## 2 ENSG00000278267 MIR6859-1 miRNA human 0.5187192
## 3 ENSG00000237613 FAM138A lncRNA human 0.5946020
## 4 ENSG00000238009 AL627309.1 lncRNA human 0.5253038
## 5 ENSG00000268903 AL627309.6 processed_pseudogene human 0.9834197
## 6 ENSG00000269981 AL627309.7 processed_pseudogene human 0.6993969
## pvalue.Mock_vs_VIC_8h pvalue.Mock_vs_VIC_24h
## 1 0.8024090 0.9301996
## 2 0.7168021 0.3527285
## 3 0.5627110 0.8575999
## 4 0.8059737 0.9089223
## 5 0.3954552 0.4628377
## 6 0.6707460 0.9460855
summary(sars_data)
## X...gene_id gene_name gene_biotype class
## Length:30491 Length:30491 Length:30491 Length:30491
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## pvalue.Mock_vs_VIC_2h pvalue.Mock_vs_VIC_8h pvalue.Mock_vs_VIC_24h
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.01217 1st Qu.:0.04519 1st Qu.:0.1410
## Median :0.33624 Median :0.40727 Median :0.4893
## Mean :0.38358 Mean :0.42098 Mean :0.4710
## 3rd Qu.:0.72346 3rd Qu.:0.75839 3rd Qu.:0.7836
## Max. :0.99996 Max. :0.99994 Max. :1.0000
# First, we select only the values in the
# `pvalue.Mock_vs_VIC_2h` column that are less than 0.01
sars_data_2h_base <- sars_data[sars_data$pvalue.Mock_vs_VIC_2h < 0.01,]
# Then we assess the number of rows remaining - each row represents a gene
nrow(sars_data_2h_base)
## [1] 7427
So based on the base \(p\)-values, there are 7427 differentially expressed genes at the 2 hour post-infection time point.
sars_data$pvalue_VIC_2h_fdr <- p.adjust(sars_data$pvalue.Mock_vs_VIC_2h, method = "fdr")
Example R code is provided below.
sars_data_low_pvalue_genes_2h <- sars_data[sars_data$pvalue_VIC_2h_fdr < 0.01,]
dim(sars_data_low_pvalue_genes_2h)
## [1] 6179 8
We can see that following FDR correction we now have 6179 significant genes, a reduction of 1248 genes.
Example R code for the 8 hour time point is provided below:
sars_data_8h_base <- sars_data[sars_data$pvalue.Mock_vs_VIC_8h < 0.01,]
dim(sars_data_8h_base)
## [1] 5904 8
sars_data$pvalue_VIC_8h_fdr <- p.adjust(sars_data$pvalue.Mock_vs_VIC_8h, method = "fdr")
sars_data_low_pvalue_genes_8h <- sars_data[sars_data$pvalue_VIC_8h_fdr < 0.01,]
dim(sars_data_low_pvalue_genes_8h)
## [1] 4652 9
Here we obtain 4652 FDR corrected significant genes, reduced from 5904 identified using the base \(p\)-values.
Example R code for the 24 hour time point is provided below:
sars_data_24h_base <- sars_data[sars_data$pvalue.Mock_vs_VIC_24h < 0.01,]
dim(sars_data_24h_base)
## [1] 3312 9
sars_data$pvalue_VIC_24h_fdr <- p.adjust(sars_data$pvalue.Mock_vs_VIC_24h, method = "fdr")
sars_data_low_pvalue_genes_24h <- sars_data[sars_data$pvalue_VIC_24h_fdr < 0.01,]
dim(sars_data_low_pvalue_genes_24h)
## [1] 1907 10
Here we obtain 1907 FDR corrected significant genes, reduced from 3312 identified using the base \(p\)-values.
For this example, it might be important to compare gene expression differences between different time points in order to determine appropriate treatments (e.g. which drug(s) will be most efficacious when treating patients), or in a broader context, when compared with other data, to help identify between different coronavirus strains. Can you think of any other reasons?
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.