Dataset description

The data is from a paper Generation of a microglial developmental index in mice and in humans reveals a sex difference in maturation and immune reactivity Published in Glia in 2017 and subsequent erratum in 2018 (Hanamsagar et al. 2018). The data was downloaded from GEO GSE99622, platform for RNA sequencing by Illumina HiSeq 2500. The total sample size is 60 with 3 replicates in Female embryonic 18 days (E18) and 4 replicates in male E18 days, 4 replicates for post-natal 5 day (P5) for both male and female, 10 replicates for P14 for both male and female and 7 P60 female replicates and 5 P60 male replicates, both treated with saline. Here two sets of study are conducted. For the developmental phase: E18, P4, P14 and P60_sal For LPS and Saline treated cells: P60_sal and P60_LPS

Theoritical background on Normalization

Normalization is a process of making data comparable within or across samples by minimizing the technical variations and retaining the biological variation (Bullard et al. 2010; Risso et al. 2014; Robinson & Oshlack 2010; Zhao, Yingdong et al. 2021; Zwiener, Frisch & Binder 2014). Normalization of the data is a crucial step which impacts the results of data analysis (Bullard et al. 2010; Evans, Hardin & Stoebel 2017; Risso et al. 2011; Robinson & Oshlack 2010; Zwiener, Frisch & Binder 2014). Several benchmarking studies on normalization have been published however, the consensus which is the most accurate is lagging (Bullard et al. 2010; Dillies et al. 2013; Li, P et al. 2015; Lin et al. 2016; Maza et al. 2013; Zwiener, Frisch & Binder 2014; Zyprych-Walczak et al. 2015). The normalization techniques can be assessed by comparing the results with real-time qPCR for benchmarking the true expression values (Li, P et al. 2015). Assessing the suitability of a selected normalization method can be challenging, but one approach is to employ multiple methods and evaluate the consistency of the outcomes.

Intersample normalization is comparison of a gene present in different samples. This type of comparison should correct sequencing depth variation, composition bias, batch effects and other unknown technical variations between samples. Biasses present in intersample comparison

  1. Sequencing depth
  2. Composition bias
  3. Batch effects
  4. Other unknown artifacts

Intersample normalization is essential for differential gene expression analysis. Sequencing depth, also known as, library size is total read counts mapped for a given sample. There is no optimum level of sequencing depth as it depends on complexity of the organism and aim of the experiment (Conesa et al. 2016). A deep sequencing which might improve the quantification and identification of transcripts leading to novel isoform detection; however, it may also detect transcriptional noise and off-target transcripts (Conesa et al. 2016). Sequencing different samples within a single experimental design may have different sequencing depths because total number of representative RNA molecules in cDNA library can vary (Dillies et al. 2013). Composition bias refers to the types and amounts of different RNAs in a sample. Composition bias can occur when some RNAs are more abundant in one group compared to another. This impact becomes more pronounced when a subset of genes within the RNA population is either exclusive to or characterized by high expression levels within a particular experimental condition, a reason might be due to contamination as well. When such a scenario occurs, a significant proportion of the sequencing resources is allocated to capturing reads from these specific genes, which decreases the reads for other genes causing under sampling, if not explicitly accounted for, this can cause composition bias (Robinson & Oshlack 2010).

  1. No normalisation (negative control)
  2. This negative control has only CPM normalization and log transformation, CPM (Counts per million mapped reads) is a basic gene expression unit, used to correct the sequencing depth variation.

  3. TMM normalisation
  4. TMM (trimmed mean of M-values) is used for correcting the composition bias and is incorporated into the edgeR package as default to normalize data for differential gene expression. In the TMM algorithm, first a reference sample is chosen. A reference sample is one which is near to the average mean of all the samples. Then, genes are selected to create a scaling factor, for this, the genes which are not outliers are selected (outliers can be known by observing log fold differences). Log fold values are trimmed from top 30% and bottom 30% and mean of logs are trimmed from top 5% and bottom 5%. The genes which are overlapping from both trimmed log fold values and mean of logs are used to calculated scaling factor. Finally, the original read counts are divided by this scaling factor (Robinson & Oshlack 2010). How TMM is choosing a reference sample and creating scaling factors is the major difference with other techniques and this makes it as robust technique to handle outliers.
  5. TMMwsp normalisation
  6. TMMwsp (TMM with singleton pairing) is a variant of TMM to handle the zero inflated data in edgeR package. In TMM, if a gene has zero count in a pair of libraries while comparing for a reference sample, then it is ignored, but in TMMwsp it is not ignored.
  7. UQ normalisation
  8. Quantile normalization was first proposed for microarray data (Bolstad et al. 2003). The upper quartile (UQ) normalization which is determined by applying the 0.75 quantile to the gene counts from all runs, is applied in RNA sequencing data. Scaling factors are calculated from the 75% quantile of the counts of each library (Bullard et al. 2010).

  9. RLE normalisation
  10. RLE (relative log expression) in DESeq2 Bioconductor package is similar to the TMM, in which, first a reference sample is created by calculating row-wise geometric mean which can handle outlier data. Then a scaling factor is calculated by using the median of the ratio for each sample in log base. The median is another way to handle extreme values. Then, median is converted to natural numbers from log base to get final the scaling factor. Count values are normalized using these scaling factor (Anders & Huber 2010b)

TMM, TMMwsp, UQ and RLE are widely used techniques for differential expression. They have an important statistical assumption that the majority of genes should not be differentially expressed (Evans, Hardin & Stoebel 2017; Zhao, Yaxing, Wong & Goh 2020)

Read data and meta data

count_data <- read.csv('/Users/nishapaudel/Documents/Richa_bulk_rna_seq/countss', sep =',', header=T, row.names = 1)
sampleinfo <- read.csv("/Users/nishapaudel/Documents/Richa_bulk_rna_seq/design.txt", sep = ',', header = TRUE, row.names = 1, stringsAsFactors = TRUE)
dim(count_data)
## [1] 21986    60

Check the sequencing depth

par(cex.axis = 0.8)
sums <- colSums(count_data)
sums <- sums[order(sums)]
barplot(sums,space = 1 ,las = 2, xlab="", col = "pink")



abline(v=1500000, col = 'red')
# Add a title to the plot
title("Barplot of library sizes")

Load the libraries

library(edgeR)
library(limma)
y <- DGEList(count_data)

adding group info in object

group <- paste(sampleinfo$Treatment, sampleinfo$Sex, sampleinfo$Age, sep = ".") #
group <- factor(group)
#head(group)
# Add the group information into the DGEList
y$samples$group <- group

Filter lowly expressed genes

keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes=FALSE]
dim(y)
## [1] 13324    60

##No normalisation (Negative control) Only cpm log transformed data, cpm normalises the sequencing depth but not compositional bias. Here scale factor is set 1 for all.

MDS plot: The distance between each pair of samples in the MDS plot is calculated as the leading fold change, defined as the root-mean-square of the largest 500 log2-fold changes between that pair of samples. Replicate samples from the same group cluster together in the plot, while samples from different groups form separate clusters. This indicates that the differences between groups are larger than those within groups, i.e., differential expression is greater than the variance and can be detected.

no_normalise <- calcNormFactors(y, method = "none")
logcount <- cpm(no_normalise,log=TRUE) 
boxplot(logcount, xlab="", ylab="not normalized",las=2, col = "grey",cex.axis=0.69 ,main="No normalisation")
abline(h=median(logcount),col="black")

plotMDS(no_normalise, col= as.numeric(group), main = "MDS Plot of No normalization")

##TMM

TMM <- calcNormFactors(y, method = "TMM")
logcounts <- cpm(TMM,log=TRUE) # after getting size factors using cpm will give TMM calculated values
boxplot(logcounts, xlab="", ylab="TMM normalised counts",las=2, col = "mistyrose1",cex.axis=0.69 ,main="TMM normalised")
## Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="black")

plotMDS(TMM, col= as.numeric(group))

##TMMwsp

TMMwsp <- calcNormFactors(y, method = "TMMwsp")
logcountswsp <- cpm(TMMwsp,log=TRUE) 
boxplot(logcountswsp, xlab="", ylab="TMMwsp normalized",las=2, col = "aquamarine",cex.axis=0.69 ,main="TMMwsp normalised")
abline(h=median(logcountswsp),col="black")

plotMDS(TMMwsp, col= as.numeric(group))

##Upper Quartile (UQ)

UQ <- calcNormFactors(y, method = "upperquartile")
logcountsuq <- cpm(TMMwsp,log=TRUE) 
boxplot(logcountsuq, xlab="", ylab="upperquartile normalized counts",las=2, col = "coral",cex.axis=0.69 ,main="Upperquartile normalised counts")
abline(h=median(logcountsuq),col="black")

plotMDS(UQ, col= as.numeric(group))

##RLE(Relative log expression)

RLE <- calcNormFactors(y, method = "RLE")
logcountsRLE <- cpm(RLE,log=TRUE) 
boxplot(logcountsRLE, xlab="", ylab="RLE normalized counts",las=2, col = "darkolivegreen1",cex.axis=0.69 ,main="RLE normalised counts")
abline(h=median(logcountsRLE),col="black")

plotMDS(RLE, col= as.numeric(group))

design <- model.matrix(~0+group)
colnames(design)<-levels(group)
#design

##Create contrast (immune activated female vs male)

SalvsLPS <- makeContrasts(LPS.Female.P60-LPS.Male.P60  , levels = design)

##DEGs from all normalisation techniques

For DEGs, https://f1000research.com/articles/9-1444, manual is followed. It is based on quasilikelihood pipeline, NB GLM model with F tests -because of strict error control rate than traditional edgeR and DESeq2

The estimateDisp function maximizes the negative binomial likelihood to give the estimation of common, trended, gene-wise dispersion across all the genes, then empirical Bayes method was  used to estimate posterior dispersion, robust = TRUE means, the estimate should be robust against the outliers. glmQLFFit function performs empirical Bayes quasi likelihood F-Tests. Multiple test correction was performed by applying the Benjamini-Hochberg method on the P values to control False positive discovery rate (FDR). 

DEGs from No normalization

no_normalise <-estimateDisp(no_normalise,design, robust = TRUE)
no_normalise <-glmQLFit(no_normalise,design, robust =TRUE)
no_normalise <-glmQLFTest(no_normalise, contrast = SalvsLPS)
summary(decideTests(no_normalise))
##        1*LPS.Female.P60 -1*LPS.Male.P60
## Down                                 21
## NotSig                            13293
## Up                                   10
table(p.adjust(no_normalise$table$PValue, method = "BH") < 0.05)
## 
## FALSE  TRUE 
## 13293    31

DEGs from TMM

TMM <-estimateDisp(TMM,design, robust = TRUE)
TMM <-glmQLFit(TMM,design, robust =TRUE)
TMM <-glmQLFTest(TMM, contrast = SalvsLPS)
summary(decideTests(TMM))
##        1*LPS.Female.P60 -1*LPS.Male.P60
## Down                                 58
## NotSig                            13254
## Up                                   12
table(p.adjust(TMM$table$PValue, method = "BH") < 0.05)
## 
## FALSE  TRUE 
## 13254    70

DEGs from TMMwsp

TMMwsp <-estimateDisp(TMMwsp,design, robust = TRUE)
TMMwsp <-glmQLFit(TMMwsp,design, robust =TRUE)
TMMwsp <-glmQLFTest(TMMwsp, contrast = SalvsLPS)
summary(decideTests(TMMwsp))
##        1*LPS.Female.P60 -1*LPS.Male.P60
## Down                                 57
## NotSig                            13254
## Up                                   13
table(p.adjust(TMMwsp$table$PValue, method = "BH") < 0.05)
## 
## FALSE  TRUE 
## 13254    70

DEGs from UQ

UQ <-estimateDisp(UQ,design, robust = TRUE)
UQ <-glmQLFit(UQ,design, robust =TRUE)
UQ <-glmQLFTest(UQ, contrast = SalvsLPS)
summary(decideTests(UQ))
##        1*LPS.Female.P60 -1*LPS.Male.P60
## Down                                 45
## NotSig                            13268
## Up                                   11
table(p.adjust(UQ$table$PValue, method = "BH") < 0.05)
## 
## FALSE  TRUE 
## 13268    56

DEGs from RLE

RLE <-estimateDisp(RLE,design, robust = TRUE)
RLE <-glmQLFit(RLE,design, robust =TRUE)
RLE <-glmQLFTest(RLE, contrast = SalvsLPS)
summary(decideTests(RLE))
##        1*LPS.Female.P60 -1*LPS.Male.P60
## Down                                 68
## NotSig                            13245
## Up                                   11
table(p.adjust(RLE$table$PValue, method = "BH") < 0.05)
## 
## FALSE  TRUE 
## 13245    79
get_de <- function(x, pvalue){
  my_i <- p.adjust(x$PValue, method="BH") < pvalue
  row.names(x)[my_i]
}
 
de_no_norm <- get_de(no_normalise$table, 0.05)
de_tmm <- get_de(TMM$table, 0.05)
de_tmmwsp <- get_de(TMMwsp$table, 0.05)
de_rle <- get_de(RLE$table, 0.05)
de_uq <- get_de(UQ$table, 0.05)
#de_uq

VennDiagram of DEG list

library(VennDiagram)
# Combine the lists into a named list for the Venn diagram
venn_data <- list(no_norm = de_no_norm, TMM = de_tmm, TMMwsp = de_tmmwsp, UQ = de_uq,RLE = de_rle)

# Create the Venn diagram with custom colors
vennColors <- c("grey", "mistyrose1", "aquamarine", "coral", "darkolivegreen1")

# Set up the Venn diagram configuration
venn_config <- venn.diagram(
  x = venn_data,
  category.names = names(venn_data),
  fill = vennColors,
  filename = NULL, # Set filename to NULL to display the plot directly in R session
  imagetype = "png", # You can change this to other formats like "pdf", "jpeg", etc.
  col = "transparent", # Set background color to transparent for better display in R session
  cex = 1.4,
  cex.label = 2,
  lwd = 2 
)

# Display the Venn diagram
grid.newpage()
grid.draw(venn_config)

get the unique genes

# Get the unique elements in TMM
unique_rle <- setdiff(venn_data$RLE, unlist(venn_data[-which(names(venn_data) == "RLE")]))

# Print the unique elements
print(unique_rle)
##  [1] "3110043O21Rik" "Ankrd57"       "Ccl3"          "Cpsf7"        
##  [5] "Csf2rb"        "Dnajc21"       "Pfkfb3"        "Runx1"        
##  [9] "Srsf11"        "Tlr2"          "Tmem87b"
# Get the unique elements in TMMwsp
#unique_no <- setdiff(venn_data$no_norm, unlist(venn_data[-which(names(venn_data) == "no_norm")]))
#print(unique_no)
# Get the unique elements in TMMwsp
#unique_UQ <- setdiff(venn_data$UQ, unlist(venn_data[-which(names(venn_data) == "UQ")]))
#print(unique_UQ)
#Get the unique elements in TMMwsp
unique_RLE <- setdiff(venn_data$RLE, unlist(venn_data[-which(names(venn_data) == "RLE")]))
print(unique_RLE)
##  [1] "3110043O21Rik" "Ankrd57"       "Ccl3"          "Cpsf7"        
##  [5] "Csf2rb"        "Dnajc21"       "Pfkfb3"        "Runx1"        
##  [9] "Srsf11"        "Tlr2"          "Tmem87b"

Save all the files for future reference

#write.csv(no_normalise$table, file ="./others/no_normalisation_treated.csv" , quote = FALSE, row.names = T)

#write.csv(TMM$table, file ="./others/TMM_treated.csv" , quote = FALSE, row.names = T)

#write.csv(TMMwsp$table, file ="./others/TMMwsp_treated.csv" , quote = FALSE, row.names = T)

#write.csv(UQ$table, file ="./others/UQ_treated.csv" , quote = FALSE, row.names = T)

#write.csv(RLE$table, file ="./others/RLE_treated.csv" , quote = FALSE, row.names = T)

#write.csv(unique_RLE, file ="./others/RLE_unique_elements.csv" , quote = FALSE)

Conclusion

All the samples had similar median values observed in the boxplots in all four normalisation approaches except for the boxplot negative control where slight fluctuation was observed.

Since, many differentially expressed genes are overlapping, any one of the four techniques can be used for this dataset. The best way to determine which technique is the most precise and accurate showing the true biological signal would be conducting the RT -qPCR for the unique gene sets we have from the Venn diagram. From the statistical point of view, TMMwsp has a more inclusive algorithm than TMM, it does not remove genes while comparing the two libraries even if one of them has zero count. TMMwsp can be used as method of preference to minimize gene loss due to zeros. RLE has the highest number of DEGs and RLE and UQ has many overlapping genes. RLE is my preference for this dataset as I am looking for more number of DEGs. However, for validation of this in silico conclusion, in vitro or in vivo experiments must be conducted.

Session info

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 14.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Melbourne
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] VennDiagram_1.7.3   futile.logger_1.4.3 edgeR_3.42.4       
## [4] limma_3.56.2       
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.1            knitr_1.43           rlang_1.1.1         
##  [4] xfun_0.40            highr_0.10           jsonlite_1.8.7      
##  [7] statmod_1.5.0        htmltools_0.5.6      formatR_1.14        
## [10] sass_0.4.7           locfit_1.5-9.8       rmarkdown_2.24      
## [13] evaluate_0.21        jquerylib_0.1.4      fastmap_1.1.1       
## [16] yaml_2.3.7           compiler_4.3.0       Rcpp_1.0.11         
## [19] lambda.r_1.2.4       rstudioapi_0.15.0    lattice_0.21-8      
## [22] digest_0.6.33        R6_2.5.1             futile.options_1.0.1
## [25] splines_4.3.0        bslib_0.5.1          tools_4.3.0         
## [28] cachem_1.0.8