This code is produced to perform Differential Expression Analysis on RNA Seq Data. And then to perform a pathway enrichment analysis. Lets dive into the analysis with no further due.

Loading the necessary libraries, first. If any of this packages is not already installed on the local machine, the code to install the libraries is given below:

Tip: Names of libraries are case sensitive.

if (!require(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

BiocManager::install(“Package name”)

# Load necessary libraries
library(GEOquery)

## Loading required package: Biobase

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Setting options('download.file.method.GEOquery'='auto')

## Setting options('GEOquery.inmemory.gpl'=FALSE)

library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## The following object is masked from 'package:Biobase':
## 
##     rowMedians

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse()     masks IRanges::collapse()
## ✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count()        masks matrixStats::count()
## ✖ dplyr::desc()         masks IRanges::desc()
## ✖ tidyr::expand()       masks S4Vectors::expand()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::first()        masks S4Vectors::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename()       masks S4Vectors::rename()
## ✖ lubridate::second()   masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice()        masks IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(vsn)

Load and Modifying the data.

Loading/Reading the count matrix aka expression data-set and pheno-typic data. Some minor modification to the count matrix as well as pheno-typic data is always required according to the desired aim and research question. Its better to make a folder and perform everything in it.

# Setting the word directory
setwd("E:/R Practice")

# Read expression set and phenotypic/meta data
count_data <- read.table("E:/R Practice/Bulk_RNAseq_raw_counts.txt",
                         header = T)
head(count_data)

##                 I0712 I0713 I0717 I0721 I0726 I0731 I0732 I0734 I0735 I0736
## ENSG00000223972     0     0     0     0     0     0     0     0     2     1
## ENSG00000227232    55    93   198   131    75    64    90   101   164    70
## ENSG00000278267     6     6    28     5     4     9     6     8    13     2
## ENSG00000243485     0     0     0     3     1     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     1     0     0     0     0     0     0     0
##                 I0737 I0739 I0740 I0744 I0745 I0747 I0748 I0749 I0751 I0752
## ENSG00000223972     0     0     0     0     0     0     1     0     0     1
## ENSG00000227232   113    67   227    56    89   121    73   119   155    95
## ENSG00000278267     8    16    35     6     9     5     3     4    18    10
## ENSG00000243485     0     0     0     1     0     0     1     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0755 I0756 I0758 I0760 I0761 I0763 I0765 I0766 I0767 I0769
## ENSG00000223972     0     0     0     0     0     0     0     0     0     0
## ENSG00000227232   122   116    74   113    62   105   101    79    93    45
## ENSG00000278267     9     2     6     7     8     2     8     9     6     8
## ENSG00000243485     0     1     0     1     0     0     1     1     0     1
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0773 I0774 I0777 I0778 I0782 I0785 I0786 I0787 I0789 I0790
## ENSG00000223972     0     0     0     1     0     0     0     0     0     0
## ENSG00000227232    83   104    65    85   150    92    98   128    97    74
## ENSG00000278267     5     7     5     6    14    13     4    10    11     7
## ENSG00000243485     0     0     0     2     0     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0796 I0800 I0801 I0802 I0803 I0805 I0806 I0808 I0810 I0812
## ENSG00000223972     0     0     0     1     0     0     3     0     0     0
## ENSG00000227232    87    49    82    85    53    47    89    97   221    63
## ENSG00000278267     7     8    15     7     6     9     3     5    17     3
## ENSG00000243485     1     0     0     0     1     0     1     3     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0815 I0816 I0819 I0820 I0821 I0822 I0825 I0828 I0829 I0832
## ENSG00000223972     1     0     1     0     0     0     2     0     0     0
## ENSG00000227232   174   104    55   146    84    68    91    97    52    66
## ENSG00000278267    23     9     8    16     7     5     5     6     3     5
## ENSG00000243485     2     2     0     0     0     0     1     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0834 I0838 I0839 I0840 I0841 I0842 I0843 I0844 I0845 I0846
## ENSG00000223972     0     0     0     0     2     0     0     0     1     0
## ENSG00000227232    92   106    97   185   101    73    94    57    81   193
## ENSG00000278267    11    11     9    22     2     8     7     5     3    18
## ENSG00000243485     0     0     0     0     1     1     1     0     1     1
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0847 I0848 I0850 I0851 I0858 I0861 I0862 I0863 I0865 I0866
## ENSG00000223972     0     0     1     0     0     0     0     0     0     1
## ENSG00000227232    87    48   157    98    62    42    62    49    97    73
## ENSG00000278267     6     4    20    12     3     5    11     3     7     4
## ENSG00000243485     1     0     0     0     0     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0867 I0868 I0871 I0872 I0874 I0875 I0876 I0877 I0878 I0883
## ENSG00000223972     0     0     0     0     0     0     0     0     0     0
## ENSG00000227232   100    86    76    91    60    55    63    79    79    84
## ENSG00000278267     1     7     5     8     3     4     5     5     4     4
## ENSG00000243485     0     0     0     1     0     0     0     1     1     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0884 I0885 I0886 I0887
## ENSG00000223972     0     1     2     1
## ENSG00000227232    50    92    60    65
## ENSG00000278267     4     4     7    10
## ENSG00000243485     1     1     0     0
## ENSG00000237613     0     0     0     0
## ENSG00000268020     0     0     0     0

metadata <- read.table("E:/R Practice/Bulk_RNAseq_metadata.txt",
                      header = T)
head(metadata)

##    sample_id  cell_type cytokine_condition stimulation_time donor_id  sex age
## 1      I0712  CD4_Naive               IFNB              16h      257 Male  38
## 2      I0713  CD4_Naive               Th17              16h      254 Male  58
## 6      I0717 CD4_Memory            Resting               5d      265 Male  59
## 10     I0721  CD4_Naive                Th2               5d      257 Male  38
## 15     I0726 CD4_Memory               Th17               5d      264 Male  27
## 20     I0731  CD4_Naive                Th0              16h      257 Male  38
##    sequencing_batch cell_culture_batch rna_integrity_number
## 1                 1                  3                  9.5
## 2                 1                  3                  9.3
## 6                 1                  4                  9.7
## 10                1                  3                 10.0
## 15                1                  4                  9.9
## 20                1                  3                  9.4

# Limiting metadata to certain important columns and excluding un-necessary data in phenotypic data
metadata <- data.frame(row.names = colnames(count_data), metadata[,1:3])

# Data Modifications as it maybe required for every dataset
metadata <- metadata[,-1]
colnames(count_data) <- rownames(metadata) # To ensure the identity of sample names

# Display dataset information
head(count_data)

##                 I0712 I0713 I0717 I0721 I0726 I0731 I0732 I0734 I0735 I0736
## ENSG00000223972     0     0     0     0     0     0     0     0     2     1
## ENSG00000227232    55    93   198   131    75    64    90   101   164    70
## ENSG00000278267     6     6    28     5     4     9     6     8    13     2
## ENSG00000243485     0     0     0     3     1     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     1     0     0     0     0     0     0     0
##                 I0737 I0739 I0740 I0744 I0745 I0747 I0748 I0749 I0751 I0752
## ENSG00000223972     0     0     0     0     0     0     1     0     0     1
## ENSG00000227232   113    67   227    56    89   121    73   119   155    95
## ENSG00000278267     8    16    35     6     9     5     3     4    18    10
## ENSG00000243485     0     0     0     1     0     0     1     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0755 I0756 I0758 I0760 I0761 I0763 I0765 I0766 I0767 I0769
## ENSG00000223972     0     0     0     0     0     0     0     0     0     0
## ENSG00000227232   122   116    74   113    62   105   101    79    93    45
## ENSG00000278267     9     2     6     7     8     2     8     9     6     8
## ENSG00000243485     0     1     0     1     0     0     1     1     0     1
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0773 I0774 I0777 I0778 I0782 I0785 I0786 I0787 I0789 I0790
## ENSG00000223972     0     0     0     1     0     0     0     0     0     0
## ENSG00000227232    83   104    65    85   150    92    98   128    97    74
## ENSG00000278267     5     7     5     6    14    13     4    10    11     7
## ENSG00000243485     0     0     0     2     0     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0796 I0800 I0801 I0802 I0803 I0805 I0806 I0808 I0810 I0812
## ENSG00000223972     0     0     0     1     0     0     3     0     0     0
## ENSG00000227232    87    49    82    85    53    47    89    97   221    63
## ENSG00000278267     7     8    15     7     6     9     3     5    17     3
## ENSG00000243485     1     0     0     0     1     0     1     3     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0815 I0816 I0819 I0820 I0821 I0822 I0825 I0828 I0829 I0832
## ENSG00000223972     1     0     1     0     0     0     2     0     0     0
## ENSG00000227232   174   104    55   146    84    68    91    97    52    66
## ENSG00000278267    23     9     8    16     7     5     5     6     3     5
## ENSG00000243485     2     2     0     0     0     0     1     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0834 I0838 I0839 I0840 I0841 I0842 I0843 I0844 I0845 I0846
## ENSG00000223972     0     0     0     0     2     0     0     0     1     0
## ENSG00000227232    92   106    97   185   101    73    94    57    81   193
## ENSG00000278267    11    11     9    22     2     8     7     5     3    18
## ENSG00000243485     0     0     0     0     1     1     1     0     1     1
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0847 I0848 I0850 I0851 I0858 I0861 I0862 I0863 I0865 I0866
## ENSG00000223972     0     0     1     0     0     0     0     0     0     1
## ENSG00000227232    87    48   157    98    62    42    62    49    97    73
## ENSG00000278267     6     4    20    12     3     5    11     3     7     4
## ENSG00000243485     1     0     0     0     0     0     0     0     0     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0867 I0868 I0871 I0872 I0874 I0875 I0876 I0877 I0878 I0883
## ENSG00000223972     0     0     0     0     0     0     0     0     0     0
## ENSG00000227232   100    86    76    91    60    55    63    79    79    84
## ENSG00000278267     1     7     5     8     3     4     5     5     4     4
## ENSG00000243485     0     0     0     1     0     0     0     1     1     0
## ENSG00000237613     0     0     0     0     0     0     0     0     0     0
## ENSG00000268020     0     0     0     0     0     0     0     0     0     0
##                 I0884 I0885 I0886 I0887
## ENSG00000223972     0     1     2     1
## ENSG00000227232    50    92    60    65
## ENSG00000278267     4     4     7    10
## ENSG00000243485     1     1     0     0
## ENSG00000237613     0     0     0     0
## ENSG00000268020     0     0     0     0

# Ensure column names of countData match row names of colData
all(colnames(count_data) %in% rownames(metadata))  # should return TRUE

## [1] TRUE

all(colnames(count_data) == rownames(metadata))    # should return TRUE

## [1] TRUE

# Check for missing values in count data as its important for down stream analysis
sum(is.na(count_data))

## [1] 0

Checking and plotting library sizes in count data.

# Check library sizes
library_sizes <- colSums(count_data)
barplot(library_sizes, main = "Librazy Sizes", ylab = "Counts per Sample",
        las = 3, col="lightblue")

Deseq Dataset creation and Filtration

Creating a DESeq Dataset Matrix to use it for differential Analysis. Remember to use ?Function_name if you wanna know more about a function in the code. In this case, we’ll use ?DESeqDataSetFromMatrix. Then, we aim to filter the low read count genes which means to remove genes that have less than 10 reads across all samples.

?DESeqDataSetFromMatrix

## starting httpd help server ... done

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = count_data,
  colData = metadata,
  design = ~ cytokine_condition
)

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

# Pre-filtering: keep only rows with at least 10 reads total
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds

## class: DESeqDataSet 
## dim: 38789 94 
## metadata(1): version
## assays(1): counts
## rownames(38789): ENSG00000223972 ENSG00000227232 ... ENSG00000275405
##   ENSG00000277475
## rowData names(0):
## colnames(94): I0712 I0713 ... I0886 I0887
## colData names(2): cell_type cytokine_condition

While DESeq2 operates on raw counts, visualization and downstream analyses often depend on data following a normal distribution. It is possible to apply a log-transformation on the raw data to achieve that. Unfortunately, the logarithm of count data tends to exhibit higher variance when the mean expression value is low. Let’s take a look at our data.

# Everything is loaded, so now we investigate the data
# Apply a pseudocount of 1 and apply log2
normtfd <- normTransform(dds)
meanSdPlot(assay(normtfd)) # A plot of mean to standard deviation comparison

Before we can compare gene counts and do DE analysis, we must account for changes in sequencing depth and RNA composition between samples. We do not need to account for differences in gene length because we will only be comparing counts of the same gene across multiple samples, ensuring that gene length remains constant. DESeq2 employs the median of ratios approach, in which counts are divided by sample-specific size factors calculated from the median ratio of gene counts relative to geometric mean per gene. These size factors can be determined using the following code or automatically by running DESeq, as shown later.

# Calculate size factors to normalize library sizes
dds <- estimateSizeFactors(dds)
size_factors <- sizeFactors(dds) #Store size factors to  avariable

# Visualize size factors
plot(sizeFactors(dds), 
     colSums(counts(dds, normalized=F)), 
     xlab = 'Size factor',
     ylab = 'Total number of reads', 
     pch = 19)

# Check normalized counts
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)

##                      I0712    I0713     I0717     I0721      I0726     I0731
## ENSG00000223972  0.0000000  0.00000   0.00000  0.000000  0.0000000  0.000000
## ENSG00000227232 54.4310843 88.18028 136.83808 94.249457 54.5914823 61.390002
## ENSG00000278267  5.9379365  5.68905  19.35084  3.597308  2.9115457  8.632969
## ENSG00000243485  0.0000000  0.00000   0.00000  2.158385  0.7278864  0.000000
## ENSG00000238009  0.9896561  0.00000   9.67542  0.000000  0.0000000  0.000000
## ENSG00000233750  0.0000000  1.89635  17.27754  0.000000  1.4557729  0.000000
##                     I0732     I0734      I0735      I0736      I0737     I0739
## ENSG00000223972  0.000000  0.000000   1.377596  0.8190376  0.0000000  0.000000
## ENSG00000227232 75.963133 97.230697 112.962838 57.3326354 86.0390415 64.458823
## ENSG00000278267  5.064209  7.701441   8.954371  1.6380753  6.0912596 15.393152
## ENSG00000243485  0.000000  0.000000   0.000000  0.0000000  0.0000000  0.000000
## ENSG00000238009  0.000000  0.000000   0.000000  0.8190376  0.0000000  0.000000
## ENSG00000233750  1.688070  0.000000   4.821585  1.6380753  0.7614074  0.962072
##                      I0740      I0744     I0745      I0747      I0748
## ENSG00000223972   0.000000  0.0000000  0.000000   0.000000  0.9468179
## ENSG00000227232 171.098254 51.2947648 73.256854 112.401949 69.1177085
## ENSG00000278267  26.380788  5.4958677  7.407996   4.644709  2.8404538
## ENSG00000243485   0.000000  0.9159779  0.000000   0.000000  0.9468179
## ENSG00000238009   2.261210  0.0000000  0.000000   0.000000  0.0000000
## ENSG00000233750   8.291105  0.0000000  0.000000   0.000000  2.8404538
##                      I0749      I0751      I0752     I0755       I0756
## ENSG00000223972   0.000000   0.000000  0.8168497  0.000000   0.0000000
## ENSG00000227232 109.639940 115.298006 77.6007216 90.334130 104.5313234
## ENSG00000278267   3.685376  13.389446  8.1684970  6.663993   1.8022642
## ENSG00000243485   0.000000   0.000000  0.0000000  0.000000   0.9011321
## ENSG00000238009   0.000000   1.487716  0.8168497  0.000000   0.0000000
## ENSG00000233750   1.842688   1.487716  0.0000000  1.480887   2.7033963
##                     I0758      I0760      I0761     I0763      I0765     I0766
## ENSG00000223972  0.000000  0.0000000  0.0000000  0.000000  0.0000000  0.000000
## ENSG00000227232 70.054657 88.6324840 57.3586273 90.489716 85.2285068 92.517714
## ENSG00000278267  5.680107  5.4905079  7.4011132  1.723614  6.7507728 10.539993
## ENSG00000243485  0.000000  0.7843583  0.0000000  0.000000  0.8438466  1.171110
## ENSG00000238009  0.000000  0.0000000  0.0000000  0.000000  1.6876932  1.171110
## ENSG00000233750  0.000000  3.9217913  0.9251392  0.000000  0.8438466  5.855551
##                     I0767     I0769      I0773     I0774     I0777      I0778
## ENSG00000223972  0.000000  0.000000  0.0000000  0.000000  0.000000  0.9136158
## ENSG00000227232 77.398128 49.237640 75.8693917 98.980175 67.126185 77.6573454
## ENSG00000278267  4.993428  8.753358  4.5704453  6.662127  5.163553  5.4816950
## ENSG00000243485  0.000000  1.094170  0.0000000  0.000000  0.000000  1.8272317
## ENSG00000238009  0.000000  0.000000  0.0000000  0.000000  0.000000  0.0000000
## ENSG00000233750  0.000000  0.000000  0.9140891  3.806930  2.065421  0.9136158
##                       I0782    I0785      I0786      I0787    I0789     I0790
## ENSG00000223972   0.0000000  0.00000  0.0000000   0.000000  0.00000  0.000000
## ENSG00000227232 117.9567550 87.11800 85.1959985 103.118961 86.91173 66.560826
## ENSG00000278267  11.0092971 12.31015  3.4773877   8.056169  9.85597  6.296294
## ENSG00000243485   0.0000000  0.00000  0.0000000   0.000000  0.00000  0.000000
## ENSG00000238009   0.7863784  0.00000  0.0000000   0.000000  0.00000  0.000000
## ENSG00000233750   1.5727567  0.00000  0.8693469   2.416851  0.00000  1.798941
##                      I0796      I0800      I0801      I0802     I0803    I0805
## ENSG00000223972  0.0000000  0.0000000  0.0000000  0.9037617  0.000000  0.00000
## ENSG00000227232 76.0476878 48.4999220 81.2612297 76.8197437 82.825844 63.21318
## ENSG00000278267  6.1187795  7.9183546 14.8648591  6.3263318  9.376511 12.10465
## ENSG00000243485  0.8741114  0.0000000  0.0000000  0.0000000  1.562752  0.00000
## ENSG00000238009  1.7482227  0.9897943  0.0000000  0.9037617  0.000000  0.00000
## ENSG00000233750  0.8741114  0.0000000  0.9909906  3.6150468  3.125504  0.00000
##                      I0806     I0808      I0810     I0812       I0815
## ENSG00000223972   3.578479  0.000000   0.000000  0.000000   0.8522398
## ENSG00000227232 106.161552 97.759342 195.351298 86.502906 148.2897241
## ENSG00000278267   3.578479  5.039141  15.027023  4.119186  19.6015153
## ENSG00000243485   1.192826  3.023485   0.000000  0.000000   1.7044796
## ENSG00000238009   1.192826  0.000000   7.071540  1.373062   1.7044796
## ENSG00000233750   1.192826  0.000000   7.955483  0.000000   1.7044796
##                      I0816     I0819      I0820     I0821     I0822      I0825
## ENSG00000223972  0.0000000  1.303393   0.000000  0.000000  0.000000  1.8411903
## ENSG00000227232 96.6269255 71.686593 154.905515 91.062837 78.750410 83.7741595
## ENSG00000278267  8.3619455 10.427141  16.975947  7.588570  5.790471  4.6029758
## ENSG00000243485  1.8582101  0.000000   0.000000  0.000000  0.000000  0.9205952
## ENSG00000238009  0.9291051  1.303393   0.000000  1.084081  0.000000  0.0000000
## ENSG00000233750  0.9291051  6.516963   1.060997  0.000000  0.000000  0.0000000
##                      I0828     I0829     I0832     I0834      I0838     I0839
## ENSG00000223972   0.000000  0.000000  0.000000  0.000000   0.000000  0.000000
## ENSG00000227232 103.637670 60.565494 76.169096 83.173235 109.405429 99.261831
## ENSG00000278267   6.410578  3.494163  5.770386  9.944626  11.353394  9.209861
## ENSG00000243485   0.000000  0.000000  0.000000  0.000000   0.000000  0.000000
## ENSG00000238009   1.068430  0.000000  0.000000  0.000000   0.000000  0.000000
## ENSG00000233750   1.068430  0.000000  1.154077  5.424341   1.032127  3.069954
##                      I0840      I0841      I0842      I0843     I0844     I0845
## ENSG00000223972   0.000000  1.9490594  0.0000000   0.000000  0.000000  1.095146
## ENSG00000227232 171.258941 98.4274998 67.7108978 135.160037 68.932184 88.706857
## ENSG00000278267  20.365928  1.9490594  7.4203724  10.065109  6.046683  3.285439
## ENSG00000243485   0.000000  0.9745297  0.9275465   1.437873  0.000000  1.095146
## ENSG00000238009   3.702896  0.0000000  0.0000000   0.000000  1.209337  0.000000
## ENSG00000233750   2.777172  0.9745297  0.0000000   1.437873  1.209337  4.380586
##                       I0846      I0847     I0848       I0850      I0851   I0858
## ENSG00000223972   0.0000000   0.000000  0.000000   0.8189666   0.000000  0.0000
## ENSG00000227232 190.1424843 117.165696 61.535125 128.5777605 106.792961 59.4394
## ENSG00000278267  17.7334959   8.080393  5.127927  16.3793326  13.076689  2.8761
## ENSG00000243485   0.9851942   1.346732  0.000000   0.0000000   0.000000  0.0000
## ENSG00000238009   0.0000000   0.000000  1.281982   2.4568999   1.089724  0.9587
## ENSG00000233750   5.9111653   2.693464  0.000000   1.6379333   0.000000  0.0000
##                     I0861    I0862     I0863     I0865      I0866      I0867
## ENSG00000223972  0.000000  0.00000  0.000000  0.000000  0.8952875   0.000000
## ENSG00000227232 55.040982 75.69676 62.757543 87.474467 65.3559902 111.743694
## ENSG00000278267  6.552498 13.43007  3.842299  6.312590  3.5811501   1.117437
## ENSG00000243485  0.000000  0.00000  0.000000  0.000000  0.0000000   0.000000
## ENSG00000238009  1.310500  0.00000  0.000000  0.000000  0.0000000   0.000000
## ENSG00000233750  0.000000  0.00000  0.000000  1.803597  0.8952875   1.117437
##                     I0868     I0871      I0872      I0874     I0875     I0876
## ENSG00000223972  0.000000  0.000000  0.0000000  0.0000000  0.000000  0.000000
## ENSG00000227232 94.763084 80.248630 89.5464473 51.5950849 66.526704 83.350353
## ENSG00000278267  7.713274  5.279515  7.8722151  2.5797542  4.838306  6.615107
## ENSG00000243485  0.000000  0.000000  0.9840269  0.0000000  0.000000  0.000000
## ENSG00000238009  0.000000  0.000000  1.9680538  0.0000000  1.209576  0.000000
## ENSG00000233750  2.203793  1.055903  0.9840269  0.8599181  1.209576  5.292086
##                     I0877     I0878      I0883     I0884      I0885     I0886
## ENSG00000223972  0.000000  0.000000   0.000000  0.000000   1.372646  2.145315
## ENSG00000227232 89.614018 86.038926 123.449409 61.511675 126.283414 64.359453
## ENSG00000278267  5.671773  4.356401   5.878543  4.920934   5.490583  7.508603
## ENSG00000243485  1.134355  1.089100   0.000000  1.230233   1.372646  0.000000
## ENSG00000238009  2.268709  0.000000   1.469636  0.000000   1.372646  0.000000
## ENSG00000233750  0.000000  0.000000   4.408907  2.460467   0.000000  0.000000
##                     I0887
## ENSG00000223972  1.010032
## ENSG00000227232 65.652057
## ENSG00000278267 10.100316
## ENSG00000243485  0.000000
## ENSG00000238009  2.020063
## ENSG00000233750  0.000000

Now comes the transformation of the data. We usually use rlog to to remove the dependence of the variance of the means, but if data is huge the R console tells us to use VST (variance stabilizing transformations) instead of rlog. Please explore the two function in depth by running ?rlog & ?vst.

# Perform VST transformation (optional, only if data is very huge)
vsd <- vst(dds, blind = FALSE)
meanSdPlot(assay(vsd))

# Perform rlog transformation if data is not huge
#rld <- rlog(dds, blind = FALSE)

Exploratory Data Analysis (EDA)

Computing the PCA: A principal component analysis is a useful tool for examining similarities in data and identifying strong confounding factors. PCA requires normally distributed data. We may use the rlog or vst functions to change our data by library size and then apply the log2 transformation. One technique to perform the PCA is to use the function prcomp.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)

# Perform PCA using prcomp
pca <- prcomp(t(assay(vsd)), scale = TRUE)

#Plotting PCA: intgroup can be set as anytihng in prepared metadata
plotPCA(vsd, intgroup = 'cytokine_condition',ntop=38789)

## using ntop=38789 top features by variance

# Extract the first four principal components
pca_data <- data.frame(pca$x[, 1:4])
pca_data$condition <- colData(vsd)$condition

# Plot pairs plot for the first four principal components
ggpairs(pca_data, aes(color = vsd$cytokine_condition, alpha = 0.6),
        columns = 1:4,
        upper = list(continuous = wrap("points", size = 1.5)),
        lower = list(continuous = wrap("smooth", method = "lm")),
        diag = list(continuous = wrap("densityDiag"))) +
  theme_minimal()

Running DESeq and performing Differential Expression

# Run DESeq2 analysis
dds <- DESeq(dds)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 176 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

# Extract results
res <- results(dds) #Syore results in a variable by using "results" function from DESeq2 package.
res <- res[!is.na(res$padj), ]  # Remove NA values

# View top differentially expressed genes
head(res[order(res$padj), ])

## log2 fold change (MLE): cytokine condition Th2 vs IFNB 
## Wald test p-value: cytokine condition Th2 vs IFNB 
## DataFrame with 6 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat      pvalue
##                  <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000112299    12.0579      14.855644  1.429180  10.39452 2.62605e-25
## ENSG00000223387    12.3688      -3.744040  0.439440  -8.52002 1.59526e-17
## ENSG00000100644 14305.0072      -0.857575  0.100341  -8.54663 1.26731e-17
## ENSG00000141068  3902.2634      -0.931053  0.109618  -8.49361 2.00324e-17
## ENSG00000135148  3637.8152      -1.114091  0.138524  -8.04258 8.79668e-16
## ENSG00000141664  3337.4575      -1.074970  0.134190  -8.01082 1.13949e-15
##                        padj
##                   <numeric>
## ENSG00000112299 7.02627e-21
## ENSG00000223387 1.33996e-13
## ENSG00000100644 1.33996e-13
## ENSG00000141068 1.33996e-13
## ENSG00000135148 4.70728e-12
## ENSG00000141664 5.08138e-12

# Summary of results
summary(res)

## 
## out of 26756 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 608, 2.3%
## LFC < 0 (down)     : 759, 2.8%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

# MA plot
plotMA(res, ylim = c(-2, 2)) #Not necessary but sometimes useful to visualize

# Extract differentially expressed genes
sum(res$padj <= 0.05 & 
      abs(res$log2FoldChange) > 2, na.rm = TRUE) # To see how many genes fall under this thresholds

## [1] 197

DEG.idx <- which(res$padj <= 0.05 & 
                   abs(res$log2FoldChange) > 2) #Thresholds can be changed also according to the need
res <- res[DEG.idx,]

# Converting Dese2 results as a data frame
sig_genes <- as.data.frame(res)
# Writing the Dese2 results in a file in local directory
write.csv(sig_genes, "E:/R Practice/DE Genes.csv")

Visualizations: Finally we generate useful visualization to see DE genes such as Volcano Plot and Heat map etc.

#Generating Volcano Plot Visualization
library(EnhancedVolcano) #Load the library first

## Loading required package: ggrepel

EnhancedVolcano(res, lab = rownames(res), 
                x = 'log2FoldChange', y = 'padj', 
                subtitle = 'Differentially Expressed Genes', labSize = 2, 
                pCutoff = 0.01,
                FCcutoff = 1,
                drawConnectors = TRUE)

## Warning: ggrepel: 126 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Heatmap of the most variable genes using rLog
library(pheatmap)#Load the library first
select <- order(rowMeans(counts(dds, normalized = TRUE)),
                decreasing = TRUE)[1:20] 
pheatmap(assay(vsd)[select, ], cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = TRUE, border_color = "black",
         annotation_col = as.data.frame(colData(vsd)))

# Use arguments (show_colnames = FALSE,show_rownames = FALSE,) to avoid mess in names
pheatmap(assay(vsd)[select, ], cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = TRUE, border_color = "black", show_colnames = FALSE,
         annotation_col = as.data.frame(colData(vsd)))

Pathway Analysis, Gene Set Enrichment Analysis

# Load the necessary libraries first
library(clusterProfiler)

##

## clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

## 
## Attaching package: 'clusterProfiler'

## The following object is masked from 'package:purrr':
## 
##     simplify

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

library(org.Hs.eg.db)

## Loading required package: AnnotationDbi

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:clusterProfiler':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

##

# Store DE Genes from Differential Expression Analysis
genes <- res@rownames 

# Convert gene symbols to Entrez IDs
anno_Ids <- bitr(genes, 
                 fromType="ENSEMBL",
                 toType=c("ENTREZID", "SYMBOL", "GENENAME"),
                 OrgDb=org.Hs.eg.db)

## 'select()' returned 1:many mapping between keys and columns

## Warning in bitr(genes, fromType = "ENSEMBL", toType = c("ENTREZID", "SYMBOL", :
## 21.32% of input gene IDs are fail to map...

Gene Ontology Analysis and Visualizations

# Perform GO enrichment analysis
go_results <- enrichGO(gene = anno_Ids$ENSEMBL,
                       OrgDb = org.Hs.eg.db,
                       keyType = "ENSEMBL", #Key types that to be used
                       ont = "BP",
                       pAdjustMethod = "fdr")

# View the results generated by EnrichGo
View(go_results@result)

#Dot Plot visualization
dotplot(go_results, showCategory=20) +
  ggtitle("GO Enrichment Analysis")

#Network GO Plot Visualization
goplot(go_results, showCategory=10)+
  ggtitle("GO Network Analysis")

## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

KEGG Pathway analysis and Visualizations

# Perform KEGG enrichment analysis
kegg_results <- enrichKEGG(gene = anno_Ids$ENTREZID, 
                           organism = 'hsa',
                           keyType = "ncbi-geneid",
                           pAdjustMethod = "BH")

## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...

## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

## Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/hsa"...

# View the results generated by enrichKEGG
View(kegg_results@result)

#Dot Plot visualization
dotplot(kegg_results, showCategory=20) +
  ggtitle("KEGG Pathway Analysis")

As the final section of this text, we call the function sessionInfo, which returns the version numbers of R and all packages used in this session. It is a good idea to keep a note of this since it will help you figure out what happened if a R script stops working or produces different results because the functions have changed in a newer version of one of your packages. By inserting it at the end of a script, your reports will become more replicable.

gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  9424238 503.4   29391372 1569.7  36739214 1962.1
## Vcells 71617466 546.4  159416859 1216.3 159416859 1216.3

length(getLoadedDLLs())

## [1] 87

sessionInfo()

## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.19.1         AnnotationDbi_1.66.0       
##  [3] clusterProfiler_4.12.0      pheatmap_1.0.12            
##  [5] EnhancedVolcano_1.22.0      ggrepel_0.9.5              
##  [7] GGally_2.2.1                vsn_3.72.0                 
##  [9] lubridate_1.9.3             forcats_1.0.0              
## [11] stringr_1.5.1               dplyr_1.1.4                
## [13] purrr_1.0.2                 readr_2.1.5                
## [15] tidyr_1.3.1                 tibble_3.2.1               
## [17] ggplot2_3.5.1               tidyverse_2.0.0            
## [19] DESeq2_1.44.0               SummarizedExperiment_1.34.0
## [21] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [23] GenomicRanges_1.56.0        GenomeInfoDb_1.40.1        
## [25] IRanges_2.38.0              S4Vectors_0.42.0           
## [27] GEOquery_2.72.0             Biobase_2.64.0             
## [29] BiocGenerics_0.50.0        
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          farver_2.1.2            rmarkdown_2.27         
##   [7] fs_1.6.4                zlibbioc_1.50.0         vctrs_0.6.5            
##  [10] memoise_2.0.1           ggtree_3.12.0           htmltools_0.5.8.1      
##  [13] S4Arrays_1.4.1          gridGraphics_0.5-1      SparseArray_1.4.8      
##  [16] sass_0.4.9              bslib_0.7.0             plyr_1.8.9             
##  [19] cachem_1.1.0            igraph_2.0.3            lifecycle_1.0.4        
##  [22] pkgconfig_2.0.3         gson_0.1.0              Matrix_1.7-0           
##  [25] R6_2.5.1                fastmap_1.2.0           GenomeInfoDbData_1.2.12
##  [28] aplot_0.2.3             digest_0.6.35           enrichplot_1.24.0      
##  [31] colorspace_2.1-0        patchwork_1.2.0         RSQLite_2.3.7          
##  [34] labeling_0.4.3          fansi_1.0.6             timechange_0.3.0       
##  [37] polyclip_1.10-6         httr_1.4.7              abind_1.4-5            
##  [40] mgcv_1.9-1              compiler_4.4.0          bit64_4.0.5            
##  [43] withr_3.0.0             BiocParallel_1.38.0     viridis_0.6.5          
##  [46] DBI_1.2.3               ggstats_0.6.0           hexbin_1.28.3          
##  [49] highr_0.11              ggforce_0.4.2           MASS_7.3-60.2          
##  [52] DelayedArray_0.30.1     HDO.db_0.99.1           tools_4.4.0            
##  [55] scatterpie_0.2.3        ape_5.8                 glue_1.7.0             
##  [58] nlme_3.1-164            GOSemSim_2.30.0         shadowtext_0.1.3       
##  [61] grid_4.4.0              reshape2_1.4.4          fgsea_1.30.0           
##  [64] generics_0.1.3          gtable_0.3.5            tzdb_0.4.0             
##  [67] preprocessCore_1.66.0   data.table_1.15.4       hms_1.1.3              
##  [70] tidygraph_1.3.1         xml2_1.3.6              utf8_1.2.4             
##  [73] XVector_0.44.0          pillar_1.9.0            yulab.utils_0.1.4      
##  [76] limma_3.60.2            splines_4.4.0           tweenr_2.0.3           
##  [79] treeio_1.28.0           lattice_0.22-6          bit_4.0.5              
##  [82] tidyselect_1.2.1        GO.db_3.19.1            locfit_1.5-9.9         
##  [85] Biostrings_2.72.1       knitr_1.47              gridExtra_2.3          
##  [88] xfun_0.45               graphlayouts_1.1.1      statmod_1.5.0          
##  [91] stringi_1.8.4           UCSC.utils_1.0.0        lazyeval_0.2.2         
##  [94] ggfun_0.1.5             yaml_2.3.8              evaluate_0.24.0        
##  [97] codetools_0.2-20        ggraph_2.2.1            qvalue_2.36.0          
## [100] BiocManager_1.30.23     ggplotify_0.1.2         cli_3.6.2              
## [103] affyio_1.74.0           munsell_0.5.1           jquerylib_0.1.4        
## [106] Rcpp_1.0.12             png_0.1-8               parallel_4.4.0         
## [109] blob_1.2.4              DOSE_3.30.1             tidytree_0.4.6         
## [112] viridisLite_0.4.2       scales_1.3.0            affy_1.82.0            
## [115] crayon_1.5.3            rlang_1.1.4             cowplot_1.1.3          
## [118] fastmatch_1.1-4         KEGGREST_1.44.1

RNA-Seq Differential Expression Analysis, Pathway and Enrichment Analysis

`6/25/2024`