Part 1: Getting the dataset ready. Cleaning the data and mapping to HGNC

Downloading the data of GSE178521. Providing some metadata of the platform and dataset.

require(GEOquery)
## Loading required package: GEOquery
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
gse <- getGEO("GSE178521", GSEMatrix=FALSE)
## Reading file....
## Parsing....
## Found 7 entities...
## GPL24676 (1 of 8 entities)
## GSM5393322 (2 of 8 entities)
## GSM5393323 (3 of 8 entities)
## GSM5393324 (4 of 8 entities)
## GSM5393325 (5 of 8 entities)
## GSM5393326 (6 of 8 entities)
## GSM5393327 (7 of 8 entities)
#Information about Platform
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))

Metadata of the series

contact_address contact_city contact_country contact_email contact_institute contact_name
上海市杨浦区控江路新华医院 shanghai China 上海交大医学院附属新华医院 bingqian,,zhou


Information about the platform - GPL24676

Platform title : Illumina NovaSeq 6000 (Homo sapiens)

Submission data : Mar 02 2018

Last update data : Nov 05 2018

Organims : Homo sapiens (taxid: 9606)

Number of GEO datasets that use this techology : 4140

Number of GEO samples that use this technology : 95270

Get the expression data and displaying information about it.

sfiles = getGEOSuppFiles("GSE178521")
fnames = rownames(sfiles)
PRMT5_exp = read.delim(fnames[1],header=TRUE,
                       check.names = FALSE)

#Information
read.table(fnames[1],header=TRUE,
           check.names = FALSE)[1,]
knitr::kable(PRMT5_exp[1:15,1:6], format = "html")
gene_id GFPkd1 GFPkd2 GFPkd3 PRMT5kd1 PRMT5kd2
ENSG00000282222 0 0 0 0 0
ENSG00000282221 0 0 0 0 0
ENSG00000212040 0 0 0 0 0
ENSG00000110514 6020 6299 5418 5120 6625
ENSG00000086015 4124 4315 3674 3316 4242
ENSG00000211769 0 0 0 0 0
ENSG00000211768 0 0 0 0 0
ENSG00000250337 365 442 459 382 359
ENSG00000255071 0 0 0 0 0
ENSG00000211767 0 0 0 0 0
ENSG00000211766 0 0 0 0 0
ENSG00000211765 0 0 0 0 0
ENSG00000211764 0 0 0 0 0
ENSG00000201510 0 0 0 0 0
ENSG00000169740 1220 1086 1086 1030 1333
#How many genes do we have measurement for?
dim(PRMT5_exp)
## [1] 58825     7
colnames(PRMT5_exp)
## [1] "gene_id"  "GFPkd1"   "GFPkd2"   "GFPkd3"   "PRMT5kd1" "PRMT5kd2" "PRMT5kd3"

Brief overview of the dataset up to 15 genes

gene_id GFPkd1 GFPkd2 GFPkd3 PRMT5kd1 PRMT5kd2 PRMT5kd3
ENSG00000282222 0 0 0 0 0 0
ENSG00000282221 0 0 0 0 0 0
ENSG00000212040 0 0 0 0 0 0
ENSG00000110514 6020 6299 5418 5120 6625 5485
ENSG00000086015 4124 4315 3674 3316 4242 3387
ENSG00000211769 0 0 0 0 0 0
ENSG00000211768 0 0 0 0 0 0
ENSG00000250337 365 442 459 382 359 482
ENSG00000255071 0 0 0 0 0 0
ENSG00000211767 0 0 0 0 0 0
ENSG00000211766 0 0 0 0 0 0
ENSG00000211765 0 0 0 0 0 0
ENSG00000211764 0 0 0 0 0 0
ENSG00000201510 0 0 0 0 0 0
ENSG00000169740 1220 1086 1086 1030 1333 968


Since our original dataset does not provide the gene symbol, we will need to map the ensemble ID to HUGO gene symbol using the library EnsDb.Hsapiens.v79

#Convert ensemble ID to gene symbols in R
#Making sure all dependencies are installed
BiocManager::install("EnsDb.Hsapiens.v79")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'EnsDb.Hsapiens.v79'
## Installation paths not writeable, unable to update packages
##   path: /usr/local/lib/R/library
##   packages:
##     class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial
## Old packages: 'clipr', 'jsonlite'
library("EnsDb.Hsapiens.v79")
## Loading required package: ensembldb
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
## Loading required package: AnnotationFilter
## 
## Attaching package: 'ensembldb'
## The following object is masked from 'package:stats':
## 
##     filter
require("ensembldb")
BiocManager::install("ensembldb")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'ensembldb'
## Installation paths not writeable, unable to update packages
##   path: /usr/local/lib/R/library
##   packages:
##     class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial
## Old packages: 'clipr', 'jsonlite'
geneID_mapping <- ensembldb::select(EnsDb.Hsapiens.v79, 
                                           keys=PRMT5_exp$gene_id,
                                           keytype="GENEID", 
                                           columns = c("SYMBOL","GENEID"))
symbol <- c() #create empty vector
for(i in PRMT5_exp$gene_id){
  if(!(i %in% geneID_mapping$GENEID)){symbol <- c(symbol, NA)}
  else{symbol <- c(symbol, geneID_mapping$SYMBOL[which(geneID_mapping$GENEID == i)])}
  
}
PRMT5_exp$HUGO_symbol <- symbol #add the gene symbols to the dataset as a column

Next, we will get the summarized gene count.

summarized_gene_counts <- sort(table(PRMT5_exp$HUGO_symbol),
                               decreasing = TRUE)
knitr::kable(summarized_gene_counts[which(summarized_gene_counts>1)[1:10]], 
             format="html")
Var1 Freq
Y_RNA 754
Metazoa_SRP 251
U3 55
uc_338 36
snoU13 34
SNORA70 30
U6 28
SNORA31 27
5S_rRNA 25
SNORA40 22

Result:

Var1 Freq
Y_RNA 754
Metazoa_SRP 251
U3 55
uc_338 36
snoU13 34
SNORA70 30
U6 28
SNORA31 27
5S_rRNA 25
SNORA40 22


We will now define the groups: Control (GFP) vs Experimental (PRMT5). Tere are 3 samples for each group

#Define the groups
samples <- data.frame(lapply(colnames(PRMT5_exp)[2:7], 
                             FUN=function(x){unlist(strsplit(x, split = "kd"))}))

colnames(samples) <- colnames(PRMT5_exp)[2:7]
rownames(samples) <- c("Replicates", "Experimental_Group")
samples <- data.frame(t(samples))

Next, we want to translate out the counts into count per million (cpm) and get rid of low counts.

#Translate out counts into count per million using edgeR
cpms = edgeR::cpm(PRMT5_exp[,2:7])
rownames(cpms) <- PRMT5_exp[,1]

#Get rid of low counts
keep = rowSums(cpms >1) >=3 # n=3, 3 samples per group

PRMT5_exp_filtered = PRMT5_exp[keep,]
dim(PRMT5_exp_filtered) # first col is the number of remaining genes
## [1] 15159     8

We also want to check the percentage of the data in which the ensemble ID could not be mapped to HGNC.

#Find percentage of unmapped genes
unmapped_percentage <-sum(is.na(PRMT5_exp_filtered$HUGO_symbol))/
  length(PRMT5_exp_filtered$HUGO_symbol)*100

#Remove any rows with NA in their HUGO_symbol
PRMT5_exp_filtered <- na.omit(PRMT5_exp_filtered)

#Just double checking if there are any NA left!
any(is.na(PRMT5_exp_filtered$HUGO_symbol))
## [1] FALSE

Since only 0.8641731% of our data is unmapped, we can safely filter out the data points (rows) that contain NA in their gene symbol.

We then check if removing genes with low count solve the duplication issue

summarized_gene_counts_filtered <- sort(table(PRMT5_exp_filtered$HUGO_symbol),
                                        decreasing = TRUE)
knitr::kable(summarized_gene_counts_filtered[
  which(summarized_gene_counts>1)[1:10]], 
  format="html")
Var1 Freq
C2orf15 2
CBS 2
CLN3 2
DUXAP8 2
FAM47E-STBD1 2
HERC2P2 2
HIST2H2AA4 2
ICOSLG 2
IDS 2
KIAA0391 2

Result:

Var1 Freq
C2orf15 2
CBS 2
CLN3 2
DUXAP8 2
FAM47E-STBD1 2
HERC2P2 2
HIST2H2AA4 2
ICOSLG 2
IDS 2
KIAA0391 2


It seems that filtering out the low count does indeed solve some duplication issue. We have cleaned up our data and will now move on to normalize it.

Part 2: Normalization of data

We proceed to normalize the data using TMM and compare the box plot and density plot between the original count and normalized count.

Normalization using TMM since our data distribution is closest to normal.

library(edgeR)
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
#Creating an edgeR container for RNASep count data
filtered_data_matrix <- as.matrix(PRMT5_exp_filtered[,2:7])
rownames(filtered_data_matrix) <- PRMT5_exp_filtered$HUGO_symbol
#d is an edgeR object
d = DGEList(counts = filtered_data_matrix, group = samples$Experimental_Group)

#Calculuate the normalization factors
d = calcNormFactors(d)
normalized_counts <- cpm(d)

Visualize with boxplot

#Boxplot
par(mfrow=c(1,2))
#Distribution of our original data - Boxplot
data2plot <- log2(edgeR::cpm(PRMT5_exp_filtered[,2:7]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Original Count")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), 
       col = "red", lwd = 0.6, lty = "dashed")

################################################################################

#Boxplot for normalized counts
data2plot_normalized <- log2(normalized_counts)
boxplot(data2plot_normalized, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Normalized")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
abline(h = median(apply(data2plot_normalized, 2, median)), 
       col = "red", lwd = 0.6, lty = "dashed")

Result: Boxplot

Visualize with density plot

par(mfrow=c(1,2))
#Density Plot for original count data
counts_density <- apply(log2(edgeR::cpm(PRMT5_exp_filtered[,2:7])), 2, density)

#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
  xlim <- range(c(xlim, counts_density[[i]]$x)); 
  ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Original Count", cex.lab = 0.8)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i])
#create legend
legend("topright", colnames(data2plot),  
       col=cols, lty=ltys, cex=0.5, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")

##############################################################################

#Density Plot for normalized count data
norm_counts_density <- apply(log2(normalized_counts), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(norm_counts_density)) {
  xlim <- range(c(xlim, norm_counts_density[[i]]$x)); 
  ylim <- range(c(ylim, norm_counts_density[[i]]$y))
}
cols <- rainbow(length(norm_counts_density))
ltys <- rep(1, length(norm_counts_density))
plot(norm_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Normalized count", cex.lab = 0.8)
for (i in 1:length(norm_counts_density)) lines(norm_counts_density[[i]], col=cols[i])
legend("topright", colnames(data2plot_normalized),  
       col=cols, lty=ltys, cex=0.5, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")

Result: Density plot

From the looks of it, the original count and normalized count graphs did not differ by much, which is a good sign as it means there is not much technical variation.

Finally, we will check the MDS plot, which measure the distance between samples.

#MDS plot: distance between samples
plotMDS(d, labels=rownames(samples),
        col = c("darkgreen","blue")[factor(samples$Experimental_Group)])

MDS

Conclusion from MDS plot: The strength of our result should be decent since there is clear separation between control and experimental groups. There is a slight clustering for the experimental group (PRMT5 replicates).


Part 3: Interpretation

I am interested in this dataset because it elucidates the mechanism and involvement of a key protein PRMT5 in the immune response for lung cancer patients. This research can provide me insights into whether PRMT5 inhibition, in combination with other treament, can be a strategy for lung cancer treatment.

The experiment involves 6 biological replicates (3 for each group) cultivated from lung adenocarcinoma cell lines. The control group is the GFP-expressed shRNA cells while the experimental group is the PRMT5-expressed shRNA cells.

As stated above in the overview of the data analysis, some of the expression values could not be mapped to current HUGO symbols and are thus safely removed since these unmapped values only account for 0.86% of the filtered data. I also get rid of genes with low counts. This helped with duplication issue.

My dataset is monitoring the expression of approximately 15000 genes, so I can conclude that it has good coverage.