Part 1: Getting the dataset ready. Cleaning the data and mapping to HGNC
Downloading the data of GSE178521. Providing some metadata of the platform and dataset.
require(GEOquery)
## Loading required package: GEOquery
## Loading required package: Biobase
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
gse <- getGEO("GSE178521", GSEMatrix=FALSE)
## Reading file....
## Parsing....
## Found 7 entities...
## GPL24676 (1 of 8 entities)
## GSM5393322 (2 of 8 entities)
## GSM5393323 (3 of 8 entities)
## GSM5393324 (4 of 8 entities)
## GSM5393325 (5 of 8 entities)
## GSM5393326 (6 of 8 entities)
## GSM5393327 (7 of 8 entities)
#Information about Platform
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
Metadata of the series
| contact_address | contact_city | contact_country | contact_email | contact_institute | contact_name |
|---|---|---|---|---|---|
| 上海市杨浦区控江路新华医院 | shanghai | China | bettyerzhou@sjtu.edu.cn | 上海交大医学院附属新华医院 | bingqian,,zhou |
Information about the platform - GPL24676
Platform title : Illumina NovaSeq 6000 (Homo sapiens)
Submission data : Mar 02 2018
Last update data : Nov 05 2018
Organims : Homo sapiens (taxid: 9606)
Number of GEO datasets that use this techology : 4140
Number of GEO samples that use this technology : 95270
Get the expression data and displaying information about it.
sfiles = getGEOSuppFiles("GSE178521")
fnames = rownames(sfiles)
PRMT5_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE)
#Information
read.table(fnames[1],header=TRUE,
check.names = FALSE)[1,]
knitr::kable(PRMT5_exp[1:15,1:6], format = "html")
| gene_id | GFPkd1 | GFPkd2 | GFPkd3 | PRMT5kd1 | PRMT5kd2 |
|---|---|---|---|---|---|
| ENSG00000282222 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000282221 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000212040 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000110514 | 6020 | 6299 | 5418 | 5120 | 6625 |
| ENSG00000086015 | 4124 | 4315 | 3674 | 3316 | 4242 |
| ENSG00000211769 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211768 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000250337 | 365 | 442 | 459 | 382 | 359 |
| ENSG00000255071 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211767 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211766 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211765 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211764 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000201510 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000169740 | 1220 | 1086 | 1086 | 1030 | 1333 |
#How many genes do we have measurement for?
dim(PRMT5_exp)
## [1] 58825 7
colnames(PRMT5_exp)
## [1] "gene_id" "GFPkd1" "GFPkd2" "GFPkd3" "PRMT5kd1" "PRMT5kd2" "PRMT5kd3"
Brief overview of the dataset up to 15 genes
| gene_id | GFPkd1 | GFPkd2 | GFPkd3 | PRMT5kd1 | PRMT5kd2 | PRMT5kd3 |
|---|---|---|---|---|---|---|
| ENSG00000282222 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000282221 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000212040 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000110514 | 6020 | 6299 | 5418 | 5120 | 6625 | 5485 |
| ENSG00000086015 | 4124 | 4315 | 3674 | 3316 | 4242 | 3387 |
| ENSG00000211769 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211768 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000250337 | 365 | 442 | 459 | 382 | 359 | 482 |
| ENSG00000255071 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211767 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211766 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211765 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000211764 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000201510 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000169740 | 1220 | 1086 | 1086 | 1030 | 1333 | 968 |
Since our original dataset does not provide the gene symbol, we will need to map the ensemble ID to HUGO gene symbol using the library EnsDb.Hsapiens.v79
#Convert ensemble ID to gene symbols in R
#Making sure all dependencies are installed
BiocManager::install("EnsDb.Hsapiens.v79")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
##
## replacement repositories:
## CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
## re-install: 'EnsDb.Hsapiens.v79'
## Installation paths not writeable, unable to update packages
## path: /usr/local/lib/R/library
## packages:
## class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial
## Old packages: 'clipr', 'jsonlite'
library("EnsDb.Hsapiens.v79")
## Loading required package: ensembldb
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
## Loading required package: AnnotationFilter
##
## Attaching package: 'ensembldb'
## The following object is masked from 'package:stats':
##
## filter
require("ensembldb")
BiocManager::install("ensembldb")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
##
## replacement repositories:
## CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest
## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
## re-install: 'ensembldb'
## Installation paths not writeable, unable to update packages
## path: /usr/local/lib/R/library
## packages:
## class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial
## Old packages: 'clipr', 'jsonlite'
geneID_mapping <- ensembldb::select(EnsDb.Hsapiens.v79,
keys=PRMT5_exp$gene_id,
keytype="GENEID",
columns = c("SYMBOL","GENEID"))
symbol <- c() #create empty vector
for(i in PRMT5_exp$gene_id){
if(!(i %in% geneID_mapping$GENEID)){symbol <- c(symbol, NA)}
else{symbol <- c(symbol, geneID_mapping$SYMBOL[which(geneID_mapping$GENEID == i)])}
}
PRMT5_exp$HUGO_symbol <- symbol #add the gene symbols to the dataset as a column
Next, we will get the summarized gene count.
summarized_gene_counts <- sort(table(PRMT5_exp$HUGO_symbol),
decreasing = TRUE)
knitr::kable(summarized_gene_counts[which(summarized_gene_counts>1)[1:10]],
format="html")
| Var1 | Freq |
|---|---|
| Y_RNA | 754 |
| Metazoa_SRP | 251 |
| U3 | 55 |
| uc_338 | 36 |
| snoU13 | 34 |
| SNORA70 | 30 |
| U6 | 28 |
| SNORA31 | 27 |
| 5S_rRNA | 25 |
| SNORA40 | 22 |
Result:
| Var1 | Freq |
|---|---|
| Y_RNA | 754 |
| Metazoa_SRP | 251 |
| U3 | 55 |
| uc_338 | 36 |
| snoU13 | 34 |
| SNORA70 | 30 |
| U6 | 28 |
| SNORA31 | 27 |
| 5S_rRNA | 25 |
| SNORA40 | 22 |
We will now define the groups: Control (GFP) vs Experimental (PRMT5). Tere are 3 samples for each group
#Define the groups
samples <- data.frame(lapply(colnames(PRMT5_exp)[2:7],
FUN=function(x){unlist(strsplit(x, split = "kd"))}))
colnames(samples) <- colnames(PRMT5_exp)[2:7]
rownames(samples) <- c("Replicates", "Experimental_Group")
samples <- data.frame(t(samples))
Next, we want to translate out the counts into count per million (cpm) and get rid of low counts.
#Translate out counts into count per million using edgeR
cpms = edgeR::cpm(PRMT5_exp[,2:7])
rownames(cpms) <- PRMT5_exp[,1]
#Get rid of low counts
keep = rowSums(cpms >1) >=3 # n=3, 3 samples per group
PRMT5_exp_filtered = PRMT5_exp[keep,]
dim(PRMT5_exp_filtered) # first col is the number of remaining genes
## [1] 15159 8
We also want to check the percentage of the data in which the ensemble ID could not be mapped to HGNC.
#Find percentage of unmapped genes
unmapped_percentage <-sum(is.na(PRMT5_exp_filtered$HUGO_symbol))/
length(PRMT5_exp_filtered$HUGO_symbol)*100
#Remove any rows with NA in their HUGO_symbol
PRMT5_exp_filtered <- na.omit(PRMT5_exp_filtered)
#Just double checking if there are any NA left!
any(is.na(PRMT5_exp_filtered$HUGO_symbol))
## [1] FALSE
Since only 0.8641731% of our data is unmapped, we can safely filter out the data points (rows) that contain NA in their gene symbol.
We then check if removing genes with low count solve the duplication issue
summarized_gene_counts_filtered <- sort(table(PRMT5_exp_filtered$HUGO_symbol),
decreasing = TRUE)
knitr::kable(summarized_gene_counts_filtered[
which(summarized_gene_counts>1)[1:10]],
format="html")
| Var1 | Freq |
|---|---|
| C2orf15 | 2 |
| CBS | 2 |
| CLN3 | 2 |
| DUXAP8 | 2 |
| FAM47E-STBD1 | 2 |
| HERC2P2 | 2 |
| HIST2H2AA4 | 2 |
| ICOSLG | 2 |
| IDS | 2 |
| KIAA0391 | 2 |
Result:
| Var1 | Freq |
|---|---|
| C2orf15 | 2 |
| CBS | 2 |
| CLN3 | 2 |
| DUXAP8 | 2 |
| FAM47E-STBD1 | 2 |
| HERC2P2 | 2 |
| HIST2H2AA4 | 2 |
| ICOSLG | 2 |
| IDS | 2 |
| KIAA0391 | 2 |
It seems that filtering out the low count does indeed solve some duplication issue. We have cleaned up our data and will now move on to normalize it.
Part 2: Normalization of data
We proceed to normalize the data using TMM and compare the box plot and density plot between the original count and normalized count.
Normalization using TMM since our data distribution is closest to normal.
library(edgeR)
## Loading required package: limma
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
#Creating an edgeR container for RNASep count data
filtered_data_matrix <- as.matrix(PRMT5_exp_filtered[,2:7])
rownames(filtered_data_matrix) <- PRMT5_exp_filtered$HUGO_symbol
#d is an edgeR object
d = DGEList(counts = filtered_data_matrix, group = samples$Experimental_Group)
#Calculuate the normalization factors
d = calcNormFactors(d)
normalized_counts <- cpm(d)
Visualize with boxplot
#Boxplot
par(mfrow=c(1,2))
#Distribution of our original data - Boxplot
data2plot <- log2(edgeR::cpm(PRMT5_exp_filtered[,2:7]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Original Count")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)),
col = "red", lwd = 0.6, lty = "dashed")
################################################################################
#Boxplot for normalized counts
data2plot_normalized <- log2(normalized_counts)
boxplot(data2plot_normalized, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Normalized")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
abline(h = median(apply(data2plot_normalized, 2, median)),
col = "red", lwd = 0.6, lty = "dashed")
Result:
Visualize with density plot
par(mfrow=c(1,2))
#Density Plot for original count data
counts_density <- apply(log2(edgeR::cpm(PRMT5_exp_filtered[,2:7])), 2, density)
#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="Original Count", cex.lab = 0.8)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.5,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
##############################################################################
#Density Plot for normalized count data
norm_counts_density <- apply(log2(normalized_counts), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(norm_counts_density)) {
xlim <- range(c(xlim, norm_counts_density[[i]]$x));
ylim <- range(c(ylim, norm_counts_density[[i]]$y))
}
cols <- rainbow(length(norm_counts_density))
ltys <- rep(1, length(norm_counts_density))
plot(norm_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="Normalized count", cex.lab = 0.8)
for (i in 1:length(norm_counts_density)) lines(norm_counts_density[[i]], col=cols[i])
legend("topright", colnames(data2plot_normalized),
col=cols, lty=ltys, cex=0.5,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
Result:
From the looks of it, the original count and normalized count graphs did not differ by much, which is a good sign as it means there is not much technical variation.
Finally, we will check the MDS plot, which measure the distance between samples.
#MDS plot: distance between samples
plotMDS(d, labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples$Experimental_Group)])
MDS
Conclusion from MDS plot: The strength of our result should be decent since there is clear separation between control and experimental groups. There is a slight clustering for the experimental group (PRMT5 replicates).
Part 3: Interpretation
I am interested in this dataset because it elucidates the mechanism and involvement of a key protein PRMT5 in the immune response for lung cancer patients. This research can provide me insights into whether PRMT5 inhibition, in combination with other treament, can be a strategy for lung cancer treatment.
The experiment involves 6 biological replicates (3 for each group) cultivated from lung adenocarcinoma cell lines. The control group is the GFP-expressed shRNA cells while the experimental group is the PRMT5-expressed shRNA cells.
As stated above in the overview of the data analysis, some of the expression values could not be mapped to current HUGO symbols and are thus safely removed since these unmapped values only account for 0.86% of the filtered data. I also get rid of genes with low counts. This helped with duplication issue.
My dataset is monitoring the expression of approximately 15000 genes, so I can conclude that it has good coverage.