BCB420 Assignment 1 Notebook

Part 1: Getting the dataset ready. Cleaning the data and mapping to HGNC

Downloading the data of GSE178521. Providing some metadata of the platform and dataset.

require(GEOquery)

## Loading required package: GEOquery

## Loading required package: Biobase

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Setting options('download.file.method.GEOquery'='auto')

## Setting options('GEOquery.inmemory.gpl'=FALSE)

gse <- getGEO("GSE178521", GSEMatrix=FALSE)

## Reading file....

## Parsing....

## Found 7 entities...

## GPL24676 (1 of 8 entities)

## GSM5393322 (2 of 8 entities)

## GSM5393323 (3 of 8 entities)

## GSM5393324 (4 of 8 entities)

## GSM5393325 (5 of 8 entities)

## GSM5393326 (6 of 8 entities)

## GSM5393327 (7 of 8 entities)

#Information about Platform
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))

Metadata of the series

contact_address	contact_city	contact_country	contact_email	contact_institute	contact_name
上海市杨浦区控江路新华医院	shanghai	China	bettyerzhou@sjtu.edu.cn	上海交大医学院附属新华医院	bingqian,,zhou

Information about the platform - GPL24676

Platform title : Illumina NovaSeq 6000 (Homo sapiens)

Submission data : Mar 02 2018

Last update data : Nov 05 2018

Organims : Homo sapiens (taxid: 9606)

Number of GEO datasets that use this techology : 4140

Number of GEO samples that use this technology : 95270

Get the expression data and displaying information about it.

sfiles = getGEOSuppFiles("GSE178521")
fnames = rownames(sfiles)
PRMT5_exp = read.delim(fnames[1],header=TRUE,
                       check.names = FALSE)

#Information
read.table(fnames[1],header=TRUE,
           check.names = FALSE)[1,]

knitr::kable(PRMT5_exp[1:15,1:6], format = "html")

gene_id	GFPkd1	GFPkd2	GFPkd3	PRMT5kd1	PRMT5kd2
ENSG00000282222	0	0	0	0	0
ENSG00000282221	0	0	0	0	0
ENSG00000212040	0	0	0	0	0
ENSG00000110514	6020	6299	5418	5120	6625
ENSG00000086015	4124	4315	3674	3316	4242
ENSG00000211769	0	0	0	0	0
ENSG00000211768	0	0	0	0	0
ENSG00000250337	365	442	459	382	359
ENSG00000255071	0	0	0	0	0
ENSG00000211767	0	0	0	0	0
ENSG00000211766	0	0	0	0	0
ENSG00000211765	0	0	0	0	0
ENSG00000211764	0	0	0	0	0
ENSG00000201510	0	0	0	0	0
ENSG00000169740	1220	1086	1086	1030	1333

#How many genes do we have measurement for?
dim(PRMT5_exp)

## [1] 58825     7

colnames(PRMT5_exp)

## [1] "gene_id"  "GFPkd1"   "GFPkd2"   "GFPkd3"   "PRMT5kd1" "PRMT5kd2" "PRMT5kd3"

Brief overview of the dataset up to 15 genes

gene_id	GFPkd1	GFPkd2	GFPkd3	PRMT5kd1	PRMT5kd2	PRMT5kd3
ENSG00000282222	0	0	0	0	0	0
ENSG00000282221	0	0	0	0	0	0
ENSG00000212040	0	0	0	0	0	0
ENSG00000110514	6020	6299	5418	5120	6625	5485
ENSG00000086015	4124	4315	3674	3316	4242	3387
ENSG00000211769	0	0	0	0	0	0
ENSG00000211768	0	0	0	0	0	0
ENSG00000250337	365	442	459	382	359	482
ENSG00000255071	0	0	0	0	0	0
ENSG00000211767	0	0	0	0	0	0
ENSG00000211766	0	0	0	0	0	0
ENSG00000211765	0	0	0	0	0	0
ENSG00000211764	0	0	0	0	0	0
ENSG00000201510	0	0	0	0	0	0
ENSG00000169740	1220	1086	1086	1030	1333	968

Since our original dataset does not provide the gene symbol, we will need to map the ensemble ID to HUGO gene symbol using the library EnsDb.Hsapiens.v79

#Convert ensemble ID to gene symbols in R
#Making sure all dependencies are installed
BiocManager::install("EnsDb.Hsapiens.v79")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest

## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'EnsDb.Hsapiens.v79'

## Installation paths not writeable, unable to update packages
##   path: /usr/local/lib/R/library
##   packages:
##     class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial

## Old packages: 'clipr', 'jsonlite'

library("EnsDb.Hsapiens.v79")

## Loading required package: ensembldb

## Loading required package: GenomicRanges

## Loading required package: stats4

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: GenomicFeatures

## Loading required package: AnnotationDbi

## Loading required package: AnnotationFilter

## 
## Attaching package: 'ensembldb'

## The following object is masked from 'package:stats':
## 
##     filter

require("ensembldb")
BiocManager::install("ensembldb")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://packagemanager.rstudio.com/all/__linux__/focal/latest

## Bioconductor version 3.14 (BiocManager 1.30.16), R 4.1.2 (2021-11-01)

## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
##   re-install: 'ensembldb'

## Installation paths not writeable, unable to update packages
##   path: /usr/local/lib/R/library
##   packages:
##     class, foreign, MASS, Matrix, nlme, nnet, rpart, spatial

## Old packages: 'clipr', 'jsonlite'

geneID_mapping <- ensembldb::select(EnsDb.Hsapiens.v79, 
                                           keys=PRMT5_exp$gene_id,
                                           keytype="GENEID", 
                                           columns = c("SYMBOL","GENEID"))
symbol <- c() #create empty vector
for(i in PRMT5_exp$gene_id){
  if(!(i %in% geneID_mapping$GENEID)){symbol <- c(symbol, NA)}
  else{symbol <- c(symbol, geneID_mapping$SYMBOL[which(geneID_mapping$GENEID == i)])}
  
}
PRMT5_exp$HUGO_symbol <- symbol #add the gene symbols to the dataset as a column

Next, we will get the summarized gene count.

summarized_gene_counts <- sort(table(PRMT5_exp$HUGO_symbol),
                               decreasing = TRUE)
knitr::kable(summarized_gene_counts[which(summarized_gene_counts>1)[1:10]], 
             format="html")

Var1	Freq
Y_RNA	754
Metazoa_SRP	251
U3	55
uc_338	36
snoU13	34
SNORA70	30
U6	28
SNORA31	27
5S_rRNA	25
SNORA40	22

Result:

Var1	Freq
Y_RNA	754
Metazoa_SRP	251
U3	55
uc_338	36
snoU13	34
SNORA70	30
U6	28
SNORA31	27
5S_rRNA	25
SNORA40	22

We will now define the groups: Control (GFP) vs Experimental (PRMT5). Tere are 3 samples for each group

#Define the groups
samples <- data.frame(lapply(colnames(PRMT5_exp)[2:7], 
                             FUN=function(x){unlist(strsplit(x, split = "kd"))}))

colnames(samples) <- colnames(PRMT5_exp)[2:7]
rownames(samples) <- c("Replicates", "Experimental_Group")
samples <- data.frame(t(samples))

Next, we want to translate out the counts into count per million (cpm) and get rid of low counts.

#Translate out counts into count per million using edgeR
cpms = edgeR::cpm(PRMT5_exp[,2:7])
rownames(cpms) <- PRMT5_exp[,1]

#Get rid of low counts
keep = rowSums(cpms >1) >=3 # n=3, 3 samples per group

PRMT5_exp_filtered = PRMT5_exp[keep,]
dim(PRMT5_exp_filtered) # first col is the number of remaining genes

## [1] 15159     8

We also want to check the percentage of the data in which the ensemble ID could not be mapped to HGNC.

#Find percentage of unmapped genes
unmapped_percentage <-sum(is.na(PRMT5_exp_filtered$HUGO_symbol))/
  length(PRMT5_exp_filtered$HUGO_symbol)*100

#Remove any rows with NA in their HUGO_symbol
PRMT5_exp_filtered <- na.omit(PRMT5_exp_filtered)

#Just double checking if there are any NA left!
any(is.na(PRMT5_exp_filtered$HUGO_symbol))

## [1] FALSE

Since only 0.8641731% of our data is unmapped, we can safely filter out the data points (rows) that contain NA in their gene symbol.

We then check if removing genes with low count solve the duplication issue

summarized_gene_counts_filtered <- sort(table(PRMT5_exp_filtered$HUGO_symbol),
                                        decreasing = TRUE)
knitr::kable(summarized_gene_counts_filtered[
  which(summarized_gene_counts>1)[1:10]], 
  format="html")

Var1	Freq
C2orf15	2
CBS	2
CLN3	2
DUXAP8	2
FAM47E-STBD1	2
HERC2P2	2
HIST2H2AA4	2
ICOSLG	2
IDS	2
KIAA0391	2

Result:

Var1	Freq
C2orf15	2
CBS	2
CLN3	2
DUXAP8	2
FAM47E-STBD1	2
HERC2P2	2
HIST2H2AA4	2
ICOSLG	2
IDS	2
KIAA0391	2

It seems that filtering out the low count does indeed solve some duplication issue. We have cleaned up our data and will now move on to normalize it.

Part 2: Normalization of data

We proceed to normalize the data using TMM and compare the box plot and density plot between the original count and normalized count.

Normalization using TMM since our data distribution is closest to normal.

library(edgeR)

## Loading required package: limma

## 
## Attaching package: 'limma'

## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

#Creating an edgeR container for RNASep count data
filtered_data_matrix <- as.matrix(PRMT5_exp_filtered[,2:7])
rownames(filtered_data_matrix) <- PRMT5_exp_filtered$HUGO_symbol
#d is an edgeR object
d = DGEList(counts = filtered_data_matrix, group = samples$Experimental_Group)

#Calculuate the normalization factors
d = calcNormFactors(d)
normalized_counts <- cpm(d)

Visualize with boxplot

#Boxplot
par(mfrow=c(1,2))
#Distribution of our original data - Boxplot
data2plot <- log2(edgeR::cpm(PRMT5_exp_filtered[,2:7]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Original Count")

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn

#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), 
       col = "red", lwd = 0.6, lty = "dashed")

################################################################################

#Boxplot for normalized counts
data2plot_normalized <- log2(normalized_counts)
boxplot(data2plot_normalized, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Normalized")

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn

abline(h = median(apply(data2plot_normalized, 2, median)), 
       col = "red", lwd = 0.6, lty = "dashed")

Result: Boxplot

Visualize with density plot

par(mfrow=c(1,2))
#Density Plot for original count data
counts_density <- apply(log2(edgeR::cpm(PRMT5_exp_filtered[,2:7])), 2, density)

#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
  xlim <- range(c(xlim, counts_density[[i]]$x)); 
  ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Original Count", cex.lab = 0.8)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i])
#create legend
legend("topright", colnames(data2plot),  
       col=cols, lty=ltys, cex=0.5, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")

##############################################################################

#Density Plot for normalized count data
norm_counts_density <- apply(log2(normalized_counts), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(norm_counts_density)) {
  xlim <- range(c(xlim, norm_counts_density[[i]]$x)); 
  ylim <- range(c(ylim, norm_counts_density[[i]]$y))
}
cols <- rainbow(length(norm_counts_density))
ltys <- rep(1, length(norm_counts_density))
plot(norm_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Normalized count", cex.lab = 0.8)
for (i in 1:length(norm_counts_density)) lines(norm_counts_density[[i]], col=cols[i])
legend("topright", colnames(data2plot_normalized),  
       col=cols, lty=ltys, cex=0.5, 
       border ="blue",  text.col = "green4", 
       merge = TRUE, bg = "gray90")

Result: Density plot

From the looks of it, the original count and normalized count graphs did not differ by much, which is a good sign as it means there is not much technical variation.

Finally, we will check the MDS plot, which measure the distance between samples.

#MDS plot: distance between samples
plotMDS(d, labels=rownames(samples),
        col = c("darkgreen","blue")[factor(samples$Experimental_Group)])

MDS

Conclusion from MDS plot: The strength of our result should be decent since there is clear separation between control and experimental groups. There is a slight clustering for the experimental group (PRMT5 replicates).

Part 3: Interpretation

I am interested in this dataset because it elucidates the mechanism and involvement of a key protein PRMT5 in the immune response for lung cancer patients. This research can provide me insights into whether PRMT5 inhibition, in combination with other treament, can be a strategy for lung cancer treatment.

The experiment involves 6 biological replicates (3 for each group) cultivated from lung adenocarcinoma cell lines. The control group is the GFP-expressed shRNA cells while the experimental group is the PRMT5-expressed shRNA cells.

As stated above in the overview of the data analysis, some of the expression values could not be mapped to current HUGO symbols and are thus safely removed since these unmapped values only account for 0.86% of the filtered data. I also get rid of genes with low counts. This helped with duplication issue.

My dataset is monitoring the expression of approximately 15000 genes, so I can conclude that it has good coverage.