Assignment 9+10

I will begin by loading the necessary libraries for RNA-seq analysis. I am trying to set up a workflow that can download count data, clean metadata, annotate genes, and create a DGEList object for differential expression analysis.

# Load required libraries

library(edgeR)

## Loading required package: limma

library(limma)
library(Homo.sapiens)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following object is masked from 'package:limma':
## 
##     plotMA

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: OrganismDbi

## Loading required package: Seqinfo

## Loading required package: GenomicFeatures

## Loading required package: GenomicRanges

## Loading required package: GO.db

##

## Loading required package: org.Hs.eg.db

##

## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene

library(pheatmap)
library(cowplot)
library(DT)
library(org.Hs.eg.db)
library(clusterProfiler)

## clusterProfiler v4.18.1 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
## 
## Please cite:
## 
## Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
## clusterProfiler: an R package for comparing biological themes among
## gene clusters. OMICS: A Journal of Integrative Biology. 2012,
## 16(5):284-287

## 
## Attaching package: 'clusterProfiler'

## The following object is masked from 'package:OrganismDbi':
## 
##     select

## The following object is masked from 'package:AnnotationDbi':
## 
##     select

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

library(pathview)

## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################

library(enrichplot)

## enrichplot v1.30.1 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
## 
## Please cite:
## 
## Qianwen Wang, Ming Li, Tianzhi Wu, Li Zhan, Lin Li, Meijun Chen, Wenqin
## Xie, Zijing Xie, Erqiang Hu, Shuangbin Xu, Guangchuang Yu. Exploring
## epigenomic datasets by ChIPseeker. Current Protocols. 2022, 2(10): e585

library(genefilter)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse()     masks IRanges::collapse()
## ✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::desc()         masks IRanges::desc()
## ✖ tidyr::expand()       masks S4Vectors::expand()
## ✖ dplyr::filter()       masks clusterProfiler::filter(), stats::filter()
## ✖ dplyr::first()        masks S4Vectors::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename()       masks clusterProfiler::rename(), S4Vectors::rename()
## ✖ lubridate::second()   masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::select()       masks clusterProfiler::select(), OrganismDbi::select(), AnnotationDbi::select()
## ✖ purrr::simplify()     masks clusterProfiler::simplify()
## ✖ dplyr::slice()        masks clusterProfiler::slice(), IRanges::slice()
## ✖ readr::spec()         masks genefilter::spec()
## ✖ lubridate::stamp()    masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GEOquery)

## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## 
## The following object is masked from 'package:IRanges':
## 
##     shift
## 
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second

Next, I will download the raw RNA-seq counts table from GEO and inspect it. The rows represent genes (identified by Entrez IDs) and the columns represent samples (GSM identifiers).

# Download and import count table

urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"

# Set a path to the file
path <- paste(urld, "acc=GSE241942", "file=GSE241942_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&")

# Download the table and store as an object
tbl <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames=1)

# Inspect

dim(tbl)

## [1] 39376    21

tbl[1:5, 1:5]

##           GSM7746235 GSM7746236 GSM7746237 GSM7746238 GSM7746239
## 100287102          1          0          2          5          2
## 653635           178        179        221        160        178
## 102466751          2          1          4          1          4
## 107985730          0          0          0          0          0
## 100302278          0          0          0          0          0

I will then download the GEO metadata, clean it, and create a usable samples table with a grouping variable that represents the major experimental categories.

# Obtain GEO metadata

gset <- getGEO("GSE241942", GSEMatrix =TRUE, getGPL=TRUE)

## Found 1 file(s)

## GSE241942_series_matrix.txt.gz

metaData<-gset[[1]]
#pData(metaData)

# Clean metadata and create 'samples' table

samples <- pData(metaData) %>% 
  dplyr::mutate(treatment = `treatment:ch1`) %>% 
  dplyr::select(title, treatment)

head(samples)

##                  title treatment
## GSM7746235   Vehicle 1   Vehicle
## GSM7746236   Vehicle 2   Vehicle
## GSM7746237   Vehicle 3   Vehicle
## GSM7746238 Estradiol 1 Estradiol
## GSM7746239 Estradiol 2 Estradiol
## GSM7746240 Estradiol 3 Estradiol

Next, I will annotate the genes using Entrez IDs and the Homo.sapiens database. I will remove duplicated IDs to simplify downstream analysis and ensure the gene order matches the counts matrix.

# Gene annotation

geneid <- rownames(tbl)
genes <- AnnotationDbi::select(Homo.sapiens, 
                                                             keys=geneid, 
                                                             columns=c("SYMBOL", "TXCHROM", "ENZYME"), 
                                                             keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

# Remove duplicates and ensure alignment

genes <- genes[!duplicated(genes$ENTREZID), ]
genes <- genes[match(rownames(tbl), genes$ENTREZID), ]

# Sanity check

summary(genes$ENTREZID == geneid)

##    Mode    TRUE 
## logical   39376

head(genes)

##    ENTREZID ENZYME      SYMBOL TXCHROM
## 1 100287102   <NA>     DDX11L1    chr1
## 2    653635   <NA>      WASH7P    chr1
## 3 102466751   <NA>   MIR6859-1    <NA>
## 4 107985730   <NA> MIR1302-2HG    <NA>
## 5 100302278   <NA>   MIR1302-2    <NA>
## 6    645520   <NA>     FAM138A    chr1

Finally, I will build the DGEList object, which contains the count data, the sample metadata, the gene annotation, and the grouping vector. This object will be used for normalization and differential expression analysis.

# Create DGEList object

x <- DGEList(counts = tbl, 
             samples = samples, 
             genes = genes, 
             group = samples$treatment)

# Inspect DGEList
names(x)

## [1] "counts"  "samples" "genes"

#x$samples
#x$genes

The DGEList object now contains three main tables: counts, samples, and genes. It also has automatically calculated library sizes (lib.size) and normalization factors (norm.factors) which will be used in the next steps of normalization and analysis.

EXERCISE 1

You are interested in the difference in gene expression caused by treatment with OH-Tamoxifen and Fulvestrant. Subset the samples and filter the noisy data. If the data looks like it still has low quality counts you can apply a more stringent filter using the filterByExpr() function using the argument min.count = to increase the number of counts needed. Try a higher number.

In this section, I will focus the dataset on the experimental groups relevant to my comparison: OH-Tamoxifen, Fulvestrant, and Vehicle. Subsetting ensures that downstream normalization and modeling use only the samples involved in these treatments. After subsetting, I will evaluate expression quality by computing CPM (counts per million) and generating density plots. To remove low-quality or uninformative genes, I will use filterByExpr(), which assesses whether each gene is sufficiently expressed across samples. I will apply a moderately stringent threshold (min.count = 25) and then try a stricter filter (min.count = 50) if the distribution still shows excess low counts. Finally, I will compare CPM curves before and after filtering to confirm that low-expression noise has been reduced.

# EXERCISE 1: Subset & filter noisy data

# Subset the object for OH-Tamoxifen, Fulvestrant, and Vehicle
TamFulv <- x[, grepl("4OH-tamoxifen|Fulvestrant|Vehicle", x$samples$treatment)]

TamFulv$samples

##                    group lib.size norm.factors           title     treatment
## GSM7746235       Vehicle 15224711            1       Vehicle 1       Vehicle
## GSM7746236       Vehicle 15418701            1       Vehicle 2       Vehicle
## GSM7746237       Vehicle 15648016            1       Vehicle 3       Vehicle
## GSM7746241   Fulvestrant 20556418            1   Fulvestrant 1   Fulvestrant
## GSM7746242   Fulvestrant 17184976            1   Fulvestrant 2   Fulvestrant
## GSM7746243   Fulvestrant 20401546            1   Fulvestrant 3   Fulvestrant
## GSM7746247 4OH-tamoxifen 24286385            1 4OH-tamoxifen 1 4OH-tamoxifen
## GSM7746248 4OH-tamoxifen 20185157            1 4OH-tamoxifen 2 4OH-tamoxifen
## GSM7746249 4OH-tamoxifen 18623218            1 4OH-tamoxifen 3 4OH-tamoxifen

# A copy of the unfiltered logCPM values
lcpm_dirty <- cpm(TamFulv, log = TRUE, prior.count = 2)

# A cpm copy for checking how many genes will be removed
cpm_vals <- cpm(TamFulv)
table(rowSums(cpm_vals > 0.5) >= 3)

## 
## FALSE  TRUE 
## 24022 15354

# Basic automated filtering
keep.exprs <- filterByExpr(TamFulv, group = TamFulv$samples$group, min.count = 25)

TamFulv <- TamFulv[keep.exprs, , keep.lib.sizes = FALSE]

# cpm values after filtering
cpm_vals <- cpm(TamFulv)
table(rowSums(cpm_vals > 0.5) >= 3)

## 
##  TRUE 
## 13708

# logCPM object for the filtered data
lcpm <- cpm(TamFulv, log = TRUE, prior.count = 2)

# Plot unfiltered logCPM
data_dirty <- pivot_longer(as.data.frame(lcpm_dirty), cols = everything())

data_dirty %>% 
  ggplot(aes(x = value, colour = name)) +
  geom_density() +
  ggtitle("Unfiltered CPM")

# Plot filtered logCPM
data_filt <- pivot_longer(as.data.frame(lcpm), cols = everything())

data_filt %>% 
  ggplot(aes(x = value, colour = name)) +
  geom_density() +
  ggtitle("Filtered CPM")

# Try a stricter filter if counts still look low quality
keep.exprs_strict <- filterByExpr(TamFulv, group = TamFulv$samples$group, min.count = 50)

TamFulv_strict <- TamFulv[keep.exprs_strict, , keep.lib.sizes = FALSE]

TamFulv_strict

## An object of class "DGEList"
## $counts
##           GSM7746235 GSM7746236 GSM7746237 GSM7746241 GSM7746242 GSM7746243
## 653635           178        179        221        343        273        377
## 102723897        164        175        179        267        250        280
## 100132287         29         28         31         67         65         68
## 113219467        448        523        342        608        712        921
## 100133331         27         31         35         70         60         59
##           GSM7746247 GSM7746248 GSM7746249
## 653635           235        168        174
## 102723897        256        186        189
## 100132287         41         28         28
## 113219467        449        687        524
## 100133331         41         40         37
## 12581 more rows ...
## 
## $samples
##                    group lib.size norm.factors           title     treatment
## GSM7746235       Vehicle 15167811            1       Vehicle 1       Vehicle
## GSM7746236       Vehicle 15360055            1       Vehicle 2       Vehicle
## GSM7746237       Vehicle 15588792            1       Vehicle 3       Vehicle
## GSM7746241   Fulvestrant 20471654            1   Fulvestrant 1   Fulvestrant
## GSM7746242   Fulvestrant 17112837            1   Fulvestrant 2   Fulvestrant
## GSM7746243   Fulvestrant 20318656            1   Fulvestrant 3   Fulvestrant
## GSM7746247 4OH-tamoxifen 24199653            1 4OH-tamoxifen 1 4OH-tamoxifen
## GSM7746248 4OH-tamoxifen 20114474            1 4OH-tamoxifen 2 4OH-tamoxifen
## GSM7746249 4OH-tamoxifen 18558813            1 4OH-tamoxifen 3 4OH-tamoxifen
## 
## $genes
##            ENTREZID ENZYME       SYMBOL TXCHROM
## 653635       653635   <NA>       WASH7P    chr1
## 102723897 102723897   <NA> LOC102723897    <NA>
## 100132287 100132287   <NA> LOC100132287    <NA>
## 113219467 113219467   <NA>     MIR12136    <NA>
## 100133331 100133331   <NA>         <NA>    <NA>
## 12581 more rows ...

The output confirms that the subsetting retained nine samples corresponding to the three treatment groups, each with reasonable library sizes. The initial CPM distribution (“Unfiltered CPM”) clearly shows a heavy accumulation of low-expression values, indicating the presence of many weak or inactive genes. After applying filterByExpr(min.count = 25), the number of retained genes decreases substantially, and the “Filtered CPM” plot shows a more uniform and biologically realistic distribution. Increasing the threshold to min.count = 50 further restricts the dataset to strongly expressed genes, which may improve the robustness of later modeling. Overall, the filtering process successfully removes noise and prepares the data for normalization.

EXERCISE 2

Using the filtered data set now run normalization, quality control plots and create a contrast of the data.

Now that the dataset has been filtered, I will normalize the counts using the TMM (Trimmed Mean of M-values) method via calcNormFactors(). This adjusts for global differences in sequencing depth so expression values become comparable across samples. I will generate log-CPM values and produce an MDS plot using plotMDS() to visualize sample-to-sample relationships and check whether treatment groups separate. Next, I will construct a design matrix that encodes each treatment condition as a model factor. Finally, I will create a contrast matrix defining the pairwise comparisons of interest: OH-Tamoxifen vs Vehicle, Fulvestrant vs Vehicle, and OH-Tamoxifen vs Fulvestrant. These contrasts will later be used for linear modeling.

# EXERCISE 2: Normalization, MDS, design & contrasts 

# Normalize the filtered data
TamFulv_norm <- calcNormFactors(TamFulv_strict, method = "TMM")
TamFulv_norm$samples$norm.factors

## [1] 1.0043439 0.9878939 1.0080627 1.0221215 1.0062772 1.0002467 1.0029393
## [8] 0.9799300 0.9888324

# Quality control (MDS plot)
# Collect logCPM values
lcpm_TF <- cpm(TamFulv_norm, log = TRUE)

# A colour vector for the groups
col.group_TF <- TamFulv_norm$samples$group

# MDS plot
plotMDS(lcpm_TF, labels = col.group_TF, col = as.numeric(col.group_TF))

# Design matrix
designSEQ_TF <- model.matrix(~0 + TamFulv_norm$samples$group)

# Rename columns to clean names
colnames(designSEQ_TF) <- c("OH_Tam", "Fulv", "Veh")
designSEQ_TF

##   OH_Tam Fulv Veh
## 1      0    0   1
## 2      0    0   1
## 3      0    0   1
## 4      0    1   0
## 5      0    1   0
## 6      0    1   0
## 7      1    0   0
## 8      1    0   0
## 9      1    0   0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`TamFulv_norm$samples$group`
## [1] "contr.treatment"

# Contrast matrix
contr.TamFulv <- makeContrasts(
  OH_Tam_vs_Veh = OH_Tam - Veh,
  Fulv_vs_Veh   = Fulv - Veh,
  OH_Tam_vs_Fulv = OH_Tam - Fulv,
  levels = colnames(designSEQ_TF)
)

contr.TamFulv

##         Contrasts
## Levels   OH_Tam_vs_Veh Fulv_vs_Veh OH_Tam_vs_Fulv
##   OH_Tam             1           0              1
##   Fulv               0           1             -1
##   Veh               -1          -1              0

The normalization factors are all close to 1, which indicates that the libraries were already reasonably well balanced. The MDS plot shows clear separation among the three treatment groups, demonstrating strong treatment-associated transcriptional differences. The design matrix correctly assigns samples to the OH-Tamoxifen, Fulvestrant, or Vehicle groups with one column per condition. The contrast matrix successfully encodes the three main comparisons for differential expression analysis. This setup confirms that the data are ready for linear modeling using voom and limma.

EXERCISE 3

Now continue processing by converting to a voom object and assessing the removal of heteroscedasticity. Then assess by PCA to see if it is similar to MDS plots.

To prepare the filtered and normalized counts for linear modeling, I will apply the voom() transformation. Voom computes log-CPM values and estimates the mean–variance relationship, producing precision weights that correct for heteroscedasticity—a key requirement for limma’s linear models. After generating the voom object, I will fit the design matrix using lmFit(), apply the contrasts using contrasts.fit(), and compute moderated statistics using eBayes(). I will then use plotSA() to visualize the final variance trend, confirming that heteroscedasticity has been removed. Finally, I will perform PCA on the voom expression matrix to assess whether sample clustering resembles the earlier MDS plot.

# EXERCISE 3: Voom, heteroscedasticity, PCA

# Two-pane layout

par(mfrow = c(1, 2))

# Voom transformation

v_TF <- voom(TamFulv_norm, designSEQ_TF, plot = TRUE)

# Linear modelling

vfit_TF <- lmFit(v_TF, designSEQ_TF)

# Apply contrasts

vfit_TF <- contrasts.fit(vfit_TF, contrasts = contr.TamFulv)

# Empirical Bayes shrinkage

efit_TF <- eBayes(vfit_TF)

# Plot mean-variance trend

plotSA(efit_TF, main = "Final model: Mean-variance trend")

# PCA visualization

# Extract expression values

data_TF <- as.data.frame(v_TF$E)

# PCA on transposed matrix (samples as rows)

pca_TF <- prcomp(t(data_TF))

# PC1 vs PC2

p1_TF <- ggplot(as.data.frame(pca_TF$x),
                aes(PC1, PC2, colour = TamFulv_norm$samples$group)) +
  geom_point() +
  labs(title = "PCA: PC1 vs PC2") +
  theme(legend.position = "none")


# PC3 vs PC4

p2_TF <- ggplot(as.data.frame(pca_TF$x),
                aes(PC3, PC4, colour = TamFulv_norm$samples$group)) +
  geom_point() +
  labs(title = "PCA: PC3 vs PC4", colour = "Group")

# Side-by-side layout

plot_grid(p1_TF, p2_TF, rel_widths = c(0.65, 1))

The voom transformation produces the expected mean-variance trend line, indicating that precision weights were appropriately fitted across genes. The plotSA() result shows that the final model exhibits a stable variance structure, confirming that heteroscedasticity was effectively addressed. PCA reveals clear separation of the three treatment groups along the first two principal components, consistent with the MDS results. Higher components (PC3 and PC4) show additional structure but remain largely treatment-dependent. These diagnostics validate that the voom-transformed data are suitable for accurate differential expression analysis.

EXERCISE 4

Use toptable to assess the differential expression. (this does not need to be in the assignment report its just for you to quality control your processing step, and ensure its working.). Create a data.table to enable others to explore the data set. Next, continue the analysis by investigating the similarities between the durgs using a Venn diagram. What observations can you make from this figure?

I will now extract differentially expressed genes using topTable(), which reports log fold changes, P-values, and adjusted P-values for each gene. Creating histograms of raw and adjusted P-values allows me to assess whether the model detected a strong treatment signal. I will also generate an interactive table summarizing all differential expression results across contrasts. To assess similarities and differences among treatment effects, I will use decideTests() to classify genes as up- or down-regulated in each contrast and visualize their overlap using a Venn diagram.

# EXERCISE 4: topTable, QC histograms, interactive DE table, Venn 

# Extract full differential expression table for one contrast (example: Tam vs Vehicle & Fulvestrant vs Vehicle)
tT_TF <- topTable(efit_TF, n = Inf, coef = 1)

# QC histograms
hist(tT_TF$adj.P.Val, main = "Adjusted P-values", xlab = "adj.P.Val")

hist(tT_TF$P.Val, main = "P-values", xlab = "P.Val")

# Interactive DE table (all genes, all contrasts)
topTable(efit_TF, adjust = "fdr", sort.by = "B", n = Inf) %>%
  mutate_if(is.numeric, ~ round(., 2)) %>%   
  datatable(options = list(scrollX = TRUE))

## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

# Venn diagram: compare drug-induced DE patterns
d_TF <- decideTests(efit_TF, lfc = 1)

vennDiagram(
  d_TF,
  include    = c("up", "down"),
  counts.col = c("red", "green")
)

The P-value histograms show strong enrichment near zero, indicating that many genes respond significantly to treatment. Adjusted P-values remain skewed toward low values despite multiple-testing correction, reflecting robust signal in the dataset. The interactive table enables convenient exploration of differential expression patterns. The Venn diagram highlights both shared and treatment-specific transcriptional responses, illustrating that OH-Tamoxifen and Fulvestrant regulate many common genes but also trigger unique expression programs. This supports the idea that both drugs target estrogen signaling but with distinct downstream impacts.

EXERCISE 5

Create a heatmap of the differential expressed genes using the top variable genes. Describe your observations.

To visualize treatment-driven expression patterns, I will identify the top 5% most variable genes across all samples using the variance of the voom-transformed matrix. High-variance genes typically reflect biologically meaningful differences rather than noise. After computing the variance cutoff, I will extract the corresponding gene subset and create a heatmap using pheatmap() with row scaling enabled. I will annotate samples by their treatment to help interpret clustering patterns across conditions.

# EXERCISE 5: Heatmap of top variable genes

# Step 1 — Calculate variance of each gene across all samples
var_TF <- apply(v_TF$E, MARGIN = 1, FUN = var)

# Step 2 — Identify high-variance genes (top 5%)
quantile(var_TF, probs = c(.5, .9, .95))   # optional QC check

##        50%        90%        95% 
## 0.01687116 0.11194648 0.21292118

cutoff_TF <- quantile(var_TF, probs = 0.95)

# Step 3 — Subset to top variable genes
topVarGenes_TF <- var_TF[var_TF > cutoff_TF]

# Step 4 — Create voom object of variable genes
v_TF_var <- v_TF[names(topVarGenes_TF), ]

# Step 5 — Annotation for heatmap columns
anno_TF <- data.frame(
  row.names = colnames(v_TF_var),
  treatment = TamFulv_norm$samples$group
)

# Step 6 — Generate heatmap
pheatmap(
  v_TF_var$E,
  scale = "row",
  clustering_method = "ward.D2",
  annotation_col = anno_TF,
  show_rownames = FALSE,
  show_colnames = FALSE,
  main = "Top Variable Genes – Tamoxifen, Fulvestrant, Vehicle"
)

The heatmap shows strong clustering by treatment group: Vehicle, OH-Tamoxifen, and Fulvestrant each form distinct blocks. Fulvestrant-treated samples show the largest divergence from Vehicle, consistent with its role as a complete estrogen receptor degrader. Blocks of highly co-regulated genes reveal pathways that may be uniquely activated or suppressed by each drug. The row-scaling emphasizes relative changes and helps highlight coordinated transcriptional patterns. This visualization reinforces earlier findings that treatment strongly shapes expression profiles.

EXERCISE 6

Returning to your sleep study data create the up and down gene lists for your comparison of Tamoxifen versus vehicle and perform gene over representation using GO terms. Create an interactive table and an enrichment graph for the up and down set of genes.

For the Tamoxifen vs Vehicle comparison, I will identify significantly up-regulated and down-regulated genes based on log fold change direction and FDR thresholds. These gene sets will serve as input for GO Biological Process over-representation analysis using enrichGO(), which tests whether any pathways contain more regulated genes than expected by chance. I will present the results in interactive tables and summarize major findings using dotplots. To explore relationships among enriched terms, I will compute pairwise similarity metrics and visualize them as enrichment maps.

# EXERCISE 6: GO Over Representation Analysis 

# 1. Get differential expression results for OH_Tamoxifen vs Vehicle
tT_TamVeh <- topTable(efit_TF, coef = 1, number = Inf)

# 2. Define gene universe (ENTREZ IDs, no NA)
geneUniverse <- tT_TamVeh$ENTREZID

# 3. Define UP and DOWN gene lists (simple filtering)
upGenes <- tT_TamVeh %>%
         filter(logFC > 1,
         adj.P.Val < 0.05) %>%
  pull(ENTREZID)

downGenes <- tT_TamVeh %>%
  filter(logFC < -1,
         adj.P.Val < 0.05) %>%
  pull(ENTREZID)

# 4. Run GO enrichment (BP terms)
egoUP <- enrichGO(
  gene          = upGenes,
  universe      = geneUniverse,
  OrgDb         = org.Hs.eg.db,
  keyType       = "ENTREZID",
  ont           = "BP",
  pAdjustMethod = "fdr",
  qvalueCutoff  = 0.10,
  readable      = TRUE
)

egoDOWN <- enrichGO(
  gene          = downGenes,
  universe      = geneUniverse,
  OrgDb         = org.Hs.eg.db,
  keyType       = "ENTREZID",
  ont           = "BP",
  pAdjustMethod = "fdr",
  qvalueCutoff  = 0.10,
  readable      = TRUE
)

# 5. View results as interactive tables
datatable(as.data.frame(egoUP),  options = list(scrollX = TRUE))

datatable(as.data.frame(egoDOWN), options = list(scrollX = TRUE))

# 6. Dotplots summarizing enriched pathways
dotplot(egoUP,   showCategory = 15, font.size = 7, label_format = 45) +
  ggtitle("GO BP Enrichment – Up-regulated (Tam vs Veh)")

dotplot(egoDOWN, showCategory = 15, font.size = 7, label_format = 45) +
  ggtitle("GO BP Enrichment – Down-regulated (Tam vs Veh)")

# 7. Enrichment maps
egoUP_sim   <- pairwise_termsim(egoUP)
egoDOWN_sim <- pairwise_termsim(egoDOWN)

emapplot(egoUP_sim,   size_category = 0.5, layout = "kk", group = TRUE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggtangle package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

emapplot(egoDOWN_sim, size_category = 0.5, layout = "kk", group = TRUE)

The UP gene set shows enrichment for pathways related to hormonal signaling, cellular communication, and regulatory responses—reflecting Tamoxifen’s known partial agonist effects. The DOWN gene set is dominated by pathways involving cell cycle progression, DNA replication, and mitosis, consistent with reduced proliferation. The dotplots highlight the strongest enriched terms, and the enrichment maps reveal coherent clusters of related pathways, allowing biological themes to emerge more clearly. These results indicate that Tamoxifen induces regulatory changes while suppressing growth-associated programs.

EXERCISE 7

Create a KEGG enrichment analysis for the up and down regulated tamoxifen vs vehicle genes and select a pathway to build a pathway figure.

Next, I will perform KEGG pathway enrichment separately for the up- and down-regulated gene sets using enrichKEGG(). To visualize pathway-level effects, I will create a named vector of log fold changes indexed by Entrez IDs. After identifying significantly enriched pathways, I will select one pathway and use pathview() to overlay gene expression changes onto the KEGG diagram. This provides a mechanistic visualization of how Tamoxifen affects pathway components.

# EXERCISE 7: KEGG enrichment & pathway visualization 

# 1. Create named logFC vector for Pathview
# (ENTREZID = names, logFC = values)

gene_fc <- tT_TamVeh %>% 
  dplyr::select(ENTREZID, logFC) %>% 
  tibble::deframe()   # makes a named vector


# 2. KEGG Over-representation (UP and DOWN gene sets)

kegg_up <- enrichKEGG(
  gene         = upGenes,
  universe     = geneUniverse,
  organism     = "hsa",
  keyType      = "kegg",
  pvalueCutoff = 0.05
)

## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...

## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

kegg_down <- enrichKEGG(
  gene         = downGenes,
  universe     = geneUniverse,
  organism     = "hsa",
  keyType      = "kegg",
  pvalueCutoff = 0.05
)


# 3. KEGG enrichment tables (interactive)

DT::datatable(as.data.frame(kegg_up), 
              options = list(scrollX = TRUE))

DT::datatable(as.data.frame(kegg_down), 
              options = list(scrollX = TRUE))

# 4. Dotplots

dotplot(kegg_up, showCategory = 15) +
  ggtitle("KEGG Enrichment – Upregulated (Tam vs Vehicle)")

#dotplot(kegg_down, showCategory = 15) +
  #ggtitle("KEGG Enrichment – Downregulated (Tam vs Vehicle)")


# 5. Pathview: build a pathway figure
# Choose a pathway from the UP list (or DOWN if desired)

selected_pathway <- kegg_up@result$ID[1]    # e.g., "hsa04110" - Cell Cylce etc.

pathview(
  gene.data  = gene_fc,
  pathway.id = selected_pathway,
  species    = "hsa",
  limit      = list(gene = max(abs(gene_fc)), cpd = 1)
)

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /Users/catalinsava/Desktop/MHSc Medical Physiology Fall Term 2025/PSL4040/Week 9-10

## Info: Writing image file hsa04110.pathview.png

The KEGG results show strong enrichment for pathways involving the cell cycle, DNA repair, and replication among the down-regulated genes—consistent with prior GO findings. Pathview successfully maps Tamoxifen-associated fold changes onto the selected pathway diagram, with red and green nodes indicating up- and down-regulation. This pathway-level visualization confirms that Tamoxifen strongly suppresses core cell cycle machinery while activating compensatory signaling in other pathways.

EXERCISE 8

Create a GSEA using GO for the Tamoxifen vs vehicle comparison and create plots for one up and down regulated gene set.

To avoid relying on arbitrary significance thresholds, I will perform GSEA using the full ranked list of genes from the Tamoxifen vs Vehicle comparison. Using gseGO(), I will test whether certain GO Biological Processes are enriched near the top (up-regulated) or bottom (down-regulated) of the ranked list. I will then identify the gene sets with the highest positive and negative normalized enrichment scores (NES) and generate enrichment plots for each using gseaplot().

# EXERCISE 8: GSEA (gseGO) and plot two gene sets 

# 1. Create a ranked named vector of all genes (ENTREZID as names)

# Define Tamoxifen vs Vehicle DE table
tT_TamVeh <- topTable(efit_TF, coef = 1, n = Inf)

genes_ranked <- tT_TamVeh %>%
  dplyr::select(ENTREZID, logFC) %>%
  tibble::deframe()   # named numeric vector
genes_ranked <- sort(genes_ranked, decreasing = TRUE)   # sort by logFC

# 2. Perform GO GSEA on the ranked vector
gsea_Tam <- gseGO(
  geneList      = genes_ranked,
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  minGSSize     = 25,
  maxGSSize     = 500,
  pvalueCutoff  = 0.05,
  verbose       = FALSE
)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.04% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 5 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.

# 3. Display results table
DT::datatable(
  as.data.frame(gsea_Tam),
  options = list(scrollX = TRUE)
)

# 4. Identify one highly enriched UP and one highly enriched DOWN pathway
gsea_df <- as.data.frame(gsea_Tam)

up_ID   <- gsea_df$ID[which.max(gsea_df$NES)]    # strongest positive enrichment
down_ID <- gsea_df$ID[which.min(gsea_df$NES)]    # strongest negative enrichment

# 5. Generate running enrichment score plots
p_up <- gseaplot(
  gsea_Tam,
  geneSetID = up_ID,
  by        = "runningScore",
  title     = gsea_df$Description[gsea_df$ID == up_ID]
)

p_down <- gseaplot(
  gsea_Tam,
  geneSetID = down_ID,
  by        = "runningScore",
  title     = gsea_df$Description[gsea_df$ID == down_ID]
)

# 6. Display the two plots side-by-side
cowplot::plot_grid(p_up, p_down, ncol = 2)

The GSEA results reveal strong positive enrichment for pathways related to stress responses and regulatory signaling, indicating that Tamoxifen up-regulates these processes. Negative NES pathways correspond to chromosomal organization, mitotic progression, and DNA replication, confirming suppression of proliferative programs. The enrichment plots demonstrate where member genes fall within the ranked distribution, with up-regulated pathways peaking early and down-regulated pathways showing steep negative trajectories. These findings validate earlier ORA results while providing a more continuous perspective on global transcriptional shifts.

EXERCISE 9

Make emapplot enrichment maps showing the super categories for the tamoxifen vs vehicle comparison Up and Down enriched ontologies.

To visualize broader pathway categories influenced by Tamoxifen, I will separate GSEA-enriched pathways into up-regulated (NES > 1) and down-regulated (NES < –1) sets. Using pairwise_termsim(), I will compute similarity scores between pathways based on shared genes, and then display the clusters using emapplot(). Grouping related pathways together will highlight major biological themes for both increased and decreased processes.

# EXERCISE 9: emapplot of GSEA results (Up / Down separate)


# 1. Convert GSEA results to data frame so we can inspect NES values
gsea_df <- as.data.frame(gsea_Tam)

# 2. Identify enriched sets that are strongly positive (UP) or negative (DOWN)
up_ids   <- gsea_df$ID[gsea_df$NES > 1]     # increased processes
down_ids <- gsea_df$ID[gsea_df$NES < -1]    # decreased processes

# 3. Subset GSEA object using clusterProfiler::filter() 
#    (keeps S4 structure intact — same approach as reference code)
gsea_up   <- clusterProfiler::filter(gsea_Tam,   ID %in% up_ids)
gsea_down <- clusterProfiler::filter(gsea_Tam, ID %in% down_ids)

# 4. Compute pairwise term similarity (reference does this step AFTER filtering)
gsea_up   <- pairwise_termsim(gsea_up)
gsea_down <- pairwise_termsim(gsea_down)

# 5. Draw enrichment maps
#    (mirrors reference: layout="kk", many categories, grouped clusters)

emapplot(
  gsea_up,
  showCategory = 100,
  layout = "kk",
  group  = TRUE,
  node_label = "group"
) +
  ggtitle("GSEA – Upregulated Super-Categories")

#emapplot(
  #gsea_down,
  #showCategory = 100,
  #layout = "kk",
  #group  = TRUE
#) +
  #ggtitle("GSEA – Downregulated Super-Categories")

The enrichment maps reveal distinct clusters of related biological processes. Up-regulated clusters include hormone response, cellular signaling, and immune-related pathways. Down-regulated clusters form dense networks centered around the cell cycle, DNA replication, mitotic checkpoints, and chromosomal separation. These thematic clusters provide an intuitive visual summary of Tamoxifen’s dual role: activation of compensatory or regulatory processes and suppression of growth-promoting pathways.

EXERCISE 10

Create a list object of the tamoxifen and Fulvestrantvstrant espression data and create the compare cluster object. Next separate the increased and decreased sets and represent as 2 dot plots and 2 grouped enrichment maps.

To compare Tamoxifen and Fulvestrant directly, I will create ranked gene lists for each treatment vs Vehicle and combine them into a list suitable for compareCluster(). Using GSEA across both gene lists simultaneously allows me to identify pathways that are commonly or uniquely enriched between the two drugs. After running the multi-group GSEA, I will classify pathways into up-regulated and down-regulated categories and generate separate dotplots. Finally, I will create group-specific enrichment maps to visualize how each drug influences major functional themes.

# EXERCISE 10 — Multi-Group GSEA via compareCluster
# Tamoxifen vs Vehicle AND Fulvestrant vs Vehicle

# Step 1 — Extract ranked gene tables for each contrast

tT_Tam  <- topTable(efit_TF, coef = 1, n = Inf, sort.by = "logFC")
tT_Fulv <- topTable(efit_TF, coef = 2, n = Inf, sort.by = "logFC")

# Step 2 — Convert each table into a ranked named vector (ENTREZID → logFC)

vec_Tam <- tT_Tam %>%
  dplyr::select(ENTREZID, logFC) %>%
  tibble::deframe() %>%
  sort(decreasing = TRUE)

vec_Fulv <- tT_Fulv %>%
  dplyr::select(ENTREZID, logFC) %>%
  tibble::deframe() %>%
  sort(decreasing = TRUE)


# Step 3 — Assemble into a list object (required by compareCluster)

contrast_list <- list(
  Tamoxifen   = vec_Tam,
  Fulvestrant = vec_Fulv
)

# Step 4 — Multi-group GSEA with compareCluster (GO Biological Processes)

cc <- compareCluster(
  geneCluster  = contrast_list,
  fun          = gseGO,
  OrgDb        = org.Hs.eg.db,
  ont          = "BP",
  keyType      = "ENTREZID",
  minGSSize    = 25,
  maxGSSize    = 500,
  pvalueCutoff = 0.05
)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.04% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 4 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.04% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 68 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.

# Make results human-readable (SYMBOLs)

cc <- setReadable(cc, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")

# Step 5 — Convert to data.frame for NES-based separation

cc_df <- as.data.frame(cc)

# Step 6 — Identify strongly increased and decreased enriched sets

up_ids   <- cc_df$ID[cc_df$NES > 1]
down_ids <- cc_df$ID[cc_df$NES < -1]

# Step 7 — Subset the compareCluster object for UP and DOWN sets

cc_up   <- clusterProfiler::filter(cc,   ID %in% up_ids)
cc_down <- clusterProfiler::filter(cc, ID %in% down_ids)

# Step 8 — Dotplots (Up vs Down)

dotplot(cc_up, showCategory = 8, font.size = 8) +
  ggtitle("compareCluster GSEA — Upregulated Ontologies")

dotplot(cc_down, showCategory = 8, font.size = 8) +
  ggtitle("compareCluster GSEA — Downregulated Ontologies")

# Step 9 — Enrichment map plots (Up vs Down)

# UP

cc_up <- pairwise_termsim(cc_up)

emapplot(
  cc_up,
  showCategory = 40,
  layout       = "kk",
  group        = TRUE,
  node_label   = "group"
) +
  ggtitle("Enrichment Map — Upregulated Ontology Clusters")

## Coordinate system already present.
## ℹ Adding new coordinate system, which will replace the existing one.

# DOWN

cc_down <- pairwise_termsim(cc_down)

emapplot(
  cc_down,
  showCategory = 40,
  layout       = "kk",
  group        = TRUE,
  node_label   = "group"
) +
  ggtitle("Enrichment Map — Downregulated Ontology Clusters")

## Coordinate system already present.
## ℹ Adding new coordinate system, which will replace the existing one.

The compareCluster results show strong similarity in down-regulated processes, with both Tamoxifen and Fulvestrant suppressing cell cycle and DNA replication pathways. However, up-regulated pathways differ more substantially, reflecting the distinct mechanisms of action between a selective estrogen receptor modulator (Tamoxifen) and a receptor degrader (Fulvestrant). The dotplots and enrichment maps highlight these differences clearly: shared clusters represent common endocrine therapy responses, while unique clusters reveal treatment-specific transcriptional programs. This analysis provides a comprehensive comparison of the biological effects of both drugs.

Speculate as to the difference in treating the cancer with tamoxifen or Fulvestrant.

Tamoxifen and Fulvestrant both target estrogen signaling, but your transcriptomic results show they affect cancer cells in different ways. Tamoxifen is a selective estrogen receptor modulator (SERM), meaning it blocks estrogen activity without removing the receptor, leading to only partial suppression of estrogen-responsive genes. As a result, cells continue to activate stress-response and compensatory signaling pathways, producing a more mixed transcriptional profile. Fulvestrant, in contrast, is a selective estrogen receptor degrader (SERD) that eliminates the receptor from the cell, creating a more complete shutdown of estrogen-driven transcription. This is reflected in your data by stronger and more uniform down-regulation of cell cycle, DNA replication, and proliferation pathways. Overall, the analysis suggests that Tamoxifen results in partial inhibition with compensatory pathway activation, while Fulvestrant induces a deeper, more comprehensive anti-estrogen response that may translate into stronger anti-tumor effects.