This Rmd file focuses only on Gene Ontology using Gene Set Enrichment Analysis (GSEA) and Over-representation analysis (ORA) of MDA-MB-231 and HCC1806. My next step of analysis will be examining Pathway Analysis methods (KEGG, Reactome, MSigDB Hallmark, etc). In this analysis, GSEA was used as the primary inference tool to identify coordinated pathway-level differences between MDA-MB-231 and HCC1806, while ORA performed using clusterProfiler, DAVID, and Enrichr served as complementary validation approaches.
First I wanted to calculate the top 100 up-regulated genes in both MDA-MB-231 and HCC1806 cell lines. This value is calculated by the fold change between the two cell lines (MDA-MB-231 – HCC1806).
Gene expression analyses were performed using publicly available DepMap expression data (DepMap Public 25Q3), comprising log-transformed transcript abundance values [log(TPM + 1)] derived from unstranded RNA-sequencing of human protein-coding genes. Expression profiles corresponding to ACH-000624 (HCC1806) and ACH-000768 (MDA-MB-231) breast cancer cell lines were extracted for downstream analysis. Because only a single expression profile was available for each cell line, statistical differential expression testing was not performed. To enable pathway-level comparison between the two cell lines, genes were ranked using a fold-change metric defined as the difference in log-transformed expression values between MDA-MB-231 and HCC1806 (MDA-MB-231 – HCC1806). This ranking metric is consistent with recommended pre-ranking strategies for Gene Set Enrichment Analysis (GSEA) in settings where only log-scale expression values (and no replicate-based statistics) are available. Because these results are based on single-sample profiles per cell line, enrichment findings should be interpreted as hypothesis-generating rather than formal differential expression statistics.
## # A tibble: 100 × 2
## Gene logFC
## <chr> <dbl>
## 1 VIM 9.26
## 2 AXL 7.94
## 3 ANKRD1 6.69
## 4 LOXL2 6.64
## 5 RIC8A 6.52
## 6 CEMIP 6.43
## 7 S100A4 6.43
## 8 CCN1 6.33
## 9 EMP3 6.30
## 10 CST1 6.29
## # ℹ 90 more rows
## # A tibble: 100 × 2
## Gene logFC
## <chr> <dbl>
## 1 KRT5 -9.86
## 2 KRT13 -9.57
## 3 FXYD3 -8.80
## 4 KRT14 -8.64
## 5 KRT4 -7.82
## 6 OLR1 -7.53
## 7 UPK1B -6.81
## 8 AKR1C2 -6.74
## 9 CSTA -6.54
## 10 SERPINB5 -6.30
## # ℹ 90 more rows
I validated this ranking using DepMap visualization software. Here I mapped a couple of the top upregulated genes in both HCC1806 and MDA-MB-231 and each gene corresponded to fold change values generated by the DepMap scatterplot.
Graph generated using DepMap Visualization software.
I used clusterProfiler (documentation in references) to build GSEA for MDA-MB-231 vs HCC1806. First I built a ranked gene list sorted in descending order based on the logFC calculated in Section 1. GSEA was performed using gseGO() as a rank-based functional class scoring method, operating on the entire transcriptome ranked by log-transformed expression differences between MDA-MB-231 and HCC1806. Enrichment scores and normalized enrichment scores (NES) were computed to assess whether GO BP gene sets were preferentially enriched toward the top or bottom of the ranked list, with positive NES values for pathways enriched in MDA-MB-231 and negative NES values for pathways enriched in HCC1806.
Over-Representation Analysis differs from Functional Class Scoring tools like GSEA. This sections covers the use of ORA using three different methods: 1. enrichGO function from clusterProfiler 2. DAVID GO analysis from user inputted gene list (top 100 genes section 1) 3. Enrichr GO analysis from user inputted gene list (top 100 genes section 1)
clusterProfiler provides enrichGO, performing hypergeometric testing with Benjamini–Hochberg correction.
DAVID represents a classical, annotation-centric implementation of ORA and relies primarily on hypergeometric or Fisher’s exact tests to evaluate whether GO terms or pathways are overrepresented among an input gene list relative to a background that is typically defined as the genome or a platform-specific gene universe. DAVID emphasizes annotation clustering, grouping related GO terms to reduce redundancy and highlight broader biological themes. Multiple-testing correction is performed using the Benjamini–Hochberg procedure. While DAVID is relatively conservative and useful for identifying high-level biological patterns, it relies on older annotation snapshots and is sensitive to gene list size.
Enrichr performs over-representation analysis using Fisher’s exact test. Although multiple ranking metrics are available in Enrichr, in this analysis only Benjamini–Hochberg–adjusted p-values from Gene Ontology Biological Process were used to facilitate direct comparison with ORA results generated using clusterProfiler and DAVID.
When you run the gene set enrichment analysis from the Run GSEA Page, GSEA ranks the genes in the expression dataset and then analyzes that ranked list of genes. You use the Metric for ranking genes parameter to select the metric used to score and rank the genes; the Gene list sorting mode parameter to determine whether to sort the genes using the real (default) or absolute value of the metric score; and the Gene list ordering mode parameter to determine whether to sort the genes in descending (default) or ascending order. This section describes each of the ranking metrics in the drop-down list of the Metric for ranking genes parameter. If your favorite metric is not listed here, you can rank the genes in your dataset using that metric and then use the GSEAPreranked Page to analyze your ranked list of genes. If your dataset contains only one sample, GSEA cannot rank the genes; however, you can rank the genes and then use the GSEAPreranked Page to analyze your ranked list of genes. Three settings in the Algorithms tab of the Preferences window (selected from the File menu in the application menu bar) affect the calculations shown here: ● Use median instead of mean for metrics. For categorical phenotypes, by default, GSEA calculates differential expression based on the mean expression value for each phenotype. To use the median expression value for each phenotype, set the Median for class metrics parameter on the Run GSEA page to True. To always use the median expression value for each phenotype, click on the checkbox to enable the Use median instead of mean for class metrics option in the Preferences window. The new default value takes effect when you restart GSEA. By default, this option is not selected. Note: Using median rather than mean may cause ties in the ranking. ● Fix metrics for low variance. When calculating ranking metrics, the denominator may be zero (0). A denominator of zero (0) causes an error in the analysis unless this option is selected. By default, this option is selected. ● Use biased variances. When calculating ranking metrics, GSEA uses an unbiased variance to calculate standard deviation. Select this option to have GSEA use a biased variance instead. By default, this option is not selected. For categorical phenotypes, GSEA determines a gene’s mean expression value for each phenotype and then uses one of the following metrics to calculate the gene’s differential expression with respect to the two phenotypes. To use median rather than mean expression values, set the Median for class metrics parameter to True, as described above. Diff_of_Classes uses the difference of class means to calculate fold change for log scale data:
where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
“Log-transformed TPM (Transcripts Per Million) values at gene level derived from unstranded RNA-seq for protein-coding genes in human”
Used enrichGO() for ORA GO and gseGO() for GSEA GO
Inputted 100 top genes generated from fold change to calculate ORA using DAVID algorithim.
Inputted 100 top genes generated from fold change to calculate ORA using Enrichr platform.