Type 1 diabetes (T1D) is an autoimmune disease characterized by the targeted destruction of insulin-producing pancreatic β-cells. Previous genetic studies have identified TYK2 as a T1D candidate gene. TYK2 encodes a Janus kinase involved in type I interferon (IFN-I) signaling, which mediates antigen presentation and thus enhances immune recognition. Interestingly, loss-of-function mutations in TYK2 have been linked to increased T1D risk in some populations, whereas a partial loss-of-function variant appears protective against autoimmune disease in others. The goal of this study was to elucidate TYK2’s roles in beta-cell development and IFN-alpha-mediated immune response.
The authors employed an array of bioinformatics techniques, predominantly for transcriptomic analysis. They first used ultra-deep bulk RNA-seq to profile global gene expression across stages of stem cell-derived islet (SC-islet) differentiation and to compare wild-type and TYK2 knockout (KO) lines. Differential expression and reactome pathway enrichment analyses identified transcriptomic differences underlying the KO cells’ halted differentiation. Single-cell RNA-seq and gene-set enrichment further clarified the mechanism of beta cell depletion in developing KO SC-islets, linking it to KRAS upregulation and consequent premature cell cycle progression. Additional transcriptomic profiling of wild-type and KO SC-islets after IFN-alpha treatment elucidated the impact of TYK2 on IFN-I signaling. Together, these bioinformatic techniques revealed a dual role of TYK2 in promoting beta-cell differentiation and MHC-I-mediated immune recognition, supporting its potential as a therapeutic target for T1D [1].
We were provided with paired-end RNA sequencing reads from a study conducted by Chandra et al [1]. Specifically, we received 6 RNAseq samples from endocrine precursor cells, 3 WT and 3 TYK2 KO. We were also provided with the GRCh38 genome assembly and gencode.v45 annotation. Quality control of the reads was performed using FastQC v0.12.1 in default mode [2]. We indexed the genome using STAR v2.7.11b in genomeGenerate mode with the provided GTF annotation. Reads were aligned to the reference in a splice-aware fashion using STAR v2.7.11b with outputs generated in BAM format [3]. We then used MultiQC v1.25 in default mode to aggregate all FastQC output zips and STAR alignment logs into a QC report [4]. Gene counts for each sample were quantified using VERSE v0.1.5 run in default mode given the gtf annotation and STAR alignments [5]. Gene counts tables from separate samples were concatenated into a total counts matrix.
We used DEseq2 v1.44.0 to run a basic differential expression analysis on the gene counts [6]. This entailed creating a ‘coldata’ dataframe mapping samples to their corresponding conditions - “E” for experimental or “C” for control. The gene counts matrix was filtered using a method described below. We then created our DESeq2 object, specifying ‘condition’ as the design parameter. DESeq2 was run with the contrast argument set to c(“condition”,“E”,“C”). We parsed the gtf annotation file to create a mapping of gene ids to human genome symbols, and we added these symbols to the table of significant genes outputted by DESeq2. We filtered significant genes by an adjusted p-value threshold of 0.01, and then performed enrichment analysis on this gene set via ENRICHR [7]. Finally, we performed an additional gene set enrichment analysis using GSEA v1.32.4 in default mode against the C2 canonical pathways MSIGDB dataset [9].
RNAseq was performed at a very high depth, with read counts ranging from 84-119 million per sample. Each sample shows over 95% alignment to the reference genome, with over 91% of reads in each sample uniquely aligned to one location in the genome. Furthermore, all samples have sufficiently high quality scores throughout the length of reads, and the average GC content displays a roughly normal distribution centered at 50%. Additionally, all 12 samples had less than 1% of reads made up of overrepresented sequences. The only flagged metrics are Per Base Sequence Content and Sequence Duplication Levels. All samples exhibit unequal base proportions at the start of reads, likely reflecting adapter or random hexamer priming bias, which is pretty standard in RNA-seq experiments and difficult to avoid. Furthermore, approximately 75% of reads appear duplicated, which could stem from the ‘Ultra-deep’ nature of their RNAseq methodology and the fact that only the transcriptome was sequenced. Since other metrics such as GC content, sequence quality, and alignment rates are all promising, it is likely fine to move forward with the data as is.
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
## table, tapply, union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse() masks IRanges::collapse()
## ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::count() masks matrixStats::count()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ lubridate::second() masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice() masks IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cts <- as.matrix(read.csv("/projectnb/bf528/students/hbeakley/project-2-hrbeakley/results/counts_matrix.tsv", sep = "\t", row.names='gene'))
coldata <- read.csv('/projectnb/bf528/students/hbeakley/project-2-hrbeakley/coldata.csv')
I pre-filtered the counts matrix following the DESeq2 vignette recommendation, keeping only genes that have at least 10 read counts in a minimum number of samples. I set the minimal number of samples to the smallest experimental group size (3 in this case), as per the vignette’s suggestion [6]. This cut down the number of genes in the counts matrix from 63,241 to 20,579.
smallestGroupSize <- 3
#filter counts matrix to only include rows that have counts >= 10 in at least 3 samples
keep <- cts[rowSums(cts >= 10) >= smallestGroupSize,]
#display the number of genes present before and after filtering
before <- nrow(cts)
after <- nrow(keep)
cat("Genes before filtering:", before, "\n")
## Genes before filtering: 63241
cat("Genes after filtering:", after, "\n")
## Genes after filtering: 20579
cat("Removed", before - after, "genes (",
round((before - after) / before * 100, 1), "% )\n")
## Removed 42662 genes ( 67.5 % )
Here are histograms showing the distribution of total read counts per gene before and after filtering.
par(mfrow = c(1, 2))
hist(log10(rowSums(cts)+1), breaks=100, main = "Before Filtering", xlab = "Read Counts Per Gene", ylab = "Frequency", col = "lightblue")
hist(log10(rowSums(keep)+1), breaks=100, main = "After Filtering", xlab = "Read Counts Per Gene", ylab = "Frequency", col = "lightgreen")
par(mfrow = c(1, 1))
# Constructing the DESeq object
dds <- DESeqDataSetFromMatrix(countData = keep,
colData = coldata,
design = ~ condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# Running DEseq
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds, contrast=c("condition","E","C"))
# DESeq2 results for the top ten significant genes ranked by padj
resOrdered <- res[order(res$padj),]
deseq2_res <- as_tibble(resOrdered, rownames='gene')
ids <- read_tsv("/projectnb/bf528/students/hbeakley/project-2-hrbeakley/results/gene_id_mappings.tsv", col_names = c('gene', 'symbol'))
## Rows: 63241 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): gene, symbol
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
deseq2_res <- deseq2_res %>% left_join(ids) %>% relocate(symbol)
## Joining with `by = join_by(gene)`
head(deseq2_res, n = 10)
## # A tibble: 10 × 8
## symbol gene baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RPS4Y1 ENSG… 10677. -8.77 0.142 -61.7 0 0
## 2 ENSG000002895… ENSG… 732. 3.70 0.125 29.7 1.66e-193 1.58e-189
## 3 PNPO ENSG… 605. -4.16 0.141 -29.5 2.13e-191 1.35e-187
## 4 PCDHGA10 ENSG… 497. 4.48 0.200 22.4 6.36e-111 3.02e-107
## 5 YPEL3-DT ENSG… 525. -2.42 0.115 -21.0 1.03e- 97 3.92e- 94
## 6 LINC02506 ENSG… 1301. -1.71 0.0986 -17.4 1.17e- 67 3.70e- 64
## 7 SLC2A14 ENSG… 634. 4.69 0.279 16.8 2.51e- 63 6.81e- 60
## 8 ENSG000002863… ENSG… 907. -1.71 0.115 -14.9 2.41e- 50 5.71e- 47
## 9 ENSG000002894… ENSG… 384. -2.13 0.145 -14.7 3.28e- 49 6.93e- 46
## 10 ENSG000002829… ENSG… 121. 3.76 0.264 14.2 6.86e- 46 1.30e- 42
I chose 0.01 as a padj threshold, leaving me with 735 significant genes (332 upregulated and 403 downregulated).
sig <- deseq2_res %>% filter(padj < 0.01)
head(sig)
## # A tibble: 6 × 8
## symbol gene baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RPS4Y1 ENSG… 10677. -8.77 0.142 -61.7 0 0
## 2 ENSG00000289575 ENSG… 732. 3.70 0.125 29.7 1.66e-193 1.58e-189
## 3 PNPO ENSG… 605. -4.16 0.141 -29.5 2.13e-191 1.35e-187
## 4 PCDHGA10 ENSG… 497. 4.48 0.200 22.4 6.36e-111 3.02e-107
## 5 YPEL3-DT ENSG… 525. -2.42 0.115 -21.0 1.03e- 97 3.92e- 94
## 6 LINC02506 ENSG… 1301. -1.71 0.0986 -17.4 1.17e- 67 3.70e- 64
pos <- sig %>% filter(log2FoldChange > 0)
sig_genes <- sig %>% select(symbol)
cat("Significant genes (padj < 0.01):", nrow(sig_genes), "\n")
## Significant genes (padj < 0.01): 735
cat("# Upregulated:", nrow(pos), "\n")
## # Upregulated: 332
cat("# Downregulated:", nrow(sig_genes) - nrow(pos), "\n")
## # Downregulated: 403
write.table(sig_genes, file = "sig_genes.txt", col.names = F, row.names = F, quote= F)
#enrichment from Reactome Pathways 2024
reactome <- read.delim("/projectnb/bf528/students/hbeakley/project-2-hrbeakley/Reactome_Pathways_2024_table.txt", header = TRUE, sep = "\t")
head(reactome[, c("Term", "Adjusted.P.value", "Overlap", "Combined.Score")], 7)
## Term
## 1 Regulation of Beta-Cell Development
## 2 TP53 Regulates Transcription of Cell Death Genes
## 3 TP53 Regulates Transcription of Death Receptors and Ligands
## 4 Regulation of Gene Expression in Endocrine-Committed (NEUROG3+) Progenitor Cells
## 5 Neuronal System
## 6 Extracellular Matrix Organization
## 7 Regulation of Gene Expression in Beta Cells
## Adjusted.P.value Overlap Combined.Score
## 1 2.019213e-05 12/42 1.875117e+02
## 2 2.019213e-05 12/44 1.700068e+02
## 3 2.019213e-05 7/12 6.163288e+02
## 4 2.019213e-05 5/5 1.592382e+06
## 5 4.720253e-05 38/411 4.268770e+01
## 6 4.549061e-04 29/300 3.745881e+01
## 7 1.136761e-03 7/21 1.578832e+02
#enrichment from KEGG 2021 Human
kegg <- read.delim("/projectnb/bf528/students/hbeakley/project-2-hrbeakley/KEGG_2021_Human_table.txt", header = TRUE, sep = "\t")
head(kegg[, c("Term", "Adjusted.P.value", "Overlap", "Combined.Score")], 7)
## Term Adjusted.P.value Overlap Combined.Score
## 1 Maturity onset diabetes of the young 5.806366e-05 9/26 215.99625
## 2 Morphine addiction 1.493751e-04 15/91 72.33060
## 3 ECM-receptor interaction 3.012224e-04 14/88 62.91286
## 4 GABAergic synapse 3.012224e-04 14/89 61.38891
## 5 Dilated cardiomyopathy 5.833508e-04 14/96 52.01448
## 6 p53 signaling pathway 5.833508e-04 12/73 58.77369
## 7 Cholinergic synapse 5.833508e-04 15/113 44.82922
BiocManager::install("fgsea")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cloud.r-project.org
## Bioconductor version 3.20 (BiocManager 1.30.26), R 4.4.3 (2025-02-28)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'fgsea'
## Installation paths not writeable, unable to update packages
## path: /share/pkg.8/r/4.4.3/install/lib64/R/library
## packages:
## acepack, actuar, ade4TkGUI, adegraphics, adehabitatLT, adephylo, admisc,
## agridat, AICcmodavg, akima, AlgDesign, alphashape3d, animation, anndata,
## anytime, apcluster, aplot, argparse, arrow, arulesCBA, arulesViz,
## AsioHeaders, attachment, basefun, batchtools, bayesplot, bayestestR,
## BDgraph, betareg, bigassertr, bigD, bigsnpr, bigstatsr, bigutilsr,
## BiocManager, bipartite, bit, blockmodeling, bnlearn, bookdown, boot,
## BradleyTerry2, brglm, brms, broom, broom.helpers, C50, Cairo, cards, CDM,
## CFtime, checkmate, chromote, CircStats, cli, clock, cluster, ClusterR, clv,
## collapse, collections, cols4all, commonmark, CompQuadForm, coneproj,
## copula, CovTools, cowplot, cpp11, credentials, crosstalk, crul, cubature,
## Cubist, curl, data.table, data.tree, datawizard, dbplyr, dbscan, dcurver,
## dendextend, DEoptimR, Deriv, DescTools, devEMF, devtools, dials, diffobj,
## diptest, DirichletReg, distributions3, DistributionUtils, distro, doBy,
## dockerfiler, docopt, DoE.base, doFuture, doRNG, DoseFinding, dotwhisker,
## downloader, DPQ, dreamerr, DT, dtplyr, duckdb, eaf, effects, elliptic,
## emdbook, emmeans, entropy, Epi, ergm, ergm.count, ergm.ego, ergm.multi,
## etm, evaluate, extrafont, extrafontdb, fastcluster, fda, fields,
## fitdistrplus, fixest, flextable, float, FME, foghorn, forcats, forecast,
## forecTheta, foreign, fresh, fs, future, future.apply, future.batchtools,
## gam, gamlss, gamlss.data, gamm4, gapminder, gargle, gclus, gdtools,
## geepack, GeneralizedHyperbolic, generics, gert, GGally, ggdist, ggdmc,
## ggeffects, ggforce, ggfortify, ggfun, ggimage, ggmcmc, ggnewscale, ggplot2,
## ggplotify, ggpp, ggraph, ggridges, ggsci, ggstats, gh, gld, GLMMadaptive,
## glmnet, globals, gmm, googledrive, googlePolylines, googlesheets4,
## GPArotation, graphon, graphsim, gsDesign, gss, gt, gtsummary, hardhat,
## harmony, haven, here, Hmisc, hms, httptest2, httpuv, httr2, hunspell,
## huxtable, hypergeo, ICS, igraph, iNEXT, insight, invgamma, irace, ISOcodes,
## ISwR, jpeg, jsonlite, keras, kinship2, kknn, knitr, ks, labdsv, labelled,
## lamW, latentnet, later, lattice, latticeExtra, lavaan, leafem, leaflet,
## leidenbase, lgr, libgeos, libstable4u, limSolve, literanger, lme4, LMest,
## Lmoments, locfit, logger, logistf, longmemo, lpSolveAPI, magick, magrittr,
## mapproj, maps, marginaleffects, markdown, matchingR, MatchIt, mathjaxr,
## MCMCprecision, mcmcse, memisc, merDeriv, meta, mets, mgcv, mgcViz, mice,
## miceadds, mime, miniUI, mirt, mixtools, mlr, mmrm, mockery, mockr,
## modeldata, modeltools, MPV, msigdbr, multcomp, MultiLCIRT, multilevel,
## multinet, MuMIn, nanoarrow, nanonext, ncdf4, ndjson, NetworkDistance, nlme,
## nloptr, odbc, officer, openair, openssl, optimx, osmdata, packrat, pak,
## parallelDist, parallelly, parameters, paran, parsnip, partitions, partykit,
## patchwork, paws.common, paws.storage, pbapply, pbdZMQ, pbkrtest, pcadapt,
## penalized, peperr, performance, permute, pheatmap, pillar, pixmap,
## pkgbuild, pkgdown, pkgload, PKI, plm, plot3D, plotly, poppr, pracma, pROC,
## prodlim, profmem, progressr, PROJ, promises, proxyC, ps, psych,
## psychotools, psychotree, psychTools, Publish, purrr, qgam, qpdf, qs,
## quanteda, quantmod, quarto, questionr, QuickJSR, R.cache, R.oo, r2rtf,
## R2WinBUGS, ragg, raster, ratelimitr, Rcpp, RcppArmadillo, RcppParallel,
## RcppTOML, RcppXPtrUtils, RcppZiggurat, Rcsdp, RCurl, Rdpack, readstata13,
## readxl, recipes, REddyProc, reformulas, rentrez, renv, reshape, reticulate,
## Rfast, Rfast2, rgl, rhub, RInside, rio, RItools, RJSONIO, rlang, rle,
## rmarkdown, Rmpfr, RMySQL, RNetCDF, rnndescent, robustbase, roptim,
## ROptSpace, roxygen2, rpart.plot, RPEGLMEN, rpf, RPostgreSQL, rprojroot,
## RPushbullet, rrcov, rredlist, rsample, rsconnect, RSQLite, rstan, rstanarm,
## rstantools, rstpm2, rsvg, Rttf2pt1, RUnit, runjags, runonce, rversions,
## rvest, rvg, s2, sampling, sass, scales, scatterpie, sctransform,
## sensitivity, seriation, SeuratObject, sf, sfsmisc, shadowtext, shapes,
## shiny, shinydashboard, shinydashboardPlus, SimComp, SimDesign, sirt, skimr,
## softImpute, SomaDataIO, spacesXYZ, sparsesvd, sparsevctrs, spatstat,
## spatstat.data, spatstat.explore, spatstat.geom, spatstat.linnet,
## spatstat.model, spatstat.random, spatstat.univar, spatstat.utils, spelling,
## spls, ssgraph, statmod, statnet.common, storr, stringfish, stringi,
## stringmagic, stringr, styler, SuppDists, survey, svglite, svUnit,
## systemfonts, table1, TAM, tcltk2, tclust, tensor, tensorflow, tergm, terra,
## textmineR, textshaping, tfruns, TH.data, threejs, tibble, tidytext,
## timeDate, timereg, tinyplot, tinytex, tkrplot, tmap, tmaptools, TMB,
## tmvtnorm, tree, tripack, tsna, TSP, tufte, tzdb, umx, units, unmarked,
## urltools, usethis, utf8, V8, variables, vcr, vegan, VGAMextra, visNetwork,
## vroom, waiter, waldo, webfakes, webmockr, webshot2, websocket, weights,
## workflows, writexl, WriteXLS, x13binary, xfun, xgboost, xlsxjars, XML,
## xml2, yulab.utils, zeallot, zip, zoo
library(fgsea)
deseq2_res_ordered <- deseq2_res %>%
drop_na() %>%
distinct(symbol, .keep_all = TRUE) %>%
arrange(desc(log2FoldChange))
rnk_list <- setNames(deseq2_res_ordered$log2FoldChange, deseq2_res_ordered$symbol)
pathways <- fgsea::gmtPathways("/projectnb/bf528/students/hbeakley/project-2-hrbeakley/c2.cp.v2025.1.Hs.symbols.gmt")
fgseaRes <- fgsea(pathways, rnk_list, minSize=15, maxSize=500)
library(dplyr)
library(ggplot2)
library(forcats)
top_positive_nes <- fgseaRes %>%
filter(padj < .1 & NES > 0) %>%
slice_max(NES, n=10)
top_negative_nes <- fgseaRes %>%
filter(padj < .1 & NES < 0) %>%
slice_min(NES, n=10)
bind_rows(top_positive_nes, top_negative_nes) %>%
mutate(pathway = forcats::fct_reorder(pathway, NES)) %>%
ggplot(aes(pathway, NES, fill = NES > 0)) +
geom_col() + coord_flip() + theme_minimal()
Most notably, several enriched Reactome 2024 pathways were related to the regulation of beta-cell development and gene expression.This is consistent with the paper’s findings, since our RNA-seq data came from WT and TYK2 KO S5 EP cell lines, and TYK2 loss was found to compromise EP formation. The second and third most significantly enriched pathways involved TP53 and its regulation of cell death genes. This finding is intriguing because the paper attributes the developmental disturbances in KO cell lines to increased KRAS, which causes cells to prematurely progress through the cell cycle. TP53 upregulation in KO cells may be a fail-safe mechanism to combat these proliferative signals [10]. Finally, the most significantly enriched KEGG 2021 Human pathway, “Maturity-Onset Diabetes of the Young,” serves as a sanity check, reinforcing TYK2’s relevance as a candidate gene for type 1 diabetes.
This plot depicts the top significantly enriched pathways from the GSEA analysis using the C2 canonical pathways dataset. It displays the normalized enrichment scores of the ten most upregulated and ten most downregulated pathways. Several pathways upregulated in TYK2 KO cells involve p53 signaling, extracellular matrix organization, and adhesion pathways, consistent with a stress response phenotype. In contrast, multiple downregulated pathways correspond to beta cell development and differentiation. Ultimately, these results support the hypothesis that TYK2 KO S5 cells exhibit compromised differentiation due to accelerated proliferation, which in turn causes a compensatory stress response.
Both the ENRICHR and GSEA analyses highlight beta-cell differentiation and p53 signaling as major pathways disrupted by TYK2 loss. Both techniques also identify enrichment of neuroendocrine programs. In particular, the GSEA results show downregulation of neuronal differentiation and signaling pathways, consistent with compromised beta cell maturation and neuroendocrine functionality in the TYK2 KO S5 cells.
Methodologically, GSEA offers directionality by computing normalized enrichment scores, and it directly derives enrichment from ranked differential expression results, which reduces the subjectivity of setting a significance threshold for defining the gene set. The ENRICHR analysis, by comparison, requires this subjective determination of a significant gene set and then reports undirected enrichment magnitudes. However, directional insights could be obtained by performing separate ENRICHR runs for genes with positive and negative fold changes. Despite these technical differences, the core biological processes detected by both methods are highly consistent, which helps validate the robustness of the results.
#create normalized counts matrix using rlog or vst
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)
## using ntop=500 top features by variance
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = condition)) +
geom_point(size = 3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed()
library("RColorBrewer")
library("pheatmap")
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
Principle component analysis (PCA) is commonly used as diagnostic in RNA-seq experiments to assess whether the condition of interest explains the majority of variation among samples. In our PCA, the samples generally cluster by condition, indicating that TYK2 KO is largely responsible for the differential expression observed. PCA plots are also useful for identifying outliers, and we have one experimental replicate that somewhat strays from its cluster. However, given the small number of replicates, this deviation is not particularly concerning.
The sample-to-sample distance heatmap represents the pairwise distances between samples, illustrating how similar or different they are from each other. Ideally, control samples should be more similar to each other, and experimental groups should be more similar to each other, with clear separation between groups. In our results, the control replicates show strong mutual similarity and are consistently different from all experimental replicates. Among the experimental group, two replicates are highly similar to each other, while one is more distant and slightly closer to the controls. Although this one deviant replicate is not ideal, the small number of replicates makes it difficult to determine outliers. Overall, the PCA and distance plots display generally robust clustering by condition, providing strong support that the differential expression results are truly due to TYK2 loss.
library(ggrepel)
sig_y <- 4 #significance cutoff on y-axis
volc <- as.data.frame(deseq2_res) %>% #start with DESeq2 results as a data.frame
dplyr::mutate(
gene = if ("symbol" %in% names(.)) symbol else rownames(.), #pick gene symbol column if present, else rownames
p = dplyr::if_else(is.na(pvalue) | pvalue == 0, .Machine$double.xmin, pvalue),
y = pmin(-log10(p), 16), # transform to -log10(p), then cap at 15 to match paper
dir = dplyr::case_when( ## direction group for coloring
y >= sig_y & log2FoldChange > 0 ~ "Up", # significant and positive LFC → Up
y >= sig_y & log2FoldChange < 0 ~ "Down", ## significant and negative LFC → Down
TRUE ~ "Not significant" # everything else → Not Significant
)
)
# list of genes to annotate (matching 3c from paper)
genes_to_label <- c("KRAS","LAMB2","SPP1","DUSP6","SPRY2","PLAT","AKT3","SLC2A2","BAX",
"COL2A1","APOE","LAMA4","KDR","PGF","PTPRU","COL9A2","NTRK2","PAK3",
"ELMO","INSM1","NEUROG3","GAB2","NOS3","PAX6","NEUROD1","THBS1",
"NKX2-2","GABRB3","ONECUT1")
lab <- dplyr::filter(volc, gene %in% genes_to_label, is.finite(log2FoldChange), is.finite(y))
#plot ranges + padding
xr <- range(volc$log2FoldChange, na.rm = TRUE) # x-range of all points (min/max LFC)
xpad <- diff(xr) * 0.12 # small horizontal padding (12% of span) for fan-out space
xlim <- xr
yr <- range(volc$y, na.rm = TRUE)
ypad <- diff(yr) * 0.25 # 25% extra room on top for labels
ylim <- c(0, max(yr) + ypad)
#plot
ggplot(volc, aes(log2FoldChange, y, color = dir)) + # base plot: x = LFC, y = -log10(p), color by direction
geom_point(alpha = .6, size = 1.2) +
#label left and right sides separately so labels can fan out
geom_text_repel(data = subset(lab, log2FoldChange > 0), aes(label = gene), # labels for right-hand points (LFC > 0)
nudge_x = xpad*.7, nudge_y = .35, # slight diagonal fan: push right + a little up
box.padding = .4, point.padding = .25, # spacing around label box & between label and point
segment.color = "grey", min.segment.length = 0,
max.overlaps = Inf) +
geom_text_repel(data = subset(lab, log2FoldChange < 0), aes(label = gene),
nudge_x = -xpad*.7, nudge_y = .35,
box.padding = .4, point.padding = .25,
segment.color = "grey", min.segment.length = 0,
max.overlaps = Inf) +
scale_color_manual(values = c(Up="red", Down="blue", NS="grey")) + # map groups to colors (Up=red, Down=blue)
scale_x_continuous(limits = xlim, expand = expansion(mult = c(.02,.02))) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
coord_cartesian(ylim = c(0, max(volc$y, na.rm = TRUE))) +
labs(x="Expression difference (log2 fold change)",
y="Significance (-log10 p-value)", color=NULL,
title="Volcano plot") +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.margin = margin(20,40,20,40)) +
coord_fixed(1)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
library(dplyr)
library(ggplot2)
library(forcats)
library(purrr)
pathways_of_interest <- c(
"REACTOME_EXTRACELLULAR_MATRIX_ORGANIZATION",
"REACTOME_INTEGRIN_CELL_SURFACE_INTERACTIONS",
"REACTOME_ECM_PROTEOGLYCANS",
"REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT",
"REACTOME_NEURONAL_SYSTEM",
"REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS"
)
alpha <- 0.01
#Identify DE genes
de_genes <- deseq2_res %>%
filter(!is.na(padj), padj < alpha) %>%
transmute(symbol = if ("symbol" %in% names(.)) symbol else rownames(.)) %>%
distinct() %>% pull(symbol)
#Keep only selected pathways
sel_paths <- pathways[names(pathways) %in% pathways_of_interest]
#Compute % of DE genes in each pathway
pct_df <- tibble(
pathway = names(sel_paths),
set_size = map_int(sel_paths, length),
overlap = map_int(sel_paths, ~ sum(.x %in% de_genes))
) %>%
mutate(pct = 100 * overlap / set_size)
#Join FGSEA results
plot_df <- pct_df %>%
left_join(fgseaRes %>% select(pathway, NES, padj), by = "pathway") %>%
filter(!is.na(NES), !is.na(padj)) %>%
mutate(
direction = if_else(NES > 0, "Up", "Down"),
pathway_lab = gsub("_", " ", sub("^REACTOME_", "", pathway)),
pathway_lab = fct_reorder(pathway_lab, pct)
)
#Plot horizontal bars colored by padj
ggplot(plot_df, aes(x = pathway_lab, y = pct, fill = padj)) +
geom_col(width = 0.6) +
coord_flip() +
facet_wrap(~ direction, ncol = 1, scales = "free_y") +
scale_fill_gradient(
low = "mediumvioletred", high = "red",
trans = "reverse",
name = "Adjusted p value"
) +
labs(
title = "Reactome enrichment",
x = NULL,
y = "Percentage of DE genes in category / all genes in category (%)"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold"),
panel.grid.major.y = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5)
)
They reported 319 upregulated and 412 downregulated genes using a padj threshold of 0.01, whereas I found 332 upregulated and 401 downregulated genes at the same threshold. Given all of the variables at play, these numbers are quite similar and indicate that my pipeline captured comparable transcriptional trends.
I replicated Figures 3C and 3F as closely as possible, using the same genes and pathways highlighted in their figures. As a result, my volcano plot pretty closely mirrors theirs, and the directions of enrichment for nearly all labeled genes are consistent. Specific genes such as KRAS and LAMB2 appear in nearly identical positions on both plots, suggesting that my differential expression results align well with those of Chandra et al.
The enrichment results were also highly similar. Both analyses identified many of the same functional programs as significantly altered, and I was able to extract six of the seven pathways displayed in their Figure 3F from my GSEA results. When plotted as a horizontal bar graph, the relative bar lengths (representing percent of all genes in that category that are significantly differentially expressed) are highly comparable, as were the padj values. Overall, these parallels indicate that both analyses identified the same biological processes as up or down regulated and at very similar magnitudes. The only notable difference is the absence of the “Signaling by Receptor Tyrosine Kinases” pathway in my GSEA results.
The small discrepencies between our results may have arisen due to a plethora of technical differences. First, the authors likely filtered reads and genes differently than I did. Second, they used slightly different tools, such as Rsubread v2.4.3 for gene counting and ClusterProfile v3.18.1 for enrichment analysis. We also used different versions of some of the same tools, (i.e. STAR aligner v2.5.4b versus v2.7.11b and DESeq2 v1.30.1 versus v1.44.0). Lastly, and perhaps most significantly, they did not specify which version of the reference genome or annotation they used for their analysis. Nevertheless, given the highly similar counts of up and downregulated genes, the closely aligned volcano plots, and the nearly matching enrichment profiles, my results capture the same key underlying biological signals as Chandra et al. for this portion of the RNA-seq experiment.
The sessioninfo function is one way we can attempt to be transparent about our environment management in R.
This will print out a summary of the current packages installed in the environment.
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.3 (2025-02-28)
## os AlmaLinux 8.10 (Cerulean Leopard)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2025-10-24
## pandoc 3.2 @ /usr/local/ood/rstudio-server-2024.12.0+467/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## quarto 1.5.57 @ /usr/local/ood/rstudio-server-2024.12.0+467/bin/quarto/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-8 2024-09-12 [1] CRAN (R 4.4.0)
## Biobase * 2.66.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.0)
## BiocGenerics * 0.52.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.0)
## BiocManager 1.30.26 2025-06-05 [1] CRAN (R 4.4.0)
## BiocParallel 1.40.2 2025-10-06 [1] Bioconductor
## bit 4.6.0 2025-03-06 [1] CRAN (R 4.4.0)
## bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.4.0)
## bslib 0.9.0 2025-01-30 [2] CRAN (R 4.4.3)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## cli 3.6.5 2025-04-23 [1] CRAN (R 4.4.0)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.3)
## colorspace 2.1-2 2025-09-22 [1] CRAN (R 4.4.0)
## cowplot 1.1.3 2024-01-22 [2] CRAN (R 4.4.3)
## crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)
## data.table 1.17.8 2025-07-10 [1] CRAN (R 4.4.0)
## DelayedArray 0.32.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## DESeq2 * 1.46.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.0)
## dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.4.3)
## evaluate 1.0.5 2025-08-27 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## fastmatch 1.1-6 2024-12-23 [1] CRAN (R 4.4.0)
## fgsea * 1.32.4 2025-03-20 [1] Bioconductor 3.20 (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [2] CRAN (R 4.4.3)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.4.0)
## GenomeInfoDb * 1.42.3 2025-01-27 [2] Bioconductor 3.20 (R 4.4.3)
## GenomeInfoDbData 1.2.13 2025-10-06 [1] Bioconductor
## GenomicRanges * 1.58.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## ggplot2 * 3.5.1 2024-04-23 [2] CRAN (R 4.4.3)
## ggrepel * 0.9.6 2024-09-07 [2] CRAN (R 4.4.3)
## glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.0)
## gtable 0.3.6 2024-10-25 [2] CRAN (R 4.4.3)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.4.3)
## htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.3)
## httr 1.4.7 2023-08-15 [2] CRAN (R 4.4.3)
## IRanges * 2.40.1 2024-12-05 [2] Bioconductor 3.20 (R 4.4.3)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.4.3)
## jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.4.0)
## knitr 1.49 2024-11-08 [2] CRAN (R 4.4.3)
## labeling 0.4.3 2023-08-29 [2] CRAN (R 4.4.3)
## lattice 0.22-7 2025-04-02 [1] CRAN (R 4.4.0)
## lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.4.3)
## locfit 1.5-9.11 2025-02-03 [2] CRAN (R 4.4.3)
## lubridate * 1.9.4 2024-12-08 [2] CRAN (R 4.4.3)
## magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.4.0)
## Matrix 1.7-4 2025-08-28 [2] CRAN (R 4.4.3)
## MatrixGenerics * 1.18.1 2025-01-09 [2] Bioconductor 3.20 (R 4.4.3)
## matrixStats * 1.5.0 2025-01-07 [1] CRAN (R 4.4.0)
## munsell 0.5.1 2024-04-01 [2] CRAN (R 4.4.3)
## pheatmap * 1.0.12 2019-01-04 [2] CRAN (R 4.4.3)
## pillar 1.10.1 2025-01-07 [2] CRAN (R 4.4.3)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.4.3)
## purrr * 1.0.4 2025-02-05 [2] CRAN (R 4.4.3)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.0)
## RColorBrewer * 1.1-3 2022-04-03 [2] CRAN (R 4.4.3)
## Rcpp 1.1.0 2025-07-02 [1] CRAN (R 4.4.0)
## readr * 2.1.5 2024-01-10 [2] CRAN (R 4.4.3)
## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.4.0)
## rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.4.3)
## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.0)
## S4Arrays 1.6.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## S4Vectors * 0.44.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## sass 0.4.9 2024-03-15 [2] CRAN (R 4.4.3)
## scales 1.3.0 2023-11-28 [2] CRAN (R 4.4.3)
## sessioninfo 1.2.3 2025-02-05 [2] CRAN (R 4.4.3)
## SparseArray 1.6.2 2025-02-20 [2] Bioconductor 3.20 (R 4.4.3)
## stringi 1.8.7 2025-03-27 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [2] CRAN (R 4.4.3)
## SummarizedExperiment * 1.36.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## tibble * 3.2.1 2023-03-20 [2] CRAN (R 4.4.3)
## tidyr * 1.3.1 2024-01-24 [2] CRAN (R 4.4.3)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.4.3)
## tidyverse * 2.0.0 2023-02-22 [2] CRAN (R 4.4.3)
## timechange 0.3.0 2024-01-18 [2] CRAN (R 4.4.3)
## tzdb 0.4.0 2023-05-12 [2] CRAN (R 4.4.3)
## UCSC.utils 1.2.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## utf8 1.2.6 2025-06-08 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.4.3)
## vroom 1.6.5 2023-12-05 [2] CRAN (R 4.4.3)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.0)
## xfun 0.53 2025-08-19 [1] CRAN (R 4.4.0)
## XVector 0.46.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.3)
## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
## zlibbioc 1.52.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.0)
##
## [1] /usr4/bf527/hbeakley/R/x86_64-pc-linux-gnu-library/4.4
## [2] /share/pkg.8/r/4.4.3/install/lib64/R/library
## * ── Packages attached to the search path.
##
## ──────────────────────────────────────────────────────────────────────────────