Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide, and gene expression profiling has been instrumental in understanding the molecular changes that drive tumorigenesis. In this project, I analyze a publicly available microarray dataset to identify genes that are differentially expressed between tumor and normal colon tissue, with the goal of highlighting candidate genes implicated in colorectal cancer biology.
Dataset: GSE44076 (Gene Expression Omnibus)
Platform: Affymetrix Human Genome U219 Array (GPL13667)
Samples: 246 total — 98 Tumor, 98 Normal (adjacent), 50 Healthy mucosa
Comparison performed: Tumor vs. Normal
A quick but important methodological note before diving in. GSE44076 is a microarray dataset, where expression is measured as log2-transformed, continuous fluorescence intensities — not raw integer read counts like RNA-seq.
| Aspect | DESeq2 | limma |
|---|---|---|
| Data type | RNA-seq raw counts | Microarray intensities |
| Input values | Integers (0, 45, 230…) | Continuous log2 values (3.6, 7.2…) |
| Statistical model | Negative Binomial | Linear model + Empirical Bayes |
| Best suited for | RNA-seq count data | Microarray / normalized data |
Since DESeq2’s negative binomial model assumes true sequencing counts, applying it to microarray intensities would require artificially converting them into pseudo-counts — a statistically inappropriate workaround. limma, on the other hand, was purpose-built for microarray data and uses linear modeling with empirical Bayes variance shrinkage, making it the correct tool for this dataset.
library(GEOquery)
library(limma)
library(hgu219.db)
library(AnnotationDbi)
library(dplyr)
library(ggplot2)
library(ggrepel)The dataset is downloaded directly from NCBI’s Gene Expression
Omnibus using GEOquery.
gse <- getGEO("GSE44076", GSEMatrix = TRUE)
gse <- gse[[1]]
expr_matrix <- exprs(gse) # log2 expression intensities
metadata <- pData(gse) # sample annotation/phenotype data
dim(expr_matrix)## [1] 49386 246
The expression matrix contains 49386 probes measured across 246 samples.
condition <- factor(metadata$`sample type:ch1`)
condition <- relevel(condition, ref = "Normal")
design <- model.matrix(~ condition)
head(design)## (Intercept) conditionMucosa conditionTumor
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 1 0
## 6 1 1 0
lmFit() fits a linear model to every probe, and
eBayes() applies empirical Bayes moderation — borrowing
information across all probes to stabilize variance estimates, which is
especially valuable given the noise inherent in array data.
results <- topTable(fit2, coef = "conditionTumor", number = Inf, adjust.method = "BH")
head(results)Genes are classified as differentially expressed if they meet both:
results$probe <- rownames(results)
results$significance <- "Not Significant"
results$significance[results$adj.P.Val < 0.05 & results$logFC > 1] <- "Upregulated"
results$significance[results$adj.P.Val < 0.05 & results$logFC < -1] <- "Downregulated"
table(results$significance)##
## Downregulated Not Significant Upregulated
## 2353 44940 2093
Probe IDs aren’t biologically interpretable on their own, so I map
them to gene symbols and full gene names using the
hgu219.db annotation package (matching the array platform,
GPL13667).
results$gene_symbol <- mapIds(hgu219.db,
keys = results$probe,
column = "SYMBOL",
keytype = "PROBEID",
multiVals = "first")
results$gene_name <- mapIds(hgu219.db,
keys = results$probe,
column = "GENENAME",
keytype = "PROBEID",
multiVals = "first")
results %>%
filter(significance != "Not Significant") %>%
arrange(adj.P.Val) %>%
select(gene_symbol, gene_name, logFC, adj.P.Val, significance) %>%
head(10)The volcano plot below summarizes all 49386 probes tested, plotting effect size (log2 fold change) against statistical significance (-log10 adjusted p-value).
top_genes <- results %>%
filter(significance != "Not Significant", !is.na(gene_symbol)) %>%
group_by(significance) %>%
slice_min(adj.P.Val, n = 5) %>%
ungroup()
volcano_plot <- ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
geom_point(alpha = 0.6, size = 1.2) +
scale_color_manual(values = c(
"Upregulated" = "#E74C3C",
"Downregulated" = "#2980B9",
"Not Significant" = "grey70"
)) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black", linewidth = 0.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black", linewidth = 0.5) +
geom_label_repel(
data = top_genes,
aes(label = gene_symbol),
size = 3.5,
fontface = "bold",
max.overlaps = 20,
box.padding = 0.5
) +
labs(
title = "Volcano Plot — Tumor vs Normal (GSE44076)",
subtitle = "Colorectal Cancer Microarray Analysis | limma",
x = "log2 Fold Change",
y = "-log10 (Adjusted P-value)",
color = "Expression"
) +
theme_classic(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
legend.position = "top"
)
print(volcano_plot)| Category | Count |
|---|---|
| Upregulated in Tumor | 2093 |
| Downregulated in Tumor | 2353 |
| Not Significant | 44940 |
| Total probes tested | 49386 |
results %>%
filter(significance == "Upregulated") %>%
arrange(adj.P.Val) %>%
select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
head(5) %>%
knitr::kable()| gene_symbol | gene_name | logFC | adj.P.Val | |
|---|---|---|---|---|
| 11728232_a_at | CLDN1 | claudin 1 | 4.920208 | 0 |
| 11735833_a_at | CEMIP | cell migration inducing hyaluronidase 1 | 4.699381 | 0 |
| 11719434_a_at | ETV4 | ETS variant transcription factor 4 | 3.297047 | 0 |
| 11728234_a_at | CLDN1 | claudin 1 | 4.224330 | 0 |
| 11739128_a_at | CDH3 | cadherin 3 | 3.592385 | 0 |
results %>%
filter(significance == "Downregulated") %>%
arrange(adj.P.Val) %>%
select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
head(5) %>%
knitr::kable()| gene_symbol | gene_name | logFC | adj.P.Val | |
|---|---|---|---|---|
| 11732838_at | GUCA2B | guanylate cyclase activator 2B | -6.547316 | 0 |
| 11715637_a_at | UGP2 | UDP-glucose pyrophosphorylase 2 | -1.611339 | 0 |
| 11733581_a_at | CA7 | carbonic anhydrase 7 | -4.651000 | 0 |
| 11759464_at | OTOP2 | otopetrin 2 | -5.157509 | 0 |
| 11737294_a_at | TMIGD1 | transmembrane and immunoglobulin domain containing 1 | -6.928715 | 0 |
Several of the top differentially expressed genes have well-documented roles in colorectal cancer, lending biological credibility to the statistical findings:
These patterns align with established colorectal cancer biology — increased cell invasion/migration capacity alongside loss of normal epithelial differentiation — providing a useful sanity check on the analysis pipeline.
This analysis identified 2093 upregulated and
2353 downregulated genes (FDR < 0.05, |log2FC| >
1) between colorectal tumor and normal tissue, using a statistically
appropriate linear modeling approach (limma) suited to
microarray data. Several top hits are established CRC-associated genes,
supporting the validity of the workflow.
Potential extensions:
For reproducibility:
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## time zone: Asia/Calcutta
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggrepel_0.9.8 ggplot2_4.0.3 dplyr_1.2.1
## [4] hgu219.db_3.2.3 org.Hs.eg.db_3.22.0 AnnotationDbi_1.72.0
## [7] IRanges_2.44.0 S4Vectors_0.48.1 limma_3.66.0
## [10] GEOquery_2.78.0 Biobase_2.70.0 BiocGenerics_0.56.0
## [13] generics_0.1.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2
## [3] blob_1.3.0 R.utils_2.13.0
## [5] Biostrings_2.78.0 S7_0.2.2
## [7] fastmap_1.2.0 XML_3.99-0.23
## [9] digest_0.6.39 lifecycle_1.0.5
## [11] statmod_1.5.2 KEGGREST_1.50.0
## [13] RSQLite_3.53.1 magrittr_2.0.5
## [15] compiler_4.5.1 rlang_1.2.0
## [17] sass_0.4.10 tools_4.5.1
## [19] yaml_2.3.12 data.table_1.18.4
## [21] knitr_1.51 labeling_0.4.3
## [23] S4Arrays_1.10.1 bit_4.6.0
## [25] curl_7.1.0 DelayedArray_0.36.1
## [27] xml2_1.5.2 RColorBrewer_1.1-3
## [29] abind_1.4-8 withr_3.0.2
## [31] purrr_1.2.2 R.oo_1.27.1
## [33] grid_4.5.1 scales_1.4.0
## [35] SummarizedExperiment_1.40.0 cli_3.6.6
## [37] rmarkdown_2.31 crayon_1.5.3
## [39] rstudioapi_0.19.0 httr_1.4.8
## [41] tzdb_0.5.0 DBI_1.3.0
## [43] cachem_1.1.0 XVector_0.50.0
## [45] matrixStats_1.5.0 vctrs_0.7.3
## [47] Matrix_1.7-3 jsonlite_2.0.0
## [49] hms_1.1.4 bit64_4.8.2
## [51] tidyr_1.3.2 jquerylib_0.1.4
## [53] glue_1.8.1 codetools_0.2-20
## [55] gtable_0.3.6 GenomicRanges_1.62.1
## [57] tibble_3.3.1 pillar_1.11.1
## [59] rappdirs_0.3.4 htmltools_0.5.9
## [61] Seqinfo_1.0.0 R6_2.6.1
## [63] httr2_1.2.2 evaluate_1.0.5
## [65] lattice_0.22-7 readr_2.2.0
## [67] rentrez_1.2.4 png_0.1-9
## [69] R.methodsS3_1.8.2 memoise_2.0.1
## [71] bslib_0.11.0 Rcpp_1.1.1-1.1
## [73] SparseArray_1.10.10 xfun_0.58
## [75] MatrixGenerics_1.22.0 pkgconfig_2.0.3