1. Introduction

Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide, and gene expression profiling has been instrumental in understanding the molecular changes that drive tumorigenesis. In this project, I analyze a publicly available microarray dataset to identify genes that are differentially expressed between tumor and normal colon tissue, with the goal of highlighting candidate genes implicated in colorectal cancer biology.

Dataset: GSE44076 (Gene Expression Omnibus)

Platform: Affymetrix Human Genome U219 Array (GPL13667)

Samples: 246 total — 98 Tumor, 98 Normal (adjacent), 50 Healthy mucosa

Comparison performed: Tumor vs. Normal


2. Why limma (and not DESeq2)?

A quick but important methodological note before diving in. GSE44076 is a microarray dataset, where expression is measured as log2-transformed, continuous fluorescence intensities — not raw integer read counts like RNA-seq.

DESeq2 vs. limma — choosing the right tool for the data type
Aspect DESeq2 limma
Data type RNA-seq raw counts Microarray intensities
Input values Integers (0, 45, 230…) Continuous log2 values (3.6, 7.2…)
Statistical model Negative Binomial Linear model + Empirical Bayes
Best suited for RNA-seq count data Microarray / normalized data

Since DESeq2’s negative binomial model assumes true sequencing counts, applying it to microarray intensities would require artificially converting them into pseudo-counts — a statistically inappropriate workaround. limma, on the other hand, was purpose-built for microarray data and uses linear modeling with empirical Bayes variance shrinkage, making it the correct tool for this dataset.


3. Load Required Libraries

library(GEOquery)
library(limma)
library(hgu219.db)
library(AnnotationDbi)
library(dplyr)
library(ggplot2)
library(ggrepel)

4. Data Acquisition

The dataset is downloaded directly from NCBI’s Gene Expression Omnibus using GEOquery.

gse <- getGEO("GSE44076", GSEMatrix = TRUE)
gse <- gse[[1]]

expr_matrix <- exprs(gse)   # log2 expression intensities
metadata <- pData(gse)      # sample annotation/phenotype data

dim(expr_matrix)
## [1] 49386   246

The expression matrix contains 49386 probes measured across 246 samples.

4.1 Inspecting Sample Metadata

table(metadata$`sample type:ch1`)
## 
## Mucosa Normal  Tumor 
##     50     98     98

For this analysis, I focus on the Tumor vs. Normal comparison (excluding the smaller “Mucosa” healthy control group, which represents a separate baseline population).


5. Differential Expression Analysis with limma

5.1 Defining the Experimental Design

condition <- factor(metadata$`sample type:ch1`)
condition <- relevel(condition, ref = "Normal")

design <- model.matrix(~ condition)
head(design)
##   (Intercept) conditionMucosa conditionTumor
## 1           1               1              0
## 2           1               1              0
## 3           1               1              0
## 4           1               1              0
## 5           1               1              0
## 6           1               1              0

5.2 Fitting the Linear Model

lmFit() fits a linear model to every probe, and eBayes() applies empirical Bayes moderation — borrowing information across all probes to stabilize variance estimates, which is especially valuable given the noise inherent in array data.

fit <- lmFit(expr_matrix, design)
fit2 <- eBayes(fit)

5.3 Extracting Results

results <- topTable(fit2, coef = "conditionTumor", number = Inf, adjust.method = "BH")
head(results)

5.4 Classifying Genes by Significance

Genes are classified as differentially expressed if they meet both:

  • Statistical significance: adjusted p-value (FDR) < 0.05
  • Biological relevance: |log2 fold change| > 1 (i.e., at least a 2-fold change)
results$probe <- rownames(results)

results$significance <- "Not Significant"
results$significance[results$adj.P.Val < 0.05 & results$logFC > 1]  <- "Upregulated"
results$significance[results$adj.P.Val < 0.05 & results$logFC < -1] <- "Downregulated"

table(results$significance)
## 
##   Downregulated Not Significant     Upregulated 
##            2353           44940            2093

6. Gene Annotation

Probe IDs aren’t biologically interpretable on their own, so I map them to gene symbols and full gene names using the hgu219.db annotation package (matching the array platform, GPL13667).

results$gene_symbol <- mapIds(hgu219.db,
                                keys = results$probe,
                                column = "SYMBOL",
                                keytype = "PROBEID",
                                multiVals = "first")

results$gene_name <- mapIds(hgu219.db,
                              keys = results$probe,
                              column = "GENENAME",
                              keytype = "PROBEID",
                              multiVals = "first")

results %>%
  filter(significance != "Not Significant") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val, significance) %>%
  head(10)

7. Visualization: Volcano Plot

The volcano plot below summarizes all 49386 probes tested, plotting effect size (log2 fold change) against statistical significance (-log10 adjusted p-value).

top_genes <- results %>%
  filter(significance != "Not Significant", !is.na(gene_symbol)) %>%
  group_by(significance) %>%
  slice_min(adj.P.Val, n = 5) %>%
  ungroup()

volcano_plot <- ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
  geom_point(alpha = 0.6, size = 1.2) +
  scale_color_manual(values = c(
    "Upregulated"     = "#E74C3C",
    "Downregulated"   = "#2980B9",
    "Not Significant" = "grey70"
  )) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black", linewidth = 0.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black", linewidth = 0.5) +
  geom_label_repel(
    data = top_genes,
    aes(label = gene_symbol),
    size = 3.5,
    fontface = "bold",
    max.overlaps = 20,
    box.padding = 0.5
  ) +
  labs(
    title    = "Volcano Plot — Tumor vs Normal (GSE44076)",
    subtitle = "Colorectal Cancer Microarray Analysis | limma",
    x        = "log2 Fold Change",
    y        = "-log10 (Adjusted P-value)",
    color    = "Expression"
  ) +
  theme_classic(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
    legend.position = "top"
  )

print(volcano_plot)


8. Results Summary

Category Count
Upregulated in Tumor 2093
Downregulated in Tumor 2353
Not Significant 44940
Total probes tested 49386

8.1 Top Upregulated Genes in Tumor

results %>%
  filter(significance == "Upregulated") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
  head(5) %>%
  knitr::kable()
gene_symbol gene_name logFC adj.P.Val
11728232_a_at CLDN1 claudin 1 4.920208 0
11735833_a_at CEMIP cell migration inducing hyaluronidase 1 4.699381 0
11719434_a_at ETV4 ETS variant transcription factor 4 3.297047 0
11728234_a_at CLDN1 claudin 1 4.224330 0
11739128_a_at CDH3 cadherin 3 3.592385 0

8.2 Top Downregulated Genes in Tumor

results %>%
  filter(significance == "Downregulated") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
  head(5) %>%
  knitr::kable()
gene_symbol gene_name logFC adj.P.Val
11732838_at GUCA2B guanylate cyclase activator 2B -6.547316 0
11715637_a_at UGP2 UDP-glucose pyrophosphorylase 2 -1.611339 0
11733581_a_at CA7 carbonic anhydrase 7 -4.651000 0
11759464_at OTOP2 otopetrin 2 -5.157509 0
11737294_a_at TMIGD1 transmembrane and immunoglobulin domain containing 1 -6.928715 0

9. Biological Interpretation

Several of the top differentially expressed genes have well-documented roles in colorectal cancer, lending biological credibility to the statistical findings:

  • CLDN1 (Claudin 1) — a tight-junction protein frequently overexpressed in colorectal cancer; implicated in altered cell adhesion and tumor invasion.
  • CEMIP — a hyaluronidase involved in extracellular matrix remodeling; associated with tumor invasion and metastasis in CRC.
  • ETV4 — an ETS-family transcription factor with oncogenic activity, known to activate invasion- and migration-related gene programs in several cancers including CRC.
  • CDH3 (P-cadherin) — a cell adhesion molecule often upregulated in epithelial cancers and linked to poor prognosis.
  • GUCA2B — a regulator of normal colonic epithelial ion transport; its downregulation is consistent with loss of differentiated colonocyte function during malignant transformation.

These patterns align with established colorectal cancer biology — increased cell invasion/migration capacity alongside loss of normal epithelial differentiation — providing a useful sanity check on the analysis pipeline.


10. Conclusion

This analysis identified 2093 upregulated and 2353 downregulated genes (FDR < 0.05, |log2FC| > 1) between colorectal tumor and normal tissue, using a statistically appropriate linear modeling approach (limma) suited to microarray data. Several top hits are established CRC-associated genes, supporting the validity of the workflow.

Potential extensions:

  • Pathway enrichment analysis (e.g., GO, KEGG) on the DEG list
  • Comparison against the “Mucosa” (fully healthy) control group
  • Survival analysis correlating top DEGs with clinical outcomes, if available

11. Session Info

For reproducibility:

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: Asia/Calcutta
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggrepel_0.9.8        ggplot2_4.0.3        dplyr_1.2.1         
##  [4] hgu219.db_3.2.3      org.Hs.eg.db_3.22.0  AnnotationDbi_1.72.0
##  [7] IRanges_2.44.0       S4Vectors_0.48.1     limma_3.66.0        
## [10] GEOquery_2.78.0      Biobase_2.70.0       BiocGenerics_0.56.0 
## [13] generics_0.1.4      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            farver_2.1.2               
##  [3] blob_1.3.0                  R.utils_2.13.0             
##  [5] Biostrings_2.78.0           S7_0.2.2                   
##  [7] fastmap_1.2.0               XML_3.99-0.23              
##  [9] digest_0.6.39               lifecycle_1.0.5            
## [11] statmod_1.5.2               KEGGREST_1.50.0            
## [13] RSQLite_3.53.1              magrittr_2.0.5             
## [15] compiler_4.5.1              rlang_1.2.0                
## [17] sass_0.4.10                 tools_4.5.1                
## [19] yaml_2.3.12                 data.table_1.18.4          
## [21] knitr_1.51                  labeling_0.4.3             
## [23] S4Arrays_1.10.1             bit_4.6.0                  
## [25] curl_7.1.0                  DelayedArray_0.36.1        
## [27] xml2_1.5.2                  RColorBrewer_1.1-3         
## [29] abind_1.4-8                 withr_3.0.2                
## [31] purrr_1.2.2                 R.oo_1.27.1                
## [33] grid_4.5.1                  scales_1.4.0               
## [35] SummarizedExperiment_1.40.0 cli_3.6.6                  
## [37] rmarkdown_2.31              crayon_1.5.3               
## [39] rstudioapi_0.19.0           httr_1.4.8                 
## [41] tzdb_0.5.0                  DBI_1.3.0                  
## [43] cachem_1.1.0                XVector_0.50.0             
## [45] matrixStats_1.5.0           vctrs_0.7.3                
## [47] Matrix_1.7-3                jsonlite_2.0.0             
## [49] hms_1.1.4                   bit64_4.8.2                
## [51] tidyr_1.3.2                 jquerylib_0.1.4            
## [53] glue_1.8.1                  codetools_0.2-20           
## [55] gtable_0.3.6                GenomicRanges_1.62.1       
## [57] tibble_3.3.1                pillar_1.11.1              
## [59] rappdirs_0.3.4              htmltools_0.5.9            
## [61] Seqinfo_1.0.0               R6_2.6.1                   
## [63] httr2_1.2.2                 evaluate_1.0.5             
## [65] lattice_0.22-7              readr_2.2.0                
## [67] rentrez_1.2.4               png_0.1-9                  
## [69] R.methodsS3_1.8.2           memoise_2.0.1              
## [71] bslib_0.11.0                Rcpp_1.1.1-1.1             
## [73] SparseArray_1.10.10         xfun_0.58                  
## [75] MatrixGenerics_1.22.0       pkgconfig_2.0.3