1. Introduction

Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide, and gene expression profiling has been instrumental in understanding the molecular changes that drive tumorigenesis. In this project, I analyze a publicly available microarray dataset to identify genes that are differentially expressed between tumor and normal colon tissue, with the goal of highlighting candidate genes implicated in colorectal cancer biology.

Dataset: GSE44076 (Gene Expression Omnibus)

Platform: Affymetrix Human Genome U219 Array (GPL13667)

Samples: 246 total — 98 Tumor, 98 Normal (adjacent), 50 Healthy mucosa

Comparison performed: Tumor vs. Normal

2. Why limma (and not DESeq2)?

A quick but important methodological note before diving in. GSE44076 is a microarray dataset, where expression is measured as log2-transformed, continuous fluorescence intensities — not raw integer read counts like RNA-seq.

DESeq2 vs. limma — choosing the right tool for the data type
Aspect	DESeq2	limma
Data type	RNA-seq raw counts	Microarray intensities
Input values	Integers (0, 45, 230…)	Continuous log2 values (3.6, 7.2…)
Statistical model	Negative Binomial	Linear model + Empirical Bayes
Best suited for	RNA-seq count data	Microarray / normalized data

Since DESeq2’s negative binomial model assumes true sequencing counts, applying it to microarray intensities would require artificially converting them into pseudo-counts — a statistically inappropriate workaround. limma, on the other hand, was purpose-built for microarray data and uses linear modeling with empirical Bayes variance shrinkage, making it the correct tool for this dataset.

3. Load Required Libraries

library(GEOquery)
library(limma)
library(hgu219.db)
library(AnnotationDbi)
library(dplyr)
library(ggplot2)
library(ggrepel)

4. Data Acquisition

The dataset is downloaded directly from NCBI’s Gene Expression Omnibus using GEOquery.

gse <- getGEO("GSE44076", GSEMatrix = TRUE)
gse <- gse[[1]]

expr_matrix <- exprs(gse)   # log2 expression intensities
metadata <- pData(gse)      # sample annotation/phenotype data

dim(expr_matrix)

## [1] 49386   246

The expression matrix contains 49386 probes measured across 246 samples.

4.1 Inspecting Sample Metadata

table(metadata$`sample type:ch1`)

## 
## Mucosa Normal  Tumor 
##     50     98     98

For this analysis, I focus on the Tumor vs. Normal comparison (excluding the smaller “Mucosa” healthy control group, which represents a separate baseline population).

5. Differential Expression Analysis with limma

5.1 Defining the Experimental Design

condition <- factor(metadata$`sample type:ch1`)
condition <- relevel(condition, ref = "Normal")

design <- model.matrix(~ condition)
head(design)

##   (Intercept) conditionMucosa conditionTumor
## 1           1               1              0
## 2           1               1              0
## 3           1               1              0
## 4           1               1              0
## 5           1               1              0
## 6           1               1              0

5.2 Fitting the Linear Model

lmFit() fits a linear model to every probe, and eBayes() applies empirical Bayes moderation — borrowing information across all probes to stabilize variance estimates, which is especially valuable given the noise inherent in array data.

fit <- lmFit(expr_matrix, design)
fit2 <- eBayes(fit)

5.3 Extracting Results

results <- topTable(fit2, coef = "conditionTumor", number = Inf, adjust.method = "BH")
head(results)

5.4 Classifying Genes by Significance

Genes are classified as differentially expressed if they meet both:

Statistical significance: adjusted p-value (FDR) < 0.05
Biological relevance: |log2 fold change| > 1 (i.e., at least a 2-fold change)

results$probe <- rownames(results)

results$significance <- "Not Significant"
results$significance[results$adj.P.Val < 0.05 & results$logFC > 1]  <- "Upregulated"
results$significance[results$adj.P.Val < 0.05 & results$logFC < -1] <- "Downregulated"

table(results$significance)

## 
##   Downregulated Not Significant     Upregulated 
##            2353           44940            2093

6. Gene Annotation

Probe IDs aren’t biologically interpretable on their own, so I map them to gene symbols and full gene names using the hgu219.db annotation package (matching the array platform, GPL13667).

results$gene_symbol <- mapIds(hgu219.db,
                                keys = results$probe,
                                column = "SYMBOL",
                                keytype = "PROBEID",
                                multiVals = "first")

results$gene_name <- mapIds(hgu219.db,
                              keys = results$probe,
                              column = "GENENAME",
                              keytype = "PROBEID",
                              multiVals = "first")

results %>%
  filter(significance != "Not Significant") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val, significance) %>%
  head(10)

7. Visualization: Volcano Plot

The volcano plot below summarizes all 49386 probes tested, plotting effect size (log2 fold change) against statistical significance (-log10 adjusted p-value).

top_genes <- results %>%
  filter(significance != "Not Significant", !is.na(gene_symbol)) %>%
  group_by(significance) %>%
  slice_min(adj.P.Val, n = 5) %>%
  ungroup()

volcano_plot <- ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = significance)) +
  geom_point(alpha = 0.6, size = 1.2) +
  scale_color_manual(values = c(
    "Upregulated"     = "#E74C3C",
    "Downregulated"   = "#2980B9",
    "Not Significant" = "grey70"
  )) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "black", linewidth = 0.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "black", linewidth = 0.5) +
  geom_label_repel(
    data = top_genes,
    aes(label = gene_symbol),
    size = 3.5,
    fontface = "bold",
    max.overlaps = 20,
    box.padding = 0.5
  ) +
  labs(
    title    = "Volcano Plot — Tumor vs Normal (GSE44076)",
    subtitle = "Colorectal Cancer Microarray Analysis | limma",
    x        = "log2 Fold Change",
    y        = "-log10 (Adjusted P-value)",
    color    = "Expression"
  ) +
  theme_classic(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, color = "grey40"),
    legend.position = "top"
  )

print(volcano_plot)

8. Results Summary

Category	Count
Upregulated in Tumor	2093
Downregulated in Tumor	2353
Not Significant	44940
Total probes tested	49386

8.1 Top Upregulated Genes in Tumor

results %>%
  filter(significance == "Upregulated") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
  head(5) %>%
  knitr::kable()

	gene_symbol	gene_name	logFC
11728232_a_at	CLDN1	claudin 1	4.920208
11735833_a_at	CEMIP	cell migration inducing hyaluronidase 1	4.699381
11719434_a_at	ETV4	ETS variant transcription factor 4	3.297047
11728234_a_at	CLDN1	claudin 1	4.224330
11739128_a_at	CDH3	cadherin 3	3.592385

8.2 Top Downregulated Genes in Tumor

results %>%
  filter(significance == "Downregulated") %>%
  arrange(adj.P.Val) %>%
  select(gene_symbol, gene_name, logFC, adj.P.Val) %>%
  head(5) %>%
  knitr::kable()

	gene_symbol	gene_name	logFC
11732838_at	GUCA2B	guanylate cyclase activator 2B	-6.547316
11715637_a_at	UGP2	UDP-glucose pyrophosphorylase 2	-1.611339
11733581_a_at	CA7	carbonic anhydrase 7	-4.651000
11759464_at	OTOP2	otopetrin 2	-5.157509
11737294_a_at	TMIGD1	transmembrane and immunoglobulin domain containing 1	-6.928715

9. Biological Interpretation

Several of the top differentially expressed genes have well-documented roles in colorectal cancer, lending biological credibility to the statistical findings:

CLDN1 (Claudin 1) — a tight-junction protein frequently overexpressed in colorectal cancer; implicated in altered cell adhesion and tumor invasion.
CEMIP — a hyaluronidase involved in extracellular matrix remodeling; associated with tumor invasion and metastasis in CRC.
ETV4 — an ETS-family transcription factor with oncogenic activity, known to activate invasion- and migration-related gene programs in several cancers including CRC.
CDH3 (P-cadherin) — a cell adhesion molecule often upregulated in epithelial cancers and linked to poor prognosis.
GUCA2B — a regulator of normal colonic epithelial ion transport; its downregulation is consistent with loss of differentiated colonocyte function during malignant transformation.

These patterns align with established colorectal cancer biology — increased cell invasion/migration capacity alongside loss of normal epithelial differentiation — providing a useful sanity check on the analysis pipeline.

10. Conclusion

This analysis identified 2093 upregulated and 2353 downregulated genes (FDR < 0.05, |log2FC| > 1) between colorectal tumor and normal tissue, using a statistically appropriate linear modeling approach (limma) suited to microarray data. Several top hits are established CRC-associated genes, supporting the validity of the workflow.

Potential extensions:

Pathway enrichment analysis (e.g., GO, KEGG) on the DEG list
Comparison against the “Mucosa” (fully healthy) control group
Survival analysis correlating top DEGs with clinical outcomes, if available

11. Session Info

For reproducibility:

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: Asia/Calcutta
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggrepel_0.9.8        ggplot2_4.0.3        dplyr_1.2.1         
##  [4] hgu219.db_3.2.3      org.Hs.eg.db_3.22.0  AnnotationDbi_1.72.0
##  [7] IRanges_2.44.0       S4Vectors_0.48.1     limma_3.66.0        
## [10] GEOquery_2.78.0      Biobase_2.70.0       BiocGenerics_0.56.0 
## [13] generics_0.1.4      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            farver_2.1.2               
##  [3] blob_1.3.0                  R.utils_2.13.0             
##  [5] Biostrings_2.78.0           S7_0.2.2                   
##  [7] fastmap_1.2.0               XML_3.99-0.23              
##  [9] digest_0.6.39               lifecycle_1.0.5            
## [11] statmod_1.5.2               KEGGREST_1.50.0            
## [13] RSQLite_3.53.1              magrittr_2.0.5             
## [15] compiler_4.5.1              rlang_1.2.0                
## [17] sass_0.4.10                 tools_4.5.1                
## [19] yaml_2.3.12                 data.table_1.18.4          
## [21] knitr_1.51                  labeling_0.4.3             
## [23] S4Arrays_1.10.1             bit_4.6.0                  
## [25] curl_7.1.0                  DelayedArray_0.36.1        
## [27] xml2_1.5.2                  RColorBrewer_1.1-3         
## [29] abind_1.4-8                 withr_3.0.2                
## [31] purrr_1.2.2                 R.oo_1.27.1                
## [33] grid_4.5.1                  scales_1.4.0               
## [35] SummarizedExperiment_1.40.0 cli_3.6.6                  
## [37] rmarkdown_2.31              crayon_1.5.3               
## [39] rstudioapi_0.19.0           httr_1.4.8                 
## [41] tzdb_0.5.0                  DBI_1.3.0                  
## [43] cachem_1.1.0                XVector_0.50.0             
## [45] matrixStats_1.5.0           vctrs_0.7.3                
## [47] Matrix_1.7-3                jsonlite_2.0.0             
## [49] hms_1.1.4                   bit64_4.8.2                
## [51] tidyr_1.3.2                 jquerylib_0.1.4            
## [53] glue_1.8.1                  codetools_0.2-20           
## [55] gtable_0.3.6                GenomicRanges_1.62.1       
## [57] tibble_3.3.1                pillar_1.11.1              
## [59] rappdirs_0.3.4              htmltools_0.5.9            
## [61] Seqinfo_1.0.0               R6_2.6.1                   
## [63] httr2_1.2.2                 evaluate_1.0.5             
## [65] lattice_0.22-7              readr_2.2.0                
## [67] rentrez_1.2.4               png_0.1-9                  
## [69] R.methodsS3_1.8.2           memoise_2.0.1              
## [71] bslib_0.11.0                Rcpp_1.1.1-1.1             
## [73] SparseArray_1.10.10         xfun_0.58                  
## [75] MatrixGenerics_1.22.0       pkgconfig_2.0.3

Differential Gene Expression Analysis in Colorectal Cancer

Microarray Analysis of Tumor vs. Normal Colon Tissue (GSE44076)

Pooja Sharma

June 24, 2026