The howoldru package provides a function, howoldru(), to estimate biological age based on DNA methylation data from Illumina 450k or EPIC arrays. This vignette demonstrates how to use the package with a real dataset downloaded from the Gene Expression Omnibus (GEO).
(Optional: Add a sentence or two here about the specific biological basis or origin of your ‘howoldru’ clock if you have that information, e.g., “The underlying model uses N CpG sites selected based on … and was trained on … data.”)
We will use dataset GSE110554 from GEO, which contains Illumina HumanMethylationEPIC array data from flow sorted peripheral blood mononuclear cells (PBMCs) of healthy donors of different ages. There are 49 samples with an even mixture of sex and age.
We use the GEOquery package to download the data.To run this example
yourself: Make sure you have the GEOquery package installed
GEOquery from Bioconductor.
Important Note: Downloading data from GEO can take time and requires
an internet connection. There are 50 samples of microarray data
downloading which takes ~ 1 min. if you want. a smaller
dataset to test, use GSE110554 (if so, you’ll have to edit
the pData to characteristics_ch1.3 to extract age)
# Ensure necessary packages for the vignette are loaded
library(howoldru)
library(GEOquery)
library(Biobase) # Needed for ExpressionSet objects
library(dplyr) # Needed for data manipulation
library(ggplot2) # Needed for plotting# This chunk downloads the data. Set eval=TRUE to run it.
gse_list <- GEOquery::getGEO("GSE110554", GSEMatrix = TRUE, getGPL = FALSE)
# also GSE110554 is an option
gse <- gse_list[[1]]
print(gse)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 866091 features, 49 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: GSM2998021 GSM2998022 ... GSM2998149 (49 total)
#> varLabels: title geo_accession ... weight_kg:ch1 (85 total)
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> pubMedIds: 29843789
#> 35140201
#> 36127421
#> Annotation: GPL21145The howoldru() function requires a DNA methylation beta value matrix as input, with CpG site identifiers as row names and sample identifiers as column names. We can extract this directly from the ExpressionSet object downloaded from GEO.
# Extract beta values (CpGs rows, Samples columns)
beta_matrix <- Biobase::exprs(gse)
# Extract phenotype data
pheno_data <- Biobase::pData(gse)
reported_ages_df <- pheno_data %>%
dplyr::transmute( # transmute keeps only the new column and grouping vars if any
SampleID = geo_accession, # Use geo_accession as sample ID
ReportedAge = suppressWarnings(as.numeric(sub("age: ", "", characteristics_ch1.15)))
) %>%
dplyr::filter(!is.na(ReportedAge)) # Keep only samples with valid numeric age
head(reported_ages_df)
#> SampleID ReportedAge
#> GSM2998021 GSM2998021 39
#> GSM2998022 GSM2998022 49
#> GSM2998023 GSM2998023 20
#> GSM2998024 GSM2998024 49
#> GSM2998025 GSM2998025 33
#> GSM2998027 GSM2998027 21
# Ensure beta matrix only contains samples with valid age for comparison
beta_matrix <- beta_matrix[, reported_ages_df$SampleID]
# check if it works.
head(reported_ages_df)
#> SampleID ReportedAge
#> GSM2998021 GSM2998021 39
#> GSM2998022 GSM2998022 49
#> GSM2998023 GSM2998023 20
#> GSM2998024 GSM2998024 49
#> GSM2998025 GSM2998025 33
#> GSM2998027 GSM2998027 21# Run the clock function
predicted_ages <- howoldru(beta_matrix)
#> [1] "Number of represented howoldru CpGs: 221 out of 221"
# Show the first few results
head(predicted_ages)
#> GSM2998021 GSM2998022 GSM2998023 GSM2998024 GSM2998025 GSM2998027
#> 46.00387 52.15344 25.35901 49.48506 37.25747 27.20428# Combine results into a data frame for plotting
results_df <- data.frame(
SampleID = names(predicted_ages),
PredictedAge = predicted_ages
)
results_df <- dplyr::inner_join(reported_ages_df, results_df, by = "SampleID")
# Display first few results
# Calculate Pearson correlation
correlation <- cor(results_df$ReportedAge, results_df$PredictedAge)
# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(results_df$ReportedAge - results_df$PredictedAge))
ggplot(results_df, aes(x = ReportedAge, y = PredictedAge)) +
geom_point(alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
# Set axis limits from 0 to 100
coord_cartesian(xlim = c(0, 100), ylim = c(0, 100)) +
labs(
title = "howoldru Predicted vs. Reported Age",
# Include Correlation and MAE in the subtitle
subtitle = paste("GSE112618 | Pearson Cor =", round(correlation, 3), "| MAE =", round(mae, 2)),
x = "Reported Chronological Age",
y = "Predicted Age (howoldru)"
) +
theme_minimal() # Use a minimal theme