Bioinformatics Analysis Report

1 Information on patient data

# Load necessary library
library(knitr)

# Create the cleaner data manually
my_table <- data.frame(
  Code = c("G170 024 312 642z (R)", "G170 024 312 932F (R)", "G170 024 313 120Y (L)", 
           "G170 024 312 768 7 (R)", "G170 024 312 xxx x (x)", "G170 225 503 994 I ()", 
           "G170 024 312 502 6 (R)", "G1702255041724", "G1702255040780", 
           "G170225504064C", "G170225504265V", "G170225504223E", "G170225504224C"),
  
  My_code = c("C1.1", "D1.1", "C4.2", "D2.1", "C5.1", "D4.1", 
              "D3.1", "C6.1", "D5.1", "C7.1", "C8.1", "D6.1", "D7.1"),
  
  Age = c(80, 67, 73, 62, 82, 84, 74, 74, 75, 72, 83, 77, 88),
  
  Disease_Control = c("Control", "Diabetes", "Control", "Diabetes", "Control", "Diabetes",
                      "Diabetes", "Control", "Diabetes", "Control", "Control", "Diabetes", "Diabetes"),
  
  DIN = c(8.6, 8.9, 8.7, 8.9, 8.5, 8.5, 8.8, 8.9, 9.1, 8.8, 9.3, 9.1, 8.9)
)

# Create the Markdown table
kable(my_table, caption = "Sample Information Table (no gDNA concentration, no Storage_gDNA)")

Sample Information Table (no gDNA concentration, no Storage_gDNA)
Code	My_code	Age	Disease_Control	DIN
G170 024 312 642z (R)	C1.1	80	Control	8.6
G170 024 312 932F (R)	D1.1	67	Diabetes	8.9
G170 024 313 120Y (L)	C4.2	73	Control	8.7
G170 024 312 768 7 (R)	D2.1	62	Diabetes	8.9
G170 024 312 xxx x (x)	C5.1	82	Control	8.5
G170 225 503 994 I ()	D4.1	84	Diabetes	8.5
G170 024 312 502 6 (R)	D3.1	74	Diabetes	8.8
G1702255041724	C6.1	74	Control	8.9
G1702255040780	D5.1	75	Diabetes	9.1
G170225504064C	C7.1	72	Control	8.8
G170225504265V	C8.1	83	Control	9.3
G170225504223E	D6.1	77	Diabetes	9.1
G170225504224C	D7.1	88	Diabetes	8.9

2 Loading and installing the packages

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install(c("minfi", "limma", "IlluminaHumanMethylationEPICv2manifest",
    "IlluminaHumanMethylationEPICv2anno.ilm10b5.hg38", "DMRcate", "RColorBrewer",
    "missMethyl", "bumphunter"))
# Install Illumina EPICv2 annotation
BiocManager::install("jokergoo/IlluminaHumanMethylationEPICv2anno.20a1.hg38")

# Load libraries
library(minfi)
library(limma)
library(IlluminaHumanMethylationEPICv2manifest)
library(IlluminaHumanMethylationEPICv2anno.20a1.hg38)
library(DMRcate)
library(RColorBrewer)
library(missMethyl)
library(bumphunter)
library(ggplot2)

3 Reading in the data

Load the files in the correct directory and set the correct directory The sample sheet needs to be formatted correctly. I used chatGPT to do this.

# Replace with your actual path to the iDAT files
idatDir <- "/Users/joelbrunomendes/Documents/Methylation_data/run0"
# setwd(idatDir)

# Read in the sample sheet (CSV file that describes your samples) The sample
# sheet should have at minimum: Sample_Name, Basename, Group columns Replace
# with your actual sample sheet file name
sampleSheet <- read.csv("/Users/joelbrunomendes/Documents/Methylation_data/run0/run0_cleaned.csv",
    header = TRUE)

Read raw IDAT files from your methylation array run and returns an object of class RGChannelSet, which contains the raw red and green signal intensities for each probe on the array. Then calculate the detection p-value for each probe in each sample, based on the raw signal in the RGset object. Assess whether the signal for a probe is significantly above background noise. e.g., < 0.01 means the probe’s signal is reliably detected in that sample. A high p-value (e.g., > 0.01) suggests the measurement is unreliable due to low intensity, technical issues, or poor hybridization. You can use the oppurtunity to produce a basic QC report from the data at this stage.

# Read the IDAT files Create a target object from the sample sheet
targets <- data.frame(sampleSheet)
RGset <- read.metharray.exp(base = idatDir)

detP <- detectionP(RGset)
# Plot QC plot
qcReport(RGset, sampNames = sampleSheet$Sample_Name, pdf = "qcReport.pdf")

## quartz_off_screen 
##                 2

4 Inspect the data

Some of the probes in the EPICv2 array will have failed. We can scan for these using the p value.

# Check failed positions per sample (e.g., p-value > 0.01)
failed <- detP > 0.01
failedPerSample <- colSums(failed)
failedPerSample

## 209212100024_R01C01 209212100024_R02C01 209212100024_R03C01 209212100024_R04C01 
##                 964                1486                1097                1382 
## 209212100024_R05C01 209212100024_R06C01 209212100024_R07C01 209387440134_R01C01 
##                1441                6321                1041                2516 
## 209387440134_R02C01 209387440134_R03C01 209387440134_R04C01 209387440134_R05C01 
##                2168                1376                1403                1266 
## 209387440134_R06C01 209387440134_R07C01 209387440134_R08C01 
##                1296                1684                1563

We can plot the failed probes per sample by bar plot

# Visualize the proportion of failed probes per sample
barplot(failedPerSample, las = 2, cex.names = 0.8, ylab = "Number of failed probes",
    main = "Failed probes per sample")

One sample in particular has a high amount of failed probes per sample but in general there is a decent distribution that we can filter out from the analysis. One thing to improve in this plot is the names of the samples.

5 Preprocessing the data

Peform the normalisation using the most complete processing function. Other options exist but we are using Funmorn which is pretty complete for data that is relatively good quality. It is made specifically for the illumina EPIC arrays. preprocessFunnorm() performs a two-step normalization to reduce technical variation while preserving true biological signal. It adjusts for differences in non-specific signal across arrays. It uses principal components from control probes to model and remove unwanted technical variation (e.g., batch effects, dye bias, etc.).

# 3. Perform preprocessing/normalization
GRset.funnorm <- preprocessFunnorm(RGset)

we then want to find the intersection of probe IDs (CpG names like cg00000029) that are present in both: • Your normalized data (GRset.funnorm) • Your detection p-values matrix (detP)

Ensuring that only probes with both data and QC info are used.

# 4. Filter probes with high detection p-values Make sure to use the common
# probes between detP and your normalized data
common_probes <- intersect(rownames(getBeta(GRset.funnorm)), rownames(detP))
detP_filtered <- detP[common_probes, ]
failed <- detP_filtered > 0.01
keep <- rowSums(failed) < (0.05 * ncol(failed))

Then create a new filtered version of the Green/Red signals

# Create a new filtered GRset
GRset.filtered <- GRset.funnorm[common_probes[keep], ]

# 5. Filter probes overlapping SNPs (if your package supports this) If
# dropLociWithSnps is available:
if (exists("dropLociWithSnps")) {
    GRset.filtered <- dropLociWithSnps(GRset.filtered)
}

6 Computing Differentially Methylated regions

We begin by extracting M values from our pre-processed data.

# 6. Extract M-values from the filtered data
M_filtered <- getM(GRset.filtered)

In order to compute differential methylation analysis we need to create a design matrix. The design matrix is fundamental, it’s part of any model-based analysis (in this case performed by limma).

The design matrix will encode which samples belong to which group (diabetes vs control). In the code below we are telling the statistical model how to partition the samples based on the groups that label them. We assign that label (which exists in the csv file under the column Sample_group) to the variable phenoData.

When we run DMA we want to be asking “how does methylation differ between groups A and B”. A design matrix will ensure that the comparison is framed properly (mathematically), instructing the model how to estimate the group means and compare them statistically.

In more complicated cases we need to add covariates into the design (age, batch, cell identity). Tools for identifying DMRs reply on CpG (per probe) statistics that come from a linear model. After computation the probes in close proximity that show consistent differences can be grouped into regions (DMRs). Without a matrix you can’t correctly estimate the group-level diffrerences so there are no DMRs to compute.

# 7. Set up your design matrix
phenoData <- factor(targets$Sample_Group)
design <- model.matrix(~0 + phenoData)
colnames(design) <- levels(phenoData)

After this we need to create a contrast matrix. whilst creating the design matrix specifies how samples are assigned to each experimental group, you often want to test specific hypotheses—for example, “How does the diabetic group differ from the control group?” or “How does group A differ from group B and group C simultaneously?” This is where the contrast matrix comes in.

In limma, the contrast matrix defines the particular comparisons (contrasts) you want the model to estimate. Each contrast is essentially a linear combination of the group parameters you specified in your design matrix.

In our example,

diabetic_vs_control is the name of the contrast. diabetic - control tells the linear model to estimate the difference in methylation signals between the diabetic group and the control group. levels = design informs the function that these group names (“diabetic” and “control”) are the columns of the existing design matrix.

# 8. Create your contrast matrix
cont.matrix <- makeContrasts(diabetic_vs_control = diabetic - control, levels = design)

Now we can run the differential methylation analysis.

As mentioned before the DMRs are going to be computed using limma. The input for these set of function is going to be the filtered M_values that we processed before.

fit <- lmFit(M_filtered, design)

Fits a linear model to each probe (row) in M_filtered using the design matrix design. For each CpG site, lmFit models the methylation levels across samples.The design matrix specifies how samples are grouped (or which covariates are included). This step provides probe-level estimates of group means (or effect sizes) and residual variation.

fit2 <- contrasts.fit(fit, cont.matrix)

Re-fits the linear model coefficients based on the specific contrasts you have defined in cont.matrix. The initial lmFit estimates parameters for all groups in the design. However, the biological question is often “Which probes differ between group X and group Y?”. contrasts.fit transforms the fitted model to focus on the particular comparison (e.g., diabetic vs. control) indicated by your contrast matrix.

fit2 <- eBayes(fit2)

Applies an empirical Bayes shrinkage to the standard errors of the estimated coefficients. Methylation arrays measure tens of thousands (or more) of CpGs. This large-scale testing can lead to highly variable estimates of variance if you treat each probe in isolation. Empirical Bayes moderates (shrinks) the variance estimates based on all probes, making them more stable and less prone to false positives. This step improves the statistical power by borrowing strength across the many probes.

DMPs <- topTable(fit2, num=Inf, coef=1, sort.by=“P”)

What it does: Retrieves the table of differentially methylated positions, sorted by statistical significance. After fitting the model and specifying the contrast, topTable ranks probes (CpGs) by p-value, log fold change, or other criteria. num=Inf means you’re getting all results rather than a truncated top list. coef=1 tells topTable which contrast’s results you want (in this example, the diabetic vs. control contrast). By default, it includes columns for p-values, adjusted p-values, log fold changes, average expression (methylation), etc.

# 9. Run differential analysis with filtered data
fit <- lmFit(M_filtered, design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
DMPs <- topTable(fit2, num = Inf, coef = 1, sort.by = "P")

Then annotate.

Here we input the filtered M values, we specify that the data are indeed M values and declare the array type that we are using (EPICv2). Secondly we specify that the analysis we want to be doing is differential and we provide our design matrix, we say that contrast is TRUE and we also provide the constrast matrix that we also made earlier.

To run model we use the function cpg.annotate(), this will fit a linear model for each CpG probe based on the specified design and contrast. By setting coef = diabetic_vs_control, we are telling the model to focus on the specific contrast (just on how the diabetic group differs from the control).

The annotation function will return an object that contains the log fold changes, p values, etc. for each CpG site and merges in the relevant annotation details from EPICv2 array (gene context, location, etc.)

This can take a while.

# 10. Annotate for DMRcate using the filtered data
annotated <- cpg.annotate("array", M_filtered, what = "M", arraytype = "EPICv2",
    analysis.type = "differential", design = design, contrasts = TRUE, cont.matrix = cont.matrix,
    coef = "diabetic_vs_control")

Now we can run DMRcate. This will compute the differentially methylated regions (DMRs). We can play a bit with the parameters to either increse the amount of DMRs by being more lenient with the p value or we can chose to be very strict to only look at very significant DMRs. For the purposes of a hypothesis generation analysis like this one, we can afford to be slightly more lenient but not too much. We are going to go with a p value of 0.01 and then chose to filter DMRs for biological significance later (effect size).

# 11. Run DMRcate with more stringent parameters
DMRs <- dmrcate(annotated, lambda = 1000, C = 2, pcutoff = 0.01)

Then extract the DMR results as ranges

# 12. Extract the DMR results as ranges
DMR.results <- extractRanges(DMRs)

# 13. Check the number of DMRs
length(DMR.results)

## [1] 80

So we have 80 regions contained differential methylation across diabetic and control patients. Great.

In order to plot these we need to conver our DMR.results into a dataframe.

# 14. Convert DMR.results into a data frame
DMR_df <- as.data.frame(DMR.results)

Now we can plot.

A volcano plot to check DMR effect size across all the DMRs

# 15. check DMR effect size by volvano plot
ggplot(DMR_df, aes(x = maxdiff, y = -log10(min_smoothed_fdr))) + geom_point(alpha = 0.7,
    color = "blue") + geom_hline(yintercept = -log10(0.05), linetype = "dashed",
    color = "red") + labs(title = "Volcano Plot: FDR vs. Effect Size", x = "Effect Size (Max Difference)",
    y = "-log10(FDR)") + theme_minimal()

In order to get access to DMRs of higher signifcance we need to filter. In this case I am going to filter by effect size.

# 16. Filter for just the most signficant DMRs (based on effect size)
significant_dmrs <- DMR_df[DMR_df$min_smoothed_fdr < 0.05 & abs(DMR_df$maxdiff) >
    0.1, ]
significant_dmrs_df <- as.data.frame(significant_dmrs)
cat("Number of significant DMRs:", nrow(significant_dmrs), "\n")

## Number of significant DMRs: 26

Re-plot

ggplot(significant_dmrs, aes(x = maxdiff, y = -log10(min_smoothed_fdr))) + geom_point(alpha = 0.7,
    color = "blue") + geom_hline(yintercept = -log10(0.05), linetype = "dashed",
    color = "red") + labs(title = "Volcano Plot: FDR vs. Effect Size", x = "Effect Size (Max Difference)",
    y = "-log10(FDR)") + theme_minimal()

We can combine the data and plot more visually

# Add a significance flag
DMR_df$significance <- ifelse(DMR_df$min_smoothed_fdr < 0.05 & abs(DMR_df$maxdiff) >
    0.1, "Significant", "Not significant")

# Plot using color = significance
ggplot(DMR_df, aes(x = maxdiff, y = -log10(min_smoothed_fdr), color = significance)) +
    geom_point(alpha = 0.7) + scale_color_manual(values = c(Significant = "blue",
    `Not significant` = "orange")) + geom_hline(yintercept = -log10(0.05), linetype = "dashed",
    color = "red") + labs(title = "Volcano Plot: DMR Significance (FDR < 0.05 & Effect Size > 0.1)",
    x = "Effect Size (Max Difference)", y = "-log10(FDR)") + theme_minimal()

7 Computing GO and KEGG enrichment

Example Code for GO and KEGG Enrichment Analysis

The previous function that we used was cpg.annotate() which returns statistics about methylation differences containing annotations merged in. In order to do GO or KEGG analysis we will need individual probe features including knowledge of chromosomes and genes.

Clean up the suffixes from the probe names

First I am going to investigate whether the duplicated probes that we get are due to technical repeats or not

# Extract cleaned IDs from your original DMPs or annotation
clean_rownames <- function(x) sub("_.*", "", x)

# Add cleaned ID to DMPs
DMPs$clean_id <- clean_rownames(rownames(DMPs))

# Check which CpGs are duplicated
dups <- DMPs$clean_id[duplicated(DMPs$clean_id)]
dups_unique <- unique(dups)

# Inspect some duplicates
subset(DMPs, clean_id %in% dups_unique[1:5])

annot <- getAnnotation(GRset.filtered)
annot$clean_id <- clean_rownames(rownames(annot))

# See how many duplicates exist in annotation
table(duplicated(annot$clean_id))

## 
##  FALSE   TRUE 
## 900717   6050

# Inspect duplicated annotations
dup_annot <- annot[annot$clean_id %in% dups_unique[1:10], ]
View(dup_annot)

You’ve got multiple rows with the same clean_id (e.g., cg02853616, cg06757900, cg25830307), but different full names due to suffixes (e.g., _TC21, _BC22, _TC22, etc.).

Considering cg02853616 as a concrete example: • Same CpG ID: cg02853616 • Very similar logFC, expression, t-statistics • Identical adjusted p-values (which is expected if using same model and contrast) • Only suffix (_TC21, _TC22) is different

This pattern suggests these entries are technical variants of the same CpG — likely due to multiple probe mappings or preprocessing (e.g., batch, sample lane, region). They are not independent biological measurements and we don’t want to process them as separate probes for enrichment analysis so that we don’t artificially inflate the weight of those CpGs. So we can collapse these to a single row based on either the lowest p value or average log FC. Because the most conservative is the min p value. That’s what I am going to do.

library(dplyr)

DMPs_collapsed <- DMPs %>%
    group_by(clean_id) %>%
    slice_min(P.Value, n = 1, with_ties = FALSE) %>%
    ungroup()

# Set rownames to clean_id
rownames(DMPs_collapsed) <- DMPs_collapsed$clean_id
DMPs_collapsed$clean_id <- NULL

Compare the number of rows before and after the collapsing

nrow(DMPs)  # Before collapsing

## [1] 906767

nrow(DMPs_collapsed)  # After collapsing

## [1] 900717

Check to see if there are any duplicates that we have missed. FALSE return means we have got them all.

any(duplicated(rownames(DMPs_collapsed)))  # Should be FALSE

## [1] FALSE

We can now use gometh to do the gene ontology analysis

#{r} #sigCpGsNames <- rownames(DMPs_collapsed[DMPs_collapsed$P.Value < 0.01, ]) #allCpGs <- rownames(M_filtered) # make sure M_filtered has clean rownames too! # GOresults <- gometh( # sig.cpg = sigCpGsNames, # all.cpg = allCpGs, # collection = "GO", # array.type = "EPIC_V2" #)

The result was that there were no probes in sigCpGsNames which is a problem. Let’s troubleshoot

Check DMP distribution (sanity check)

table(DMPs_collapsed$adj.P.Val < 0.05)  # Should show TRUE/FALSE counts

## 
##  FALSE 
## 900717

summary(DMPs_collapsed$adj.P.Val)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9968  0.9968  0.9968  0.9972  0.9968  1.0000

This data confirms that actually all adjusted p-values are > 0.99 and there are no DMPs remotely close to significance at adj.P.Val < 0.05. Hence, sigCpGsNames is empty — and gometh() has no CpGs to test. There are some things that we can do. One way is to forsake FDR-adjusted p value and use the raw p-value. This is a significant loss in statistical power but will provide CpGs for an exploratory analysis. Can’t be used for confirmation.

Another thing that we can do is a region specific analysis. Becaues we got significant DMRs we can investigate just the CpGs that contribute to them. In this instance we are doing Gene ontology analysis on differentially methylated regions rather than on signficantly methylated CpGs.

sigCpGsFromDMRs <- unique(names(annotated@ranges))

#```{r} #GOresults_dmrs <- gometh( # sig.cpg = sigCpGsFromDMRs, # all.cpg = rownames(M_filtered), # array.type = “EPIC_V2” # collection = “GO”, #)

#topGO(GOresults_dmrs) #```

This approach still yielded no annotated genes to the significant CpGs.

head(sigCpGsFromDMRs)

## [1] "chr1:191550" "chr1:605420" "chr1:777872" "chr1:777913" "chr1:779860"
## [6] "chr1:817115"

sigCpGsFromDMRs <- annotated@ranges@elementMetadata@listData$ID
sigCpGsFromDMRs <- unique(sigCpGsFromDMRs)
head(sigCpGsFromDMRs)

## NULL

sigCpGsFromDMRs <- rownames(M_filtered)[annotated@ranges@elementMetadata@listData$is.sig]

sigCpGsFromDMRs <- sub("_.*", "", sigCpGsFromDMRs)

head(sigCpGsFromDMRs)

## character(0)

length(sigCpGsFromDMRs)

## [1] 0

Error: There are no genes annotated to the significant CpGs

This means that the enrichment analysis is running on an empty or invalid CpG list and we’re losing the chance to biologically interpret your DMRs. Even we do have real DMRs, they’re not being leveraged for GO/KEGG.

8 DMR-level Enrichment via Gene Mapping

Load the EPICv2 annotation and match DMRs to genes.

library(IlluminaHumanMethylationEPICv2anno.20a1.hg38)
data(IlluminaHumanMethylationEPICv2anno.20a1.hg38)
anno_v2 <- IlluminaHumanMethylationEPICv2anno.20a1.hg38

# Convert EPICv2 annotation to GRanges
cpg_anno_gr <- getLocations(anno_v2)

Overlap the DMRs with the annotation

library(GenomicRanges)

overlaps <- findOverlaps(DMR.results, cpg_anno_gr)

annotated_dmrs <- cbind(as.data.frame(DMR.results[queryHits(overlaps)]), as.data.frame(cpg_anno_gr[subjectHits(overlaps)]))

Extract Gene symbols by first inspecting the columns, the column containing the gene symbols is “overlapping.genes”.

colnames(annotated_dmrs)

##  [1] "seqnames"          "start"             "end"              
##  [4] "width"             "strand"            "no.cpgs"          
##  [7] "min_smoothed_fdr"  "Stouffer"          "HMFDR"            
## [10] "Fisher"            "maxdiff"           "meandiff"         
## [13] "overlapping.genes" "seqnames"          "start"            
## [16] "end"               "width"             "strand"

Extract the gene symbols. From this initial inspection it looks like there ar 51 genes genes that lie within or adjacent to your differentially methylated regions (DMRs). The methylation change (hyper or hypo) observed in those DMRs could influence the regulation (expression/silencing) of these genes. These 51 genes are now candidate targets — potentially involved in epigenetic regulation tied to diabetic retinopathy.

genes_from_dmrs <- unique(unlist(strsplit(annotated_dmrs$overlapping.genes, ";")))
genes_from_dmrs <- genes_from_dmrs[genes_from_dmrs != ""]

head(genes_from_dmrs)

## [1] "PHC3"               "KDM2B"              "SLC39A14"          
## [4] "GABBR1"             NA                   "AC011899.9, PTPRN2"

length(genes_from_dmrs)

## [1] 51

Convert to Entez IDs for enrichment

library(clusterProfiler)
library(org.Hs.eg.db)

entrez_genes <- bitr(genes_from_dmrs, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)

GO/KEGG enrichment from gene list

ego <- enrichGO(gene = entrez_genes$ENTREZID, OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
    ont = "BP", pvalueCutoff = 0.05, qvalueCutoff = 0.2)
dotplot(ego)

In the plot above, each dot represents a GO biological process statistically enriched in my list of genes. The Y axis represents the biological term (expressed as a GO term) and the X axis represents the ratio of overlapping genes to total input genes. The size of the dot expresses the number of genes in that GO term and the colour is used to express statistical significance. All dots are above a significance threshold of p. adjust 0.05.

The top enriched processes are firstly the regulation of neuron apoptosis and secondly post-synaptic density/specialisation/organization. Regulation of neuron apoptotic process accounted for roughly 16% of the genes involved and presents an interesting avenue for DR research given than the disease is liked to neurodeneration and the apoptosis of retinal neurons. Postsynaptic density/specialisation and organisation include several terms that relate to excitatory neurons. A differential methylation phenomenon could impact neuronal signalling, synaptic function and neurovascular communication which is integral to the gliovascular unit.

Interest within in these studies relate to the retina being an extension of the CNS, coupling with a complex neurovascular structure. Diabetic Retinopathy is linked to neurogeneration, synaptic dysfunction and apoptosis of retinal neurons. These GO terms suggest that the DMRs may be targeting genes involved in neuronal survival and synapse function, consistent with early, neurodegenerative changes in diabetic eyes.

The next step is to map the gene symbols of the individual targets and then visualise them with a cnet plot

head(ego)

cnetplot(ego, showCategory = 5)

The cnetplot demonstrates a tight network of genes involved in postsynaptic organization and assembly processes. A coordinated epigenetic impact on synaptic structure can have a likely involvement in synaptic function critical in neurodegeneration and retinal health. These genes may play a role in regulating neuronal stability under metabolic stress like diabetic retinopathy. We can see what these genes are individually or we can inspect all the genes in the cnetplot.

bitr(c("2895", "2185", "9113"), fromType = "ENTREZID", toType = "SYMBOL", OrgDb = org.Hs.eg.db)

We can compare the direction of methylation in the DMRs to hypothesis functional consequences.

head(DMR_df[, c("overlapping.genes", "meandiff", "maxdiff")])

DMR_df$direction <- ifelse(DMR_df$meandiff > 0, "Hypermethylated", "Hypomethylated")
table(DMR_df$direction)

## 
## Hypermethylated  Hypomethylated 
##              22              58

library(ggplot2)

ggplot(DMR_df, aes(x = direction, fill = direction)) + geom_bar() + scale_fill_manual(values = c(Hypermethylated = "tomato",
    Hypomethylated = "skyblue")) + theme_minimal() + labs(title = "Distribution of DMR Methylation Direction",
    x = "Methylation Direction", y = "Number of DMRs", fill = "Direction")

After annotation, we need to split the genes based on their DMR methylation direction and visualise the overlap.

hyper_genes <- unique(unlist(strsplit(DMR_df$overlapping.genes[DMR_df$direction ==
    "Hypermethylated"], ";")))
hypo_genes <- unique(unlist(strsplit(DMR_df$overlapping.genes[DMR_df$direction ==
    "Hypomethylated"], ";")))

# Remove NA and empty string values
hyper_genes <- hyper_genes[!is.na(hyper_genes) & hyper_genes != ""]
hypo_genes <- hypo_genes[!is.na(hypo_genes) & hypo_genes != ""]

library(ggVennDiagram)

gene_sets <- list(Hyper = hyper_genes, Hypo = hypo_genes)
gene_sets <- lapply(gene_sets, function(x) x[!is.na(x) & x != ""])

ggVennDiagram(gene_sets, label_alpha = 0) + scale_fill_gradient(low = "skyblue",
    high = "tomato") + theme_minimal() + labs(title = "Genes in Hyper vs Hypo-methylated DMRs")

Between these two plots we have discovered how many DMRs are hypo/hyper-methylated across Diabetic and non diabetic samples. We have also discovered how many genes in those DMRs are affected (and in what way).

We didn’t see any overlap between hypo and hyper in the last plot, which is good. We can confirm this again by using intersect().

intersect(hyper_genes, hypo_genes)

## character(0)

Now we can re-run the GO analysis using hyper and hypo genes separately. This lets us understand what function are being upregulated and which function are being silenced by methylation and if there are diff biological processes enriched in either hypo or hyper methylation.

We need to start by splitting annotation and extract the CpG IDs that match hypo/hyper genes.

# Split annotated object by direction Assuming these are SYMBOLs
hyper_genes <- unique(annotated_dmrs$overlapping.genes[annotated_dmrs$meandiff >
    0])
hypo_genes <- unique(annotated_dmrs$overlapping.genes[annotated_dmrs$meandiff < 0])

# Clean up gene lists (e.g., from 'A1BG;TUSC3' format)
hyper_genes <- unique(unlist(strsplit(hyper_genes, ";")))
hypo_genes <- unique(unlist(strsplit(hypo_genes, ";")))

# Remove blanks
hyper_genes <- hyper_genes[hyper_genes != ""]
hypo_genes <- hypo_genes[hypo_genes != ""]

library(clusterProfiler)
library(org.Hs.eg.db)

# Convert to ENTREZ
hypo_entrez <- bitr(hypo_genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
hyper_entrez <- bitr(hyper_genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)

# GO enrichment
go_hypo <- enrichGO(gene = hypo_entrez$ENTREZID, OrgDb = org.Hs.eg.db, ont = "BP",
    pAdjustMethod = "BH")
go_hyper <- enrichGO(gene = hyper_entrez$ENTREZID, OrgDb = org.Hs.eg.db, ont = "BP",
    pAdjustMethod = "BH")

# Visualize
dotplot(go_hypo, title = "GO: Hypomethylated DMRs")

dotplot(go_hyper, title = "GO: Hypermethylated DMRs")

Key Observations for Hypomethylation plot:

Top GO terms include:
regulation of postsynaptic density assembly
postsynaptic specialization assembly
regulation of excitatory synapse assembly
negative regulation of synaptic transmission
These all strongly cluster around synaptic structure and function, particularly postsynaptic architecture and signal regulation.
High gene ratios (0.125–0.15): This means ~12–15% of your gene list hits these categories — not bad for such specific functions.
Low adjusted p-values (FDR < 0.005): Suggesting statistically robust overrepresentation.

Hypomethylation is often associated with increased gene expression. These DMRs point to genes involved in neuronal synapse organization, especially postsynaptic structure/function. This could imply enhanced or dysregulated synaptic remodeling in the retina in diabetic patients — possibly linked to neuronal stress, neuroplasticity, or excitotoxicity.

Key Observations:

Very few enriched categories — only two GO terms:
cyclic nucleotide metabolic process
cyclic purine nucleotide metabolic process
Smaller gene count (n = 2) and moderate gene ratio (~22.5%) because the gene list is smaller overall.
Adjusted p-value ~0.02 — statistically significant but not as strong as the hypo plot.

Biological Interpretation:

Hypermethylation typically suppresses gene expression.
The enriched genes may be involved in cAMP/cGMP signaling, which regulates:
Retinal signaling cascades
Ion channels
Cellular responses to glucose stress
Suppression of this pathway could signal blunted neurochemical signaling, energy imbalance, or inhibition of adaptive cellular responses in the diabetic retina.

Feature	Hypomethylated DMRs	Hypermethylated DMRs
# Enriched GO Terms	Many (~10+)	2
Biological Focus	Postsynaptic structure, excitatory synapse	cAMP/cGMP signaling/metabolic regulation
Directional Implication	Possibly upregulation of neuronal signaling	Possibly suppression of cyclic nucleotide signaling
Disease Relevance	Suggests synaptic remodeling / plasticity	Suggests loss of metabolic signaling adaptation

We can inspect the gene list manually also:

# Create a set of known diabetic retinopathy-associated genes (simplified for
# now)
retinopathy_genes <- c("GDNF", "CDH23", "PDE4B", "PTK2B")  # Add more as needed

# Flag multi-gene entries
is_multigene <- function(x) grepl(",", x)

# Pad lists to equal length
max_len <- max(length(hypo_genes), length(hyper_genes))
hypo_genes_padded <- c(as.character(hypo_genes), rep(NA, max_len - length(hypo_genes)))
hyper_genes_padded <- c(as.character(hyper_genes), rep(NA, max_len - length(hyper_genes)))

# Build table with flags
dmr_gene_table <- data.frame(`Hypomethylated DMR Genes` = hypo_genes_padded, `Hypo → Multi-gene` = ifelse(is_multigene(hypo_genes_padded),
    "✓", ""), `Hypo → Retina Gene` = ifelse(hypo_genes_padded %in% retinopathy_genes,
    "✓", ""), `Hypermethylated DMR Genes` = hyper_genes_padded, `Hyper → Multi-gene` = ifelse(is_multigene(hyper_genes_padded),
    "✓", ""), `Hyper → Retina Gene` = ifelse(hyper_genes_padded %in% retinopathy_genes,
    "✓", ""), check.names = FALSE)

# Display entire table in Markdown
knitr::kable(dmr_gene_table, caption = "DMR Genes by Methylation Direction, Annotated for Multi-Gene Entries and Retina-Related Genes")

DMR Genes by Methylation Direction, Annotated for Multi-Gene Entries and Retina-Related Genes
Hypomethylated DMR Genes	Hypo → Multi-gene	Hypo → Retina Gene	Hypermethylated DMR Genes	Hyper → Multi-gene	Hyper → Retina Gene
PHC3			KDM2B
SLC39A14			NA
GABBR1			TRAF7
NA			PDE4B		✓
AC011899.9, PTPRN2	✓		ACTR1A
TRIM10			AC116609.3
CADPS2			RP11-368L12.1
RP11-416O18.1			CLC
RP3-323P13.2			GUCY2C
PKP1			SPIRE2
RP11-63A23.2			HLA-L, HCG17, HCG18	✓
PDIA5, SEC22A	✓		PRKAR1B
EPS8L2, AP006621.9	✓		SOX9-AS1
ACSF2			AC007879.5
F13A1			RP3-467K16.4
RP1-179N16.6			ZNRF3
AC009784.3			NA
CTC-431G16.2			NA
GRID2			NA
GDNF		✓	NA
CDH23		✓	NA
CBFA2T3			NA
PTK2B		✓	NA
AC019330.1			NA
C9orf84			NA
SRRM1			NA
SLC1A6			NA
ARHGAP30			NA
SP100, AC010149.4	✓		NA
LATS1			NA
AL121932.1, LINC00533	✓		NA
COL26A1			NA
AIG1			NA
CTD-2008A1.2			NA
EIF2AK4			NA
TBKBP1			NA

Plotting entire DMRs

if (!requireNamespace("Gviz", quietly = TRUE)) BiocManager::install("Gviz")
if (!requireNamespace("TxDb.Hsapiens.UCSC.hg38.knownGene", quietly = TRUE)) BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) BiocManager::install("org.Hs.eg.db")

library(Gviz)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)

# Assuming DMR.results is a GRanges object
dmr_to_plot <- DMR.results[1]

# Add a small buffer around the DMR
plot_range <- resize(dmr_to_plot, width = width(dmr_to_plot) + 4000, fix = "center")

# Genome axis track
axisTrack <- GenomeAxisTrack()

# Gene annotation track
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
genesTrack <- GeneRegionTrack(txdb, chromosome = as.character(seqnames(plot_range)),
    start = start(plot_range), end = end(plot_range), genome = "hg38", name = "Genes")

# DMR region track
dmrTrack <- AnnotationTrack(plot_range, genome = "hg38", name = "DMR", fill = "tomato",
    col = "black")

plotTracks(list(axisTrack, genesTrack, dmrTrack), from = start(plot_range), to = end(plot_range),
    chromosome = as.character(seqnames(plot_range)), transcriptAnnotation = "symbol",
    background.title = "gray90")

In the next analysis I want to see if I can use the DMPs to determine the biological age of the donour eyes. Methylation clocks (like methyAge() or methscore()) are trained models. These clocks are pre-built, often using Elastic Net regression on large public datasets (e.g., from thousands of individuals). They rely on specific CpG sites — often around 300–1000 — that have been shown to correlate with chronological age, or other traits like mortality risk, immune age, etc. Your data just needs to cover those CpG sites with reliable beta values. My own CpGs do not need to be significantly differentially methylated because I’m not building a new model — you’re applying an existing predictive model to your samples.

There are a few models that can be used I will trial two.

# Load libraries
library(ggplot2)

# Read in your cleaned file
pheno <- read.csv("/Users/joelbrunomendes/Documents/Methylation_data/run0/match_4_methage_REBUILT.csv",
    stringsAsFactors = FALSE)

# Make sure Age is numeric
pheno$Age <- as.numeric(pheno$Age)

# Drop missing Age or Sample_Group entries
pheno <- na.omit(pheno[, c("Age", "Sample_Group")])

# Calculate mean age per group
library(dplyr)
mean_age <- pheno %>%
    group_by(Sample_Group) %>%
    summarise(Mean_Age = mean(Age))

# Plot using ggplot2
ggplot(mean_age, aes(x = Sample_Group, y = Mean_Age, fill = Sample_Group)) + geom_bar(stat = "identity",
    width = 0.6) + scale_fill_manual(values = c(control = "skyblue", diabetic = "tomato")) +
    labs(title = "Mean Age by Group (Control vs Diabetic)", x = "Group", y = "Mean Age (years)") +
    theme_minimal(base_size = 14) + theme(legend.position = "none")

9 Horvath Methylation Clock

We start by installing the required package - use bioconductor to do so.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("methylclock")

library(methylclock)

Then estimate methylation age. In order to do so we will need to have access to the beta scores from before. There is another function called beta() that sometimes gets confused amd you’ll need to overwrite the Betas. I’m going to do it in this case.

Another issue that can happen is that the names of the CpG sites are not compatible with the naming system used by the Horvath clock. If this is the case we will need to clean those up.

# get Betas again
beta <- getBeta(GRset.funnorm)  # or whatever your normalized object is called
# Can verify the cass and dimension if you want, just un-# below class(beta) #
# should now return 'matrix' dim(beta) # should show [num_cpgs x num_samples]
rownames(beta) <- sub("_.*", "", rownames(beta))
head(rownames(beta))

## [1] "cg00381604" "cg21870274" "cg00002271" "cg08058514" "cg00004581"
## [6] "cg09261435"

library(methylclock)
# Assuming 'beta' is your matrix of beta values with rows as CpG sites and
# columns as samples and 'pheno' is your phenotype data frame with a column
# 'Age' for chronological age

# Estimate Horvath's DNAmAge
horvath_age <- DNAmAge(x = beta, method = "horvath", normalization = "none")

# Step 1: Load everything as character
pheno <- read.csv("/Users/joelbrunomendes/Documents/Methylation_data/run0/match_4_methage_REBUILT.csv",
    colClasses = "character", stringsAsFactors = FALSE)

# Step 2: See what we’ve got
unique(pheno$SampleID)  # confirm if both corrupted ID sets are there

##  [1] "2.09E+11_R01C01"    "2.09E+11_R02C01"    "2.09E+11_R03C01"   
##  [4] "2.09E+11_R04C01"    "2.09E+11_R05C01"    "2.09E+11_R06C01"   
##  [7] "2.09E+11_R07C01"    "2.09387E+11_R01C01" "2.09387E+11_R02C01"
## [10] "2.09387E+11_R03C01" "2.09387E+11_R04C01" "2.09387E+11_R05C01"
## [13] "2.09387E+11_R06C01" "2.09387E+11_R07C01" "2.09387E+11_R08C01"

# Step 3: Apply both fixes by row index (safe approach assuming order is
# preserved)
pheno$SampleID[1:7] <- gsub("2\\.09[Ee]\\+11", "209212100024", pheno$SampleID[1:7])
pheno$SampleID[8:15] <- gsub("2\\.09[Ee]\\+11", "209387440134", pheno$SampleID[8:15])

# Step 4: Check matches
matched_ids <- intersect(pheno$SampleID, horvath_age$id)
length(matched_ids)  # ✅ should now be 15

## [1] 7

# Fix the first corrupted Sentrix ID (from 209212100024)
pheno$SampleID <- gsub("2\\.09[Ee]\\+11", "209212100024", pheno$SampleID)

# Fix the second corrupted Sentrix ID (from 209387440134)
pheno$SampleID <- gsub("2\\.09387[Ee]\\+11", "209387440134", pheno$SampleID)

# Confirm fix
matched_ids <- intersect(pheno$SampleID, horvath_age$id)
length(matched_ids)  # ✅ should now return 15

## [1] 15

print(matched_ids)

##  [1] "209212100024_R01C01" "209212100024_R02C01" "209212100024_R03C01"
##  [4] "209212100024_R04C01" "209212100024_R05C01" "209212100024_R06C01"
##  [7] "209212100024_R07C01" "209387440134_R01C01" "209387440134_R02C01"
## [10] "209387440134_R03C01" "209387440134_R04C01" "209387440134_R05C01"
## [13] "209387440134_R06C01" "209387440134_R07C01" "209387440134_R08C01"

pheno <- merge(pheno, horvath_age, by.x = "SampleID", by.y = "id")
pheno$AccelHorvath <- as.numeric(pheno$Horvath) - as.numeric(pheno$Age)

boxplot(AccelHorvath ~ Sample_Group, data = pheno, main = "Horvath Epigenetic Age Acceleration",
    ylab = "Horvath - Chronological Age", xlab = "Group", col = c("skyblue", "tomato"))

About the Horvath clock.

The Horvath clock (Horvath, 2013) is a pan-tissue epigenetic clock — meaning it’s based on DNA methylation (DNAm) levels at specific CpG sites. It can estimate biological age across a wide range of human tissues. It was trained using elastic net regression on thousands of samples to identify a robust, tissue-agnostic signature of ageing.

This Multi-tissue clock is predictor of chronological aged trained to match true age. It’s based on a selection of 353 CpG sites (known as the “Horvath 353”).

The types of tissues that were included in this original method were 51 tissue and cell types compiled out of 8000 samples (healthy and diseased). The epigenetic clock was established using Illumina arrays, 27k and 450k. Tissues included: blood (whole, leukocytes), skin, liver, muscle, brain, kidney, saliva, adipose, lung, pancreas, vascular–tissue and other but eye or retina tissues were not used.

This clock is designed to generalize across these tissues — making it a “universal” or “pan-tissue” epigenetic biomarker of age.

It’s widely considered to be robust across tissues and platforms. Having a high correlation with chronological age (R ~0.96). Minimal tissue-specific calibration needed and being applicable to disease models including cancer and metabolic disease. Because the Horvath clock was built across tissues, it doesn’t require tissue-specific retraining to detect age acceleration in diseases like diabetes.

Even if there is a skew in the training dataset the blood and accessible tissues, the underlying CpGs selected (the “Horvath 353”) are conserved markers of epigenetic ageing across multiple cell types.

My results reflect retinal biological age acceleration, not systemic ageing per se. The valuable here exhists because it suggests that diabetes may accelerate molecular ageing specifically in the retina. Retinal neurovascular degeneration, chronic oxidative stress and inflammation in diabetic eyes and the emerging view of “the retina as a window into CNS ageing”. Local ageing processes may impact the retina via vascular compromise, oxidative stress and immune dysregulation.

10 Levine clock model

Same set up as before so we are going to re purpose some of the code

levine_age <- DNAmAge(x = beta, method = "levine", normalization = "none")

# Add PhenoAge to your pheno table
pheno$Levine <- levine_age$Levine

# Calculate acceleration (optional)
pheno$AccelLevine <- as.numeric(pheno$Levine) - as.numeric(pheno$Age)

And plot

boxplot(AccelLevine ~ Sample_Group, data = pheno, main = "Levine Epigenetic Age Acceleration",
    ylab = "Levine - Chronological Age", xlab = "Group", col = c("skyblue", "tomato"))

Scatter plot

library(ggplot2)

ggplot(pheno, aes(x = Horvath, y = Levine, color = Sample_Group)) + geom_point(size = 3,
    alpha = 0.8) + geom_smooth(method = "lm", se = FALSE, aes(group = 1), color = "black",
    linetype = "dashed") + scale_color_manual(values = c(control = "skyblue", diabetic = "tomato")) +
    labs(title = "Horvath vs Levine (PhenoAge) by Group", x = "Horvath DNA Methylation Age",
        y = "Levine (PhenoAge)", color = "Group") + theme_minimal(base_size = 14)