methylclock 0.7.6
This manual describes how to estimate chronological and gestational DNA methylation (DNAm) age as well as biological age using different methylation clocks. The package includes the following estimators:
The biological DNAm clocks implemented in our package are:
The main aim of this package is to facilitate the interconnection with R and Bioconductor’s infrastructure and, hence, avoiding submitting data to online calculators. Additionally, methylclock also provides an unified way of computing DNAm age to help downstream analyses.
The package depends on some R packages that can be previously installed into your computer by:
library(BiocManager)
install(c("tidyverse", "impute", "Rcpp"))
Then methylclock package is installed into your computer by executing:
library(devtools)
install_github("isglobal-brge/methylclock")
The package is loaded into R as usual:
library(methylclock)
These libraries are required to reproduce this document:
library(Biobase)
library(tibble)
library(impute)
library(ggplot2)
library(ggpmisc)
library(GEOquery)
The main function to estimate chronological and biological mDNA age is called DNAmAge while the gestational DNAm age is estimated using DNAmGA function. Both functions have similar input arguments. Next subsections detail some of the important issues to be consider before computind DNAm clocks.
The methylation data is given in the argument x. They can be either beta or M values. The argument toBetas should be set to TRUE when M values are provided. The x object can be:
A matrix with CpGs in rows and individuals in columns having the name of the CpGs in the rownames.
A data frame or a tibble with CpGs in rows and individuals in columns having the name of the CpGs in the first column (e.g. cg00000292, cg00002426, cg00003994, …) as required in the Horvath’s DNA Methylation Age Calculator website (https://dnamage.genetics.ucla.edu/home).
A GenomicRatioSet object, the default method to encapsulate methylation data in minfi Bioconductor package.
An ExpressionSet object as obtained, for instance, when downloading methylation data from GEO (https://www.ncbi.nlm.nih.gov/geo/).
In principle, data can be normalized by using any of the existing standard methods such as QN, ASMN, PBC, SWAN, SQN, BMIQ (see a revision of those methods in Wang et al. (2015)). DNAmAge function includes the BMIQ method proposed by Teschendorff et al. (2012) using Horvath’s robust implementation that basically consists of an optimal R code implementation and optimization procedures. This normalization is recommended by Horvath since it improves the predictions for his clock. This normalization procedure is very time-consuming. In order to overcome these difficulties, we have parallelize this process using BiocParallel library. This step is not mandatory, so that, you can use your normalized data and set the argument normalize equal to FALSE (default).
All the implemented methods require complete cases. DNAmAge function has an imputation method based on KNN implemented in the function knn.impute from impute Bioconductor package. This is performed when missing data is present in the CpGs used in any of the computed clocks. There is also another option based on a fast imputation method that imputes missing values by the median of required CpGs as recommended in Bohlin et al. (2016). This is recommended when analyzing 450K arrays since knn.impute for large datasets may be very time consuming. Fast imputation can be performed by setting fastImp=TRUE which is not the default value.
By default the package computes the different clocks when there are more than 80% of the required CpGs of each method. Nothing is required when having missing CpGs since the main functions will return NA for those estimators when this criteria is not meet. Let us use a test dataset (TestDataset) which is available within the package to illustrate the type of information we are obtaining:
cpgs.missing <- checkClocks(TestDataset)
There are some clocks that cannot be computed since your data do not contain the required CpGs.
These are the total number of missing CpGs for each clock :
clock Cpgs_in_clock missing_CpGs percentage
1 Horvath 353 2 0.6
2 Hannum 71 64 90.1
3 Levine 513 3 0.6
4 SkinHorvath 391 283 72.4
5 PedBE 94 91 96.8
6 Wu 111 2 1.8
7 TL 140 137 97.9
cpgs.missing.GA <- checkClocksGA(TestDataset)
There are some clocks that cannot be computed since your data do not contain the required CpGs.
These are the total number of missing CpGs for each clock :
clock Cpgs_in_clock missing_CpGs percentage
1 Knight 148 0 0.0
2 Bohlin 87 87 100.0
3 Mayne 62 0 0.0
4 Lee 1125 1072 95.3
The objects cpgs.missing and cpgs.missing.GA are lists having the missing CpGs of each clock
names(cpgs.missing)
[1] "Horvath" "Hannum" "Levine" "skinHorvath" "PedBE" "Wu" "TL"
We can see which are those CpGs for a given clock (for example Hannum) just typing:
cpgs.missing$Hannum
[1] "cg20822990" "cg22512670" "cg25410668" "cg04400972" "cg16054275" "cg10501210"
[7] "ch.2.30415474F" "cg22158769" "cg02085953" "cg06639320" "cg22454769" "cg24079702"
[13] "cg23606718" "cg22016779" "cg03607117" "cg07553761" "cg00481951" "cg25478614"
[19] "cg25428494" "cg02650266" "cg08234504" "cg23500537" "cg20052760" "cg16867657"
[25] "cg06685111" "cg00486113" "cg13001142" "cg20426994" "cg14361627" "cg08097417"
[31] "cg07955995" "cg22285878" "cg03473532" "cg08540945" "cg07927379" "cg16419235"
[37] "cg07583137" "cg22796704" "cg19935065" "cg23091758" "cg23744638" "cg04940570"
[43] "cg11067179" "cg22213242" "cg06419846" "cg02046143" "cg00748589" "cg18473521"
[49] "cg01528542" "ch.13.39564907R" "cg03032497" "cg04875128" "cg09651136" "cg03399905"
[55] "cg04416734" "cg07082267" "cg14692377" "cg06874016" "cg21139312" "cg02867102"
[61] "cg19283806" "cg14556683" "cg07547549" "cg08415592"
In Section 4.1 we describe how to change this 80% threshold.
The EEAA method requires to estimate cell counts. We use the package meffil (Min et al. (2018)) that provides some functions to estimate cell counts using predefined datasets. This is performed by setting cell.count=TRUE (default value). The reference panel is passed through the argument cell.count.reference. So far, the following options are available:
miffil package. It includes CD14, Bcell, CD4T, CD8T, NK, Gran.Next we illustrate how to estimate the chronological DNAm age using several datasets which aim to cover different data input formats.
csv with CpGs in rows)Let us start by reproducing the results proposed in Horvath (2013). It uses the format available in the file ’MethylationDataExample55.csv" from his tutorial (available here). These data are available at methylclock package. Although these data can be loaded into R by using standard functions such as read.csv we hihgly recommend to use functions from tidiverse, in particular read_csv from readr package. The main reason is that currently researchers are analyzing Illumina 450K or EPIC arrays that contains a huge number of CpGs that can take a long time to be loaded when using basic importing R function. These functions import csv data as tibble which is one of the possible formats of DNAmAge function
library(tidyverse)
path <- system.file("extdata", package = "methylclock")
MethylationData <- read_csv(file.path(path, "MethylationDataExample55.csv"))
MethylationData
# A tibble: 27,578 x 17
ProbeID GSM946048 GSM946049 GSM946052 GSM946054 GSM946055 GSM946056 GSM946059 GSM946062 GSM946064 GSM946065 GSM946066
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg00000292 0.706 0.730 0.705 0.751 0.715 0.634 0.682 0.635 0.728 0.777 0.601
2 cg00002426 0.272 0.274 0.311 0.279 0.178 0.269 0.330 0.501 0.197 0.282 0.203
3 cg00003994 0.0370 0.0147 0.0171 0.0290 0.0163 0.0243 0.0127 0.0206 0.0151 0.0105 0.0290
4 cg00005847 0.133 0.120 0.121 0.107 0.110 0.129 0.102 0.124 0.104 0.108 0.122
5 cg00006414 0.0309 0.0192 0.0217 0.0132 0.0181 0.0243 0.0199 0.0143 0.0184 0.0173 0.0179
6 cg00007981 0.0700 0.0715 0.0655 0.0719 0.0914 0.0508 0.0294 0.0564 0.0458 0.0377 0.0413
7 cg00008493 0.993 0.993 0.993 0.994 0.991 0.994 0.993 0.996 0.992 0.994 0.994
8 cg00008713 0.0215 0.0202 0.0187 0.0169 0.0162 0.0143 0.0172 0.0189 0.0194 0.0188 0.0153
9 cg00009407 0.0105 0.00518 0.00410 0.00671 0.00758 0.00518 0.00543 0.00624 0.00642 0.00680 0.00712
10 cg00010193 0.634 0.635 0.621 0.639 0.599 0.591 0.594 0.583 0.610 0.631 0.618
# ... with 27,568 more rows, and 5 more variables: GSM946067 <dbl>, GSM946073 <dbl>, GSM946074 <dbl>, GSM946075 <dbl>,
# GSM946076 <dbl>
IMPORTANT NOTE: Be sure that the first column contains the CpG names. Sometimes, your imported data look like this one (it can happen, for instance, if the csv file was created in R without indicating row.names=FALSE)
> mydata
# A tibble: 473,999 x 6
X1 Row.names BIB_15586_1X BIB_33043_1X EDP_5245_1X KAN_584_1X
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 cg000000~ 0.635 0.575 0.614 0.631
2 2 cg000001~ 0.954 0.948 0.933 0.950
3 3 cg000001~ 0.889 0.899 0.901 0.892
4 4 cg000001~ 0.115 0.124 0.107 0.123
5 5 cg000002~ 0.850 0.753 0.806 0.815
6 6 cg000002~ 0.676 0.771 0.729 0.665
7 7 cg000002~ 0.871 0.850 0.852 0.863
8 8 cg000003~ 0.238 0.174 0.316 0.206
If so, the first column must be removed before being used as the input object in DNAmAge funcion. It can be done using dplyr function
> mydata2 <- select(mydata, -1)
# A tibble: 473,999 x 5
Row.names BIB_15586_1X BIB_33043_1X EDP_5245_1X KAN_584_1X
<chr> <dbl> <dbl> <dbl> <dbl>
1 cg000000~ 0.635 0.575 0.614 0.631
2 cg000001~ 0.954 0.948 0.933 0.950
3 cg000001~ 0.889 0.899 0.901 0.892
4 cg000001~ 0.115 0.124 0.107 0.123
5 cg000002~ 0.850 0.753 0.806 0.815
6 cg000002~ 0.676 0.771 0.729 0.665
7 cg000002~ 0.871 0.850 0.852 0.863
8 cg000003~ 0.238 0.174 0.316 0.206
In any case, if you use the object mydata that contains the CpGs in the second column, you will see this error message:
> DNAmAge(mydata)
Error in DNAmAge(mydata) : First column should contain CpG names
Once data is in the proper format, DNAmAge can be estimated by simply:
age.example55 <- DNAmAge(MethylationData)
Warning in predAge(cpgs.imp, coefHannum, intercept = FALSE, min.perc): The number of missing CpGs for Hannum clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefSkin, intercept = TRUE, min.perc): The number of missing CpGs for Skin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefPedBE, intercept = TRUE, min.perc): The number of missing CpGs for PedBE clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefTL, intercept = TRUE, min.perc): The number of missing CpGs for TL clock exceeds 80 %.
---> This DNAm clock will be NA.
age.example55
# A tibble: 16 x 9
id Horvath Hannum Levine BNN skinHorvath PedBE Wu TL
<chr> <dbl> <lgl> <dbl> <dbl> <lgl> <lgl> <dbl> <lgl>
1 GSM946048 51.8 NA -30.3 56.4 NA NA 1.08 NA
2 GSM946049 39.8 NA -29.6 42.1 NA NA 0.808 NA
3 GSM946052 26.4 NA -33.3 25.6 NA NA 0.772 NA
4 GSM946054 34.0 NA -36.0 28.0 NA NA 0.941 NA
5 GSM946055 10.1 NA -52.8 13.4 NA NA 0.456 NA
6 GSM946056 20.4 NA -42.2 16.7 NA NA 0.621 NA
7 GSM946059 6.00 NA -44.8 7.54 NA NA 0.258 NA
8 GSM946062 34.6 NA -23.2 34.6 NA NA 0.624 NA
9 GSM946064 7.91 NA -49.8 12.0 NA NA 0.237 NA
10 GSM946065 4.72 NA -48.2 6.43 NA NA 0.396 NA
11 GSM946066 29.6 NA -39.9 28.5 NA NA 0.413 NA
12 GSM946067 1.38 NA -48.3 3.48 NA NA 0.122 NA
13 GSM946073 56.0 NA -26.7 47.3 NA NA 0.714 NA
14 GSM946074 24.0 NA -39.7 23.3 NA NA 0.676 NA
15 GSM946075 9.38 NA -45.4 11.9 NA NA 0.251 NA
16 GSM946076 38.8 NA -27.5 41.4 NA NA 0.599 NA
As mention in Section 3.4 some clocks returns NA when there are more than 80% of the required CpGs are missing as we can see when typing
missCpGs <- checkClocks(MethylationData)
There are some clocks that cannot be computed since your data do not contain the required CpGs.
These are the total number of missing CpGs for each clock :
clock Cpgs_in_clock missing_CpGs percentage
1 Horvath 353 0 0.0
2 Hannum 71 64 90.1
3 Levine 513 0 0.0
4 SkinHorvath 391 282 72.1
5 PedBE 94 91 96.8
6 Wu 111 0 0.0
7 TL 140 137 97.9
Here we can observe that 72.1% of the required CpGs for SkinHorvath clock are missing. We could estimate DNAm age using this clock just changing the argument min.perc in DNAmAge. For example, we can indicate that the minimum amount of required CpGs for computing a given clock should be 25%.
age.example55.2 <- DNAmAge(MethylationData, min.perc = 0.25)
Warning in predAge(cpgs.imp, coefHannum, intercept = FALSE, min.perc): The number of missing CpGs for Hannum clock exceeds 25 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefPedBE, intercept = TRUE, min.perc): The number of missing CpGs for PedBE clock exceeds 25 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefTL, intercept = TRUE, min.perc): The number of missing CpGs for TL clock exceeds 25 %.
---> This DNAm clock will be NA.
age.example55.2
# A tibble: 16 x 9
id Horvath Hannum Levine BNN skinHorvath PedBE Wu TL
<chr> <dbl> <lgl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
1 GSM946048 51.8 NA -30.3 56.4 7.15 NA 1.08 NA
2 GSM946049 39.8 NA -29.6 42.1 7.09 NA 0.808 NA
3 GSM946052 26.4 NA -33.3 25.6 5.93 NA 0.772 NA
4 GSM946054 34.0 NA -36.0 28.0 6.34 NA 0.941 NA
5 GSM946055 10.1 NA -52.8 13.4 5.76 NA 0.456 NA
6 GSM946056 20.4 NA -42.2 16.7 5.79 NA 0.621 NA
7 GSM946059 6.00 NA -44.8 7.54 5.64 NA 0.258 NA
8 GSM946062 34.6 NA -23.2 34.6 5.55 NA 0.624 NA
9 GSM946064 7.91 NA -49.8 12.0 5.06 NA 0.237 NA
10 GSM946065 4.72 NA -48.2 6.43 5.48 NA 0.396 NA
11 GSM946066 29.6 NA -39.9 28.5 6.19 NA 0.413 NA
12 GSM946067 1.38 NA -48.3 3.48 4.91 NA 0.122 NA
13 GSM946073 56.0 NA -26.7 47.3 7.07 NA 0.714 NA
14 GSM946074 24.0 NA -39.7 23.3 6.23 NA 0.676 NA
15 GSM946075 9.38 NA -45.4 11.9 5.57 NA 0.251 NA
16 GSM946076 38.8 NA -27.5 41.4 6.69 NA 0.599 NA
In that case, we see as SkinHorvath clock is estimated (though it can be observed that the estimation is not very accurate - this is why we considered at least having 80% of the required CpGs).
By default all available clocks (Hovarth, Hannum, Levine, BNN, skinHorvath, …) are estimated. One may select a set of clocks by using the argument clocks as follows:
age.example55.sel <- DNAmAge(MethylationData,
clocks=c("Horvath", "BNN"))
age.example55.sel
# A tibble: 16 x 3
id Horvath BNN
<chr> <dbl> <dbl>
1 GSM946048 51.8 56.4
2 GSM946049 39.8 42.1
3 GSM946052 26.4 25.6
4 GSM946054 34.0 28.0
5 GSM946055 10.1 13.4
6 GSM946056 20.4 16.7
7 GSM946059 6.00 7.54
8 GSM946062 34.6 34.6
9 GSM946064 7.91 12.0
10 GSM946065 4.72 6.43
11 GSM946066 29.6 28.5
12 GSM946067 1.38 3.48
13 GSM946073 56.0 47.3
14 GSM946074 24.0 23.3
15 GSM946075 9.38 11.9
16 GSM946076 38.8 41.4
However, in epidemiological studies one is interested in assessing whether age acceleration is associated with a given trait or condition. Three different measures can be computed:
All this estimates can be obtained for each clock when providing chronological age through age argument. This information is normally provided in a different file including different covariates (metadata or sample annotation data). In this example data are available at ‘SampleAnnotationExample55.csv’ file that is also available at methylclock package:
covariates <- read_csv(file.path(path,
"SampleAnnotationExample55.csv"))
covariates
# A tibble: 16 x 14
OriginalOrder id title geo_accession TissueDetailed Tissue diseaseStatus Age PostMortemInter~ CauseofDeath
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 3 GSM946048 Autism~ GSM946048 Fresh frozen b~ occip~ 1 60 26.5 cancer
2 4 GSM946049 Contro~ GSM946049 Fresh frozen b~ occip~ 0 39 NA cardiac
3 7 GSM946052 Autism~ GSM946052 Fresh frozen b~ occip~ 1 28 43 cancer
4 9 GSM946054 Autism~ GSM946054 Fresh frozen b~ occip~ 1 39 14 cardiac
5 10 GSM946055 Autism~ GSM946055 Fresh frozen b~ occip~ 1 8 22.2 cancer
6 11 GSM946056 Autism~ GSM946056 Fresh frozen b~ occip~ 1 22 25 hypoxia
7 14 GSM946059 Contro~ GSM946059 Fresh frozen b~ occip~ 0 4 17 cardiac
8 17 GSM946062 Contro~ GSM946062 Fresh frozen b~ occip~ 0 28 13 other
9 19 GSM946064 Autism~ GSM946064 Fresh frozen b~ occip~ 1 5 25.5 hypoxia
10 20 GSM946065 Autism~ GSM946065 Fresh frozen b~ occip~ 1 2 4 hypoxia
11 21 GSM946066 Autism~ GSM946066 Fresh frozen b~ occip~ 1 30 16 cardiac
12 22 GSM946067 Contro~ GSM946067 Fresh frozen b~ occip~ 0 1 19 unknown
13 28 GSM946073 Contro~ GSM946073 Fresh frozen b~ occip~ 0 60 24.2 unknown
14 29 GSM946074 Contro~ GSM946074 Fresh frozen b~ occip~ 0 22 21.5 unknown
15 30 GSM946075 Contro~ GSM946075 Fresh frozen b~ occip~ 0 8 5 cardiac
16 31 GSM946076 Contro~ GSM946076 Fresh frozen b~ occip~ 0 30 15 hypoxia
# ... with 4 more variables: individual <dbl>, Female <dbl>, Caucasian <lgl>, FemaleOriginal <lgl>
In this case, chronological age is available at Age column:
age <- covariates$Age
head(age)
[1] 60 39 28 39 8 22
The different methylation clocks along with their age accelerated estimates can be simply computed by:
age.example55 <- DNAmAge(MethylationData, age=age,
cell.count=TRUE)
Warning in predAge(cpgs.imp, coefHannum, intercept = FALSE, min.perc): The number of missing CpGs for Hannum clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefSkin, intercept = TRUE, min.perc): The number of missing CpGs for Skin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefPedBE, intercept = TRUE, min.perc): The number of missing CpGs for PedBE clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefTL, intercept = TRUE, min.perc): The number of missing CpGs for TL clock exceeds 80 %.
---> This DNAm clock will be NA.
age.example55
# A tibble: 16 x 22
id Horvath ageAcc.Horvath ageAcc2.Horvath ageAcc3.Horvath Hannum Levine ageAcc.Levine ageAcc2.Levine ageAcc3.Levine
<chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 GSM946048 51.8 -8.22 -4.45 -4.91 NA -30.3 -90.3 -5.94 -6.19
2 GSM946049 39.8 0.754 2.00 1.59 NA -29.6 -68.6 3.10 5.38
3 GSM946052 26.4 -1.59 -1.67 -1.86 NA -33.3 -61.3 3.72 2.20
4 GSM946054 34.0 -5.00 -3.76 -0.463 NA -36.0 -75.0 -3.29 -0.339
5 GSM946055 10.1 2.06 -0.428 2.82 NA -52.8 -60.8 -7.77 0.964
6 GSM946056 20.4 -1.61 -2.42 -2.88 NA -42.2 -64.2 -2.76 -3.70
7 GSM946059 6.00 2.00 -0.971 -0.827 NA -44.8 -48.8 1.81 2.88
8 GSM946062 34.6 6.65 6.57 5.32 NA -23.2 -51.2 13.9 7.97
9 GSM946064 7.91 2.91 0.0589 -2.61 NA -49.8 -54.8 -3.61 -6.59
10 GSM946065 4.72 2.72 -0.489 1.46 NA -48.2 -50.2 -0.845 2.53
11 GSM946066 29.6 -0.427 -0.268 -1.37 NA -39.9 -69.9 -3.64 -2.62
12 GSM946067 1.38 0.375 -2.95 -2.19 NA -48.3 -49.3 -0.513 -3.85
13 GSM946073 56.0 -4.01 -0.242 1.62 NA -26.7 -86.7 -2.32 -1.01
14 GSM946074 24.0 2.03 1.23 -0.669 NA -39.7 -61.7 -0.227 0.0211
15 GSM946075 9.38 1.38 -1.11 -0.885 NA -45.4 -53.4 -0.362 -2.37
16 GSM946076 38.8 8.76 8.92 5.85 NA -27.5 -57.5 8.78 4.73
# ... with 12 more variables: BNN <dbl>, ageAcc.BNN <dbl>, ageAcc2.BNN <dbl>, ageAcc3.BNN <dbl>, skinHorvath <lgl>,
# PedBE <lgl>, Wu <dbl>, ageAcc.Wu <dbl>, ageAcc2.Wu <dbl>, ageAcc3.Wu <dbl>, TL <lgl>, age <dbl>
By default, the argument cell.count is set equal to TRUE and, hence, can be omitted. This implies that ageAcc3 will be computed for all clocks. In some occassions this can be very time consuming. In such cases one can simply estimate DNAmAge, accAge and accAge2 by setting cell.count=FALSE. NOTE: see section 3.5 to see the reference panels available to estimate cell counts.
Then, we can investigate, for instance, whether the accelerated age is associated with Autism. In that example we will use a non-parametric test (NOTE: use t-test or linear regression for large sample sizes)
autism <- covariates$diseaseStatus
kruskal.test(age.example55$ageAcc.Horvath ~ autism)
Kruskal-Wallis rank sum test
data: age.example55$ageAcc.Horvath by autism
Kruskal-Wallis chi-squared = 1.3346, df = 1, p-value = 0.248
kruskal.test(age.example55$ageAcc2.Horvath ~ autism)
Kruskal-Wallis rank sum test
data: age.example55$ageAcc2.Horvath by autism
Kruskal-Wallis chi-squared = 3.1875, df = 1, p-value = 0.0742
kruskal.test(age.example55$ageAcc3.Horvath ~ autism)
Kruskal-Wallis rank sum test
data: age.example55$ageAcc3.Horvath by autism
Kruskal-Wallis chi-squared = 2.8235, df = 1, p-value = 0.09289
ExpressionSet dataOne may be interested in assessing association between chronologial age and DNA methylation age or evaluating how well chronological age is predicted by DNAmAge. In order to illustrate this analysis we downloaded data from GEO corresponding to a set of healthy individuals (GEO accession number GSE58045). Data can be retrieved into R by using GEOquery package as an ExpressionSet object that can be the input of our main function.
dd <- GEOquery::getGEO("GSE58045")
gse58045 <- dd[[1]]
gse58045
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27578 features, 172 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1399890 GSM1399891 ... GSM1400061 (172 total)
varLabels: title geo_accession ... twin:ch1 (43 total)
varMetadata: labelDescription
featureData
featureNames: cg00000292 cg00002426 ... cg27665659 (27578 total)
fvarLabels: ID Name ... ORF (38 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 22532803
Annotation: GPL8490
The chronological age is obtained by using pData function from Biobase package that is able to deal with ExpressionSet objects:
pheno <- pData(gse58045)
age <- as.numeric(pheno$`age:ch1`)
And the different DNA methylation age estimates are obtained by using DNAmAge function (NOTE: as there are missing values, the program automatically runs impute.knn function to get complete cases):
age.gse58045 <- DNAmAge(gse58045, age=age)
Imputing missing data of the entire matrix ....
Data imputed. Starting DNAm clock estimation ...
Warning in predAge(cpgs.imp, coefHannum, intercept = FALSE, min.perc): The number of missing CpGs for Hannum clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefSkin, intercept = TRUE, min.perc): The number of missing CpGs for Skin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefPedBE, intercept = TRUE, min.perc): The number of missing CpGs for PedBE clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefTL, intercept = TRUE, min.perc): The number of missing CpGs for TL clock exceeds 80 %.
---> This DNAm clock will be NA.
age.gse58045
# A tibble: 172 x 22
id Horvath ageAcc.Horvath ageAcc2.Horvath ageAcc3.Horvath Hannum Levine ageAcc.Levine ageAcc2.Levine ageAcc3.Levine
<chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 GSM1399890 65.6 1.07 4.58 5.46 NA 50.7 -13.8 2.91 4.28
2 GSM1399891 66.3 0.197 4.06 5.06 NA 51.3 -14.8 2.45 5.12
3 GSM1399892 53.9 -5.31 -2.98 -2.42 NA 40.5 -18.7 -3.52 -2.54
4 GSM1399893 40.6 -5.23 -5.89 -6.14 NA 31.3 -14.5 -3.21 -4.67
5 GSM1399894 50.1 0.982 1.06 1.28 NA 41.1 -8.00 4.27 3.37
6 GSM1399895 63.7 -0.895 2.64 2.92 NA 48.1 -16.6 0.223 -1.25
7 GSM1399896 44.7 -0.875 -1.59 -1.76 NA 29.2 -16.4 -5.17 -4.57
8 GSM1399897 59.7 -8.55 -4.20 -3.48 NA 41.0 -27.2 -9.37 -7.06
9 GSM1399898 48.4 -5.84 -4.63 -2.50 NA 43.8 -10.4 3.31 7.18
10 GSM1399899 59.3 -3.93 -0.719 -0.609 NA 46.1 -17.1 -0.693 0.339
# ... with 162 more rows, and 12 more variables: BNN <dbl>, ageAcc.BNN <dbl>, ageAcc2.BNN <dbl>, ageAcc3.BNN <dbl>,
# skinHorvath <lgl>, PedBE <lgl>, Wu <dbl>, ageAcc.Wu <dbl>, ageAcc2.Wu <dbl>, ageAcc3.Wu <dbl>, TL <lgl>, age <dbl>
Figure shows the correlation between DNAmAge obtained from Horvath’s method and the chronological age, while Figure depicts the correlation of a new method based on fitting a Bayesian Neural Network to predict DNAmAge based on Horvath’s CpGs.
plotDNAmAge(age.gse58045$Horvath, age)
plotDNAmAge(age.gse58045$BNN, age, tit="Bayesian Neural Network")
Let us illustrate how to use DNAmAge information in association studies (e.g case/control, smokers/non-smokers, responders/non-responders, …). GEO number GSE19711 contains transcriptomic and epigenomic data of a study in lung cancer. Data can be retrieved into R by
dd <- GEOquery::getGEO("GSE19711")
gse19711 <- dd[[1]]
The object gse19711is an ExpressionSet that can contains CpGs and phenotypic (e.g clinical) information
gse19711
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27578 features, 540 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM491937 GSM491938 ... GSM492476 (540 total)
varLabels: title geo_accession ... stage:ch1 (58 total)
varMetadata: labelDescription
featureData
featureNames: cg00000292 cg00002426 ... cg27665659 (27578 total)
fvarLabels: ID Name ... ORF (38 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 20219944
Annotation: GPL8490
Let us imagine we are interested in comparing the accelerated age between cases and controls. Age and case/control status information can be obtained by:
pheno <- pData(gse19711)
age <- as.numeric(pheno$`ageatrecruitment:ch1`)
disease <- pheno$`sample type:ch1`
table(disease)
disease
bi-sulphite converted genomic whole blood DNA from Case bi-sulphite converted genomic whole blood DNA from Control
266 274
disease[grep("Control", disease)] <- "Control"
disease[grep("Case", disease)] <- "Case"
disease <- factor(disease, levels=c("Control", "Case"))
table(disease)
disease
Control Case
274 266
The DNAmAge estimates of different methods is computed by
age.gse19711 <- DNAmAge(gse19711, age=age)
Imputing missing data of the entire matrix ....
Data imputed. Starting DNAm clock estimation ...
Warning in predAge(cpgs.imp, coefHannum, intercept = FALSE, min.perc): The number of missing CpGs for Hannum clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefSkin, intercept = TRUE, min.perc): The number of missing CpGs for Skin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefPedBE, intercept = TRUE, min.perc): The number of missing CpGs for PedBE clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefTL, intercept = TRUE, min.perc): The number of missing CpGs for TL clock exceeds 80 %.
---> This DNAm clock will be NA.
We can observe there are missing data. The funcion automatically impute those using impute.knn function from impute package since complete cases are required to compute the different methylation clocks. The estimates are:
age.gse19711
# A tibble: 540 x 22
id Horvath ageAcc.Horvath ageAcc2.Horvath ageAcc3.Horvath Hannum Levine ageAcc.Levine ageAcc2.Levine ageAcc3.Levine
<chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 GSM491937 62.9 -5.14 -0.351 -1.10 NA 61.1 -6.90 7.63 3.65
2 GSM491938 68.8 -12.2 -2.85 -2.13 NA 57.0 -24.0 -5.48 -1.54
3 GSM491939 60.0 3.96 4.54 4.37 NA 43.0 -13.0 -2.13 -2.03
4 GSM491940 57.9 -4.13 -1.45 -1.38 NA 40.9 -21.1 -8.40 -6.35
5 GSM491941 59.0 -13.0 -6.79 -6.98 NA 57.0 -15.0 0.756 -1.77
6 GSM491942 57.0 -4.00 -1.66 -1.09 NA 44.7 -16.3 -3.92 -1.35
7 GSM491943 61.9 -3.08 0.657 0.183 NA 47.9 -17.1 -3.47 -4.21
8 GSM491944 59.1 -11.9 -6.07 -5.53 NA 50.0 -21.0 -5.60 -3.06
9 GSM491945 60.7 -16.3 -8.33 -9.33 NA 47.7 -29.3 -12.0 -15.7
10 GSM491946 51.1 -7.93 -6.30 -6.33 NA 52.5 -6.54 5.26 1.30
# ... with 530 more rows, and 12 more variables: BNN <dbl>, ageAcc.BNN <dbl>, ageAcc2.BNN <dbl>, ageAcc3.BNN <dbl>,
# skinHorvath <lgl>, PedBE <lgl>, Wu <dbl>, ageAcc.Wu <dbl>, ageAcc2.Wu <dbl>, ageAcc3.Wu <dbl>, TL <lgl>, age <dbl>
The association between disease status and DNAmAge estimated using Horvath’s method can be computed by
mod.horvath1 <- glm(disease ~ ageAcc.Horvath ,
data=age.gse19711,
family="binomial")
summary(mod.horvath1)
Call:
glm(formula = disease ~ ageAcc.Horvath, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.358 -1.160 -1.030 1.184 1.771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.10995 0.09771 -1.125 0.2605
ageAcc.Horvath -0.02023 0.01154 -1.753 0.0795 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 745.25 on 538 degrees of freedom
AIC: 749.25
Number of Fisher Scoring iterations: 4
mod.skinHorvath <- glm(disease ~ ageAcc2.Horvath ,
data=age.gse19711,
family="binomial")
summary(mod.skinHorvath)
Call:
glm(formula = disease ~ ageAcc2.Horvath, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.279 -1.163 -1.082 1.189 1.589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02970 0.08617 -0.345 0.730
ageAcc2.Horvath -0.01315 0.01209 -1.087 0.277
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 747.27 on 538 degrees of freedom
AIC: 751.27
Number of Fisher Scoring iterations: 3
mod.horvath3 <- glm(disease ~ ageAcc3.Horvath ,
data=age.gse19711,
family="binomial")
summary(mod.horvath3)
Call:
glm(formula = disease ~ ageAcc3.Horvath, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.338 -1.163 -1.046 1.185 1.771
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02993 0.08626 -0.347 0.729
ageAcc3.Horvath -0.01927 0.01283 -1.502 0.133
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 746.13 on 538 degrees of freedom
AIC: 750.13
Number of Fisher Scoring iterations: 4
We do not observe statistical significant association between age acceleration estimated using Horvath method and the risk of developing lung cancer. It is worth to notice that Horvath’s clock was created to predict chronological age and the impact of age acceleration of this clock on disease may be limited. On the other hand, Levine’s clock aimed to distinguish risk between same-aged individuals. Let us evaluate whether this age acceleration usin Levine’s clock is associated with lung cancer
mod.levine1 <- glm(disease ~ ageAcc.Levine , data=age.gse19711,
family="binomial")
summary(mod.levine1)
Call:
glm(formula = disease ~ ageAcc.Levine, family = "binomial", data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.592 -1.149 -0.939 1.174 1.733
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40956 0.17894 2.289 0.02209 *
ageAcc.Levine 0.03178 0.01133 2.806 0.00502 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 740.17 on 538 degrees of freedom
AIC: 744.17
Number of Fisher Scoring iterations: 4
mod.levine2 <- glm(disease ~ ageAcc2.Levine , data=age.gse19711,
family="binomial")
summary(mod.levine2)
Call:
glm(formula = disease ~ ageAcc2.Levine, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7053 -1.1328 -0.8614 1.1529 1.8015
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02925 0.08718 -0.336 0.737225
ageAcc2.Levine 0.04430 0.01234 3.589 0.000332 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 734.49 on 538 degrees of freedom
AIC: 738.49
Number of Fisher Scoring iterations: 4
mod.levine3 <- glm(disease ~ ageAcc3.Levine , data=age.gse19711,
family="binomial")
summary(mod.levine3)
Call:
glm(formula = disease ~ ageAcc3.Levine, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.354 -1.161 -1.057 1.187 1.408
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02962 0.08622 -0.344 0.731
ageAcc3.Levine 0.01679 0.01244 1.350 0.177
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 746.62 on 538 degrees of freedom
AIC: 750.62
Number of Fisher Scoring iterations: 3
Here we observe as the risk of developing lung cancer increases 3.23 percent per each unit in the age accelerated variable (ageAcc). Similar conclusion is obtained when using ageAcc2 and ageAcc3 variables.
In some occasions cell composition should be used to assess association. This information is calculated in DNAmAge function and it can be incorporated in the model by:
cell <- attr(age.gse19711, "cell_proportion")
mod.cell <- glm(disease ~ ageAcc.Levine + cell, data=age.gse19711,
family="binomial")
summary(mod.cell)
Call:
glm(formula = disease ~ ageAcc.Levine + cell, family = "binomial",
data = age.gse19711)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9605 -1.0832 -0.6241 1.0742 2.3395
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.768206 4.380382 -2.230 0.025748 *
ageAcc.Levine 0.003959 0.012208 0.324 0.745746
cellCD4T -3.339693 3.833531 -0.871 0.383656
cellMono 10.165096 4.594096 2.213 0.026922 *
cellNeu 16.319534 4.584745 3.560 0.000372 ***
cellNK -0.882134 4.296498 -0.205 0.837326
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 748.48 on 539 degrees of freedom
Residual deviance: 686.56 on 534 degrees of freedom
AIC: 698.56
Number of Fisher Scoring iterations: 4
Here we observe as the positive association disapears after adjusting for cell counts.
dd <- GEOquery::getGEO("GSE109446")
gse109446 <- dd[[1]]
controls <- pData(gse109446)$`diagnosis:ch1`=="control"
gse <- gse109446[,controls]
age <- as.numeric(pData(gse)$`age:ch1`)
age.gse <- DNAmAge(gse, age=age)
plotCorClocks(age.gse)
Let us start by reproducing the example provided in Knight et al. (2016) as a test data set (file ‘TestDataset.csv’). It consists on 3 individuals whose methylation data are available as supplementary data of their paper. The data is also available at methylclock package as a data frame.
TestDataset[1:5,]
CpGName Sample1 Sample2 Sample3
1 cg00000292 0.72546496 0.72350947 0.69023377
2 cg00002426 0.85091763 0.80077888 0.80385777
3 cg00003994 0.05125853 0.05943935 0.05559333
4 cg00005847 0.08775420 0.11722333 0.10845113
5 cg00006414 0.03982478 0.06146891 0.03491992
The Gestational Age (in months) is simply computed by
ga.test <- DNAmGA(TestDataset)
Warning in DNAmGA(TestDataset): CpGs in all Gestational Age clocks are not present in your data. Try 'checkClocksGA' function
to find the missing CpGs of each method.
Warning in predAge(cpgs.imp, coefBohlin, intercept = TRUE, min.perc): The number of missing CpGs for Bohlin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in DNAmGA(TestDataset): The number of missing CpGs for Lee clocks exceeds 80%.
---> This DNAm clock will be NA.
ga.test
# A tibble: 3 x 5
id Knight Bohlin Mayne Lee
<chr> <dbl> <dbl> <dbl> <lgl>
1 Sample1 38.2 NA 35.8 NA
2 Sample2 38.8 NA 36.5 NA
3 Sample3 40.0 NA 36.6 NA
like in DNAmAge we can use the parameter min.perc to set the minimum missing percentage.
The results are the same as those described in the additional file 7 of Knight et al. (2016) (link here)
Let us continue by illustrating how to compute GA of real examples. The PROGRESS cohort data is available in the additional file 8 of Knight et al. (2016). It is available at methylclock as a tibble:
progress_data
# A tibble: 148 x 151
CpGmarker `784` `1052` `1048` `1017` `956` `1038` `989` `946` `941` `1024` `1047` `1035` `988` `939` `936` `748`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg00022866 0.289 0.372 0.347 0.351 0.313 0.300 0.298 0.294 0.322 0.313 0.357 0.332 0.309 0.282 0.320 0.354
2 cg00466249 0.658 0.724 0.700 0.717 0.695 0.665 0.710 0.686 0.692 0.704 0.676 0.713 0.747 0.669 0.726 0.723
3 cg00546897 0.682 0.711 0.684 0.717 0.627 0.605 0.684 0.716 0.684 0.666 0.625 0.643 0.647 0.606 0.635 0.718
4 cg00575744 0.312 0.381 0.300 0.331 0.294 0.348 0.284 0.305 0.319 0.325 0.280 0.271 0.303 0.287 0.370 0.357
5 cg00689340 0.566 0.576 0.556 0.571 0.521 0.569 0.599 0.575 0.532 0.564 0.641 0.593 0.554 0.570 0.558 0.531
6 cg01056568 0.558 0.620 0.529 0.600 0.577 0.574 0.590 0.576 0.548 0.555 0.584 0.552 0.560 0.545 0.573 0.560
7 cg01184449 0.712 0.718 0.667 0.744 0.668 0.676 0.710 0.744 0.685 0.717 0.721 0.705 0.685 0.735 0.607 0.671
8 cg01348086 0.195 0.186 0.180 0.194 0.212 0.208 0.183 0.129 0.161 0.144 0.119 0.198 0.167 0.176 0.183 0.120
9 cg02100629 0.329 0.330 0.340 0.344 0.268 0.280 0.288 0.314 0.283 0.346 0.285 0.288 0.305 0.306 0.348 0.316
10 cg02813863 0.819 0.858 0.832 0.874 0.861 0.830 0.894 0.873 0.895 0.863 0.895 0.888 0.864 0.826 0.825 0.839
# ... with 138 more rows, and 134 more variables: 1031 <dbl>, 903 <dbl>, 864 <dbl>, 874 <dbl>, 898 <dbl>, 1013 <dbl>,
# 971 <dbl>, 966 <dbl>, 866 <dbl>, 924 <dbl>, 931 <dbl>, 1007 <dbl>, 954 <dbl>, 958 <dbl>, 1037 <dbl>, 965 <dbl>,
# 1008 <dbl>, 1005 <dbl>, 962 <dbl>, 979 <dbl>, 881 <dbl>, 876 <dbl>, 764 <dbl>, 743 <dbl>, 987 <dbl>, 930 <dbl>,
# 1023 <dbl>, 928 <dbl>, 910 <dbl>, 897 <dbl>, 1036 <dbl>, 904 <dbl>, 769 <dbl>, 907 <dbl>, 821 <dbl>, 990 <dbl>,
# 747 <dbl>, 753 <dbl>, 843 <dbl>, 761 <dbl>, 819 <dbl>, 820 <dbl>, 802 <dbl>, 805 <dbl>, 870 <dbl>, 817 <dbl>,
# 1040 <dbl>, 815 <dbl>, 952 <dbl>, 974 <dbl>, 951 <dbl>, 929 <dbl>, 980 <dbl>, 911 <dbl>, 927 <dbl>, 914 <dbl>,
# 841 <dbl>, 912 <dbl>, 969 <dbl>, 754 <dbl>, 1053 <dbl>, 884 <dbl>, 878 <dbl>, 909 <dbl>, 810 <dbl>, 863 <dbl>, ...
This file also contains different variables that are available in this tibble.
progress_vars
# A tibble: 150 x 3
id birthweight EGA
<chr> <dbl> <dbl>
1 784 2.62 38
2 1052 2.59 38.3
3 1048 3.20 38
4 1017 3.28 38.6
5 956 2.79 37.1
6 1038 2.89 38.1
7 989 2.47 38
8 946 2.42 37.7
9 941 2.96 36.7
10 1024 2.61 38.6
# ... with 140 more rows
The Clinical Variables including clinical assesment of gestational age (EGA) are available at this tibble.
The Gestational Age (in months) is simply computed by
ga.progress <- DNAmGA(progress_data)
Warning in DNAmGA(progress_data): CpGs in all Gestational Age clocks are not present in your data. Try 'checkClocksGA' function
to find the missing CpGs of each method.
Warning in predAge(cpgs.imp, coefBohlin, intercept = TRUE, min.perc): The number of missing CpGs for Bohlin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefMayneGA, intercept = TRUE, min.perc): The number of missing CpGs for Mayne clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in DNAmGA(progress_data): The number of missing CpGs for Lee clocks exceeds 80%.
---> This DNAm clock will be NA.
ga.progress
# A tibble: 150 x 5
id Knight Bohlin Mayne Lee
<chr> <dbl> <dbl> <lgl> <lgl>
1 784 38.8 NA NA NA
2 1052 37.2 NA NA NA
3 1048 40.3 NA NA NA
4 1017 39.2 NA NA NA
5 956 38.9 NA NA NA
6 1038 39.2 NA NA NA
7 989 37.2 NA NA NA
8 946 35.4 NA NA NA
9 941 33.5 NA NA NA
10 1024 37.4 NA NA NA
# ... with 140 more rows
We can compare these results with the clinical GA available in the variable EGA
plotDNAmAge(ga.progress$Knight, progress_vars$EGA,
tit="GA Knight's method",
clock="GA")
Figure 3b (only for PROGRESS dataset) in Knight et al. (2016) representing the correlation between GA acceleration and birthweight can be reproduced by
library(ggplot2)
progress_vars$acc <- ga.progress$Knight - progress_vars$EGA
p <- ggplot(data=progress_vars, aes(x = acc, y = birthweight)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE, color="black") +
xlab("GA acceleration") +
ylab("Birthweight (kgs.)")
p
Finally, we can also estimate the “accelerated gestational age” using two of the the three different estimates previously described (accAge, accAge2) by provinding information of gestational age through age argument. Notice that in that case accAge3 cannot be estimates since we do not have all the CpGs required by the default reference panel to estimate cell counts for gestational age which is “andrews and bakulski cord blood.”
accga.progress <- DNAmGA(progress_data,
age = progress_vars$EGA,
cell.count=FALSE)
Warning in DNAmGA(progress_data, age = progress_vars$EGA, cell.count = FALSE): CpGs in all Gestational Age clocks are not present in your data. Try 'checkClocksGA' function
to find the missing CpGs of each method.
Warning in predAge(cpgs.imp, coefBohlin, intercept = TRUE, min.perc): The number of missing CpGs for Bohlin clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in predAge(cpgs.imp, coefMayneGA, intercept = TRUE, min.perc): The number of missing CpGs for Mayne clock exceeds 80 %.
---> This DNAm clock will be NA.
Warning in DNAmGA(progress_data, age = progress_vars$EGA, cell.count = FALSE): The number of missing CpGs for Lee clocks exceeds 80%.
---> This DNAm clock will be NA.
accga.progress
# A tibble: 150 x 8
id Knight ageAcc.Knight ageAcc2.Knight Bohlin Mayne Lee age
<chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl>
1 784 38.8 0.792 1.27 NA NA NA 38
2 1052 37.2 -1.05 -0.488 NA NA NA 38.3
3 1048 40.3 2.29 2.77 NA NA NA 38
4 1017 39.2 0.643 1.28 NA NA NA 38.6
5 956 38.9 1.75 1.99 NA NA NA 37.1
6 1038 39.2 1.09 1.61 NA NA NA 38.1
7 989 37.2 -0.774 -0.292 NA NA NA 38
8 946 35.4 -2.36 -1.96 NA NA NA 37.7
9 941 33.5 -3.18 -3.06 NA NA NA 36.7
10 1024 37.4 -1.12 -0.486 NA NA NA 38.6
# ... with 140 more rows
One can also check which clocks can be estimated given the CpGs available in the methylation data by
checkClocksGA(progress_data)
There are some clocks that cannot be computed since your data do not contain the required CpGs.
These are the total number of missing CpGs for each clock :
clock Cpgs_in_clock missing_CpGs percentage
1 Knight 148 0 0.0
2 Bohlin 87 87 100.0
3 Mayne 62 0 0.0
4 Lee 1125 1072 95.3
We can compute the correlation among biological clocks using the function plotCorClocks that requires the package ggplot2 and ggpubr to be installed in your computer.
We can obtain, for instance, the correlation among the clocks estimated for the healthy individuals study previosuly analyze (GEO accession number GSE58045) by simply executing:
plotCorClocks(age.gse58045)
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0
[3] dplyr_1.0.7 purrr_0.3.4
[5] readr_2.0.1 tidyr_1.1.3
[7] tidyverse_1.3.1 GEOquery_2.60.0
[9] ggpmisc_0.4.2-1 ggpp_0.4.2
[11] impute_1.66.0 tibble_3.1.4
[13] BiocStyle_2.20.2 wateRmelon_1.36.0
[15] illuminaio_0.34.0 IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
[17] ROC_1.68.1 lumi_2.44.0
[19] methylumi_2.38.0 minfi_1.38.0
[21] bumphunter_1.34.0 locfit_1.5-9.4
[23] iterators_1.0.13 foreach_1.5.1
[25] Biostrings_2.60.2 XVector_0.32.0
[27] SummarizedExperiment_1.22.0 MatrixGenerics_1.4.3
[29] FDb.InfiniumMethylation.hg19_2.2.0 org.Hs.eg.db_3.13.0
[31] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.44.2
[33] AnnotationDbi_1.54.1 GenomicRanges_1.44.0
[35] GenomeInfoDb_1.28.2 IRanges_2.26.0
[37] S4Vectors_0.30.0 ggplot2_3.3.5
[39] reshape2_1.4.4 scales_1.1.1
[41] matrixStats_0.60.1 limma_3.48.3
[43] Biobase_2.52.0 BiocGenerics_0.38.0
[45] cgageR_0.1.0 methylclock_0.7.6
[47] quadprog_1.5-8 devtools_2.4.2
[49] usethis_2.0.1
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 SparseM_1.81 rtracklayer_1.52.1 bit64_4.0.5
[5] knitr_1.33 DelayedArray_0.18.0 data.table_1.14.0 KEGGREST_1.32.0
[9] RCurl_1.98-1.4 generics_0.1.0 preprocessCore_1.54.0 callr_3.7.0
[13] RSQLite_2.2.8 bit_4.0.4 tzdb_0.1.2 xml2_1.3.2
[17] lubridate_1.7.10 assertthat_0.2.1 xfun_0.25 hms_1.1.0
[21] jquerylib_0.1.4 evaluate_0.14 fansi_0.5.0 restfulr_0.0.13
[25] scrime_1.3.5 progress_1.2.2 dbplyr_2.1.1 readxl_1.3.1
[29] DBI_1.1.1 reshape_0.8.8 ellipsis_0.3.2 ggpubr_0.4.0
[33] backports_1.2.1 bookdown_0.23 annotate_1.70.0 biomaRt_2.48.3
[37] sparseMatrixStats_1.4.2 vctrs_0.3.8 remotes_2.4.0 quantreg_5.86
[41] abind_1.4-5 cachem_1.0.6 withr_2.4.2 vroom_1.5.4
[45] GenomicAlignments_1.28.0 xts_0.12.1 prettyunits_1.1.1 mclust_5.4.7
[49] crayon_1.4.1 genefilter_1.74.0 labeling_0.4.2 pkgconfig_2.0.3
[53] nlme_3.1-152 pkgload_1.2.1 rlang_0.4.11 lifecycle_1.0.0
[57] nleqslv_3.3.2 MatrixModels_0.5-0 filelock_1.0.2 affyio_1.62.0
[61] BiocFileCache_2.0.0 modelr_0.1.8 cellranger_1.1.0 rprojroot_2.0.2
[65] rngtools_1.5 base64_2.0 Matrix_1.3-4 carData_3.0-4
[69] Rhdf5lib_1.14.2 zoo_1.8-9 reprex_2.0.1 processx_3.5.2
[73] png_0.1-7 rjson_0.2.20 bitops_1.0-7 KernSmooth_2.23-20
[77] rhdf5filters_1.4.0 blob_1.2.2 DelayedMatrixStats_1.14.3 doRNG_1.8.2
[81] nor1mix_1.3-0 rstatix_0.7.0 ggsignif_0.6.2 memoise_2.0.0
[85] magrittr_2.0.1 plyr_1.8.6 zlibbioc_1.38.0 compiler_4.1.1
[89] BiocIO_1.2.0 RColorBrewer_1.1-2 Rsamtools_2.8.0 cli_3.0.1
[93] affy_1.70.0 ps_1.6.0 PerformanceAnalytics_2.0.4 MASS_7.3-54
[97] mgcv_1.8-36 tidyselect_1.1.1 stringi_1.7.4 highr_0.9
[101] yaml_2.2.1 askpass_1.1 grid_4.1.1 sass_0.4.0
[105] polynom_1.4-0 tools_4.1.1 rio_0.5.27 rstudioapi_0.13
[109] foreign_0.8-81 farver_2.1.0 digest_0.6.27 BiocManager_1.30.16
[113] Rcpp_1.0.7 car_3.0-11 siggenes_1.66.0 broom_0.7.9
[117] httr_1.4.2 colorspace_2.0-2 rvest_1.0.1 XML_3.99-0.7
[121] fs_1.5.0 splines_4.1.1 conquer_1.0.2 multtest_2.48.0
[125] sessioninfo_1.1.1 xtable_1.8-4 jsonlite_1.7.2 testthat_3.0.4
[129] R6_2.5.1 pillar_1.6.2 htmltools_0.5.2 glue_1.4.2
[133] fastmap_1.1.0 BiocParallel_1.26.2 beanplot_1.2 codetools_0.2-18
[137] pkgbuild_1.2.0 utf8_1.2.2 lattice_0.20-44 bslib_0.2.5.1
[141] curl_4.3.2 zip_2.2.0 openxlsx_4.2.4 openssl_1.4.4
[145] survival_3.2-11 rmarkdown_2.10 desc_1.3.0 munsell_0.5.0
[149] rhdf5_2.36.0 GenomeInfoDbData_1.2.6 HDF5Array_1.20.0 haven_2.4.3
[153] gtable_0.3.0