Estimation of the cell-type proportions using reference-free method (RefFreeEWAS)

Paulina Jedynak, PhD

Team of Environmental Epidemiology applied to Development and Respiratory Health, Inserm, CNRS, University Grenoble Alpes, Institute for Advanced Biosciences (IAB), U1209 Joint Research Center, Grenoble, France
paulina.jedynak@univ-grenoble-alpes.fr, paulina.jedynak@gmail.com

November 23, 2021

Abstract

This is a reproducible example showing how to apply the reference-free method for estimation of the cell-type proportions in a sample using RefFreeEWAS package (Houseman et al. 2016). A more detailed description of each step and theoretical background is available in the supplementary file reffree_cell_mix_tutorial.pdf to be found at https://bit.ly/3nJaqpS

# Install and load needed packages
list_packages <- c("dplyr", "RefFreeEWAS", "FactoMineR", "factoextra", "brgedata", "minfi")
new_packages <- list_packages[!(list_packages %in% installed.packages()[, "Package"])]

if (length(new_packages)) {
  install.packages(new_packages)
}

if (!isTRUE("medepir" %in% installed.packages()[, "Package"])) {
  remotes::install_github("bcm-uga/medepir")
}

library(dplyr)
library(RefFreeEWAS)
library(FactoMineR)
library(factoextra)
library(brgedata)
library(minfi)
library(medepir)

# Load an open source methylation dataset 
data(brge_methy) # brge_methy is a GenomicRatioSet with methylation data corresponding to the Illumina Human Methylation 450K. It contains 392,277 probes and 20 samples. The object contains 9 phenotypic variables (age, sex and cell types proportions)

# Retrieve methylation beta values
methylation_matrix <- minfi::getBeta(brge_methy) # 392,277 x 20

# To save computational time, create dataset subsets
set.seed(12345)
short <- methylation_matrix[sample(1:dim(methylation_matrix)[1], 500), ] # 500 x 20
final <- methylation_matrix[sample(1:dim(methylation_matrix)[1], 1000), ] # 1,000 x 20

# Retrieve covariates data
convariates_dataset <- colData(brge_methy) %>% 
  as.data.frame()

PART I: Estimate K (the number of cell types in your sample) for a subset of CpG probes

Note: The right choice of K is crucial for a successful estimation of cell proportions. Using a subset of probes enables more precise estimation of K as well as saves computational time. There are different methods of probe selection, e.g. selection based on the probes’ variability (e.g. 10,000 most variable probes can be used). Another interesting method is removing the probes whose methylation is affected by factors that are not influencing cell proportions (Decamps et al. 2020).

# Option 1: Selecting 10,000 most variable probes
variance <- apply(methylation_matrix, 1, var, na.rm = TRUE)
most_variant_10k <- methylation_matrix[names(variance[order(variance, decreasing = TRUE)][1:10000]), ]

# Option 2: Filtering out the probes whose methylation is associated with factors that are not influencing cell proportions (like: batch effect). For details see Decamps et al. 2020 and package medepir

# NOTE: see e.g., the discussion here https://github.com/bcm-uga/medepir/issues/3#issuecomment-953653677 on how to choose these factors.
# Since "affect methylation but not the cell mix" may mean different things, two scenarios could be considered: 1) variables affecting methylation but not associated in 'real life' with the cell mix (e.g., I would not consider variables that happen after the constitution of the cell mix, e.g., delivery mode that happens after placental cell proportions constitution); 2) all variables associated with methylation (ignoring the 'real life' chronology explained above). 
# In the Decamps et al. 2020 paper, the entire metadata file is given to the CF_detection function, and the confounding factors are: maternal age and child sex (the authors also suggest to add information on smoking status and technical factors, if available).
# Defining which confounders you want to include in your analysis is really up to you, you are the data-analyst and the expert of your specific scientific question. There is always a risk to remove probes also associated with the biological processes you are interested in. This is why it is very important to bring your own expertise to identify the relevant biological priors you need to use in your analysis framework.
convariates_dataset <- convariates_dataset %>%
  dplyr::select(age, sex) # Age and sex are the only phenotypic variables available for this exemplary dataset, so I choose them. See comment above on how to choose your own predictors.

# Select the probes
selected_probes <- medepir::CF_detection(D = short, exp_grp = convariates_dataset) # 487 * 20, 13 CpGs were excluded as linearly associated with methylation

# D is the input methylation matrix (in this case it contains 500 CpGs due to the computational time constrains, this would be replaced by 'methylation_matrix' in real life); exp_grp is the matrix of phenotypic data for all the patients, restricted to variables affecting methylation.

Selected probes are then used to estimate the K. Again we have (at least) two methods to choose from.

If we decided to use 10k most variable probes, we can continue with the algorithm proposed by Houseman et al.

# Run an estimation for a subset of CpG probes, with different values of K
# Y is the input methylation matrix (in this case it contains 500 CpGs due to the computational time constrains, this would be replaced by 'most_variant_10k' in real life).
# Klist represents values of K to be tested (for the purpose of the tutorial I test Ks from 1 to 10)
omega_array <- RefFreeEWAS::RefFreeCellMixArray(Y = short, Klist = 1:10)

## Fitting K = 1 
## Fitting K = 2 
## Fitting K = 3 
## Fitting K = 4 
## Fitting K = 5 
## Fitting K = 6 
## Fitting K = 7 
## Fitting K = 8 
## Fitting K = 9 
## Fitting K = 10

RefFreeEWAS package offers a deviance measure that may help to estimate the right value of K.

# Run bootstrap analysis to estimate deviance for each K
# rfArray is a list of RefFreeCellMix objects, Y is the input methylation matrix (in this case it
# contains 500 selected CpGs due to the computational time constrains, this would be replaced by 'most_variant_10k' in real life) and R is a number of bootstrapped vectors to return
set.seed(13446)
testBootDevs <- RefFreeEWAS::RefFreeCellMixArrayDevianceBoots(rfArray = omega_array, Y = short, R = 10)

## Bootstrap 10

# To find the best K compare the deviances and choose the minimal one
mean_dev <- apply(testBootDevs[-1, ], 2, mean, trim = 0.25)
which.min(mean_dev)

## 1 
## 1

An optimal K is 1.

# Assign a value of K for further steps
K_optimal_deviance <- which.min(mean_dev) # The obtained optimal K equal to 1 is not very interesting as it means that there is no variability due to the cellular heterogeneity. This comes from the fact that only 500 probes were used

# I change the optimal K to some random number, to make the example more interesting
K_optimal_deviance <- 4

If we decided to use probes remaining after exclusion of those whose methylation is affected by factors that are not influencing cell proportions, we can continue with the algorithm proposed by Decamps et al. They propose a visual inspection of the scree plot (showing the eigenvalues of the selected probes methylation matrix in descending order) as a powerful technique to choose the optimal K. For choosing K, Cattell’s rule is applied, which states that components corresponding to eigenvalues to the left of the straight line should be retained ((Cattell 1966)). When the actual number of different cell types is equal to K, we expect that (K-1) eigenvalues would correspond to the mixture of cell types and that other eigenvalues would correspond to noise (or other confounders unaccounted for).

# Run PCA on selected probes methylation matrix and plot results
# X is the input methylation matrix of selected probes
pca <- FactoMineR::PCA(X = t(selected_probes))

factoextra::fviz_screeplot(X = pca, geom = "line")

Using Cattlle’s rule we choose the optimal K to be 5.

# Assign a value of K for further steps
K_optimal_scree <- 5

PART II: Estimate cell mix proportion using chosen K

At this step we use the subset of probes (selected by one method or another) and the corresponding optimal K to estimate the final cell mix proportion. There are several algorithms for doing it, in this tutorial I present only the one proposed by Houseman et al. (from the RefFreeEWAS package), without explicit initialization of the M matrix (recommended). There is an ongoing debate what should be the initialization matrix, as it can significantly influence the results of the cell-type proportion estimation (Decamps et al. 2020). One solution would be to obtain cell-type proportions using both methods and compare the deviance of Y matrices reconstructed using each solution with the original input matrix Y, and choosing the cell-type proportion matrix returning lower deviance from original matrix Y.

NOTE: For the RefFreeCellMix algorithm, the cell proportions may not sum up to one - this is normal due to the RefFreeEWAS convergence algorithm that applies some constrains to the estimates.

# Houseman
# Estimate cell mix proportion without explicit initialization basing on 10k most variable probes or the entire methylation matrix
# Y represents the input methylation matrix (in this case it contains 500 selected CpGs due to the computational time constrains, but it would be replaced by either 'most_variant_10k' or 'methylation_matrix' in real life) and K_optimal is the value of K that was chosen in PART I

# Computationally heavy!
cell_mix <- RefFreeEWAS::RefFreeCellMix(Y = short, K = K_optimal_deviance, verbose = FALSE)

# Print omega (cell mix proportions for each sample). Columns indicate the cell types (from 1 to 4) and rows - individuals
head(cell_mix$Omega)

##                   1           2         3         4
## x0017  4.758029e-01 0.001361177 0.3464616 0.1638546
## x0043  5.362122e-01 0.179223015 0.2725779 0.0000000
## x0036  1.153017e-01 0.455537594 0.0000000 0.4215052
## x0041  8.282581e-02 0.025760921 0.8819150 0.0000000
## x0032 -4.558147e-18 0.037645947 0.1160053 0.8463488
## x0026  7.316968e-02 0.575982169 0.0000000 0.3471519

# Decamps
# Estimate cell mix proportion without explicit initialization basing on selected probes remaining after removal of probes whose methylation is not associated with cell mix
# Y represents the input methylation matrix (containing the subset of probes 'selected_probes') and K_optimal is the value of K that was chosen in PART I
cell_mix <- RefFreeEWAS::RefFreeCellMix(Y = selected_probes, K = K_optimal_scree, verbose = FALSE)

# Print omega (cell mix proportions for each sample). Columns indicate the cell types (from 1 to 5) and rows - individuals
head(cell_mix$Omega)

##               1             2            3            4          5
## x0017 0.6646564  7.741295e-02 2.055569e-01 1.734723e-18 0.03958448
## x0043 0.5992880  2.590630e-01 1.053921e-01 5.214625e-03 0.01903510
## x0036 0.3894882  3.765547e-01 7.816398e-02 1.113201e-01 0.03996036
## x0041 0.1989238  6.995143e-02 7.225250e-01 5.192959e-20 0.00000000
## x0032 0.5179490 -1.058905e-18 3.167709e-01 5.179147e-02 0.11348861
## x0026 0.0000000 -5.956148e-20 5.002566e-20 9.514261e-01 0.04743778

APPENDIX

Example of use of the Yfinal argument.

Some functions from the RefFreeEWAS package allow for an optional argument Yfinal. By default, it is set to NULL which makes it equal to Y. We can also set it for a different value.

In the RefFreeEWAS package help we read:
“Note that the decomposition will be based on Y, but Yfinal (=Y by default) will be used to determine the final value of Mu based on the last iterated value of Omega.”

Setting Yfinal different from NULL will cause the following: the matrix decomposition will be based on the information provided by the CpGs from the Y argument of the function, therefore Omega and Mu will be estimated basing on Y. However, the final displayed solution for Mu will converge to the methylation matrix provided in Yfinal argument.

Yfinal = NULL = Y → Y = M(_Y) x Ω(_Y)^T both M and Ω are estimated basing on Y.

Yfinal = Y1 → Y1 = M(_Y1) x Ω(_Y)^T both M and Ω are estimated basing on Y; Returned M will be adjusted to satisfy the new equation Y1 = M(_Y1) x Ω(_Y)^T is going to converge to.

Let’s compare results obtained by the RefFreeCellMix without and with Yfinal argument:

Yfinal = NULL = Y

# Run cell-type proportion estimation without Yfinal argument
# Y is the input methylation matrix, in this case it contains 500 selected CpGs, K_optimal is the value of K that was chosen in PART II and Yfinal is NULL so by default it equals Y
cell_mix_Y <- RefFreeEWAS::RefFreeCellMix(Y = short, K = K_optimal_deviance, Yfinal = NULL, verbose = FALSE)

# Print omega and mu for estimation without Yfinal argument
head(cell_mix_Y$Omega)

##                   1           2         3         4
## x0017  4.758029e-01 0.001361177 0.3464616 0.1638546
## x0043  5.362122e-01 0.179223015 0.2725779 0.0000000
## x0036  1.153017e-01 0.455537594 0.0000000 0.4215052
## x0041  8.282581e-02 0.025760921 0.8819150 0.0000000
## x0032 -4.558147e-18 0.037645947 0.1160053 0.8463488
## x0026  7.316968e-02 0.575982169 0.0000000 0.3471519

dim(cell_mix_Y$Omega)

## [1] 20  4

head(cell_mix_Y$Mu)

##                     1          2          3          4
## cg09076821 0.71705620 0.30617663 0.45047350 0.42551121
## cg20254225 0.11922072 0.09284526 0.08704003 0.10808034
## cg09652933 0.02175610 0.01835747 0.01722238 0.02047879
## cg02863842 0.06374079 0.07319621 0.03989381 0.05549677
## cg04248234 0.12760647 0.09193503 0.11460157 0.16719032
## cg19019537 0.05223937 0.03204958 0.01725836 0.02643954

dim(cell_mix_Y$Mu)

## [1] 500   4

Now let’s add Yfinal argument:

Yfinal = final

# Run cell-type proportion estimation with Yfinal argument
# Y is the input methylation matrix, in this case it contains 500 selected CpGs, K_optimal is the value of K that was chosen in PART II and Yfinal is the methylation matrix that the final solution of Mu will converge to
cell_mix_Yfinal <- RefFreeEWAS::RefFreeCellMix(Y = short, K = K_optimal_deviance, Yfinal = final, verbose = FALSE)

# Print omega and mu for estimation with Yfinal argument
head(cell_mix_Yfinal$Omega)

##                   1           2         3         4
## x0017  4.758029e-01 0.001361177 0.3464616 0.1638546
## x0043  5.362122e-01 0.179223015 0.2725779 0.0000000
## x0036  1.153017e-01 0.455537594 0.0000000 0.4215052
## x0041  8.282581e-02 0.025760921 0.8819150 0.0000000
## x0032 -4.558147e-18 0.037645947 0.1160053 0.8463488
## x0026  7.316968e-02 0.575982169 0.0000000 0.3471519

dim(cell_mix_Yfinal$Omega)

## [1] 20  4

head(cell_mix_Yfinal$Mu)

##                     1          2          3          4
## cg09120960 0.91262241 0.90779162 0.93595870 0.91650746
## cg23584332 0.05999688 0.03860012 0.02307283 0.03862555
## cg04501176 0.69133161 0.55534338 0.74962001 0.72716788
## cg18696027 0.16300307 0.10004924 0.05372412 0.13418646
## cg04065210 0.19663826 0.20801974 0.17361322 0.25864172
## cg19673155 0.38289359 0.20642147 0.14987697 0.29912160

dim(cell_mix_Yfinal$Mu)

## [1] 1000    4

For the two solutions, Omega stays the same and Mu is changing after adding Yfinal argument, in order to conform to the Yfinal = final.

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Europe.1252  LC_CTYPE=English_Europe.1252   
## [3] LC_MONETARY=English_Europe.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Europe.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] medepir_0.1.0               minfi_1.36.0               
##  [3] bumphunter_1.32.0           locfit_1.5-9.4             
##  [5] iterators_1.0.13            foreach_1.5.1              
##  [7] Biostrings_2.58.0           XVector_0.30.0             
##  [9] SummarizedExperiment_1.20.0 MatrixGenerics_1.2.1       
## [11] matrixStats_0.59.0          GenomicRanges_1.42.0       
## [13] GenomeInfoDb_1.26.7         IRanges_2.24.1             
## [15] S4Vectors_0.28.1            brgedata_1.12.0            
## [17] Biobase_2.50.0              BiocGenerics_0.36.1        
## [19] factoextra_1.0.7            ggplot2_3.3.5              
## [21] FactoMineR_2.4              RefFreeEWAS_2.2            
## [23] quadprog_1.5-8              dplyr_1.0.7                
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1              backports_1.2.1          
##   [3] BiocFileCache_1.14.0      plyr_1.8.6               
##   [5] splines_4.0.5             BiocParallel_1.24.1      
##   [7] digest_0.6.28             htmltools_0.5.1.1        
##   [9] fansi_0.5.0               magrittr_2.0.1           
##  [11] memoise_2.0.0             cluster_2.1.2            
##  [13] doParallel_1.0.16         bigassertr_0.1.5         
##  [15] bigstatsr_1.5.1           openxlsx_4.2.4           
##  [17] limma_3.46.0              readr_1.4.0              
##  [19] annotate_1.68.0           askpass_1.1              
##  [21] siggenes_1.64.0           prettyunits_1.1.1        
##  [23] colorspace_2.0-2          blob_1.2.1               
##  [25] rappdirs_0.3.3            ggrepel_0.9.1            
##  [27] haven_2.4.1               xfun_0.24                
##  [29] crayon_1.4.2              RCurl_1.98-1.3           
##  [31] jsonlite_1.7.2            genefilter_1.72.1        
##  [33] GEOquery_2.58.0           survival_3.2-13          
##  [35] flock_0.7                 glue_1.5.0               
##  [37] gtable_0.3.0              zlibbioc_1.36.0          
##  [39] DelayedArray_0.16.3       car_3.0-10               
##  [41] Rhdf5lib_1.12.1           HDF5Array_1.18.1         
##  [43] abind_1.4-5               scales_1.1.1             
##  [45] DBI_1.1.1                 rngtools_1.5             
##  [47] bigparallelr_0.3.2        rstatix_0.7.0            
##  [49] Rcpp_1.0.7                xtable_1.8-4             
##  [51] progress_1.2.2            foreign_0.8-81           
##  [53] flashClust_1.01-2         bit_4.0.4                
##  [55] mclust_5.4.7              preprocessCore_1.52.1    
##  [57] DT_0.18                   htmlwidgets_1.5.3        
##  [59] httr_1.4.2                RColorBrewer_1.1-2       
##  [61] ellipsis_0.3.2            farver_2.1.0             
##  [63] pkgconfig_2.0.3           reshape_0.8.8            
##  [65] XML_3.99-0.6              sass_0.4.0               
##  [67] dbplyr_2.1.1              utf8_1.2.2               
##  [69] labeling_0.4.2            tidyselect_1.1.1         
##  [71] rlang_0.4.12              AnnotationDbi_1.52.0     
##  [73] cellranger_1.1.0          munsell_0.5.0            
##  [75] tools_4.0.5               cachem_1.0.5             
##  [77] generics_0.1.0            RSQLite_2.2.7            
##  [79] broom_0.7.7               evaluate_0.14            
##  [81] stringr_1.4.0             fastmap_1.1.0            
##  [83] yaml_2.2.1                knitr_1.33               
##  [85] bit64_4.0.5               zip_2.2.0                
##  [87] beanplot_1.2              scrime_1.3.5             
##  [89] purrr_0.3.4               nlme_3.1-152             
##  [91] doRNG_1.8.2               sparseMatrixStats_1.2.1  
##  [93] nor1mix_1.3-0             leaps_3.1                
##  [95] xml2_1.3.2                biomaRt_2.46.3           
##  [97] compiler_4.0.5            curl_4.3.1               
##  [99] ggsignif_0.6.2            tibble_3.1.6             
## [101] bslib_0.2.5.1             stringi_1.6.2            
## [103] highr_0.9                 GenomicFeatures_1.42.3   
## [105] forcats_0.5.1             lattice_0.20-44          
## [107] Matrix_1.3-4              multtest_2.46.0          
## [109] vctrs_0.3.8               pillar_1.6.4             
## [111] lifecycle_1.0.1           rhdf5filters_1.2.1       
## [113] jquerylib_0.1.4           cowplot_1.1.1            
## [115] data.table_1.14.0         bitops_1.0-7             
## [117] rtracklayer_1.50.0        R6_2.5.1                 
## [119] rio_0.5.27                parallelly_1.29.0        
## [121] codetools_0.2-18          MASS_7.3-54              
## [123] assertthat_0.2.1          rhdf5_2.34.0             
## [125] openssl_1.4.4             withr_2.4.2              
## [127] GenomicAlignments_1.26.0  Rsamtools_2.6.0          
## [129] GenomeInfoDbData_1.2.4    hms_1.1.0                
## [131] grid_4.0.5                tidyr_1.1.3              
## [133] base64_2.0                rmarkdown_2.9            
## [135] DelayedMatrixStats_1.12.3 carData_3.0-4            
## [137] illuminaio_0.32.0         ggpubr_0.4.0             
## [139] scatterplot3d_0.3-41

References:

Cattell, R. B. 1966. “The Scree Test For The Number Of Factors.” Multivariate Behav Res 1 (2): 245–76. https://doi.org/10.1207/s15327906mbr0102_10.

Decamps, Clémentine, Florian Privé, Raphael Bacher, Daniel Jost, Arthur Waguet, Sophie Achard, Elise Amblard, et al. 2020. “Guidelines for Cell-Type Heterogeneity Quantification Based on a Comparative Analysis of Reference-Free DNA Methylation Deconvolution Software.” BMC Bioinformatics 21 (1): 16. https://doi.org/10.1186/s12859-019-3307-2.

Houseman, E. Andres, Molly L. Kile, David C. Christiani, Tan A. Ince, Karl T. Kelsey, and Carmen J. Marsit. 2016. “Reference-Free Deconvolution of DNA Methylation Data and Mediation by Cell Composition Effects.” BMC Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-1140-4.