to select one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size.
R terminology might be different from that of mathematics and statistics. Please refer to Appendix Section 8 for explanation and reference of the terms and abbreviations used in this vignette.
Sys.info()[c('sysname', 'release', 'machine')]# sysname release machine # "Darwin" "25.0.0" "arm64"R.version# _ # platform aarch64-apple-darwin20 # arch aarch64 # os darwin20 # system aarch64, darwin20 # status # major 4 # minor 5.1 # year 2025 # month 06 # day 13 # svn rev 88306 # language R # version.string R version 4.5.1 (2025-06-13)# nickname Great Square Root
Experimental (and maybe unstable) features are released extremely frequently to Github. Active developers should use the Github version; suggestions and bug reports are welcome! Stable releases to CRAN are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process.
As of package spatstat.explore(Baddeley, Rubak, and Turner 2015, v3.5.2) and pROC(Robin et al. 2011, v1.19.0.1), we have a function name clash between spatstat.explore::plot.roc() and pROC::plot.roc() when loading both packages groupedHyperframe (which Imports package spatstat.explore) and maxEff (which Imports package caret which Imports package pROC). This function name clash is potentially hazardous as the S3 classes spatstat.explore::roc and pROC::roc are totally different.
1.3 Acknowledgement
This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants
Partition into a training (80%) and test (20%) set.
set.seed(32); id =sample.int(n =nrow(Ki67q), size =nrow(Ki67q)*.8) |>sort.int()s0 = Ki67q[id, , drop =FALSE] # training sets1 = Ki67q[-id, , drop =FALSE] # test set
dim(s0)# [1] 497 12
dim(s1)# [1] 125 12
4 Starting Model
Let’s consider a starting model of endpoint PFS with predictor Tstage on the training set s0. As of package spatstat.geom v3.5.0.8, the S3 method dispatch spatstat.geom::as.data.frame.hyperframe() warns on hypercolumns that aren’t able to be converted to a data.frame. Therefore, user should use suppressWarnings() or set #| warning: false in R markdown code–chunk.
m0 =coxph(PFS ~ Tstage, data = s0)
Starting coxph model m0
summary(m0)# Call:# coxph(formula = PFS ~ Tstage, data = s0)# # n= 497, number of events= 91 # # coef exp(coef) se(coef) z Pr(>|z|) # Tstage 0.6725 1.9591 0.1061 6.34 2.29e-10 ***# ---# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# # exp(coef) exp(-coef) lower .95 upper .95# Tstage 1.959 0.5104 1.591 2.412# # Concordance= 0.654 (se = 0.028 )# Likelihood ratio test= 32.23 on 1 df, p=1e-08# Wald test = 40.2 on 1 df, p=2e-10# Score (logrank) test = 43.73 on 1 df, p=4e-11
considers each “slice” of the numeric-hypercolumn logKi67.quantile as an additional numeric predictor. Users are encourage to read the package groupedHyperframevignette, Appendix Section On anylist for more details;
updates the starting model m0 with each one of the additional numeric predictors, respectively;
returns an 'add_numeric' object, which inherits from class 'add_'. This is a listof objects with internal class 'add_numeric_', see Appendix Section 7.3.
Second, the S3 method dispatch sort_by.add_() sorts the input by the absolute values of the regression coefficients, i.e., effect size effsize, of the additional numeric predictors.
Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.
The S3 method dispatch print.add_numeric() displays the calls to the selected numeric predictors.
The S3 method dispatch predict.add_numeric() applies the starting model structure in m0, as well as the additional numeric predictors selected by a0, on the test set s1. The author categorizes this functionality under the S3 generic stats::predict(), instead of stats::update(), to emphasize that this “update” is on a newdata, even that the workhorse function is the S3 method dispatch stats:::update.default().
a1 = a0 |>predict(newdata = s1)
Predicted models a1
a1# logKi67.quantile["0.06"] :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.5900 1.8041 0.1959 3.012 0.0026# x. 0.6071 1.8352 0.8687 0.699 0.4846# # Likelihood ratio test=9.23 on 2 df, p=0.009926# n= 125, number of events= 27 # # logKi67.quantile["0.05"] :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.5886 1.8015 0.1961 3.002 0.00269# x. 0.6276 1.8732 0.8746 0.718 0.47299# # Likelihood ratio test=9.25 on 2 df, p=0.009794# n= 125, number of events= 27
Developers are encouraged to learn more about advanced usage of the class 'add_numeric_' in Appendix Section 7.3.
dichotomizes each “slice” of the numeric-hypercolumn logKi67.quantile into a logical variable, see Appendix Section 7.1;
removes the duplicatedlogical variables;
updates the starting model m0 with each one of the additional logical predictors, respectively;
returns an 'add_dummy' object, which inherits from class 'add_'. This is a listof objects with internal class add_dummy_, see Appendix Section 7.4.
Second, the S3 method dispatch subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor.
Third, the S3 method dispatch sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor.
Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.
The S3 method dispatch print.add_dummy() displays the calls to the selected logical predictors.
The S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by b0, on the test set s1.
b1 = b0 |>predict(newdata = s1)
Predicted models b1
b1# logKi67.quantile["0.95"]>=6.76825904668107 :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.6287 1.8753 0.1922 3.271 0.00107# x.TRUE -0.2630 0.7687 0.4693 -0.560 0.57515# # Likelihood ratio test=9.05 on 2 df, p=0.01086# n= 125, number of events= 27 # # logKi67.quantile["0.94"]>=6.72093005757954 :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.61820 1.85558 0.19162 3.226 0.00125# x.TRUE -0.09157 0.91249 0.46920 -0.195 0.84526# # Likelihood ratio test=8.78 on 2 df, p=0.01238# n= 125, number of events= 27
Developers are encouraged to learn more about advanced usage of the class add_dummy_ in Appendix Section 7.4.
First, function add_dummy_partition() creates a dichotomizing rule for each additional numeric predictor in the following steps.
Generates twenty (20L) partitions of the training set s0 using functions caret::createDataPartition() or statusPartition() (Appendix Section 7.2) at .8 vs. .2=(1-.8) ratio.
For the i-th partition (i=1,\cdots,20) of the training set s0,
creates a dichotomizing rule of the numeric predictor in the i-th training-subset of s0 using function node1() (Appendix Section 7.1);
applies such dichotomizing rule to the numeric predictor in the i-th test-subset of s0;
updates the starting model m0 with the i-th test-subset of s0, as well as the logical predictor from the previous step;
records the estimated regression coefficients, i.e., effect size effsize, of the logical predictor.
Selects the dichotomizing rule based on the partition with the median effect size of the logical predictor in their own, specific test-subset of s0.
Returns an 'add_dummy' object.
Second, the S3 method dispatch subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor in their own, specific test-subset of s0.
Third, the S3 method dispatch sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor in their own, specific test-subset of s0.
Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.
Similar to Section 6.1, the S3 method dispatch print.add_dummy() displays the calls to the selected logical predictors.
Similar to Section 6.1, the S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by c0, on the test set s1.
c1 = c0 |>predict(newdata = s1)
Predicted models c1
c1# logKi67.quantile["0.99"]>=7.16313918263581 :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.6258 1.8697 0.1941 3.224 0.00126# x.TRUE -0.5360 0.5851 0.4467 -1.200 0.23017# # Likelihood ratio test=10.07 on 2 df, p=0.006517# n= 125, number of events= 27 # # logKi67.quantile["0.88"]>=6.7273333203265 :# Call:# coxph(formula = PFS ~ Tstage + x., data = newd)# # coef exp(coef) se(coef) z p# Tstage 0.6044 1.8302 0.1928 3.134 0.00172# x.TRUE 0.2514 1.2858 0.4052 0.620 0.53500# # Likelihood ratio test=9.14 on 2 df, p=0.01037# n= 125, number of events= 27
Developers are encouraged to learn more about advanced usage of the class add_dummy_ in Appendix Section 7.4.
7 Appendix
We cover some minor and/or experimental features in the appendix, in order not to interrupt the main narrative of this vignette.
7.1node1()
Function node1() dichotomizes a single numericvector based on the first node of a recursive partitioning and regression tree rpart.object from package rpart(T. Therneau and Atkinson 2025). In this section, we illustrate
the use of function node1(), and
the S3 method dispatches predict.node1() and labels.node1()
using data examples from package rpart. To keep the user’s search path simple and clean, we intentionally call the objects and functions in package rpart and survival(T. M. Therneau 2024) with explicit namespace, instead of using library() or require().
data(stagec, package ='rpart')stagec0 = stagec[1:140,] # training setstagec1 = stagec[-(1:140),] # test set
Function rpart::rpart() creates a recursive partitioning model of the numeric variable age with the endpoint of progression-free survival in the training set stagec0, with
parameter cp = .Machine$double.eps, to ensure at least one node/split of the partitioning tree.
parameter maxdepth = 2L, to reduce the computation cost as we need only the first node.
Function node1() creates a dichotomizing rule\mathcal{D} of the numeric variable age based on the first node in the tree rp0. Function node1() returns an object of class 'node1', which is also a closure that consists of a function and the environment in which it was created.
D0 = rp0 |>node1()
Dichotomizing rule D0 should be used as an R function.
The S3 method dispatch base::print.function() displays the dichomizing rule D0 and its enclosure environment,
D0# function (newx = age) # {# ret <- (newx >= 53.5)# ret0 <- na.omit(ret)# if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) # warning("Dichotomized values are all-0 or all-1")# return(ret)# }# <environment: 0x15a5f8958># attr(,"class")# [1] "node1" "function"
The name of the numeric variable, e.g., age, is stored as the formals argument of the parameter newx. This is an advanced R programming trick. The evaluation of the dichotomizing rule discovers the variable age in
the global environment .GlobalEnv
the enclosure environment of the dichotomizing rule, if its parent.environment is a namespace or the .GlobalEnv
Advanced trick: .GlobalEnv
age = stagec1$ageD0()# [1] TRUE TRUE TRUE TRUE FALSE TRUErm(age)
Advanced trick: environment(D0), child of package namespace
Advanced fact: be careful with as.environment(), child of emptyenv() !
ev = stagec1 |>as.environment()parent.env(ev)# <environment: R_EmptyEnv>D0. = D0environment(D0.) = evtryCatch(expr = {D0.()}, error = identity)# <simpleError in { ret <- (newx >= 53.5) ret0 <- na.omit(ret) if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) warning("Dichotomized values are all-0 or all-1") return(ret)}: could not find function "{">rm(ev, D0.)
The S3 method dispatch labels.node1() returns a human-friendly character text to describe the dichotomizing rule.
D0 |>labels()# [1] "age>=53.5"
The S3 method dispatch predict.node1() dichotomizes the numeric variable age in the test set stagec1, using the dichotomizing rule D0 determined by the training set stagec0.
The (tentatively named) S3 method dispatch get_cutoff.node1() returns the numeric cutoff value of the dichotomizing rule D0.
Subject to change: get numeric cutoff value
D0 |>get_cutoff()# [1] 53.5
7.2statusPartition()
Function statusPartition()
splits a left-censored survival::Surv object by its survival status, i.e., observed vs. left-censored;
partitions the observed and left-censored subjects, respectively, into test/training sets
Function statusPartition() intends to avoid the situation that a Cox proportional hazards model survival::coxph() in one or more of the partitioned data set being degenerate due to the fact that all subjects in that partition being censored.
set.seed(12); id = capacitor_failure |>statusPartition(times =1L, p = .5)capacitor_failure[id[[1L]], 2L] |>table() # balanced by survival status# # 0 1 # 16 16
Function statusPartition() is an extension of the very popular function caret::createDataPartition(), which stratifies a Surv object by the quantiles of its survival time (as of package caret v7.0.1).
caret::createDataPartition(): not balanced by survival status
The internal class 'add_dummy_'inherits from class 'node1' (see Appendix Section 7.1), with additional attributes
attr(., 'p1'), a numeric scalar between 0 and 1, the TRUE probability of the additional logical predictor in the training set
attr(., 'effsize'), a numeric scalar, the regression coefficients, i.e., effect size effsize, of the additional logical predictor
attr(., 'model'), the regression model with additional logical predictor
The S3 method dispatch base::print.default() displays each 'add_dummy_' object.
Training models b0 in training set s0: 1st element
b0[[1L]]# function (newx = logKi67.quantile["0.95"]) # {# ret <- (newx >= 6.76825904668107)# ret0 <- na.omit(ret)# if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) # warning("Dichotomized values are all-0 or all-1")# return(ret)# }# <environment: 0x13af42270># attr(,"class")# [1] "add_dummy_" "node1" "function" # attr(,"p1")# [1] 0.7746479# attr(,"effsize")# [1] 1.233025# attr(,"model")# Call:# coxph(formula = PFS ~ Tstage + x., data = data_)# # coef exp(coef) se(coef) z p# Tstage 0.6291 1.8760 0.1079 5.830 5.54e-09# x.TRUE 1.2330 3.4316 0.3948 3.123 0.00179# # Likelihood ratio test=45.99 on 2 df, p=1.033e-10# n= 497, number of events= 91
Training models b0 in training set s0: 2nd element
b0[[2L]]# function (newx = logKi67.quantile["0.94"]) # {# ret <- (newx >= 6.72093005757954)# ret0 <- na.omit(ret)# if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) # warning("Dichotomized values are all-0 or all-1")# return(ret)# }# <environment: 0x13ae832e0># attr(,"class")# [1] "add_dummy_" "node1" "function" # attr(,"p1")# [1] 0.7806841# attr(,"effsize")# [1] 1.150391# attr(,"model")# Call:# coxph(formula = PFS ~ Tstage + x., data = data_)# # coef exp(coef) se(coef) z p# Tstage 0.6133 1.8465 0.1079 5.683 1.32e-08# x.TRUE 1.1504 3.1594 0.3963 2.903 0.0037# # Likelihood ratio test=43.77 on 2 df, p=3.134e-10# n= 497, number of events= 91
Training models c0 in test-subset of training set s0: 1st element
c0[[1L]]# function (newx = logKi67.quantile["0.99"]) # {# ret <- (newx >= 7.16313918263581)# ret0 <- na.omit(ret)# if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) # warning("Dichotomized values are all-0 or all-1")# return(ret)# }# <environment: 0x15d04c778># attr(,"class")# [1] "add_dummy_" "node1" "function" # attr(,"p1")# [1] 0.7821782# attr(,"effsize")# [1] 0.5009811# attr(,"model")# Call:# coxph(formula = PFS ~ Tstage + x., data = data_)# # coef exp(coef) se(coef) z p# Tstage 0.8548 2.3509 0.2123 4.026 5.68e-05# x.TRUE 0.5010 1.6503 0.6312 0.794 0.427# # Likelihood ratio test=13.92 on 2 df, p=0.0009486# n= 101, number of events= 19
Training models c0 in test-subset of training set s0: 2nd element
c0[[2L]]# function (newx = logKi67.quantile["0.88"]) # {# ret <- (newx >= 6.7273333203265)# ret0 <- na.omit(ret)# if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) # warning("Dichotomized values are all-0 or all-1")# return(ret)# }# <environment: 0x15d1641d8># attr(,"class")# [1] "add_dummy_" "node1" "function" # attr(,"p1")# [1] 0.5445545# attr(,"effsize")# [1] 0.3891977# attr(,"model")# Call:# coxph(formula = PFS ~ Tstage + x., data = data_)# # coef exp(coef) se(coef) z p# Tstage 1.1616 3.1949 0.2370 4.902 9.51e-07# x.TRUE 0.3892 1.4758 0.4792 0.812 0.417# # Likelihood ratio test=20.86 on 2 df, p=2.953e-05# n= 101, number of events= 19
The S3 method dispatch predict.node1() evaluates a dichotomizing rule in a hyperframe. Note that user must call the S3 method dispatch predict.node1() explicitly, otherwise the S3 generic stats::predict() would dispatch to predict.add_dummy_().
Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Baddeley, Adrian, and Rolf Turner. 2005. “spatstat: An R Package for Analyzing Spatial Point Patterns.”Journal of Statistical Software 12 (6): 1–42. https://doi.org/10.18637/jss.v012.i06.
Kuhn, Max. 2008. “Building Predictive Models in R Using the caret Package.”Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Posit team. 2025. RStudio: Integrated Development Environment for r. Boston, MA: Posit Software, PBC. http://www.posit.co/.
Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves.”BMC Bioinformatics 12: 77. https://doi.org/10.1186/1471-2105-12-77.