Additional Predictor with Maximum Effect Size

Author

Tingting Zhan

Published

October 12, 2025

1 Introduction

This vignette provides examples of using R packages

groupedHyperframe, Zhan and Chervoneva (2025a), CRAN, Github, RPubs
maxEff, Zhan and Chervoneva (2025b), CRAN, Github, RPubs

to select one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size.

1.1 Prerequisite

Packages groupedHyperframe and maxEff require R version 4.5.0 (released 2025-04-11) or higher (macOS, Windows).

Environment on author’s computer

Sys.info()[c('sysname', 'release', 'machine')]
#  sysname  release  machine 
# "Darwin" "25.1.0"  "arm64"
R.version
#                _                           
# platform       aarch64-apple-darwin20      
# arch           aarch64                     
# os             darwin20                    
# system         aarch64, darwin20           
# status                                     
# major          4                           
# minor          5.1                         
# year           2025                        
# month          06                          
# day            13                          
# svn rev        88306                       
# language       R                           
# version.string R version 4.5.1 (2025-06-13)
# nickname       Great Square Root

Experimental (and maybe unstable) features are released extremely frequently to Github. Active developers should use the Github version; suggestions and bug reports are welcome! Stable releases to CRAN are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process.

remotes::install_github('tingtingzhan/groupedHyperframe')
remotes::install_github('tingtingzhan/maxEff')

Developers, do NOT use the CRAN version!

utils::install.packages('groupedHyperframe') # Developers, do NOT use!!
utils::install.packages('maxEff') # Developers, do NOT use!!

1.2 Getting Started

Examples in this vignette require that the search path has

library(maxEff)
library(groupedHyperframe)
library(survival)

1.3 Manuscript in Progress

1.3.1 [Paper 1]

The authors are working on a manuscript featureing the pipeline

1.3.1.1 [Paper 1; section 1]

Ki67s = groupedHyperframe::Ki67 |>
  grouped_ppp(formula = logKi67 ~ . | patientID/tissueID) |>
  Emark_(r = seq.int(from = 0, to = 100)) |>
  summary_fv() |>
  aggregate(by = ~ patientID)
#

A hyper data frame Ki67s with aggregated Emark

Ki67s |>
  head()
# Hyperframe:
#   patientID Tstage  PFS adj_rad adj_chemo histology  Her2   HR  node  race age
# 1   PT00037      2 100+   FALSE     FALSE         3  TRUE TRUE  TRUE White  66
# 2   PT00039      1   22   FALSE     FALSE         3 FALSE TRUE FALSE Black  42
# 3   PT00040      1  99+   FALSE        NA         3 FALSE TRUE FALSE White  60
# 4   PT00042      1  99+   FALSE      TRUE         3 FALSE TRUE  TRUE White  53
# 5   PT00054      1  112    TRUE      TRUE         3 FALSE TRUE  TRUE White  52
# 6   PT00059      4   12    TRUE     FALSE         2  TRUE TRUE  TRUE Black  51
#   logKi67.E.y logKi67.E.cumtrapz logKi67.E.cumvtrapz      ppp. logKi67.E
# 1   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)
# 2   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)
# 3   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)
# 4   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)
# 5   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)
# 6   (numeric)          (numeric)           (numeric) (ppplist)  (fvlist)

This pipeline contains four steps. Readers are encouraged to read more about each function in package groupedHyperframe vignette.

Sections in package **`groupedHyperframe`** vignette
Function	Purpose	Section
`grouped_ppp()`	to create a grouped hyper data frame with one-and-only-one point-pattern hypercolumn	3.1
`Emark_()`	to perform batch processes of conditional mean E(r) between the points and the marks, on the one-and-only-one point-pattern hypercolumn	3.2
`summary_fv()`	to summarize the function value tables	3.3
`aggregate()`	to aggregate the summary statistics per patient	3.4

1.3.1.2 [Paper 1; section 2]

Say some English.

set.seed(32); id = sample.int(n = nrow(Ki67s), size = nrow(Ki67s)*.8) |> sort.int()
s0 = Ki67s[id, , drop = FALSE] # training set
s1 = Ki67s[-id, , drop = FALSE] # test set

m0 = coxph(PFS ~ Tstage, data = s0)

1.3.1.3 [Paper 1; section 3]

The pipeline of choosing the additional numeric predictor with maximum effect size, details in Section 5.

a0 = m0 |>
  add_numeric(x = ~ logKi67.E.y) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline of choosing the additional logical predictor with maximum effect size, details in Section 6.1.

b0 = m0 |>
  add_dummy(x = ~ logKi67.E.y) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline of choosing the additional logical predictor with maximum effect size, details in Section 6.2.

set.seed(83); c0 = m0 |> 
  add_dummy_partition(~ logKi67.E.y, times = 20L, p = .8) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

1.4 Acknowledgement

This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants

R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui)
R01CA253977 (H. Rui and I. Chervoneva).

2 Examples

2.1 Non-Spatial Example

groupedHyperframe vignette, section Publications.

hyper.gam vignette, section Quantile Index, subsection Compute Aggregated Quantiles.

Ki67q = groupedHyperframe::Ki67 |>
  within.data.frame(expr = {
    x = y = NULL # remove x- and y-coords for non-spatial application
  }) |>
  as.groupedHyperframe(group = ~ patientID/tissueID) |>
  quantile(probs = seq.int(from = .01, to = .99, by = .01)) |>
  aggregate(by = ~ patientID)

A hyperframe Ki67q with aggregated quantiles

Ki67q |>
  head()
# Hyperframe:
#   Tstage  PFS adj_rad adj_chemo histology  Her2   HR  node  race age patientID
# 1      2 100+   FALSE     FALSE         3  TRUE TRUE  TRUE White  66   PT00037
# 2      1   22   FALSE     FALSE         3 FALSE TRUE FALSE Black  42   PT00039
# 3      1  99+   FALSE        NA         3 FALSE TRUE FALSE White  60   PT00040
# 4      1  99+   FALSE      TRUE         3 FALSE TRUE  TRUE White  53   PT00042
# 5      1  112    TRUE      TRUE         3 FALSE TRUE  TRUE White  52   PT00054
# 6      4   12    TRUE     FALSE         2  TRUE TRUE  TRUE Black  51   PT00059
#   logKi67.quantile
# 1        (numeric)
# 2        (numeric)
# 3        (numeric)
# 4        (numeric)
# 5        (numeric)
# 6        (numeric)

3 Data Partition

Partition into a training (80%) and test (20%) set.

set.seed(32); id = sample.int(n = nrow(Ki67q), size = nrow(Ki67q)*.8) |> sort.int()
s0 = Ki67q[id, , drop = FALSE] # training set
s1 = Ki67q[-id, , drop = FALSE] # test set

dim(s0)
# [1] 497  12

dim(s1)
# [1] 125  12

4 Starting Model

Let’s consider a starting model of endpoint PFS with predictor Tstage on the training set s0. As of package spatstat.geom v3.6.0.2, the S3 method dispatch spatstat.geom::as.data.frame.hyperframe() warns on hypercolumns that aren’t able to be converted to a data.frame. Therefore, user should use suppressWarnings() or set #| warning: false in R markdown code–chunk.

m0 = coxph(PFS ~ Tstage, data = s0)

Starting coxph model m0

summary(m0)
# Call:
# coxph(formula = PFS ~ Tstage, data = s0)
# 
#   n= 497, number of events= 91 
# 
#          coef exp(coef) se(coef)    z Pr(>|z|)    
# Tstage 0.6725    1.9591   0.1061 6.34 2.29e-10 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
#        exp(coef) exp(-coef) lower .95 upper .95
# Tstage     1.959     0.5104     1.591     2.412
# 
# Concordance= 0.654  (se = 0.028 )
# Likelihood ratio test= 32.23  on 1 df,   p=1e-08
# Wald test            = 40.2  on 1 df,   p=2e-10
# Score (logrank) test = 43.73  on 1 df,   p=4e-11

5 Adding `numeric` Predictor

a0 = m0 |>
  add_numeric(x = ~ logKi67.quantile) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of three steps.

First, function add_numeric()

considers each “slice” of the numeric-hypercolumn logKi67.quantile as an additional numeric predictor. Users are encourage to read the package groupedHyperframe vignette, Appendix Section anylist for more details;
updates the starting model m0 with each one of the additional numeric predictors, respectively;
returns an 'add_numeric' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_numeric_' (Appendix Section 7.4).

Second, the S3 method dispatch sort_by.add_() sorts the input by the absolute values of the regression coefficients, i.e., effect size effsize, of the additional numeric predictors.

Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.

The S3 method dispatch print.add_numeric() displays the calls to the selected numeric predictors.

a0
# logKi67.quantile["6%"]
# logKi67.quantile["5%"]

The S3 method dispatch predict.add_numeric() applies the starting model structure in m0, as well as the additional numeric predictors selected by a0, on the test set s1. The author categorizes this functionality under the S3 generic stats::predict(), instead of stats::update(), to emphasize that this “update” is on a newdata, even that the workhorse function is the S3 method dispatch stats:::update.default().

a1 = a0 |> 
  predict(newdata = s1)

Predicted models a1

a1
# logKi67.quantile["6%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z      p
# Tstage 0.5900    1.8041   0.1959 3.012 0.0026
# x.     0.6071    1.8352   0.8687 0.699 0.4846
# 
# Likelihood ratio test=9.23  on 2 df, p=0.009926
# n= 125, number of events= 27 
# 
# logKi67.quantile["5%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z       p
# Tstage 0.5886    1.8015   0.1961 3.002 0.00269
# x.     0.6276    1.8732   0.8746 0.718 0.47299
# 
# Likelihood ratio test=9.25  on 2 df, p=0.009794
# n= 125, number of events= 27

6 Adding `logical` Predictor

6.1 Naive Practice

b0 = m0 |>
  add_dummy(x = ~ logKi67.quantile) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy()

dichotomizes (Appendix Section 7.2) each “slice” of the numeric-hypercolumn logKi67.quantile into a logical variable;
removes the duplicated logical variables;
updates the starting model m0 with each one of the additional logical predictors, respectively;
returns an 'add_dummy' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_dummy_' (Appendix Section 7.5).

Second, the S3 method dispatch subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor.

Third, the S3 method dispatch sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor.

Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.

The S3 method dispatch print.add_dummy() displays the calls to the selected logical predictors.

b0
# logKi67.quantile["95%"]>=6.76825904668107
# logKi67.quantile["94%"]>=6.72093005757954

The S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by b0, on the test set s1.

b1 = b0 |> 
  predict(newdata = s1)

Predicted models b1

b1
# logKi67.quantile["95%"]>=6.76825904668107 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6287    1.8753   0.1922  3.271 0.00107
# x.TRUE -0.2630    0.7687   0.4693 -0.560 0.57515
# 
# Likelihood ratio test=9.05  on 2 df, p=0.01086
# n= 125, number of events= 27 
# 
# logKi67.quantile["94%"]>=6.72093005757954 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#            coef exp(coef) se(coef)      z       p
# Tstage  0.61820   1.85558  0.19162  3.226 0.00125
# x.TRUE -0.09157   0.91249  0.46920 -0.195 0.84526
# 
# Likelihood ratio test=8.78  on 2 df, p=0.01238
# n= 125, number of events= 27

6.2 via Repeated Partitions

set.seed(83); c0 = m0 |> 
  add_dummy_partition(~ logKi67.quantile, times = 20L, p = .8) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy_partition() creates a dichotomizing rule for each additional numeric predictor in the following steps.

Generates twenty (20L) partitions of the training set s0 using functions caret::createDataPartition() or statusPartition() (Appendix Section 7.3) at .8 vs. .2=(1-.8) ratio.
For the i-th partition (i=1,\cdots,20) of the training set s0,
- creates a dichotomizing rule of the numeric predictor in the i-th training-subset of s0 using function node1() (Appendix Section 7.2);
- applies such dichotomizing rule to the numeric predictor in the i-th test-subset of s0;
- updates the starting model m0 with the i-th test-subset of s0, as well as the logical predictor from the previous step;
- records the estimated regression coefficients, i.e., effect size effsize, of the logical predictor.
Selects the dichotomizing rule based on the partition with the median effect size of the logical predictor in their own, specific test-subset of s0.
Returns an 'add_dummy' object.

Second, the S3 method dispatch subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor in their own, specific test-subset of s0.

Third, the S3 method dispatch sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor in their own, specific test-subset of s0.

Lastly, the S3 method dispatch utils:::head.default() chooses the top n element from the input.

Similar to Section 6.1, the S3 method dispatch print.add_dummy() displays the calls to the selected logical predictors.

c0
# logKi67.quantile["99%"]>=7.16313918263581
# logKi67.quantile["88%"]>=6.7273333203265

Similar to Section 6.1, the S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by c0, on the test set s1.

c1 = c0 |> 
  predict(newdata = s1)

Predicted models c1

c1
# logKi67.quantile["99%"]>=7.16313918263581 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6258    1.8697   0.1941  3.224 0.00126
# x.TRUE -0.5360    0.5851   0.4467 -1.200 0.23017
# 
# Likelihood ratio test=10.07  on 2 df, p=0.006517
# n= 125, number of events= 27 
# 
# logKi67.quantile["88%"]>=6.7273333203265 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z       p
# Tstage 0.6044    1.8302   0.1928 3.134 0.00172
# x.TRUE 0.2514    1.2858   0.4052 0.620 0.53500
# 
# Likelihood ratio test=9.14  on 2 df, p=0.01037
# n= 125, number of events= 27

7 Appendix

Technical details, as well as the minor and/or experimental features, are covered in the Appendix, in order not to interrupt the main narrative of this vignette.

All R code-chunks are folded in the Appendix, for ease of navigation. They are categorized as

Data, to create an R object for further operations.
Review, to demonstrate functions shipped with vanilla R version 4.5.1 (2025-06-13), or from other packages.
Example, to demonstrate functions from package maxEff.
Advanced, discussions for R experts.
Figure, to create a ggplot.
Workaround, to provide alternative solutions to a problem.

7.1 About

R terminology might be different from that of mathematics and statistics. Please refer to Appendix Section 9 for explanation and reference of the terms and abbreviations used in this vignette.

7.1.1 Environment

This vignette is created under R version 4.5.1 (2025-06-13) using packages knitr (Xie 2025, v1.50), quarto (Allaire and Dervieux 2024, v1.5.1 with Quarto v1.8.25) and rmarkdown (Allaire et al. 2025, v2.30).

An Integrated Development Environment (IDE), e.g., RStudio (Posit Team 2025) or Positron, is not required, but highly recommended.

7.1.2 Dependency

Package maxEff Imports packages

🗝 caret (Kuhn 2008, v7.0.1), for data partition
🗝 groupedHyperframe (dev v0.3.0), key dependency, for S3 method dispatch t.vectorlist()
parallel shipped with R version 4.5.1 (2025-06-13), for parallel computing
🗝 rpart (T. Therneau and Atkinson 2025, v4.1.24), key dependency, for recursive partitioning
🗝 spatstat.geom (Baddeley, Rubak, and Turner 2015, dev v3.6.0.2), key dependency, for hyperframe data structure

Package maxEff Suggests packages

🗝 survival (T. M. Therneau 2024, v3.8.3), key suggest (of package groupedHyperframe)

Package groupedHyperframe dependencies are outlined in its separate vignette (CRAN, RPubs).

As of package spatstat.explore (Baddeley, Rubak, and Turner 2015, v3.5.3.2) and pROC (Robin et al. 2011, v1.19.0.1), we have a function name clash between spatstat.explore::plot.roc() and pROC::plot.roc() when loading both packages groupedHyperframe (which Imports package spatstat.explore) and maxEff (which Imports package caret which Imports package pROC). This function name clash is potentially hazardous as the S3 classes spatstat.explore::roc and pROC::roc are totally different.

7.2 `node1()`

Function node1() dichotomizes a single numeric or factor vector based on the first node of a recursive partitioning and regression tree rpart.object from package rpart (T. Therneau and Atkinson 2025). In this section, we illustrate

the use of function node1(), and
the S3 method dispatches predict.node1() and labels.node1()

using data examples from package rpart. To keep the user’s search path simple and clean, we intentionally call the objects and functions in package rpart and survival (T. M. Therneau 2024) with explicit namespace, instead of using library() or require().

Data set stagec from package rpart contains 146 subjects with Stage C prostate cancer. We use the first 140 subjects as the training set, and the last 6 subjects as the test set.

Data: training set stagec0 and test set stagec1

stagec0 = rpart::stagec[1:140,] # training set
stagec1 = rpart::stagec[-(1:140),] # test set

7.2.1 `numeric` Predictor

Function rpart::rpart() creates a recursive partitioning model of the numeric variable age with the endpoint of progression-free survival in the training set stagec0, with

parameter cp = .Machine$double.eps, to ensure at least one node/split of the partitioning tree.
parameter maxdepth = 2L, to reduce the computation cost as we need only the first node.

Review: rpart::rpart(), numeric predictor

rp0a = rpart::rpart(
  formula = survival::Surv(pgtime, pgstat) ~ age, 
  data = stagec0, 
  cp = .Machine$double.eps,
  maxdepth = 2L
)
rp0a
# n= 140 
# 
# node), split, n, deviance, yval
#       * denotes terminal node
# 
# 1) root 140 183.84220 1.0000000  
#   2) age>=53.5 132 166.93880 0.9451090  
#     4) age>=67.5 30  35.69402 0.7546356 *
#     5) age< 67.5 102 130.54580 1.0034740 *
#   3) age< 53.5 8  14.33619 1.8678730 *

Function node1() creates a dichotomizing rule \mathcal{D} of the numeric variable age based on the first node in the tree rp0a. Function node1() returns an object of class 'node1', which is also a closure that consists of a function and the environment in which it was created.

Example: node1(), numeric predictor

D0a = rp0a |>
  node1()

Dichotomizing rule D0a should be used as an R function.

Example: application of dichotomizing rule D0a

set.seed(35); rnorm(n = 6L, mean = 53.5) |> 
  D0a()
# [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

The S3 method dispatch base::print.function() displays the dichomizing rule D0a and its enclosure environment,

Review: base::print.function() on D0a

D0a
# function (newx = age) 
# {
#     ret <- (newx >= 53.5)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x42c971570>
# attr(,"class")
# [1] "node1_numeric" "node1"         "function"

The enclosure environment of the dichomizing rule D0a is intentionally cleaned up.

Advanced: enclosure environment of D0a

D0a |>
  environment() |>
  ls(envir = _)
# [1] "dc"  "trm"
D0a |>
  environment() |>
  ls(envir = _, all.names = TRUE)
# [1] ".fn" "dc"  "trm"

The name of the numeric variable, e.g., age, is stored as the formals argument of the parameter newx. This is an advanced R programming trick. The evaluation of the dichotomizing rule discovers the variable age in

the global environment .GlobalEnv
the enclosure environment of the dichotomizing rule, if its parent.environment is a namespace or the .GlobalEnv

Advanced: evaluate in .GlobalEnv

age = stagec1$age
D0a()
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
rm(age)

Advanced: evaluate in environment(D0a), child of package namespace

ev = environment(D0a)
parent.env(ev)
# <environment: namespace:maxEff>
stopifnot(isNamespace(parent.env(ev)))
ls(envir = ev)
# [1] "dc"  "trm"
assign(x = 'age', value = stagec1$age, envir = ev)
D0a()
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
rm(age, envir = ev)
ls(envir = environment(D0a))
# [1] "dc"  "trm"
rm(ev)

Advanced: evaluate in new.env(parent = .GlobalEnv)

ev = new.env(parent = .GlobalEnv)
parent.env(ev)
# <environment: R_GlobalEnv>
assign(x = 'age', value = stagec1$age, envir = ev)
ls(envir = ev)
# [1] "age"
D0a. = D0a
environment(D0a.) = ev
D0a.()
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
tryCatch(expr = {
  as.function(D0a, envir = ev)()
}, error = identity)
# <simpleError in as.function(D0a, envir = ev)(): object 'age' not found>
tryCatch(expr = {
  eval(D0a(), envir = ev)
}, error = identity)
# <simpleError in D0a(): object 'age' not found>
tryCatch(expr = {
  eval(D0a(), enclos = ev)
}, error = identity)  
# <simpleError in D0a(): object 'age' not found>
rm(ev, D0a.)

Advanced: evaluate in list2env(., parent = .GlobalEnv)

ev = stagec1 |> 
  list2env(parent = .GlobalEnv) 
parent.env(ev)
# <environment: R_GlobalEnv>
D0a. = D0a
environment(D0a.) = ev
D0a.()
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
rm(ev, D0a.)

Advanced: do not evaluate in as.environment(), child of emptyenv() !

ev = stagec1 |>
  as.environment()
parent.env(ev)
# <environment: R_EmptyEnv>
D0a. = D0a
environment(D0a.) = ev
tryCatch(expr = {
  D0a.()
}, error = identity)
# <simpleError in {    ret <- (newx >= 53.5)    ret0 <- na.omit(ret)    if ((length(ret0) > 1L) && (all(ret0) || !any(ret0)))         warning("Dichotomized values are all-0 or all-1")    return(ret)}: could not find function "{">
rm(ev, D0a.)

The S3 method dispatch labels.node1() returns a human-friendly character text to describe the dichotomizing rule \mathcal{D}.

Example: labels.node1() on D0a

D0a |> 
  labels()
# [1] "age>=53.5"

The S3 method dispatch predict.node1() dichotomizes the numeric variable age in the test set stagec1, using the dichotomizing rule D0a determined by the training set stagec0.

Example: predict.node1() on D0a

stagec1$age
# [1] 62 63 73 68 51 56
D0a |> 
  predict(newdata = stagec1)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

The (tentatively named) S3 method dispatch get_cutoff.node1() returns the numeric cutoff value of the dichotomizing rule \mathcal{D}.

Example: get_cutoff.node1() on D0a

D0a |> 
  get_cutoff()
# [1] 53.5

7.2.2 `factor` Predictor

Function node1() creates a dichotomizing rule D0b for the factor variable ploidy based on the first node in the recursive partitioning tree rp0b, in a similar procedure as outlined in Appendix Section 7.2.1.

Review: rpart::rpart(), factor predictor

rp0b = rpart::rpart(
  formula = survival::Surv(pgtime, pgstat) ~ ploidy, 
  data = stagec0, 
  cp = .Machine$double.eps,
  maxdepth = 2L
)
rp0b
# n= 140 
# 
# node), split, n, deviance, yval
#       * denotes terminal node
# 
# 1) root 140 183.84220 1.0000000  
#   2) ploidy=diploid 63  63.24389 0.5037078 *
#   3) ploidy=tetraploid,aneuploid 77 107.17600 1.4543450  
#     6) ploidy=tetraploid 67  86.81640 1.3362750 *
#     7) ploidy=aneuploid 10  18.54132 2.1347490 *

Example: node1(), factor predictor

D0b = rp0b |>
  node1()

Review: base::print.function() on D0b

D0b
# function (newx = ploidy) 
# {
#     ret <- unclass(newx) %in% c(1L)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x3ddc98938>
# attr(,"class")
# [1] "node1_factor" "node1"        "function"

Example: labels.node1() on D0b

D0b |> 
  labels()
# [1] "ploidy in levels c(1L)"

The S3 method dispatch predict.node1() dichotomizes the factor variable ploidy in the test set stagec1, using the dichotomizing rule D0b determined by the training set stagec0. User must make sure that the factor variable being dichotomized has the same levels in the training and the test set.

Example: predict.node1() on D0b

stopifnot(identical(levels(stagec0$ploidy), levels(stagec1$ploidy)))
stagec1$ploidy
# [1] diploid    diploid    tetraploid aneuploid  diploid    diploid   
# Levels: diploid tetraploid aneuploid
D0b |> 
  predict(newdata = stagec1)
# [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

7.3 `statusPartition()`

Function statusPartition()

splits a left-censored survival::Surv object by its survival status, i.e., observed vs. left-censored;
partitions the observed and left-censored subjects, respectively, into test/training sets.

We follow the terminology of the (very popular) function caret::createDataPartition(), that

We split the data into multiple groups (determined by predictor and/or response variable(s)), and partition each group into test/training sets.

Consider a toy example of

Data: left-censored Surv object capacitor_failure

capacitor_failure = survival::capacitor |> 
  with(expr = survival::Surv(time, status))
capacitor_failure
#  [1]  439   904  1092  1105   572   690   904  1090   315   315   439   628 
# [13]  258   258   347   588   959  1065  1065  1087   216   315   455   473 
# [25]  241   315   332   380   241   241   435   455  1105+ 1105+ 1105+ 1105+
# [37] 1090+ 1090+ 1090+ 1090+  628+  628+  628+  628+  588+  588+  588+  588+
# [49] 1087+ 1087+ 1087+ 1087+  473+  473+  473+  473+  380+  380+  380+  380+
# [61]  455+  455+  455+  455+

Function statusPartition() intends to avoid the situation that a Cox proportional hazards model survival::coxph() in one or more of the partitioned data set being degenerate due to the fact that all subjects in that partition being censored.

Example: statusPartition()

set.seed(12); id = capacitor_failure |>
  statusPartition(times = 1L, p = .5)
capacitor_failure[id[[1L]], 2L] |> 
  table() # balanced by survival status
# 
#  0  1 
# 16 16

Function statusPartition() is an extension of the very popular function caret::createDataPartition(), which stratifies a Surv object by the quantiles of its survival time (as of package caret v7.0.1).

Review: caret::createDataPartition(), not balanced by survival status

set.seed(12); id0 = capacitor_failure |>
  caret::createDataPartition(times = 1L, p = .5)
capacitor_failure[id0[[1L]], 2L] |> 
  table()
# 
#  0  1 
# 19 14

7.4 `'add_numeric_'`

The internal class 'add_numeric_' inherits from the class 'call', with additional attributes

attr(., 'effsize'), a numeric scalar, regression coefficients, i.e., effect size effsize, of the additional numeric predictor
attr(., 'model'), the regression model with additional numeric predictor

The S3 method dispatch base::print.default() displays each 'add_numeric_' object.

Example: training models a0, 1st element

a0[[1L]]
# logKi67.quantile["6%"]
# attr(,"effsize")
# [1] 0.2869008
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z        p
# Tstage 0.6694    1.9530   0.1064 6.293 3.11e-10
# x.     0.2869    1.3323   0.3550 0.808    0.419
# 
# Likelihood ratio test=32.85  on 2 df, p=7.369e-08
# n= 497, number of events= 91 
# attr(,"class")
# [1] "add_numeric_" "call"

Example: training models a0, 2nd element

a0[[2L]]
# logKi67.quantile["5%"]
# attr(,"effsize")
# [1] 0.2863243
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z       p
# Tstage 0.6690    1.9522   0.1064 6.289 3.2e-10
# x.     0.2863    1.3315   0.3563 0.804   0.422
# 
# Likelihood ratio test=32.84  on 2 df, p=7.394e-08
# n= 497, number of events= 91 
# attr(,"class")
# [1] "add_numeric_" "call"

The S3 method dispatch spatstat.geom::with.hyperframe() obtains the selected numeric predictors by passing the call to parameter ee.

Example: 1st selected numeric predictor

s0 |>
  with(ee = a0[[1L]]) |> # ?spatstat.geom::with.hyperframe
  summary.default()
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   5.351   5.873   6.005   6.028   6.165   7.930
s1 |>
  with(ee = a0[[1L]]) |> # ?spatstat.geom::with.hyperframe
  summary.default()
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   5.471   5.861   6.000   6.016   6.125   6.651

Example: 2nd selected numeric predictor

s0 |>
  with(ee = a0[[2L]]) |> # ?spatstat.geom::with.hyperframe
  summary.default()
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   5.334   5.851   5.987   6.008   6.145   7.917
s1 |>
  with(ee = a0[[2L]]) |> # ?spatstat.geom::with.hyperframe
  summary.default()
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   5.449   5.850   5.983   5.998   6.109   6.603

The S3 method dispatch predict.add_numeric_() is the workhorse of the S3 method dispatch predict.add_numeric().

Example: predict.add_numeric_(); predicted models a1, 1st element

a11 = a0[[1L]] |> 
  predict(newdata = s1)
stopifnot(identical(a1[[1L]], a11))

Example: predict.add_numeric_(); predicted models a1, 2nd element

a12 = a0[[2L]] |> 
  predict(newdata = s1)
stopifnot(identical(a1[[2L]], a12))

7.5 `'add_dummy_'`

The internal class 'add_dummy_' inherits from the class 'node1' (Appendix Section 7.2), with additional attributes

attr(., 'p1'), a numeric scalar between 0 and 1, the TRUE probability of the additional logical predictor in the training set
attr(., 'effsize'), a numeric scalar, the regression coefficients, i.e., effect size effsize, of the additional logical predictor
attr(., 'model'), the regression model with additional logical predictor

The S3 method dispatch base::print.default() displays each 'add_dummy_' object.

Example: training models b0 in training set s0: 1st element

b0[[1L]]
# function (newx = logKi67.quantile["95%"]) 
# {
#     ret <- (newx >= 6.76825904668107)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x424be7b80>
# attr(,"class")
# [1] "add_dummy_"    "node1_numeric" "node1"         "function"     
# attr(,"p1")
# [1] 0.7746479
# attr(,"effsize")
# [1] 1.233025
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z        p
# Tstage 0.6291    1.8760   0.1079 5.830 5.54e-09
# x.TRUE 1.2330    3.4316   0.3948 3.123  0.00179
# 
# Likelihood ratio test=45.99  on 2 df, p=1.033e-10
# n= 497, number of events= 91

Example: training models b0 in training set s0: 2nd element

b0[[2L]]
# function (newx = logKi67.quantile["94%"]) 
# {
#     ret <- (newx >= 6.72093005757954)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x3d64e9f60>
# attr(,"class")
# [1] "add_dummy_"    "node1_numeric" "node1"         "function"     
# attr(,"p1")
# [1] 0.7806841
# attr(,"effsize")
# [1] 1.150391
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z        p
# Tstage 0.6133    1.8465   0.1079 5.683 1.32e-08
# x.TRUE 1.1504    3.1594   0.3963 2.903   0.0037
# 
# Likelihood ratio test=43.77  on 2 df, p=3.134e-10
# n= 497, number of events= 91

Example: training models c0 in test-subset of training set s0: 1st element

c0[[1L]]
# function (newx = logKi67.quantile["99%"]) 
# {
#     ret <- (newx >= 7.16313918263581)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x42f694020>
# attr(,"class")
# [1] "add_dummy_"    "node1_numeric" "node1"         "function"     
# attr(,"p1")
# [1] 0.7821782
# attr(,"effsize")
# [1] 0.5009811
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z        p
# Tstage 0.8548    2.3509   0.2123 4.026 5.68e-05
# x.TRUE 0.5010    1.6503   0.6312 0.794    0.427
# 
# Likelihood ratio test=13.92  on 2 df, p=0.0009486
# n= 101, number of events= 19

Example: training models c0 in test-subset of training set s0: 2nd element

c0[[2L]]
# function (newx = logKi67.quantile["88%"]) 
# {
#     ret <- (newx >= 6.7273333203265)
#     ret0 <- na.omit(ret)
#     if ((length(ret0) > 1L) && (all(ret0) || !any(ret0))) 
#         warning("Dichotomized values are all-0 or all-1")
#     return(ret)
# }
# <environment: 0x3da009c10>
# attr(,"class")
# [1] "add_dummy_"    "node1_numeric" "node1"         "function"     
# attr(,"p1")
# [1] 0.5445545
# attr(,"effsize")
# [1] 0.3891977
# attr(,"model")
# Call:
# coxph(formula = PFS ~ Tstage + x., data = data_)
# 
#          coef exp(coef) se(coef)     z        p
# Tstage 1.1616    3.1949   0.2370 4.902 9.51e-07
# x.TRUE 0.3892    1.4758   0.4792 0.812    0.417
# 
# Likelihood ratio test=20.86  on 2 df, p=2.953e-05
# n= 101, number of events= 19

The S3 method dispatch predict.node1() evaluates a dichotomizing rule in a hyperframe. Note that user must call the S3 method dispatch predict.node1() explicitly, otherwise the S3 generic stats::predict() would dispatch to predict.add_dummy_().

Example: predict.node1(); 1st selected logical predictor

b0[[1L]] |> 
  predict.node1(newdata = s0) |>
  table() |> 
  addmargins()  
# 
# FALSE  TRUE   Sum 
#   112   385   497
b0[[1L]] |> 
  predict.node1(newdata = s1) |>
  table() |> 
  addmargins()
# 
# FALSE  TRUE   Sum 
#    21   104   125

Example: predict.node1(); 2nd selected logical predictor

b0[[2L]] |> 
  predict.node1(newdata = s0) |>
  table() |> 
  addmargins() 
# 
# FALSE  TRUE   Sum 
#   109   388   497
b0[[2L]] |> 
  predict.node1(newdata = s1) |>
  table() |> 
  addmargins()  
# 
# FALSE  TRUE   Sum 
#    21   104   125

Example: predict.node1(); 1st selected logical predictor via repeated partitions

c0[[1L]] |>
  predict.node1(newdata = s0) |>
  table() |> 
  addmargins()
# 
# FALSE  TRUE   Sum 
#   113   384   497
c0[[1L]] |>
  predict.node1(newdata = s1) |>
  table() |> 
  addmargins()
# 
# FALSE  TRUE   Sum 
#    20   105   125

Example: predict.node1(); 2nd selected logical predictor via repeated partitions

c0[[2L]] |>
  predict.node1(newdata = s0) |>
  table() |> 
  addmargins()
# 
# FALSE  TRUE   Sum 
#   224   273   497
c0[[2L]] |>
  predict.node1(newdata = s1) |>
  table() |> 
  addmargins()
# 
# FALSE  TRUE   Sum 
#    54    71   125

The S3 method dispatch predict.add_dummy_() is the workhorse of the S3 method dispatch predict.add_dummy().

Example: predict.add_dummy_(); predicted models b1: 1st element

b11 = b0[[1L]] |> 
  predict(newdata = s1)
stopifnot(identical(b1[[1L]], b11))

Example: predict.add_dummy_(); predicted models b1: 2nd element

b12 = b0[[2L]] |> 
  predict(newdata = s1)
stopifnot(identical(b1[[2L]], b12))

Example: predict.add_dummy_(); predicted models c1: 1st element

c11 = c0[[1L]] |> 
  predict(newdata = s1)
stopifnot(identical(c1[[1L]], c11))

Example: predict.add_dummy_(); predicted models c1: 2nd element

c12 = c0[[2L]] |> 
  predict(newdata = s1)
stopifnot(identical(c1[[2L]], c12))

8 What We Don’t Do

8.1 Use of “Optim”

In earlier publications, a junior author referred to the methodology outlined in Section 6.2 using the term ‘optim’. This is a wrong practice of nomenclature, indicating a wrong understanding of R terminology.

In the world of R, an algorithm may use the term ‘optim’ if-and-only-if its core/workhorse function is either one of

Brent–Dekker algorithm in one-dimensional optimization, via function stats::optimize() and/or its alias stats::optimise().
Nelder-Mead, quasi-Newton, conjugate-gradient algorithms in higher-dimensional optimization, via function stats::optim().

Otherwise, it is strongly advised not call an algorithm ‘optim’.

Therefore, we now name this package maxEff, meaning ‘maximum effect size’.

9 Terms & Abbreviations

Table 1 presents a comprehensive glossary of R terms and abbreviations used in this vignette.

R terminology and nomenclature could be drastically different from that of mathematics and statistics. Users are strongly advised to read closely from the links in Table 1, which point to webpages on search.r-project.org, cran.r-project.org, and en.wikipedia.org.

Table 1: Terms & Abbreviations in R

Term / Abbreviation	Description
`coxph`	Cox proportional hazards model
`createDataPartition`	Test vs. training data set partition, from package `caret` (Kuhn 2008)
`groupedHyperframe`	Grouped hyper data frame, from package `groupedHyperframe` (Zhan and Chervoneva 2025a)
`hypercolumns`, `hyperframe`	(Hyper columns of) hyper data frame, from package `spatstat.geom` (Baddeley and Turner 2005)
`PFS`	Progression/recurrence free survival, https://en.wikipedia.org/wiki/Progression-free_survival
`rpart`, `rpart.object`, `node`	Recursive partitioning and regression trees
`Surv`	Survival, i.e., time-to-event, object, from package `survival` (T. M. Therneau 2024)

10 References

Allaire, JJ, and Christophe Dervieux. 2024. quarto: R Interface to ‘Quarto‘ Markdown Publishing System. https://CRAN.R-project.org/package=quarto.

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2025. rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/.

Baddeley, Adrian, and Rolf Turner. 2005. “spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. https://doi.org/10.18637/jss.v012.i06.

Kuhn, Max. 2008. “Building Predictive Models in R Using the ‘Caret‘ Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.

Posit Team. 2025. RStudio: Integrated Development Environment for R. Boston, MA: Posit Software, PBC. https://posit.co/.

Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12: 77. https://doi.org/10.1186/1471-2105-12-77.

Therneau, Terry M. 2024. A Package for Survival Analysis in R. https://CRAN.R-project.org/package=survival.

Therneau, Terry, and Beth Atkinson. 2025. rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Xie, Yihui. 2025. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.

Zhan, Tingting, and Inna Chervoneva. 2025a. groupedHyperframe: Grouped Hyper Data Frame: An Extension of Hyper Data Frame. https://CRAN.R-project.org/package=groupedHyperframe.

———. 2025b. maxEff: Additional Predictor with Maximum Effect Size. https://CRAN.R-project.org/package=maxEff.

1 Introduction

1.1 Prerequisite

1.2 Getting Started

1.3 Manuscript in Progress

1.3.1 [Paper 1]

1.3.1.1 [Paper 1; section 1]

1.3.1.2 [Paper 1; section 2]

1.3.1.3 [Paper 1; section 3]

1.4 Acknowledgement

2 Examples

2.1 Non-Spatial Example

3 Data Partition

4 Starting Model

5 Adding numeric Predictor

6 Adding logical Predictor

6.1 Naive Practice

6.2 via Repeated Partitions

7 Appendix

7.1 About

7.1.1 Environment

7.1.2 Dependency

7.2 node1()

7.2.1 numeric Predictor

7.2.2 factor Predictor

7.3 statusPartition()

7.4 'add_numeric_'

7.5 'add_dummy_'

8 What We Don’t Do

8.1 Use of “Optim”

9 Terms & Abbreviations

10 References

5 Adding `numeric` Predictor

6 Adding `logical` Predictor

7.2 `node1()`

7.2.1 `numeric` Predictor

7.2.2 `factor` Predictor

7.3 `statusPartition()`

7.4 `'add_numeric_'`

7.5 `'add_dummy_'`