Dr. Chukwuebuka Ogwo
2/5/2026
Introduction
Model performance can be highly dependent on how input predictors are encoded.
Sensitivities to different encodings varies across model types.
Tree-based methods are notably insensitive.
Linear regression is not.
Predictor encoding, called feature engineering, is done as a data preprocessing step prior to model fitting.
Feature engineering generally includes additions, deletions, and transformations of training set data.
Typically, there are many different encodings from which to choose.
Optimal feature engineering depends on the (1) model type and (2) true relationship between the predictors and outcome.
The most effective approaches are often informed by scientific understandings of problems rather than purely algorithmic processing.
The fit() and resample() functions in
the MachineShop R package support model specification with recipe
objects supplied by the recipes package (Kuhn and Wickham,
2020).
Recipes define predictor and outcome variables to be modeled, training data, and preprocessing steps to be applied to them; and have the general syntax shown below.
Recipe Syntax
Recipe Elements
recipe(): defines the ingredients with which to create
a recipe for preprocessing data.formula: model formula of outcome and predictor
variables. Dots are allowed; in-line functions are not.data: data frame containing the variables.%>%: forward pipe operator for adding preprocessing
steps to the recipe.step(): preprocessing functions to be applied to the
data in their order of appearance....: variables to which to apply the preprocessing.
They may be specified using one or more of their data frame names or
with the selector functions below.starts_with(), ends_with(),
contains(), matches(),
num_range(), all_of(), any_of(),
everything()all_predictors(), all_outcomes(),
has_role()all_numeric(), all_nominal(),
has_type()| Syntax | Types* | Description |
|---|---|---|
recipe(function, data) |
- | Create a recipe for preprocessing data |
check_missing() |
A | Check for missing values |
step_factor2string() |
F | Convert factors to strings |
step_intercept() |
- | Add intercept column |
step_naomit() |
A | Remove cases with missing values |
step_novel() |
F | Simple value assignments for novel factor levels |
step_num2factor() |
N | Convert numbers to factors |
step_ordinalscore() |
F | Convert ordinal factors to numeric scores |
step_string2factor() |
C | Convert strings to factors |
step_unorder() |
F | Convert ordered to unordered factors |
summary() |
A | Summarize a recipe |
In the following example code, recipe Caries_recipe
is created with the Caries data from the IFS Study.
A step is added to check for missing values in the outcome
variable AdjDFS_I and, if present, return an error message
when the data are processed.
Missing values in the outcomes will cause problems with the model fitting.
An alternative strategy to excluding such observations would be to impute their values if desired to keep them in the analysis.
Recall that some of the variables are stored as numeric in the original dataset.
The role_case() step function in the recipe is from
the MachineShop package and included to designate AdjDFS_I
as a stratification variable in calls to resample() for
resample estimation of prediction performance.
Some variables whose values represent categorical levels are
converted in the recipe to nominal and ordinal factors with the
step_num2factor() function.
The conversion requires that the values be consecutive integers
starting at 1 or be transformed to such with an appropriate function
supplied to the transform argument.
Finally, a step to remove cases with missing predictor values is
added to create recipe rec_naomit.
library(MachineShop)
library(recipes)
## Load data and libraries
setwd("/Users/damia/OneDrive - Harvard University/HARVARD HSDM/CREATING NEW COURSE")
#setwd("/Users/cho379/OneDrive-Harvard University/HARVARD HSDM/CREATING NEW COURSE")
Caries <- read.csv("ML_Week 1/Caries_data.csv")
## Dataset without preprocessing
train_indices <- sample(nrow(Caries), nrow(Caries) * 2 / 3)
trainset <- Caries[train_indices, ]
testset <- Caries[-train_indices, ]
## Global resample control for comparability of results
control <- CVControl(folds = 10, seed = 123)
Caries_recipe <- recipe(AdjDFS_I ~ MotherEduc + Income + Female1 + Total_mgF + Homeppm + Brushingfreq
+ Waterbase + Milk + Juice100 + SSB,data = trainset) %>%
role_case(stratum = AdjDFS_I) %>%
check_missing(AdjDFS_I) %>%
#step_naomit(MotherEduc, Income) %>%
step_num2factor(MotherEduc, transform = function(x) x + 1,levels = c("1", "2", "3", "4","5", "6"), ordered = TRUE) %>%
step_num2factor(Income, levels = c("Lowest", "Low", "Medium", "High", "Highest"), ordered = TRUE) %>%
step_num2factor(Female1, transform = function(x) x + 1,levels = c("M", "F"))
prep()
summary(prep(Caries_recipe))
juice(prep(Caries_recipe ))Dummy Variables
| Income | N | Normal | Fixed Defect | Reversible Defect |
|---|---|---|---|---|
| Lowest | 4 | 1 | 0 | 0 |
| Low | 18 | 0 | 1 | 0 |
| Medium | 18 | 0 | 0 | 1 |
| High | 16 | 0 | 1 | 0 |
| Highest | 52 | 1 | 0 | 0 |
Dummy Variable Recipe Functions
| Syntax | Types | Description |
|---|---|---|
step_bin2factor() |
N | Create factors from dummy variables |
step_dummy() |
F | Dummy variables creation |
step_regex() |
F | Create dummy variables with regular expressions |
dummy_names() |
- | Naming tools |
A recipe step is added to create dummy variables for all nominal (and ordinal) factors in the training data.
At this point, a complete case analysis is being performed by
extending the rec_naomit recipe.
The step_dummy() function excludes the first dummy
variable in each set by default.
Dummy variables are 0/1 indicators for unordered factors and polynomial contrasts for ordered factors.
Specification of all_nominal() and
-all_outcomes() ensures that only factor variables among
the predictors are processed.
Dummy variables are not needed for the outcome variable.
fit() and
resample() functionsAn unprocessed recipe can be passed to the fit() and
resample() functions.
The fit() function processes the recipe on the full
dataset and resample() on each resampled dataset of the
resampling algorithm.
Observed and predicted outcomes on new (unprocessed) data can be
obtained with the predict() and response()
functions and compared with performance().
### Model fitting and resampling
modelFit <- fit(Car_rec , model = GLMModel)
res <- resample(Car_rec , model = GLMModel, control = control)
summary(res)
### Prediction and performance on new data
obs <- response(modelFit, newdata = testset)
pred <- predict(modelFit, newdata = testset)
performance(obs, pred)There may be advantages to removing some predictors prior to modeling.
Fewer predictors require less computational time and lead to more interpretable models.
Removal of weakly or non-informative variables can improve model stability and performance.
Two common motivations for removal are:
Preprocessing steps to remove variables are referred to as filtering.
There must be variability in the values of a predictor in order for it to have an association with the outcome.
Predictors that have a single value, and hence zero variance, are said to have degenerate distributions.
Degenerate predictors should be removed since they have no predictive ability and can cause computational errors in some models.
Similarly, the removal of predictors with near-zero variance may also be advantageous.
This is not always the case though; e.g., in genetics studies of rare diseases, rare genetic variants may be of primary interest as predictors.
One filtering approach for near-zero variance predictors is as follows.
Near-Zero Variance Filter Algorithm
Remove predictors for which
The algorithm can be applied to numeric or categorical predictors.
Zero variance is an extreme case of near-zero variance.
Multicollinearity is a condition in which there are high degrees of correlation among multiple predictors.
High correlation suggests overlap in information provided by the predictors which can lead to unnecessarily complex models.
Moreover, for some models, multicollinearity can lead to unstable parameter estimates, numerical errors, and degraded prediction performance.
One remedy is to remove the minimum number of predictors necessary to ensure that all pairwise correlations are below a specified threshold.
High Correlation Filter Algorithm 1. Calculate the pairwise correlation matrix for the predictors. 2. Identify the two predictors, say x and x′, with the largest absolute correlation. 3. Compute the average correlation between each of x and x′ and the other predictors. 4. Remove the one with the largest average correlation. 5. Repeat the previous steps until no absolute correlations are above the threshold.
As typically implemented, this algorithm requires numeric predictors for the calculation of pairwise correlations.
Different types of correlation could be computed, including traditional Pearson correlation or Spearman correlations and Kendall.
Removal multicollinearity may significantly improve model performance.
However, overlapping information due to non-linear associations between predictors will not necessarily be addressed.
Clustering predictors and extracting the medoids with the
MachineShop step_kmedoids() function is another example of
filtering.
Filter Recipe Functions
| Category | Syntax | Types | Description |
|---|---|---|---|
| Degenerate Distribution | step_nzv() |
FN | Near-zero variance filter |
step_zv() |
FN | Zero variance filter | |
| Multicollinearity | step_corr() |
N | High correlation filter |
step_lincomb() |
N | Linear combination filter | |
| Attributes | step_rm() |
A | General variable filter |
A near-zero variance filter on all predictors is added in the first recipe.
step_nzv() function has default thresholds of 10%
for percent of unique values and 95/5 for the ratio of most to second
most common values.
The second recipe adds a pairwise correlation filter on all numeric predictors and sets the absolute correlation threshold at 0.90.
Dummy variables are created after filtering in each case.
### Near-zero variance filter
Car_rec1 <- Caries_recipe_naomit %>%
step_nzv(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes())
### High correlation filter
Car_rec2 <- Caries_recipe_naomit %>%
step_corr(all_numeric(), -all_outcomes(), threshold = 0.9) %>%
step_dummy(all_nominal(), -all_outcomes())In dental public health research, some values of the predictors may be missing.
Missing values may be structural or indeterminable at the time of model building.
Distinctions are made between types of missing data.
Missing completely at random (MCAR): the missingness of data is unrelated to any study variable; data are rarely MCAR.
Missing at random (MAR): missingness can be fully accounted for by non-missing variables.
Missing not at random (MNAR): missing data that are neither MAR or MCAR; also known as nonignorable nonresponse.
Do not confuse missing data with censored data.
Approaches for dealing with missing data include
Imputation has been studied extensively in the context of statistical inference.
However, Imputation for predictive modeling is a separate issue in which prediction accuracy rather than valid statistical inference is the goal.
Mean imputation simply replaces missing values in a numeric variable with the mean of its observed values.
Nominal predictors can be handled by substituting their modes.
Addition of a categorical level for missing values is another approach, but not advisable due to the potential for bias (Jones, 1997).
Advantages
Disadvantages
Assumes MCAR
Attenuates associations
Mean and mode imputation are not generally recommended for multivariate analyses.
This imputation method replaces missing values in a numeric or nominal variable with predictions from a bagged tree fit with the remaining variables as predictors.
Prediction from bagged trees is an aggregation over decision trees individually fit to resampled training datasets.
Bagged trees reduce prediction variance and accommodate mixtures of numeric and nominal predictors.
Advantages
Disadvantages
Assumes MAR
Computationally slow
More tuning parameters
An effective machine learning tool, although has significant computational and tuning costs.
This method replaces each missing values in a numeric or nominal variable with the mean or mode, respectively, of the K closest neighbors.
Closeness is relative to the remaining variables and can be computed on mixtures of numeric and nominal variables with Gower’s distance (1971).
Advantages
Disadvantages
An attractive choice for machine learning
| Syntax | Types | Description |
|---|---|---|
step_bagimpute() |
FN | Imputation via bagged trees |
step_knnimpute() |
FN | Imputation via K-nearest neighbors |
step_lowerimpute() |
N | Impute numeric data below the threshold of measurement |
step_meanimpute() |
N | Impute numeric data using the mean |
step_modeimpute() |
F | Impute nominal data using the most common value |
Note that the outcome variable is not excluded from the imputations.
The recipe is designed so that the outcome is required to have no missing values.
By default, all predictors are utilized in the bagged tree and K-nearest neighbors imputations.
K-nearest neighbors is being conducted with K = 5.
Dummy variables are created at the ends of the recipes to allow imputation on their original categorical levels.
### Imputation using mean or mode
Car_rec3 <- Caries_recipe %>%
step_meanimpute(all_numeric()) %>%
step_modeimpute(all_nominal()) %>%
step_dummy(all_nominal(), -all_outcomes())
res_meanimpute <- resample(Car_rec3, model = GLMModel, control = control)
### Imputation via bagged trees
Car_rec4 <- Caries_recipe %>%
step_bagimpute(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes())
res_bagimpute <- resample(Car_rec4, model = GLMModel, control = control)
### Imputation via K-nearest neighbors
Car_rec5 <- Caries_recipe %>%
step_knnimpute(all_predictors(), K = 5) %>%
step_dummy(all_nominal(), -all_outcomes())
res_knnimpute <- resample(Car_rec5, model = GLMModel, control = control)### Imputation with predictor-specific methods
Car_rec6 <- Caries_recipe %>%
step_modeimpute(all_nominal()) %>%
step_meanimpute(Total_mgF) %>%
step_knnimpute(MotherEduc,Income,Female1,Homeppm,Brushingfreq, Waterbase, Milk, Juice100, SSB) %>%
step_dummy(all_nominal(), -all_outcomes())
res_customimpute <- resample(Car_rec6, model = GLMModel, control = control)
### Compare method performances
res <- c("mean" = res_meanimpute, "bag" = res_bagimpute, "knn" = res_knnimpute, "custom" = res_customimpute)
summary(res)Data Transformations for Individual Predictors I
Transformations of individual predictors may be desired or needed for planned modeling approaches.
A few types of transformations are summarized below.
Centering and Scaling: improves numerical stability of some calculations and performance of some models.
step_center() is used to subtract the mean from numeric
predictors.
step_scale() is used to standardize the spread
(variance) of numeric predictors by dividing each variable by its
standard deviation.
step_normalize() is that performs z-score
standardization.
Skewness: may need to be reduced to deemphasize the effects of observations in the tails of their distributions.
Basis Expansions: may allow for more flexible modeling of predictor effects.
Univariate Transformations: can be applied based on subject-matter knowledge or modeling strategies.
| Category | Syntax | Types | Description |
|---|---|---|---|
| Centering and Scaling | step_center() |
N | Centering numeric data |
step_scale() |
N | Scaling numeric data | |
step_normalize() |
N | Centering and scaling numeric data | |
step_range() |
N | Scaling numeric data to a specific range | |
| Skewness | step_BoxCox() |
N | Box-Cox transformation (non-negative data) |
step_YeoJohnson() |
N | Yeo-Johnson transformation | |
| Basis Expansions | step_bs/ns() |
N | Spline basis functions |
step_poly() |
N | Orthogonal polynomial basis functions | |
| Univariate Transformations | step_discretize() |
N | Discretize numeric variables |
step_hyperbolic() |
N | Hyperbolic transformation | |
step_invlogit() |
N | Inverse logit transformation | |
step_log() |
N | Log transformation | |
step_logit() |
N | Logit transformation | |
step_other() |
F | Collapse some categorical levels | |
step_relu() |
N | Apply rectified linear transformation | |
step_sqrt() |
N | Square root transformation | |
step_window() |
N | Moving window functions |
Recipes are given to illustrate transformations of individual predictors.
Yeo-Johnson transformations are applied to make the numeric variable distributions more symmetric.
Finally, centering and scaling are applied to numeric variables before and after dummy variable creation, respectively.
# Add one of the imputation methods to the base recipe
Car_rec_impute <- Caries_recipe %>% step_knnimpute(all_predictors())
### Yeo-Johnson transformation
Car_rec7 <- Car_rec_impute %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
### Centering and scaling (before and after ordinal scoring)
Car_rec8 <- Car_rec_impute %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_ordinalscore(MotherEduc) %>%
step_dummy(all_nominal(), -all_outcomes())
Car_rec9 <- Car_rec_impute %>%
step_ordinalscore(MotherEduc) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())Transformations on groups of predictors can help to resolve outliers, reduce the dimensionality of data, or account for interactions.
Data entry errors are often the cause of outliers and should be checked for as a first step.
Removal of outlying subjects is not recommended unless they are confirmed to come from a population for which prediction is not of interest.
Some models are less affected by outliers.
Otherwise, transformations such as spatial sign are available to draw outliers closer to the rest of the observations.
To reduce dimensionality, methods such as principle components analysis (PCA) can be employed to generate a smaller set of predictors that capture much of the original information.
There may be interest in modeling interaction terms to explicitly account for multiplicative effects of predictors.
Below is the Recipe functions for multiple predictor transformations:
Multivariate Transformation Recipe Functions
| Category | Syntax | Types | Description |
|---|---|---|---|
| Outliers | step_spatialsign() |
N | Spatial sign preprocessing |
| Dimension Reduction | step_ica() |
N | ICA signal extraction |
step_isomap() |
N | Isomap embedding | |
step_kpca() |
N | Kernel PCA signal extraction | |
step_pca() |
N | PCA signal extraction | |
| Interaction | step_interact() |
N | Create interaction variables |
step_ratio() |
N | Ratio variable creation |
In the syntax below, PCA is applied to reduce the number of predictors to a smaller set.
Dummy variables are created beforehand so that categorical variables are included in the PCA.
Centering and scaling are not performed by the
step_pca() function by default and thus are explicitly
included as recipe steps.
We set the threshold at O.75 which means that the recipe will retain enough components so as to capture 75% of the variance in the variables.
Alternatively, there is a num argument in the PCA
step function that can be used to specify the exact number of components
to retain (default: 5).
Clustering and averaging of predictors within clusters with the
MachineShop step_kmeans() function is another example of
dimension reduction.
### Data reduction with principal components
Car_rec10 <- Car_rec_impute %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_pca(all_numeric(), -all_outcomes(), threshold = 0.75)
res2 <- resample(Car_rec10 , model = GLMModel, control = control)
summary(res2)Binning or categorization is the process of turning a continuous variable into a categorical one.
While binning may aid in interpretability, there are considerable downsides in the context of predictive modeling.
Avoid manual binning of variables!
Data preprocessing can have a large impact on the final performance of a predictive model.
Preprocessing steps should be included in resampling algorithms to obtain proper estimates of prediction performance.
The optimal types and combination (recipe) of preprocessing steps depends on the scientific problem to be addressed and models to be fit.
Different recipes can be implemented and compared with respect to their impacts on prediction performance.
This can be accomplished with recipe step functions provided by the recipes and MachineShop R packages.
Some custom step functions may also be implemented to extend the packages and develop novel preprocessing approaches.
Gower C (1971) A general coefficient of similarity and some of its properties, Biometrics, 857-871.
Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression, Journal of the American Statistical Association 91: 222230.
Kuhn M and Johnson K (2013) Applied Predictive Modeling, Chapter 3, New York: Springer.
Max Kuhn and Hadley Wickham (2020). recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes
Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/