Week 2: Data Preprocessing and Model Tuning

Dr. Chukwuebuka Ogwo

2/5/2026

Outline I

Outline II

Data Preprocessing

Introduction

Recipes I

Recipe Syntax

recipe(formula, data) %>%
step_1(...) %>%
step_2(...) %>%
...
step_N(...)

Recipe Elements

Recipes II

General Recipe Functions

Syntax Types* Description
recipe(function, data) - Create a recipe for preprocessing data
check_missing() A Check for missing values
step_factor2string() F Convert factors to strings
step_intercept() - Add intercept column
step_naomit() A Remove cases with missing values
step_novel() F Simple value assignments for novel factor levels
step_num2factor() N Convert numbers to factors
step_ordinalscore() F Convert ordinal factors to numeric scores
step_string2factor() C Convert strings to factors
step_unorder() F Convert ordered to unordered factors
summary() A Summarize a recipe

IFS Study Example I

IFS Study Example I

library(MachineShop)
library(recipes)
## Load data and libraries
setwd("/Users/damia/OneDrive - Harvard University/HARVARD HSDM/CREATING NEW COURSE")
#setwd("/Users/cho379/OneDrive-Harvard University/HARVARD HSDM/CREATING NEW COURSE")
Caries <- read.csv("ML_Week 1/Caries_data.csv")

## Dataset without preprocessing
train_indices <- sample(nrow(Caries), nrow(Caries) * 2 / 3)
trainset <- Caries[train_indices, ]
testset <- Caries[-train_indices, ]

## Global resample control for comparability of results
control <- CVControl(folds = 10, seed = 123)

Caries_recipe <- recipe(AdjDFS_I ~ MotherEduc + Income + Female1 + Total_mgF + Homeppm + Brushingfreq 
             + Waterbase + Milk + Juice100 + SSB,data = trainset) %>%
    role_case(stratum = AdjDFS_I) %>%
    check_missing(AdjDFS_I) %>%
    #step_naomit(MotherEduc, Income) %>%
    step_num2factor(MotherEduc, transform = function(x) x + 1,levels = c("1", "2", "3", "4","5", "6"), ordered = TRUE) %>%       
    step_num2factor(Income, levels = c("Lowest", "Low", "Medium", "High", "Highest"), ordered = TRUE) %>%
    step_num2factor(Female1, transform = function(x) x + 1,levels = c("M", "F")) 
    
prep()
summary(prep(Caries_recipe))
juice(prep(Caries_recipe ))

2 Adding Predictors

Dummy Variables

Income N Normal Fixed Defect Reversible Defect
Lowest 4 1 0 0
Low 18 0 1 0
Medium 18 0 0 1
High 16 0 1 0
Highest 52 1 0 0

Dummy Variable Recipe Functions

Syntax Types Description
step_bin2factor() N Create factors from dummy variables
step_dummy() F Dummy variables creation
step_regex() F Create dummy variables with regular expressions
dummy_names() - Naming tools

IFS Study Example I

## Exclude cases with missing observations
Caries_recipe_naomit <- Caries_recipe %>% step_naomit(all_predictors())
Caries_recipe_naomit

### Dummy variables creation
(Car_rec <- Caries_recipe_naomit %>%
    step_dummy(all_nominal(), -all_outcomes()))
summary(prep(Car_rec))

Processing Recipe under fit() and resample() functions

### Model fitting and resampling
modelFit <- fit(Car_rec , model = GLMModel)
res <- resample(Car_rec , model = GLMModel, control = control)
summary(res)

### Prediction and performance on new data
obs <- response(modelFit, newdata = testset)
pred <- predict(modelFit, newdata = testset)
performance(obs, pred)

3 Removing Predictors

Removing Predictors

Degenerate Distributions I

Near-Zero Variance Filter Algorithm

Multicollinearity I

High Correlation Filter Algorithm 1. Calculate the pairwise correlation matrix for the predictors. 2. Identify the two predictors, say x and x′, with the largest absolute correlation. 3. Compute the average correlation between each of x and x′ and the other predictors. 4. Remove the one with the largest average correlation. 5. Repeat the previous steps until no absolute correlations are above the threshold.

IFS Study Example I

Filter Recipe Functions

Category Syntax Types Description
Degenerate Distribution step_nzv() FN Near-zero variance filter
step_zv() FN Zero variance filter
Multicollinearity step_corr() N High correlation filter
step_lincomb() N Linear combination filter
Attributes step_rm() A General variable filter
### Near-zero variance filter
Car_rec1 <- Caries_recipe_naomit %>%
  step_nzv(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())
### High correlation filter
Car_rec2 <- Caries_recipe_naomit %>%
  step_corr(all_numeric(), -all_outcomes(), threshold = 0.9) %>%
  step_dummy(all_nominal(), -all_outcomes())

4 Dealing with Missing Values

Dealing with Missing Values I

Mean Imputation

Advantages

Disadvantages

Bagged Tree Imputation

Advantages

Disadvantages

K-Nearest Neighbors Imputation

This method replaces each missing values in a numeric or nominal variable with the mean or mode, respectively, of the K closest neighbors.

Closeness is relative to the remaining variables and can be computed on mixtures of numeric and nominal variables with Gower’s distance (1971).

Advantages

Disadvantages

An attractive choice for machine learning

Imputation Recipe Functions

Syntax Types Description
step_bagimpute() FN Imputation via bagged trees
step_knnimpute() FN Imputation via K-nearest neighbors
step_lowerimpute() N Impute numeric data below the threshold of measurement
step_meanimpute() N Impute numeric data using the mean
step_modeimpute() F Impute nominal data using the most common value

Imputation Example with IFS Study

### Imputation using mean or mode
Car_rec3 <- Caries_recipe %>%
  step_meanimpute(all_numeric()) %>%
  step_modeimpute(all_nominal()) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_meanimpute <- resample(Car_rec3, model = GLMModel, control = control)

### Imputation via bagged trees
Car_rec4 <- Caries_recipe %>%
  step_bagimpute(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_bagimpute <- resample(Car_rec4, model = GLMModel, control = control)

### Imputation via K-nearest neighbors
Car_rec5 <- Caries_recipe %>%
  step_knnimpute(all_predictors(), K = 5) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_knnimpute <- resample(Car_rec5, model = GLMModel, control = control)

Imputation with predictor-specific methods

### Imputation with predictor-specific methods
Car_rec6 <- Caries_recipe %>%
  step_modeimpute(all_nominal()) %>%
  step_meanimpute(Total_mgF) %>%
  step_knnimpute(MotherEduc,Income,Female1,Homeppm,Brushingfreq, Waterbase, Milk, Juice100, SSB) %>%
  step_dummy(all_nominal(), -all_outcomes())

res_customimpute <- resample(Car_rec6, model = GLMModel, control = control)

### Compare method performances
res <- c("mean" = res_meanimpute, "bag" = res_bagimpute, "knn" = res_knnimpute, "custom" = res_customimpute)
summary(res)

5 Data Transformations

Data Transformations for Individual Predictors I

Transformations of individual predictors may be desired or needed for planned modeling approaches.

A few types of transformations are summarized below.

Univariate Transformation Recipe Functions

Category Syntax Types Description
Centering and Scaling step_center() N Centering numeric data
step_scale() N Scaling numeric data
step_normalize() N Centering and scaling numeric data
step_range() N Scaling numeric data to a specific range
Skewness step_BoxCox() N Box-Cox transformation (non-negative data)
step_YeoJohnson() N Yeo-Johnson transformation
Basis Expansions step_bs/ns() N Spline basis functions
step_poly() N Orthogonal polynomial basis functions
Univariate Transformations step_discretize() N Discretize numeric variables
step_hyperbolic() N Hyperbolic transformation
step_invlogit() N Inverse logit transformation
step_log() N Log transformation
step_logit() N Logit transformation
step_other() F Collapse some categorical levels
step_relu() N Apply rectified linear transformation
step_sqrt() N Square root transformation
step_window() N Moving window functions

IFS Study Example (Individual Predictors) I

# Add one of the imputation methods to the base recipe
Car_rec_impute <- Caries_recipe %>% step_knnimpute(all_predictors())

### Yeo-Johnson transformation
Car_rec7 <- Car_rec_impute %>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

### Centering and scaling (before and after ordinal scoring)
Car_rec8 <- Car_rec_impute %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_ordinalscore(MotherEduc) %>%
  step_dummy(all_nominal(), -all_outcomes())

Car_rec9 <- Car_rec_impute %>%
  step_ordinalscore(MotherEduc) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

Data Transformations for Multiple Predictors I

Data Transformations for Multiple Predictors II

Below is the Recipe functions for multiple predictor transformations:

Multivariate Transformation Recipe Functions

Category Syntax Types Description
Outliers step_spatialsign() N Spatial sign preprocessing
Dimension Reduction step_ica() N ICA signal extraction
step_isomap() N Isomap embedding
step_kpca() N Kernel PCA signal extraction
step_pca() N PCA signal extraction
Interaction step_interact() N Create interaction variables
step_ratio() N Ratio variable creation

IFS Study Example (Multiple Predictors) I

### Data reduction with principal components
Car_rec10 <- Car_rec_impute %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_pca(all_numeric(), -all_outcomes(), threshold = 0.75)

res2 <- resample(Car_rec10 , model = GLMModel, control = control)
summary(res2)

6 Binning Predictors

7 Summary

8 References

Gower C (1971) A general coefficient of similarity and some of its properties, Biometrics, 857-871.

Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression, Journal of the American Statistical Association 91: 222230.

Kuhn M and Johnson K (2013) Applied Predictive Modeling, Chapter 3, New York: Springer.

Max Kuhn and Hadley Wickham (2020). recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes

Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/