1 Problem 3.1 (Glass data): explore distributions, outliers, correlations, and transformations
2 Problem 3.2 (Soybean data): near-zero variance predictors and missing data
3 Session Info

required_pkgs <- c("tidyverse", "caret", "mlbench", "e1071", "corrplot", "RANN")

to_install <- required_pkgs[!vapply(required_pkgs, requireNamespace, FUN.VALUE = logical(1), quietly = TRUE)]
if (length(to_install) > 0) install.packages(to_install)

invisible(lapply(required_pkgs, library, character.only = TRUE))

1 Problem 3.1 (Glass data): explore distributions, outliers, correlations, and transformations

1.1 Load data

data(Glass)
glimpse(Glass)

## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

The outcome is Type (categorical). The remaining columns are predictors measured on different scales.

1.2 Pairwise plots and correlations

# Pairwise scatterplot matrix (can be visually busy; keeping it for completeness)
pairs(Glass[, -10], pch = 19, cex = 0.4)

# Correlation among predictors
cor_mat <- cor(Glass[, -10])
cor_mat

##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000

# Visual correlation map (clustered)
corrplot(cor_mat, order = "hclust", tl.cex = 0.7)

Quick note: Several predictors show moderate correlations, but nothing here automatically implies we must remove variables—this is mainly a diagnostic step before modeling.

1.3 Outliers and skewness

# Boxplots for a quick outlier scan across predictors
boxplot(Glass[, -10], las = 2, main = "Glass predictors: boxplots (outliers + spread)")

# Skewness of each predictor (raw scale)
sk_raw <- apply(Glass[, -10], 2, e1071::skewness)
sk_raw

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

# Look at histograms for a few skewed predictors
par(mfrow = c(1, 3))
hist(Glass$K,  main = "K",  xlab = "K")
hist(Glass$Ba, main = "Ba", xlab = "Ba")
hist(Glass$Mg, main = "Mg", xlab = "Mg")

par(mfrow = c(1, 1))

From the plots, there are predictors with noticeable skewness and some extreme values (e.g., K and Ba). These can affect models that assume roughly symmetric distributions or that are sensitive to large values.

1.4 Box–Cox transformations (reduce skewness)

Box–Cox requires strictly positive values, so we add a tiny constant to any zero-valued columns before estimating transformations.

glass_bc <- Glass

# Add a very small constant to avoid zeros (helps BoxCoxTrans converge)
for (nm in names(glass_bc)[names(glass_bc) != "Type"]) {
  if (any(glass_bc[[nm]] <= 0, na.rm = TRUE)) {
    glass_bc[[nm]] <- glass_bc[[nm]] + 1e-6
  }
}

# Helper: compute skewness after applying an estimated Box-Cox transform
skew_after_boxcox <- function(x) {
  bct <- BoxCoxTrans(x)
  x_bc <- predict(bct, x)
  e1071::skewness(x_bc)
}

sk_bc <- apply(glass_bc[, -10], 2, skew_after_boxcox)
sk_bc

##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.43270870  0.09105899 -0.65090568 -0.78216211 
##          Ca          Ba          Fe 
## -0.19395573  1.67566612  0.74424403

# Compare skewness before vs after
sk_comp <- tibble(
  predictor = names(sk_raw),
  skew_raw = as.numeric(sk_raw),
  skew_boxcox = as.numeric(sk_bc)
) |> arrange(desc(abs(skew_raw)))

sk_comp

## # A tibble: 9 × 3
##   predictor skew_raw skew_boxcox
##   <chr>        <dbl>       <dbl>
## 1 K            6.46      -0.782 
## 2 Ba           3.37       1.68  
## 3 Ca           2.02      -0.194 
## 4 Fe           1.73       0.744 
## 5 RI           1.60       1.57  
## 6 Mg          -1.14      -1.43  
## 7 Al           0.895      0.0911
## 8 Si          -0.720     -0.651 
## 9 Na           0.448      0.0338

Conclusion (3.1): Several predictors show strong skewness and/or outliers. A Box–Cox transformation tends to pull the skewness closer to 0 for many variables, which can make modeling (and distance-based methods in particular) behave more stably. I would typically combine transformation with centering/scaling in a preprocessing recipe (depending on the model family).

2 Problem 3.2 (Soybean data): near-zero variance predictors and missing data

2.1 Load data

data(Soybean)
glimpse(Soybean)

## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

The outcome is Class. Many predictors are categorical and the dataset contains missing values.

2.2 Identify and remove near-zero variance predictors

nzv <- nearZeroVar(Soybean)
length(nzv)

## [1] 3

colnames(Soybean)[nzv]

## [1] "leaf.mild" "mycelium"  "sclerotia"

Soybean_nzv <- Soybean[, -nzv]
dim(Soybean)

## [1] 683  36

dim(Soybean_nzv)

## [1] 683  33

Removing near-zero variance predictors helps because they contribute little information but can add noise and complexity.

2.3 Missing data: how much and where?

na_counts <- sapply(Soybean_nzv, function(x) sum(is.na(x)))
na_counts |> sort(decreasing = TRUE) |> head(15)

##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
## fruiting.bodies     fruit.spots   seed.discolor      shriveling     leaf.shread 
##             106             106             106             106             100 
##            seed     mold.growth       seed.size       leaf.halo       leaf.marg 
##              92              92              92              84              84

# How many rows have at least one missing predictor?
# (use all predictors and exclude the outcome column)
row_has_na <- apply(dplyr::select(Soybean_nzv, -Class), 1, function(x) any(is.na(x)))
table(row_has_na)

## row_has_na
## FALSE  TRUE 
##   562   121

# Are missing values concentrated in specific classes?
tab_class_na <- table(Class = Soybean_nzv$Class, HasMissing = row_has_na)
tab_class_na

##                              HasMissing
## Class                         FALSE TRUE
##   2-4-d-injury                    0   16
##   alternarialeaf-spot            91    0
##   anthracnose                    44    0
##   bacterial-blight               20    0
##   bacterial-pustule              20    0
##   brown-spot                     92    0
##   brown-stem-rot                 44    0
##   charcoal-rot                   20    0
##   cyst-nematode                   0   14
##   diaporthe-pod-&-stem-blight     0   15
##   diaporthe-stem-canker          20    0
##   downy-mildew                   20    0
##   frog-eye-leaf-spot             91    0
##   herbicide-injury                0    8
##   phyllosticta-leaf-spot         20    0
##   phytophthora-rot               20   68
##   powdery-mildew                 20    0
##   purple-seed-stain              20    0
##   rhizoctonia-root-rot           20    0

Conclusion (3.2): After removing near-zero variance predictors, there are still missing values across many predictors, and the missingness is not evenly distributed across classes.

2.4 Example: kNN imputation via `preProcess()`

For many models, we need a complete dataset. One common approach is kNN imputation. For caret::preProcess() with knnImpute, predictors must be numeric, so we first one-hot encode the predictors.

# One-hot encode predictors (excluding outcome)
dummies <- dummyVars(Class ~ ., data = Soybean_nzv, fullRank = TRUE)
X <- predict(dummies, newdata = Soybean_nzv) |> as.data.frame()

# Preprocess: center/scale + kNN imputation
pp <- preProcess(X, method = c("center", "scale", "knnImpute"))
X_imp <- predict(pp, X)

# Confirm no missing values remain
sum(is.na(X_imp))

## [1] 0

At this point, you can bind the outcome back to the processed predictors and proceed to model training:

soy_processed <- bind_cols(Class = Soybean_nzv$Class, as_tibble(X_imp))
glimpse(soy_processed)

## Rows: 683
## Columns: 61
## $ Class             <fct> diaporthe-stem-canker, diaporthe-stem-canker, diapor…
## $ date.1            <dbl> -0.3512511, -0.3512511, -0.3512511, -0.3512511, -0.3…
## $ date.2            <dbl> -0.3970683, -0.3970683, -0.3970683, -0.3970683, -0.3…
## $ date.3            <dbl> -0.4570701, -0.4570701, 2.1846402, 2.1846402, -0.457…
## $ date.4            <dbl> -0.4872381, 2.0493754, -0.4872381, -0.4872381, -0.48…
## $ date.5            <dbl> -0.5283368, -0.5283368, -0.5283368, -0.5283368, -0.5…
## $ date.6            <dbl> 2.5628369, -0.3896205, -0.3896205, -0.3896205, 2.562…
## $ plant.stand.L     <dbl> -0.9090678, -0.9090678, -0.9090678, -0.9090678, -0.9…
## $ precip.L          <dbl> 0.5874845, 0.5874845, 0.5874845, 0.5874845, 0.587484…
## $ precip.Q          <dbl> 0.4580454, 0.4580454, 0.4580454, 0.4580454, 0.458045…
## $ temp.L            <dbl> -0.2900854, -0.2900854, -0.2900854, -0.2900854, -0.2…
## $ temp.Q            <dbl> -0.8630451, -0.8630451, -0.8630451, -0.8630451, -0.8…
## $ hail.1            <dbl> -0.5398468, -0.5398468, -0.5398468, -0.5398468, -0.5…
## $ crop.hist.1       <dbl> 1.7429466, -0.5728809, 1.7429466, 1.7429466, -0.5728…
## $ crop.hist.2       <dbl> -0.6986461, 1.4291939, -0.6986461, -0.6986461, 1.429…
## $ crop.hist.3       <dbl> -0.6962726, -0.6962726, -0.6962726, -0.6962726, -0.6…
## $ area.dam.1        <dbl> 1.4147319, -0.7058113, -0.7058113, -0.7058113, -0.70…
## $ area.dam.2        <dbl> -0.5192521, -0.5192521, -0.5192521, -0.5192521, -0.5…
## $ area.dam.3        <dbl> -0.6141855, -0.6141855, -0.6141855, -0.6141855, -0.6…
## $ sever.1           <dbl> 0.8625633, -1.1572724, -1.1572724, -1.1572724, 0.862…
## $ sever.2           <dbl> -0.2947639, 3.3865094, 3.3865094, 3.3865094, -0.2947…
## $ seed.tmt.1        <dbl> -0.8073285, 1.2364491, 1.2364491, -0.8073285, -0.807…
## $ seed.tmt.2        <dbl> -0.2574791, -0.2574791, -0.2574791, -0.2574791, -0.2…
## $ germ.L            <dbl> -1.32623673, -0.06199437, 1.20224798, -0.06199437, 1…
## $ germ.Q            <dbl> 0.7706686, -1.2953021, 0.7706686, -1.2953021, 0.7706…
## $ plant.growth.1    <dbl> 1.395852, 1.395852, 1.395852, 1.395852, 1.395852, 1.…
## $ leaves.1          <dbl> 0.3561975, 0.3561975, 0.3561975, 0.3561975, 0.356197…
## $ leaf.halo.1       <dbl> -0.2526587, -0.2526587, -0.2526587, -0.2526587, -0.2…
## $ leaf.halo.2       <dbl> -1.152613, -1.152613, -1.152613, -1.152613, -1.15261…
## $ leaf.marg.1       <dbl> -0.1904508, -0.1904508, -0.1904508, -0.1904508, -0.1…
## $ leaf.marg.2       <dbl> 1.306733, 1.306733, 1.306733, 1.306733, 1.306733, 1.…
## $ leaf.size.L       <dbl> 1.170838, 1.170838, 1.170838, 1.170838, 1.170838, 1.…
## $ leaf.size.Q       <dbl> 1.095536, 1.095536, 1.095536, 1.095536, 1.095536, 1.…
## $ leaf.shread.1     <dbl> -0.443607, -0.443607, -0.443607, -0.443607, -0.44360…
## $ leaf.malf.1       <dbl> -0.2847663, -0.2847663, -0.2847663, -0.2847663, -0.2…
## $ stem.1            <dbl> 0.8925511, 0.8925511, 0.8925511, 0.8925511, 0.892551…
## $ lodging.1         <dbl> 3.5155259, -0.2839463, -0.2839463, -0.2839463, -0.28…
## $ stem.cankers.1    <dbl> -0.253489, -0.253489, -0.253489, -0.253489, -0.25348…
## $ stem.cankers.2    <dbl> -0.2429437, -0.2429437, -0.2429437, -0.2429437, -0.2…
## $ stem.cankers.3    <dbl> 1.5405448, 1.5405448, 1.5405448, 1.5405448, 1.540544…
## $ canker.lesion.1   <dbl> 2.6001128, 2.6001128, -0.3840024, -0.3840024, 2.6001…
## $ canker.lesion.2   <dbl> -0.6145069, -0.6145069, -0.6145069, -0.6145069, -0.6…
## $ canker.lesion.3   <dbl> -0.3345074, -0.3345074, -0.3345074, -0.3345074, -0.3…
## $ fruiting.bodies.1 <dbl> 2.1307732, 2.1307732, 2.1307732, 2.1307732, 2.130773…
## $ ext.decay.1       <dbl> 1.9421433, 1.9421433, 1.9421433, 1.9421433, 1.942143…
## $ ext.decay.2       <dbl> -0.1433099, -0.1433099, -0.1433099, -0.1433099, -0.1…
## $ int.discolor.1    <dbl> -0.2703661, -0.2703661, -0.2703661, -0.2703661, -0.2…
## $ int.discolor.2    <dbl> -0.1787467, -0.1787467, -0.1787467, -0.1787467, -0.1…
## $ fruit.pods.1      <dbl> -0.5260444, -0.5260444, -0.5260444, -0.5260444, -0.5…
## $ fruit.pods.2      <dbl> -0.1545693, -0.1545693, -0.1545693, -0.1545693, -0.1…
## $ fruit.pods.3      <dbl> -0.2949049, -0.2949049, -0.2949049, -0.2949049, -0.2…
## $ fruit.spots.1     <dbl> -0.386191, -0.386191, -0.386191, -0.386191, -0.38619…
## $ fruit.spots.2     <dbl> -0.3307951, -0.3307951, -0.3307951, -0.3307951, -0.3…
## $ fruit.spots.4     <dbl> 2.18214, 2.18214, 2.18214, 2.18214, 2.18214, 2.18214…
## $ seed.1            <dbl> -0.4911088, -0.4911088, -0.4911088, -0.4911088, -0.4…
## $ mold.growth.1     <dbl> -0.3572761, -0.3572761, -0.3572761, -0.3572761, -0.3…
## $ seed.discolor.1   <dbl> -0.3529024, -0.3529024, -0.3529024, -0.3529024, -0.3…
## $ seed.size.1       <dbl> -0.332738, -0.332738, -0.332738, -0.332738, -0.33273…
## $ shriveling.1      <dbl> -0.2652899, -0.2652899, -0.2652899, -0.2652899, -0.2…
## $ roots.1           <dbl> -0.3895002, -0.3895002, -0.3895002, -0.3895002, -0.3…
## $ roots.2           <dbl> -0.1533355, -0.1533355, -0.1533355, -0.1533355, -0.1…

3 Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RANN_2.6.2      corrplot_0.95   e1071_1.7-17    mlbench_2.1-7  
##  [5] caret_7.0-1     lattice_0.22-7  lubridate_1.9.5 forcats_1.0.1  
##  [9] stringr_1.6.0   dplyr_1.2.0     purrr_1.2.1     readr_2.2.0    
## [13] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4052.112    farver_2.1.2        
##  [4] S7_0.2.1             fastmap_1.2.0        pROC_1.19.0.1       
##  [7] digest_0.6.39        rpart_4.1.24         timechange_0.4.0    
## [10] lifecycle_1.0.5      survival_3.8-3       magrittr_2.0.4      
## [13] compiler_4.5.2       rlang_1.1.7          sass_0.4.10         
## [16] tools_4.5.2          utf8_1.2.6           yaml_2.3.12         
## [19] data.table_1.18.2.1  knitr_1.51           plyr_1.8.9          
## [22] RColorBrewer_1.1-3   withr_3.0.2          nnet_7.3-20         
## [25] grid_4.5.2           stats4_4.5.2         future_1.69.0       
## [28] globals_0.19.0       scales_1.4.0         iterators_1.0.14    
## [31] MASS_7.3-65          cli_3.6.5            rmarkdown_2.30      
## [34] generics_0.1.4       rstudioapi_0.18.0    future.apply_1.20.2 
## [37] reshape2_1.4.5       tzdb_0.5.0           cachem_1.1.0        
## [40] proxy_0.4-29         splines_4.5.2        parallel_4.5.2      
## [43] vctrs_0.7.1          hardhat_1.4.2        Matrix_1.7-4        
## [46] jsonlite_2.0.0       hms_1.1.4            listenv_0.10.0      
## [49] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [52] recipes_1.3.1        glue_1.8.0           parallelly_1.46.1   
## [55] codetools_0.2-20     stringi_1.8.7        gtable_0.3.6        
## [58] pillar_1.11.1        htmltools_0.5.9      ipred_0.9-15        
## [61] lava_1.8.2           R6_2.6.1             evaluate_1.0.5      
## [64] bslib_0.10.0         class_7.3-23         Rcpp_1.1.1          
## [67] nlme_3.1-168         prodlim_2025.04.28   xfun_0.56           
## [70] ModelMetrics_1.2.2.2 pkgconfig_2.0.3

DATA 624 — Applied Predictive Modeling (Kuhn & Johnson) — Problems 3.1–3.2

Sachi Kapoor

March 01, 2026

1 Problem 3.1 (Glass data): explore distributions, outliers, correlations, and transformations

1.1 Load data

1.2 Pairwise plots and correlations

1.3 Outliers and skewness

1.4 Box–Cox transformations (reduce skewness)

2 Problem 3.2 (Soybean data): near-zero variance predictors and missing data

2.1 Load data

2.2 Identify and remove near-zero variance predictors

2.3 Missing data: how much and where?

2.4 Example: kNN imputation via `preProcess()`

3 Session Info

DATA 624 — Applied Predictive Modeling (Kuhn & Johnson) — Problems 3.1–3.2

Sachi Kapoor

March 01, 2026

1 Problem 3.1 (Glass data): explore distributions, outliers, correlations, and transformations

1.1 Load data

1.2 Pairwise plots and correlations

1.3 Outliers and skewness

1.4 Box–Cox transformations (reduce skewness)

2 Problem 3.2 (Soybean data): near-zero variance predictors and missing data

2.1 Load data

2.2 Identify and remove near-zero variance predictors

2.3 Missing data: how much and where?

2.4 Example: kNN imputation via preProcess()

3 Session Info

2.4 Example: kNN imputation via `preProcess()`