Exercise 3.1

Data input

data(Glass)

Data checks

I performed some basic checks on the data before answering the exercise questions.

Data types

The data appears to be tidy and data types appear to be appropriate.

glimpse(Glass)
## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Based on the dataset documentation, the target variable is Type, and the numeric variables are features.

Duplicate data

I found and removed one duplicate row.

sprintf("Duplicate rows in dataset: %d", nrow(Glass) - nrow(distinct(Glass)))
## [1] "Duplicate rows in dataset: 1"
Glass <- Glass %>%
  distinct()

Missing data

No variables had missing data. This is consistent with information in the dataset documentation.

get_na_vars(Glass)
## [1] "There are no variables with missing values"

Zero-variance variables

No variables had near-zero variance.

# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Glass)
sprintf('There are %d near-zero variance columns', length(zvar_idx))
## [1] "There are 0 near-zero variance columns"

a. Explore predictor variables

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

I created a pairs plot to visualize the distribution of predictor variables (scatterplots [lower triangle] and density plots [diagonal]) and relationships between predictors (correlations [upper triangle]).

Glass %>%
  select_if(is.numeric) %>%
  # Color transparency and small data points to reduce overplotting in scatterplots
  GGally::ggpairs(mapping = aes(alpha = 0.2),
                  lower = list(
                    continuous = wrap('points', size = 0.2, color = '#005AB5')),
                  diag = list(
                    # Create bar histograms on diagonal
                    continuous = wrap('barDiag', bins = 20, fill = '#DC3220')),
                  progress = FALSE)

Distributions

The histograms on the diagonal show that the predictor variables have a unimodal distribution. The distribution of Si appears to be the closest to normal (ie, bell-shaped). Several variables (K, Ba, Fe) are right skewed. Two of these variables (Ba and Fe) are skewed due to sparsity as they have more zero values than nonzero values.

Relationships between predictors

The Pearson correlation coefficients in the upper triangle show that several pairs of variables are significantly correlated.

For positive correlations, RI and Ca are most strongly correlated (r = 0.81), which indicates collinearity. This is consistent with the scatterplot of these two variables (lower triangle), which shows a linear relationship.

No other variable pair has a correlation coefficient or scatterplot that suggests multicollinearity.

# Convert correlation matrix to dataframe in long format
correlations_df <- Glass %>%
  select_if(is.numeric) %>%
  cor(method = 'pearson') %>%
  get_upper_triangular_df() %>%
  rownames_to_column('var1') %>%
  pivot_longer(cols = -var1, names_to = 'var2', values_to = 'r') %>%
  drop_na()
# Strongest positive correlations
correlations_df %>%
  arrange(desc(r)) %>%
  slice_head(n = 5) %>%
  kbl() %>%
  kable_styling()
var1 var2 r
RI Ca 0.8111829
Al Ba 0.4806424
Na Ba 0.3290798
Al K 0.3236828
Na Al 0.1677350

For negative correlations, the strongest is a moderate correlation between RI and Si (r = -0.54); however, this does does not suggest collinearity.

# Strongest negative correlations
correlations_df %>%
  arrange(r) %>%
  slice_head(n = 5) %>%
  kbl() %>%
  kable_styling()
var1 var2 r
RI Si -0.5389998
Mg Ba -0.4918178
Mg Al -0.4795752
Mg Ca -0.4461971
RI Al -0.4009733

b. Outliers

Do there appear to be any outliers in the data? Are any predictors skewed?

The boxplots below show that all predictor variables except Mg have outliers (as assessed by values that exceed the first and third quartiles by more than than 1.5 times the interquartile range).

# Pivot numeric variables to long format for plotting
Glass_long <- Glass %>%
  select_if(is.numeric) %>%  
  pivot_longer(cols = everything(), names_to = 'variable', values_to = 'value')
ggplot(Glass_long, aes(x = value, fill = variable)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = 'free_x') +
  theme(
    legend.position = 'none'
  )

The skewness values below show that the distribution of several variables is right skewed. In particular, K and Ba are the most right skewed (skewness >> 1). Mg is left-skewed (skewness < -1).

skewness_vec <- Glass %>%
  select_if(is.numeric) %>%  
  apply(MARGIN = 2, e1071::skewness)

skewness_vec
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6248785  0.4591599 -1.1300042  0.9252262 -0.7341909  6.4535281  2.0219586 
##         Ba         Fe 
##  3.3589166  1.7226422

c. Transformations

Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox transformations

For skewed variables (part (b)), transformations may improve the data distribution. To determine the best transformation for each skewed predictor variable, I calculated the optimal lambda values for Box-Cox transformations.

I first identified variables that are not positive definite (Mg, K, Ba, and Fe), which is a requirement for Box-Cox transformation. To satisfy the requirement, I added 1 to them.

Glass %>%
  # Variables with skewness < -1 or > +1
  dplyr::select(RI, Mg, K, Ca, Ba, Fe) %>%
  apply(MARGIN = 2, function(x) any(x <= 0))
##    RI    Mg     K    Ca    Ba    Fe 
## FALSE  TRUE  TRUE FALSE  TRUE  TRUE
Glass <- Glass %>%
  mutate(
    Mg1 = Mg + 1,
    K1 = K + 1,
    Ba1 = Ba + 1,
    Fe1 = Fe + 1
  )

Then I calculated the optimal lambda values for Box-Cox transformations.

For RI, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_RI <- MASS::boxcox(lm(Glass$RI ~ 1), plotit = FALSE)
(best_lambda_RI <- boxcox_result_RI$x[which.max(boxcox_result_RI$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$RI, best_lambda_RI) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_RI, ') on RI'))

create_sbs_histograms(compare_long_df, plot_title)

For Mg, \(\lambda_{best} = 2\), which is a squared transformation.

boxcox_result_Mg <- MASS::boxcox(lm(Glass$Mg1 ~ 1), plotit = FALSE)
(best_lambda_Mg <- boxcox_result_Mg$x[which.max(boxcox_result_Mg$y)])
## [1] 2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Mg1, best_lambda_Mg) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Mg, ') on Mg'))

create_sbs_histograms(compare_long_df, plot_title)

For K, \(\lambda_{best} = -1\), which is an inverse transformation.

boxcox_result_K <- MASS::boxcox(lm(Glass$K1 ~ 1), plotit = FALSE)
(best_lambda_K <- boxcox_result_K$x[which.max(boxcox_result_K$y)])
## [1] -1
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$K, best_lambda_K) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation changed the distribution, but it is not normalized.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_K, ') on K'))

create_sbs_histograms(compare_long_df, plot_title)
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).

For Ca, \(\lambda_{best} = -1.1\), which is close to an inverse transformation.

boxcox_result_Ca <- MASS::boxcox(lm(Glass$Ca ~ 1), plotit = FALSE)
(best_lambda_Ca <- boxcox_result_Ca$x[which.max(boxcox_result_Ca$y)])
## [1] -1.1
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ca, best_lambda_Ca) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation improved the distribution somewhat.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Ca, ') on Ca'))

create_sbs_histograms(compare_long_df, plot_title)

For Ba, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_Ba <- MASS::boxcox(lm(Glass$Ba1 ~ 1), plotit = FALSE)
(best_lambda_Ba <- boxcox_result_Ba$x[which.max(boxcox_result_Ba$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ba1, best_lambda_Ba) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Ba, ') on Ba'))

create_sbs_histograms(compare_long_df, plot_title)

For Fe, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_Fe <- MASS::boxcox(lm(Glass$Fe1 ~ 1), plotit = FALSE)
(best_lambda_Fe <- boxcox_result_Fe$x[which.max(boxcox_result_Fe$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Fe1, best_lambda_Fe) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Fe, ') on Fe'))

create_sbs_histograms(compare_long_df, plot_title)

Overall, the plots above suggest that Box-Cox transformations are not very effective to normalize the predictor variables. The only variable with some effect was Ca. Given its collinearity with RI (part (a)), the Box-Cox transformed Ca may be more more suitable for a predictive model.

Spatial sign transformations

Next, I decided to try spatial sign transformations on all skewed predictor variables that had outliers (part (b)).

Glass_ss <- Glass %>%
  select_if(is.numeric) %>%
  scale() %>%
  caret::spatialSign() %>%
  as.data.frame()

colnames(Glass_ss) <- paste(colnames(Glass_ss), '_ss', sep = '')

The histograms below show that the spatial sign transformation improved the distribution of RI.

transform_df <- data.frame(untransformed = Glass$RI, transformed = Glass_ss$RI_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on RI'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation improved the distribution of K.

transform_df <- data.frame(untransformed = Glass$K, transformed = Glass_ss$K_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on K'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation worsened the distribution of Ca.

transform_df <- data.frame(untransformed = Glass$Ca, transformed = Glass_ss$Ca_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Ca'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation improved the distribution of Ba.

transform_df <- data.frame(untransformed = Glass$Ba, transformed = Glass_ss$Ba_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Ba'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation worsened the distribution of Fe, making it bimodal.

transform_df <- data.frame(untransformed = Glass$Fe, transformed = Glass_ss$Fe_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Fe'
create_sbs_histograms(compare_long_df, plot_title)

Conclusions

Overall, spatial sign transformations were more effective than Box-Cox transformations and improved the distribution (ie, more normal) of three skewed predictor variables (RI, K, and Ba). For Ca, spatial sign transformation worsened the distribution; however, Box-Cox transformation with \(\lambda = -1.1\) (ie, inverse) showed some improvement, so I would use the latter. The remaining predictor variables in the Glass dataset would likely be used as is for model development.

Depending on the choice of classifier, scaling and centering may also be useful transformations.

Exercise 3.2

Data input

data(Soybean)

Data checks

I performed some basic checks on the data before answering the exercise questions.

Data types

The data appears to be tidy and data types appear to be appropriate.

glimpse(Soybean)
## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Based on the dataset documentation, the target variable is Class. The other 35 variables are features.

Duplicate data

I found and removed 52 duplicate rows.

sprintf("Duplicate rows in dataset: %d", nrow(Soybean) - nrow(distinct(Soybean)))
## [1] "Duplicate rows in dataset: 52"
Soybean <- Soybean %>%
  distinct()

Missing data

There are 34 variables with missing data, which is consistent with the dataset documentation.

All variables except leaves have missing data. The proportion of missing data ranged from 0.2% area.dam and date) to 15.7% (hail, sever, seed.tmt, and lodging). Based on these fractions, the affected variables can be imputed rather than omitted from the classifier.

Additional information about variables with missingness is in part (b) of this exercise.

missing_data_df <- get_na_vars(Soybean) %>%
  bind_rows() %>%
  pivot_longer(names_to = 'variable', values_to = 'null_rows', cols = everything()) %>%
  mutate(
    missing_frac = round(null_rows / nrow(Soybean), 3)
  ) %>%
  arrange(desc(null_rows))

missing_data_df %>%
  kbl() %>%
  kable_styling() %>%
  scroll_box('300px')
variable null_rows missing_frac
hail 99 0.157
sever 99 0.157
seed.tmt 99 0.157
lodging 99 0.157
germ 91 0.144
leaf.mild 87 0.138
fruiting.bodies 85 0.135
fruit.spots 85 0.135
seed.discolor 85 0.135
shriveling 85 0.135
leaf.shread 79 0.125
seed 75 0.119
mold.growth 75 0.119
seed.size 75 0.119
fruit.pods 67 0.106
leaf.halo 63 0.100
leaf.marg 63 0.100
leaf.size 63 0.100
leaf.malf 63 0.100
precip 34 0.054
stem.cankers 34 0.054
canker.lesion 34 0.054
ext.decay 34 0.054
mycelium 34 0.054
int.discolor 34 0.054
sclerotia 34 0.054
plant.stand 32 0.051
roots 30 0.048
temp 26 0.041
crop.hist 16 0.025
plant.growth 16 0.025
stem 16 0.025
date 1 0.002
area.dam 1 0.002
sprintf('The fraction of missing data in predictor variables ranges from %.3f to %.3f',
        min(missing_data_df$missing_frac), max(missing_data_df$missing_frac))
## [1] "The fraction of missing data in predictor variables ranges from 0.002 to 0.157"
# Variables that do not have missing data
soybean_vars <- colnames(Soybean)
missing_data_vars <- missing_data_df$variable

complete_vars <- setdiff(soybean_vars, missing_data_vars)
sprintf('Soybean variables with no missing data: %s', paste(complete_vars, collapse = ', '))
## [1] "Soybean variables with no missing data: Class, leaves"

Zero-variance variables

There were 3 variables (lead.mild, mycelium, and sclerotia) with near-zero variance. These variables are not likely to be informative for a machine learning classifier.

# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Soybean)
sprintf('There are %d near-zero variance columns', length(zvar_idx))
## [1] "There are 3 near-zero variance columns"
names(Soybean)[zvar_idx]
## [1] "leaf.mild" "mycelium"  "sclerotia"

a. Frequency distribution of categorical predictors

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The bar plots below show that predictor variables have 2 to 7 categories (levels); however, they are often not distributed evenly. Except for area.dam, most variables have a dominant category (ie, value with higher frequency).

The most extreme case, where all observations fall into a single category, is known as a degenerate distribution. Strictly speaking, none of the distributions meet this criterion; however, several are close. The variables with the most pronounced near-degenerate distribution are mycelium, sclerotia, and leaf.mild, which were also identified using the nearZeroVar() function from the caret package.

# Due to differences between factor and ordinal data types in the Soybean dataset, 
# I coerced categorical variables to be numeric in order to reshape and create facet plots
Soybean_numeric <- type.convert(Soybean, as.is = FALSE)
# Reshape data to long format for facet plot
Soybean_numeric_long <- Soybean_numeric %>%
  # Omit target variable
  dplyr::select(-c(Class)) %>%
  pivot_longer(cols = everything(), names_to = 'variable', values_to = 'category')
# Create faceted barplots
ggplot(Soybean_numeric_long, aes(x = category, fill = variable)) +
    geom_bar(na.rm = TRUE) +
    facet_wrap(~ variable, scales = 'free') +
    labs(
      x = 'Category', y = 'Count',
      title = 'Frequency distribution of Soybean variables'
  ) +
  theme_minimal() +
  theme(
    legend.position = 'none'
  )

b. Patterns in missing data

Are there particular predictors that are more likely to be missing?

Yes, some predictors are more likely to have missing data. In the Data checks section, I showed that the fraction of missing data varies from 0.2% to 15.7%. Here, I list the predictors with >5% missingness.

Of note, this subset includes the three near-degenerate predictors identified in part (a) (mycelium, sclerotia, and leaf.mild).

high_missingness_vars <- missing_data_df %>%
  filter(missing_frac > 0.05)

high_missingness_vars %>%
  kbl() %>%
  kable_styling() %>%
  scroll_box('300px')
variable null_rows missing_frac
hail 99 0.157
sever 99 0.157
seed.tmt 99 0.157
lodging 99 0.157
germ 91 0.144
leaf.mild 87 0.138
fruiting.bodies 85 0.135
fruit.spots 85 0.135
seed.discolor 85 0.135
shriveling 85 0.135
leaf.shread 79 0.125
seed 75 0.119
mold.growth 75 0.119
seed.size 75 0.119
fruit.pods 67 0.106
leaf.halo 63 0.100
leaf.marg 63 0.100
leaf.size 63 0.100
leaf.malf 63 0.100
precip 34 0.054
stem.cankers 34 0.054
canker.lesion 34 0.054
ext.decay 34 0.054
mycelium 34 0.054
int.discolor 34 0.054
sclerotia 34 0.054
plant.stand 32 0.051


Similarly, a visual representation of missing data in the Soybean dataset shows that there are systematic patterns of missing data and that missingness does not occur at random (ie, the gray areas corresponding to NA are not scattered).

vis_dat(Soybean)

Is the pattern of missing data related to the classes?

Yes, the pattern of missing data is related to the soybean disease classes. The first clue to this is that not all classes have missing data. Among the 5 classes with missing data, phytophthora-rot has the most missing observations, followed by 2-4-d-injury.

# Reshape soybean data to long format
# Note I reshaped the dataset in part (a), but this version includes Class
Soybean_numeric_long2 <- Soybean_numeric %>%
  pivot_longer(cols = -Class, names_to = 'variable', values_to = 'value')
# Count number of observations with missing data by category
Soybean_missing_by_class <- Soybean_numeric_long2 %>%
  group_by(Class, value) %>%
  count(name = 'total_obs') %>%
  ungroup() %>%
  filter(is.na(value)) %>%
  rename('missing_obs' = 'total_obs') %>%
  dplyr::select(Class, missing_obs) %>%
  arrange(desc(missing_obs))

Soybean_missing_by_class %>%
  kbl() %>%
  kable_styling()
Class missing_obs
phytophthora-rot 897
2-4-d-injury 450
cyst-nematode 240
diaporthe-pod-&-stem-blight 166
herbicide-injury 160
sprintf('%d of %d classes in Soybean have missing data', 
        nrow(Soybean_missing_by_class), length(table(Soybean$Class)))
## [1] "5 of 19 classes in Soybean have missing data"

c. Handling missing data

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Since some predictors in the Soybean dataset are more likely than others to have missing data, missing completely at random (MCAR) is not likely. The systematic pattern of missingness as well as differences in missingness by Class suggest that missing at random (MAR) is more likely. This suggests that imputing missing data would be preferred to eliminating variables with missing data. However, I would remove the near-degenerate variables since they are not likely to be informative in a classification model.

Potential imputation methods include random forest models or multiple imputation by chained equations (MICE). These methods are preferred to simple imputation methods, such as using a summary statistic (eg, median) or k-nearest neighbors, because they generate multiple datasets with imputed values, which are more statistically robust (similar to the advantages of using k-fold cross-validation).


References

Kuhn M & Johnson K. (2013) Applied predictive modeling. DOI: 10.1007/978-1-4614-6849-3

Tuzen MF. (2025). Handling missing data in R: A comprehensive guide. https://www.r-bloggers.com/2025/08/handling-missing-data-in-r-a-comprehensive-guide/

Session Details

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 latex2exp_0.9.8  visdat_0.6.0     caret_7.0-1     
##  [5] lattice_0.22-7   MASS_7.3-65      e1071_1.7-17     GGally_2.4.0    
##  [9] mlbench_2.1-7    lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0   
## [13] dplyr_1.1.4      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [17] tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.1             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.39        rpart_4.1.24        
## [10] timechange_0.3.0     lifecycle_1.0.5      survival_3.8-6      
## [13] magrittr_2.0.4       compiler_4.5.2       rlang_1.1.7         
## [16] sass_0.4.10          tools_4.5.2          yaml_2.3.12         
## [19] data.table_1.18.2.1  knitr_1.51           labeling_0.4.3      
## [22] xml2_1.5.2           plyr_1.8.9           RColorBrewer_1.1-3  
## [25] withr_3.0.2          stats4_4.5.2         nnet_7.3-20         
## [28] grid_4.5.2           future_1.69.0        globals_0.18.0      
## [31] scales_1.4.0         iterators_1.0.14     cli_3.6.5           
## [34] rmarkdown_2.30       generics_0.1.4       otel_0.2.0          
## [37] rstudioapi_0.18.0    future.apply_1.20.1  reshape2_1.4.5      
## [40] tzdb_0.5.0           cachem_1.1.0         proxy_0.4-29        
## [43] splines_4.5.2        parallel_4.5.2       vctrs_0.7.1         
## [46] hardhat_1.4.2        Matrix_1.7-4         jsonlite_2.0.0      
## [49] hms_1.1.4            listenv_0.10.0       systemfonts_1.3.1   
## [52] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [55] recipes_1.3.1        glue_1.8.0           parallelly_1.46.1   
## [58] ggstats_0.12.0       codetools_0.2-20     stringi_1.8.7       
## [61] gtable_0.3.6         pillar_1.11.1        htmltools_0.5.9     
## [64] ipred_0.9-15         lava_1.8.2           R6_2.6.1            
## [67] textshaping_1.0.4    evaluate_1.0.5       bslib_0.10.0        
## [70] class_7.3-23         Rcpp_1.1.1           svglite_2.2.2       
## [73] nlme_3.1-168         prodlim_2025.04.28   xfun_0.56           
## [76] ModelMetrics_1.2.2.2 pkgconfig_2.0.3