data(Glass)
I performed some basic checks on the data before answering the exercise questions.
The data appears to be tidy and data types appear to be appropriate.
glimpse(Glass)
## Rows: 214
## Columns: 10
## $ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Based on the dataset
documentation, the target variable is Type, and the
numeric variables are features.
I found and removed one duplicate row.
sprintf("Duplicate rows in dataset: %d", nrow(Glass) - nrow(distinct(Glass)))
## [1] "Duplicate rows in dataset: 1"
Glass <- Glass %>%
distinct()
No variables had missing data. This is consistent with information in the dataset documentation.
get_na_vars(Glass)
## [1] "There are no variables with missing values"
No variables had near-zero variance.
# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Glass)
sprintf('There are %d near-zero variance columns', length(zvar_idx))
## [1] "There are 0 near-zero variance columns"
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
I created a pairs plot to visualize the distribution of predictor variables (scatterplots [lower triangle] and density plots [diagonal]) and relationships between predictors (correlations [upper triangle]).
Glass %>%
select_if(is.numeric) %>%
# Color transparency and small data points to reduce overplotting in scatterplots
GGally::ggpairs(mapping = aes(alpha = 0.2),
lower = list(
continuous = wrap('points', size = 0.2, color = '#005AB5')),
diag = list(
# Create bar histograms on diagonal
continuous = wrap('barDiag', bins = 20, fill = '#DC3220')),
progress = FALSE)
The histograms on the diagonal show that the predictor variables have
a unimodal distribution. The distribution of Si appears to
be the closest to normal (ie, bell-shaped). Several variables
(K, Ba, Fe) are right skewed. Two
of these variables (Ba and Fe) are skewed due
to sparsity as they have more zero values than nonzero values.
The Pearson correlation coefficients in the upper triangle show that several pairs of variables are significantly correlated.
For positive correlations, RI and Ca are
most strongly correlated (r = 0.81), which indicates collinearity. This
is consistent with the scatterplot of these two variables (lower
triangle), which shows a linear relationship.
No other variable pair has a correlation coefficient or scatterplot that suggests multicollinearity.
# Convert correlation matrix to dataframe in long format
correlations_df <- Glass %>%
select_if(is.numeric) %>%
cor(method = 'pearson') %>%
get_upper_triangular_df() %>%
rownames_to_column('var1') %>%
pivot_longer(cols = -var1, names_to = 'var2', values_to = 'r') %>%
drop_na()
# Strongest positive correlations
correlations_df %>%
arrange(desc(r)) %>%
slice_head(n = 5) %>%
kbl() %>%
kable_styling()
| var1 | var2 | r |
|---|---|---|
| RI | Ca | 0.8111829 |
| Al | Ba | 0.4806424 |
| Na | Ba | 0.3290798 |
| Al | K | 0.3236828 |
| Na | Al | 0.1677350 |
For negative correlations, the strongest is a moderate correlation
between RI and Si (r = -0.54); however, this
does does not suggest collinearity.
# Strongest negative correlations
correlations_df %>%
arrange(r) %>%
slice_head(n = 5) %>%
kbl() %>%
kable_styling()
| var1 | var2 | r |
|---|---|---|
| RI | Si | -0.5389998 |
| Mg | Ba | -0.4918178 |
| Mg | Al | -0.4795752 |
| Mg | Ca | -0.4461971 |
| RI | Al | -0.4009733 |
Do there appear to be any outliers in the data? Are any predictors skewed?
The boxplots below show that all predictor variables except
Mg have outliers (as assessed by values that exceed the
first and third quartiles by more than than 1.5 times the interquartile
range).
# Pivot numeric variables to long format for plotting
Glass_long <- Glass %>%
select_if(is.numeric) %>%
pivot_longer(cols = everything(), names_to = 'variable', values_to = 'value')
ggplot(Glass_long, aes(x = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ variable, scales = 'free_x') +
theme(
legend.position = 'none'
)
The skewness values below show that the distribution of several
variables is right skewed. In particular, K and
Ba are the most right skewed (skewness >> 1).
Mg is left-skewed (skewness < -1).
skewness_vec <- Glass %>%
select_if(is.numeric) %>%
apply(MARGIN = 2, e1071::skewness)
skewness_vec
## RI Na Mg Al Si K Ca
## 1.6248785 0.4591599 -1.1300042 0.9252262 -0.7341909 6.4535281 2.0219586
## Ba Fe
## 3.3589166 1.7226422
Are there any relevant transformations of one or more predictors that might improve the classification model?
For skewed variables (part (b)), transformations may improve the data distribution. To determine the best transformation for each skewed predictor variable, I calculated the optimal lambda values for Box-Cox transformations.
I first identified variables that are not positive definite
(Mg, K, Ba, and Fe),
which is a requirement for Box-Cox transformation. To satisfy the
requirement, I added 1 to them.
Glass %>%
# Variables with skewness < -1 or > +1
dplyr::select(RI, Mg, K, Ca, Ba, Fe) %>%
apply(MARGIN = 2, function(x) any(x <= 0))
## RI Mg K Ca Ba Fe
## FALSE TRUE TRUE FALSE TRUE TRUE
Glass <- Glass %>%
mutate(
Mg1 = Mg + 1,
K1 = K + 1,
Ba1 = Ba + 1,
Fe1 = Fe + 1
)
Then I calculated the optimal lambda values for Box-Cox transformations.
For RI, \(\lambda_{best} =
-2\), which is an inverse squared transformation.
boxcox_result_RI <- MASS::boxcox(lm(Glass$RI ~ 1), plotit = FALSE)
(best_lambda_RI <- boxcox_result_RI$x[which.max(boxcox_result_RI$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$RI, best_lambda_RI) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation did not change the distribution much.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_RI, ') on RI'))
create_sbs_histograms(compare_long_df, plot_title)
For Mg, \(\lambda_{best} =
2\), which is a squared transformation.
boxcox_result_Mg <- MASS::boxcox(lm(Glass$Mg1 ~ 1), plotit = FALSE)
(best_lambda_Mg <- boxcox_result_Mg$x[which.max(boxcox_result_Mg$y)])
## [1] 2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Mg1, best_lambda_Mg) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation did not change the distribution much.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_Mg, ') on Mg'))
create_sbs_histograms(compare_long_df, plot_title)
For K, \(\lambda_{best} =
-1\), which is an inverse transformation.
boxcox_result_K <- MASS::boxcox(lm(Glass$K1 ~ 1), plotit = FALSE)
(best_lambda_K <- boxcox_result_K$x[which.max(boxcox_result_K$y)])
## [1] -1
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$K, best_lambda_K) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation changed the distribution, but it is not normalized.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_K, ') on K'))
create_sbs_histograms(compare_long_df, plot_title)
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).
For Ca, \(\lambda_{best} =
-1.1\), which is close to an inverse transformation.
boxcox_result_Ca <- MASS::boxcox(lm(Glass$Ca ~ 1), plotit = FALSE)
(best_lambda_Ca <- boxcox_result_Ca$x[which.max(boxcox_result_Ca$y)])
## [1] -1.1
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ca, best_lambda_Ca) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation improved the distribution somewhat.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_Ca, ') on Ca'))
create_sbs_histograms(compare_long_df, plot_title)
For Ba, \(\lambda_{best} =
-2\), which is an inverse squared transformation.
boxcox_result_Ba <- MASS::boxcox(lm(Glass$Ba1 ~ 1), plotit = FALSE)
(best_lambda_Ba <- boxcox_result_Ba$x[which.max(boxcox_result_Ba$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ba1, best_lambda_Ba) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_Ba, ') on Ba'))
create_sbs_histograms(compare_long_df, plot_title)
For Fe, \(\lambda_{best} =
-2\), which is an inverse squared transformation.
boxcox_result_Fe <- MASS::boxcox(lm(Glass$Fe1 ~ 1), plotit = FALSE)
(best_lambda_Fe <- boxcox_result_Fe$x[which.max(boxcox_result_Fe$y)])
## [1] -2
# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Fe1, best_lambda_Fe) %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.
plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
'$\\lambda$ = ', best_lambda_Fe, ') on Fe'))
create_sbs_histograms(compare_long_df, plot_title)
Overall, the plots above suggest that Box-Cox transformations are not
very effective to normalize the predictor variables. The only variable
with some effect was Ca. Given its collinearity with
RI (part (a)), the Box-Cox transformed Ca may
be more more suitable for a predictive model.
Next, I decided to try spatial sign transformations on all skewed predictor variables that had outliers (part (b)).
Glass_ss <- Glass %>%
select_if(is.numeric) %>%
scale() %>%
caret::spatialSign() %>%
as.data.frame()
colnames(Glass_ss) <- paste(colnames(Glass_ss), '_ss', sep = '')
The histograms below show that the spatial sign transformation
improved the distribution of RI.
transform_df <- data.frame(untransformed = Glass$RI, transformed = Glass_ss$RI_ss)
compare_long_df <- transform_df %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on RI'
create_sbs_histograms(compare_long_df, plot_title)
The histograms below show that the spatial sign transformation
improved the distribution of K.
transform_df <- data.frame(untransformed = Glass$K, transformed = Glass_ss$K_ss)
compare_long_df <- transform_df %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on K'
create_sbs_histograms(compare_long_df, plot_title)
The histograms below show that the spatial sign transformation
worsened the distribution of Ca.
transform_df <- data.frame(untransformed = Glass$Ca, transformed = Glass_ss$Ca_ss)
compare_long_df <- transform_df %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Ca'
create_sbs_histograms(compare_long_df, plot_title)
The histograms below show that the spatial sign transformation
improved the distribution of Ba.
transform_df <- data.frame(untransformed = Glass$Ba, transformed = Glass_ss$Ba_ss)
compare_long_df <- transform_df %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Ba'
create_sbs_histograms(compare_long_df, plot_title)
The histograms below show that the spatial sign transformation
worsened the distribution of Fe, making it bimodal.
transform_df <- data.frame(untransformed = Glass$Fe, transformed = Glass_ss$Fe_ss)
compare_long_df <- transform_df %>%
pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())
plot_title <- 'Effect of spatial sign transformation on Fe'
create_sbs_histograms(compare_long_df, plot_title)
Overall, spatial sign transformations were more effective than
Box-Cox transformations and improved the distribution (ie, more normal)
of three skewed predictor variables (RI, K,
and Ba). For Ca, spatial sign transformation
worsened the distribution; however, Box-Cox transformation with \(\lambda = -1.1\) (ie, inverse) showed some
improvement, so I would use the latter. The remaining predictor
variables in the Glass dataset would likely be used as is
for model development.
Depending on the choice of classifier, scaling and centering may also be useful transformations.
data(Soybean)
I performed some basic checks on the data before answering the exercise questions.
The data appears to be tidy and data types appear to be appropriate.
glimpse(Soybean)
## Rows: 683
## Columns: 36
## $ Class <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Based on the dataset
documentation, the target variable is Class. The other
35 variables are features.
I found and removed 52 duplicate rows.
sprintf("Duplicate rows in dataset: %d", nrow(Soybean) - nrow(distinct(Soybean)))
## [1] "Duplicate rows in dataset: 52"
Soybean <- Soybean %>%
distinct()
There are 34 variables with missing data, which is consistent with the dataset documentation.
All variables except leaves have missing data. The
proportion of missing data ranged from 0.2% area.dam and
date) to 15.7% (hail, sever,
seed.tmt, and lodging). Based on these
fractions, the affected variables can be imputed rather than omitted
from the classifier.
Additional information about variables with missingness is in part (b) of this exercise.
missing_data_df <- get_na_vars(Soybean) %>%
bind_rows() %>%
pivot_longer(names_to = 'variable', values_to = 'null_rows', cols = everything()) %>%
mutate(
missing_frac = round(null_rows / nrow(Soybean), 3)
) %>%
arrange(desc(null_rows))
missing_data_df %>%
kbl() %>%
kable_styling() %>%
scroll_box('300px')
| variable | null_rows | missing_frac |
|---|---|---|
| hail | 99 | 0.157 |
| sever | 99 | 0.157 |
| seed.tmt | 99 | 0.157 |
| lodging | 99 | 0.157 |
| germ | 91 | 0.144 |
| leaf.mild | 87 | 0.138 |
| fruiting.bodies | 85 | 0.135 |
| fruit.spots | 85 | 0.135 |
| seed.discolor | 85 | 0.135 |
| shriveling | 85 | 0.135 |
| leaf.shread | 79 | 0.125 |
| seed | 75 | 0.119 |
| mold.growth | 75 | 0.119 |
| seed.size | 75 | 0.119 |
| fruit.pods | 67 | 0.106 |
| leaf.halo | 63 | 0.100 |
| leaf.marg | 63 | 0.100 |
| leaf.size | 63 | 0.100 |
| leaf.malf | 63 | 0.100 |
| precip | 34 | 0.054 |
| stem.cankers | 34 | 0.054 |
| canker.lesion | 34 | 0.054 |
| ext.decay | 34 | 0.054 |
| mycelium | 34 | 0.054 |
| int.discolor | 34 | 0.054 |
| sclerotia | 34 | 0.054 |
| plant.stand | 32 | 0.051 |
| roots | 30 | 0.048 |
| temp | 26 | 0.041 |
| crop.hist | 16 | 0.025 |
| plant.growth | 16 | 0.025 |
| stem | 16 | 0.025 |
| date | 1 | 0.002 |
| area.dam | 1 | 0.002 |
sprintf('The fraction of missing data in predictor variables ranges from %.3f to %.3f',
min(missing_data_df$missing_frac), max(missing_data_df$missing_frac))
## [1] "The fraction of missing data in predictor variables ranges from 0.002 to 0.157"
# Variables that do not have missing data
soybean_vars <- colnames(Soybean)
missing_data_vars <- missing_data_df$variable
complete_vars <- setdiff(soybean_vars, missing_data_vars)
sprintf('Soybean variables with no missing data: %s', paste(complete_vars, collapse = ', '))
## [1] "Soybean variables with no missing data: Class, leaves"
There were 3 variables (lead.mild,
mycelium, and sclerotia) with near-zero
variance. These variables are not likely to be informative for a machine
learning classifier.
# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Soybean)
sprintf('There are %d near-zero variance columns', length(zvar_idx))
## [1] "There are 3 near-zero variance columns"
names(Soybean)[zvar_idx]
## [1] "leaf.mild" "mycelium" "sclerotia"
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
The bar plots below show that predictor variables have 2 to 7
categories (levels); however, they are often not distributed evenly.
Except for area.dam, most variables have a dominant
category (ie, value with higher frequency).
The most extreme case, where all observations fall into a single
category, is known as a degenerate distribution. Strictly speaking, none
of the distributions meet this criterion; however, several are close.
The variables with the most pronounced near-degenerate distribution are
mycelium, sclerotia, and
leaf.mild, which were also identified using the
nearZeroVar() function from the caret
package.
# Due to differences between factor and ordinal data types in the Soybean dataset,
# I coerced categorical variables to be numeric in order to reshape and create facet plots
Soybean_numeric <- type.convert(Soybean, as.is = FALSE)
# Reshape data to long format for facet plot
Soybean_numeric_long <- Soybean_numeric %>%
# Omit target variable
dplyr::select(-c(Class)) %>%
pivot_longer(cols = everything(), names_to = 'variable', values_to = 'category')
# Create faceted barplots
ggplot(Soybean_numeric_long, aes(x = category, fill = variable)) +
geom_bar(na.rm = TRUE) +
facet_wrap(~ variable, scales = 'free') +
labs(
x = 'Category', y = 'Count',
title = 'Frequency distribution of Soybean variables'
) +
theme_minimal() +
theme(
legend.position = 'none'
)
Are there particular predictors that are more likely to be missing?
Yes, some predictors are more likely to have missing data. In the Data checks section, I showed that the fraction of missing data varies from 0.2% to 15.7%. Here, I list the predictors with >5% missingness.
Of note, this subset includes the three near-degenerate predictors
identified in part (a) (mycelium, sclerotia,
and leaf.mild).
high_missingness_vars <- missing_data_df %>%
filter(missing_frac > 0.05)
high_missingness_vars %>%
kbl() %>%
kable_styling() %>%
scroll_box('300px')
| variable | null_rows | missing_frac |
|---|---|---|
| hail | 99 | 0.157 |
| sever | 99 | 0.157 |
| seed.tmt | 99 | 0.157 |
| lodging | 99 | 0.157 |
| germ | 91 | 0.144 |
| leaf.mild | 87 | 0.138 |
| fruiting.bodies | 85 | 0.135 |
| fruit.spots | 85 | 0.135 |
| seed.discolor | 85 | 0.135 |
| shriveling | 85 | 0.135 |
| leaf.shread | 79 | 0.125 |
| seed | 75 | 0.119 |
| mold.growth | 75 | 0.119 |
| seed.size | 75 | 0.119 |
| fruit.pods | 67 | 0.106 |
| leaf.halo | 63 | 0.100 |
| leaf.marg | 63 | 0.100 |
| leaf.size | 63 | 0.100 |
| leaf.malf | 63 | 0.100 |
| precip | 34 | 0.054 |
| stem.cankers | 34 | 0.054 |
| canker.lesion | 34 | 0.054 |
| ext.decay | 34 | 0.054 |
| mycelium | 34 | 0.054 |
| int.discolor | 34 | 0.054 |
| sclerotia | 34 | 0.054 |
| plant.stand | 32 | 0.051 |
Similarly, a visual representation of missing data in the Soybean dataset shows that there are systematic patterns of missing data and that missingness does not occur at random (ie, the gray areas corresponding to NA are not scattered).
vis_dat(Soybean)
Is the pattern of missing data related to the classes?
Yes, the pattern of missing data is related to the soybean disease
classes. The first clue to this is that not all classes have missing
data. Among the 5 classes with missing data,
phytophthora-rot has the most missing observations,
followed by 2-4-d-injury.
# Reshape soybean data to long format
# Note I reshaped the dataset in part (a), but this version includes Class
Soybean_numeric_long2 <- Soybean_numeric %>%
pivot_longer(cols = -Class, names_to = 'variable', values_to = 'value')
# Count number of observations with missing data by category
Soybean_missing_by_class <- Soybean_numeric_long2 %>%
group_by(Class, value) %>%
count(name = 'total_obs') %>%
ungroup() %>%
filter(is.na(value)) %>%
rename('missing_obs' = 'total_obs') %>%
dplyr::select(Class, missing_obs) %>%
arrange(desc(missing_obs))
Soybean_missing_by_class %>%
kbl() %>%
kable_styling()
| Class | missing_obs |
|---|---|
| phytophthora-rot | 897 |
| 2-4-d-injury | 450 |
| cyst-nematode | 240 |
| diaporthe-pod-&-stem-blight | 166 |
| herbicide-injury | 160 |
sprintf('%d of %d classes in Soybean have missing data',
nrow(Soybean_missing_by_class), length(table(Soybean$Class)))
## [1] "5 of 19 classes in Soybean have missing data"
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Since some predictors in the Soybean dataset are more likely than
others to have missing data, missing completely at random (MCAR) is not
likely. The systematic pattern of missingness as well as differences in
missingness by Class suggest that missing at random (MAR)
is more likely. This suggests that imputing missing data would be
preferred to eliminating variables with missing data. However, I would
remove the near-degenerate variables since they are not likely to be
informative in a classification model.
Potential imputation methods include random forest models or multiple imputation by chained equations (MICE). These methods are preferred to simple imputation methods, such as using a summary statistic (eg, median) or k-nearest neighbors, because they generate multiple datasets with imputed values, which are more statistically robust (similar to the advantages of using k-fold cross-validation).
Kuhn M & Johnson K. (2013) Applied predictive modeling. DOI: 10.1007/978-1-4614-6849-3
Tuzen MF. (2025). Handling missing data in R: A comprehensive guide. https://www.r-bloggers.com/2025/08/handling-missing-data-in-r-a-comprehensive-guide/
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 latex2exp_0.9.8 visdat_0.6.0 caret_7.0-1
## [5] lattice_0.22-7 MASS_7.3-65 e1071_1.7-17 GGally_2.4.0
## [9] mlbench_2.1-7 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
## [13] dplyr_1.1.4 purrr_1.2.1 readr_2.1.6 tidyr_1.3.2
## [17] tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 timeDate_4052.112
## [4] farver_2.1.2 S7_0.2.1 fastmap_1.2.0
## [7] pROC_1.19.0.1 digest_0.6.39 rpart_4.1.24
## [10] timechange_0.3.0 lifecycle_1.0.5 survival_3.8-6
## [13] magrittr_2.0.4 compiler_4.5.2 rlang_1.1.7
## [16] sass_0.4.10 tools_4.5.2 yaml_2.3.12
## [19] data.table_1.18.2.1 knitr_1.51 labeling_0.4.3
## [22] xml2_1.5.2 plyr_1.8.9 RColorBrewer_1.1-3
## [25] withr_3.0.2 stats4_4.5.2 nnet_7.3-20
## [28] grid_4.5.2 future_1.69.0 globals_0.18.0
## [31] scales_1.4.0 iterators_1.0.14 cli_3.6.5
## [34] rmarkdown_2.30 generics_0.1.4 otel_0.2.0
## [37] rstudioapi_0.18.0 future.apply_1.20.1 reshape2_1.4.5
## [40] tzdb_0.5.0 cachem_1.1.0 proxy_0.4-29
## [43] splines_4.5.2 parallel_4.5.2 vctrs_0.7.1
## [46] hardhat_1.4.2 Matrix_1.7-4 jsonlite_2.0.0
## [49] hms_1.1.4 listenv_0.10.0 systemfonts_1.3.1
## [52] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
## [55] recipes_1.3.1 glue_1.8.0 parallelly_1.46.1
## [58] ggstats_0.12.0 codetools_0.2-20 stringi_1.8.7
## [61] gtable_0.3.6 pillar_1.11.1 htmltools_0.5.9
## [64] ipred_0.9-15 lava_1.8.2 R6_2.6.1
## [67] textshaping_1.0.4 evaluate_1.0.5 bslib_0.10.0
## [70] class_7.3-23 Rcpp_1.1.1 svglite_2.2.2
## [73] nlme_3.1-168 prodlim_2025.04.28 xfun_0.56
## [76] ModelMetrics_1.2.2.2 pkgconfig_2.0.3