Exercise 3.1

Data input

data(Glass)

Data checks

I performed some basic checks on the data before answering the exercise questions.

Data types

The data appears to be tidy and data types appear to be appropriate.

glimpse(Glass)

## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Based on the dataset documentation, the target variable is Type, and the numeric variables are features.

Duplicate data

I found and removed one duplicate row.

sprintf("Duplicate rows in dataset: %d", nrow(Glass) - nrow(distinct(Glass)))

## [1] "Duplicate rows in dataset: 1"

Glass <- Glass %>%
  distinct()

Missing data

No variables had missing data. This is consistent with information in the dataset documentation.

get_na_vars(Glass)

## [1] "There are no variables with missing values"

Zero-variance variables

No variables had near-zero variance.

# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Glass)
sprintf('There are %d near-zero variance columns', length(zvar_idx))

## [1] "There are 0 near-zero variance columns"

a. Explore predictor variables

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

I created a pairs plot to visualize the distribution of predictor variables (scatterplots [lower triangle] and density plots [diagonal]) and relationships between predictors (correlations [upper triangle]).

Glass %>%
  select_if(is.numeric) %>%
  # Color transparency and small data points to reduce overplotting in scatterplots
  GGally::ggpairs(mapping = aes(alpha = 0.2),
                  lower = list(
                    continuous = wrap('points', size = 0.2, color = '#005AB5')),
                  diag = list(
                    # Create bar histograms on diagonal
                    continuous = wrap('barDiag', bins = 20, fill = '#DC3220')),
                  progress = FALSE)

Distributions

The histograms on the diagonal show that the predictor variables have a unimodal distribution. The distribution of Si appears to be the closest to normal (ie, bell-shaped). Several variables (K, Ba, Fe) are right skewed. Two of these variables (Ba and Fe) are skewed due to sparsity as they have more zero values than nonzero values.

Relationships between predictors

The Pearson correlation coefficients in the upper triangle show that several pairs of variables are significantly correlated.

For positive correlations, RI and Ca are most strongly correlated (r = 0.81), which indicates collinearity. This is consistent with the scatterplot of these two variables (lower triangle), which shows a linear relationship.

No other variable pair has a correlation coefficient or scatterplot that suggests multicollinearity.

# Convert correlation matrix to dataframe in long format
correlations_df <- Glass %>%
  select_if(is.numeric) %>%
  cor(method = 'pearson') %>%
  get_upper_triangular_df() %>%
  rownames_to_column('var1') %>%
  pivot_longer(cols = -var1, names_to = 'var2', values_to = 'r') %>%
  drop_na()

# Strongest positive correlations
correlations_df %>%
  arrange(desc(r)) %>%
  slice_head(n = 5) %>%
  kbl() %>%
  kable_styling()

var1	var2	r
RI	Ca	0.8111829
Al	Ba	0.4806424
Na	Ba	0.3290798
Al	K	0.3236828
Na	Al	0.1677350

For negative correlations, the strongest is a moderate correlation between RI and Si (r = -0.54); however, this does does not suggest collinearity.

# Strongest negative correlations
correlations_df %>%
  arrange(r) %>%
  slice_head(n = 5) %>%
  kbl() %>%
  kable_styling()

var1	var2	r
RI	Si	-0.5389998
Mg	Ba	-0.4918178
Mg	Al	-0.4795752
Mg	Ca	-0.4461971
RI	Al	-0.4009733

b. Outliers

Do there appear to be any outliers in the data? Are any predictors skewed?

The boxplots below show that all predictor variables except Mg have outliers (as assessed by values that exceed the first and third quartiles by more than than 1.5 times the interquartile range).

# Pivot numeric variables to long format for plotting
Glass_long <- Glass %>%
  select_if(is.numeric) %>%  
  pivot_longer(cols = everything(), names_to = 'variable', values_to = 'value')

ggplot(Glass_long, aes(x = value, fill = variable)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = 'free_x') +
  theme(
    legend.position = 'none'
  )

The skewness values below show that the distribution of several variables is right skewed. In particular, K and Ba are the most right skewed (skewness >> 1). Mg is left-skewed (skewness < -1).

skewness_vec <- Glass %>%
  select_if(is.numeric) %>%  
  apply(MARGIN = 2, e1071::skewness)

skewness_vec

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6248785  0.4591599 -1.1300042  0.9252262 -0.7341909  6.4535281  2.0219586 
##         Ba         Fe 
##  3.3589166  1.7226422

c. Transformations

Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox transformations

For skewed variables (part (b)), transformations may improve the data distribution. To determine the best transformation for each skewed predictor variable, I calculated the optimal lambda values for Box-Cox transformations.

I first identified variables that are not positive definite (Mg, K, Ba, and Fe), which is a requirement for Box-Cox transformation. To satisfy the requirement, I added 1 to them.

Glass %>%
  # Variables with skewness < -1 or > +1
  dplyr::select(RI, Mg, K, Ca, Ba, Fe) %>%
  apply(MARGIN = 2, function(x) any(x <= 0))

##    RI    Mg     K    Ca    Ba    Fe 
## FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Glass <- Glass %>%
  mutate(
    Mg1 = Mg + 1,
    K1 = K + 1,
    Ba1 = Ba + 1,
    Fe1 = Fe + 1
  )

Then I calculated the optimal lambda values for Box-Cox transformations.

For RI, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_RI <- MASS::boxcox(lm(Glass$RI ~ 1), plotit = FALSE)
(best_lambda_RI <- boxcox_result_RI$x[which.max(boxcox_result_RI$y)])

## [1] -2

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$RI, best_lambda_RI) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_RI, ') on RI'))

create_sbs_histograms(compare_long_df, plot_title)

For Mg, \(\lambda_{best} = 2\), which is a squared transformation.

boxcox_result_Mg <- MASS::boxcox(lm(Glass$Mg1 ~ 1), plotit = FALSE)
(best_lambda_Mg <- boxcox_result_Mg$x[which.max(boxcox_result_Mg$y)])

## [1] 2

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Mg1, best_lambda_Mg) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Mg, ') on Mg'))

create_sbs_histograms(compare_long_df, plot_title)

For K, \(\lambda_{best} = -1\), which is an inverse transformation.

boxcox_result_K <- MASS::boxcox(lm(Glass$K1 ~ 1), plotit = FALSE)
(best_lambda_K <- boxcox_result_K$x[which.max(boxcox_result_K$y)])

## [1] -1

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$K, best_lambda_K) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation changed the distribution, but it is not normalized.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_K, ') on K'))

create_sbs_histograms(compare_long_df, plot_title)

## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).

For Ca, \(\lambda_{best} = -1.1\), which is close to an inverse transformation.

boxcox_result_Ca <- MASS::boxcox(lm(Glass$Ca ~ 1), plotit = FALSE)
(best_lambda_Ca <- boxcox_result_Ca$x[which.max(boxcox_result_Ca$y)])

## [1] -1.1

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ca, best_lambda_Ca) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation improved the distribution somewhat.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Ca, ') on Ca'))

create_sbs_histograms(compare_long_df, plot_title)

For Ba, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_Ba <- MASS::boxcox(lm(Glass$Ba1 ~ 1), plotit = FALSE)
(best_lambda_Ba <- boxcox_result_Ba$x[which.max(boxcox_result_Ba$y)])

## [1] -2

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Ba1, best_lambda_Ba) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Ba, ') on Ba'))

create_sbs_histograms(compare_long_df, plot_title)

For Fe, \(\lambda_{best} = -2\), which is an inverse squared transformation.

boxcox_result_Fe <- MASS::boxcox(lm(Glass$Fe1 ~ 1), plotit = FALSE)
(best_lambda_Fe <- boxcox_result_Fe$x[which.max(boxcox_result_Fe$y)])

## [1] -2

# Transform and reshape as long format for plotting
compare_long_df <- BoxCoxTransform(Glass$Fe1, best_lambda_Fe) %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

The histograms below show that the transformation did not change the distribution much. This may be due to the sparsity of this variable.

plot_title <- latex2exp::TeX(paste0('Effect of Box-Cox transformation (',
                                    '$\\lambda$ = ', best_lambda_Fe, ') on Fe'))

create_sbs_histograms(compare_long_df, plot_title)

Overall, the plots above suggest that Box-Cox transformations are not very effective to normalize the predictor variables. The only variable with some effect was Ca. Given its collinearity with RI (part (a)), the Box-Cox transformed Ca may be more more suitable for a predictive model.

Spatial sign transformations

Next, I decided to try spatial sign transformations on all skewed predictor variables that had outliers (part (b)).

Glass_ss <- Glass %>%
  select_if(is.numeric) %>%
  scale() %>%
  caret::spatialSign() %>%
  as.data.frame()

colnames(Glass_ss) <- paste(colnames(Glass_ss), '_ss', sep = '')

The histograms below show that the spatial sign transformation improved the distribution of RI.

transform_df <- data.frame(untransformed = Glass$RI, transformed = Glass_ss$RI_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

plot_title <- 'Effect of spatial sign transformation on RI'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation improved the distribution of K.

transform_df <- data.frame(untransformed = Glass$K, transformed = Glass_ss$K_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

plot_title <- 'Effect of spatial sign transformation on K'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation worsened the distribution of Ca.

transform_df <- data.frame(untransformed = Glass$Ca, transformed = Glass_ss$Ca_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

plot_title <- 'Effect of spatial sign transformation on Ca'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation improved the distribution of Ba.

transform_df <- data.frame(untransformed = Glass$Ba, transformed = Glass_ss$Ba_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

plot_title <- 'Effect of spatial sign transformation on Ba'
create_sbs_histograms(compare_long_df, plot_title)

The histograms below show that the spatial sign transformation worsened the distribution of Fe, making it bimodal.

transform_df <- data.frame(untransformed = Glass$Fe, transformed = Glass_ss$Fe_ss)

compare_long_df <- transform_df %>%
  pivot_longer(names_to = 'transform_status', values_to = 'value', cols = everything())

plot_title <- 'Effect of spatial sign transformation on Fe'
create_sbs_histograms(compare_long_df, plot_title)

Conclusions

Overall, spatial sign transformations were more effective than Box-Cox transformations and improved the distribution (ie, more normal) of three skewed predictor variables (RI, K, and Ba). For Ca, spatial sign transformation worsened the distribution; however, Box-Cox transformation with \(\lambda = -1.1\) (ie, inverse) showed some improvement, so I would use the latter. The remaining predictor variables in the Glass dataset would likely be used as is for model development.

Depending on the choice of classifier, scaling and centering may also be useful transformations.

Exercise 3.2

Data input

data(Soybean)

Data checks

I performed some basic checks on the data before answering the exercise questions.

Data types

The data appears to be tidy and data types appear to be appropriate.

glimpse(Soybean)

## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Based on the dataset documentation, the target variable is Class. The other 35 variables are features.

Duplicate data

I found and removed 52 duplicate rows.

sprintf("Duplicate rows in dataset: %d", nrow(Soybean) - nrow(distinct(Soybean)))

## [1] "Duplicate rows in dataset: 52"

Soybean <- Soybean %>%
  distinct()

Missing data

There are 34 variables with missing data, which is consistent with the dataset documentation.

All variables except leaves have missing data. The proportion of missing data ranged from 0.2% area.dam and date) to 15.7% (hail, sever, seed.tmt, and lodging). Based on these fractions, the affected variables can be imputed rather than omitted from the classifier.

Additional information about variables with missingness is in part (b) of this exercise.

missing_data_df <- get_na_vars(Soybean) %>%
  bind_rows() %>%
  pivot_longer(names_to = 'variable', values_to = 'null_rows', cols = everything()) %>%
  mutate(
    missing_frac = round(null_rows / nrow(Soybean), 3)
  ) %>%
  arrange(desc(null_rows))

missing_data_df %>%
  kbl() %>%
  kable_styling() %>%
  scroll_box('300px')

variable	null_rows	missing_frac
hail	99	0.157
sever	99	0.157
seed.tmt	99	0.157
lodging	99	0.157
germ	91	0.144
leaf.mild	87	0.138
fruiting.bodies	85	0.135
fruit.spots	85	0.135
seed.discolor	85	0.135
shriveling	85	0.135
leaf.shread	79	0.125
seed	75	0.119
mold.growth	75	0.119
seed.size	75	0.119
fruit.pods	67	0.106
leaf.halo	63	0.100
leaf.marg	63	0.100
leaf.size	63	0.100
leaf.malf	63	0.100
precip	34	0.054
stem.cankers	34	0.054
canker.lesion	34	0.054
ext.decay	34	0.054
mycelium	34	0.054
int.discolor	34	0.054
sclerotia	34	0.054
plant.stand	32	0.051
roots	30	0.048
temp	26	0.041
crop.hist	16	0.025
plant.growth	16	0.025
stem	16	0.025
date	1	0.002
area.dam	1	0.002

sprintf('The fraction of missing data in predictor variables ranges from %.3f to %.3f',
        min(missing_data_df$missing_frac), max(missing_data_df$missing_frac))

## [1] "The fraction of missing data in predictor variables ranges from 0.002 to 0.157"

# Variables that do not have missing data
soybean_vars <- colnames(Soybean)
missing_data_vars <- missing_data_df$variable

complete_vars <- setdiff(soybean_vars, missing_data_vars)
sprintf('Soybean variables with no missing data: %s', paste(complete_vars, collapse = ', '))

## [1] "Soybean variables with no missing data: Class, leaves"

Zero-variance variables

There were 3 variables (lead.mild, mycelium, and sclerotia) with near-zero variance. These variables are not likely to be informative for a machine learning classifier.

# Indexes of columns with near-zero variance
zvar_idx <- caret::nearZeroVar(Soybean)
sprintf('There are %d near-zero variance columns', length(zvar_idx))

## [1] "There are 3 near-zero variance columns"

names(Soybean)[zvar_idx]

## [1] "leaf.mild" "mycelium"  "sclerotia"

a. Frequency distribution of categorical predictors

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The bar plots below show that predictor variables have 2 to 7 categories (levels); however, they are often not distributed evenly. Except for area.dam, most variables have a dominant category (ie, value with higher frequency).

The most extreme case, where all observations fall into a single category, is known as a degenerate distribution. Strictly speaking, none of the distributions meet this criterion; however, several are close. The variables with the most pronounced near-degenerate distribution are mycelium, sclerotia, and leaf.mild, which were also identified using the nearZeroVar() function from the caret package.

# Due to differences between factor and ordinal data types in the Soybean dataset, 
# I coerced categorical variables to be numeric in order to reshape and create facet plots
Soybean_numeric <- type.convert(Soybean, as.is = FALSE)

# Reshape data to long format for facet plot
Soybean_numeric_long <- Soybean_numeric %>%
  # Omit target variable
  dplyr::select(-c(Class)) %>%
  pivot_longer(cols = everything(), names_to = 'variable', values_to = 'category')

# Create faceted barplots
ggplot(Soybean_numeric_long, aes(x = category, fill = variable)) +
    geom_bar(na.rm = TRUE) +
    facet_wrap(~ variable, scales = 'free') +
    labs(
      x = 'Category', y = 'Count',
      title = 'Frequency distribution of Soybean variables'
  ) +
  theme_minimal() +
  theme(
    legend.position = 'none'
  )

b. Patterns in missing data

Are there particular predictors that are more likely to be missing?

Yes, some predictors are more likely to have missing data. In the Data checks section, I showed that the fraction of missing data varies from 0.2% to 15.7%. Here, I list the predictors with >5% missingness.

Of note, this subset includes the three near-degenerate predictors identified in part (a) (mycelium, sclerotia, and leaf.mild).

high_missingness_vars <- missing_data_df %>%
  filter(missing_frac > 0.05)

high_missingness_vars %>%
  kbl() %>%
  kable_styling() %>%
  scroll_box('300px')

variable	null_rows	missing_frac
hail	99	0.157
sever	99	0.157
seed.tmt	99	0.157
lodging	99	0.157
germ	91	0.144
leaf.mild	87	0.138
fruiting.bodies	85	0.135
fruit.spots	85	0.135
seed.discolor	85	0.135
shriveling	85	0.135
leaf.shread	79	0.125
seed	75	0.119
mold.growth	75	0.119
seed.size	75	0.119
fruit.pods	67	0.106
leaf.halo	63	0.100
leaf.marg	63	0.100
leaf.size	63	0.100
leaf.malf	63	0.100
precip	34	0.054
stem.cankers	34	0.054
canker.lesion	34	0.054
ext.decay	34	0.054
mycelium	34	0.054
int.discolor	34	0.054
sclerotia	34	0.054
plant.stand	32	0.051

Similarly, a visual representation of missing data in the Soybean dataset shows that there are systematic patterns of missing data and that missingness does not occur at random (ie, the gray areas corresponding to NA are not scattered).

vis_dat(Soybean)

Is the pattern of missing data related to the classes?

Yes, the pattern of missing data is related to the soybean disease classes. The first clue to this is that not all classes have missing data. Among the 5 classes with missing data, phytophthora-rot has the most missing observations, followed by 2-4-d-injury.

# Reshape soybean data to long format
# Note I reshaped the dataset in part (a), but this version includes Class
Soybean_numeric_long2 <- Soybean_numeric %>%
  pivot_longer(cols = -Class, names_to = 'variable', values_to = 'value')

# Count number of observations with missing data by category
Soybean_missing_by_class <- Soybean_numeric_long2 %>%
  group_by(Class, value) %>%
  count(name = 'total_obs') %>%
  ungroup() %>%
  filter(is.na(value)) %>%
  rename('missing_obs' = 'total_obs') %>%
  dplyr::select(Class, missing_obs) %>%
  arrange(desc(missing_obs))

Soybean_missing_by_class %>%
  kbl() %>%
  kable_styling()

Class	missing_obs
phytophthora-rot	897
2-4-d-injury	450
cyst-nematode	240
diaporthe-pod-&-stem-blight	166
herbicide-injury	160

sprintf('%d of %d classes in Soybean have missing data', 
        nrow(Soybean_missing_by_class), length(table(Soybean$Class)))

## [1] "5 of 19 classes in Soybean have missing data"

c. Handling missing data

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Since some predictors in the Soybean dataset are more likely than others to have missing data, missing completely at random (MCAR) is not likely. The systematic pattern of missingness as well as differences in missingness by Class suggest that missing at random (MAR) is more likely. This suggests that imputing missing data would be preferred to eliminating variables with missing data. However, I would remove the near-degenerate variables since they are not likely to be informative in a classification model.

Potential imputation methods include random forest models or multiple imputation by chained equations (MICE). These methods are preferred to simple imputation methods, such as using a summary statistic (eg, median) or k-nearest neighbors, because they generate multiple datasets with imputed values, which are more statistically robust (similar to the advantages of using k-fold cross-validation).

References

Kuhn M & Johnson K. (2013) Applied predictive modeling. DOI: 10.1007/978-1-4614-6849-3

Tuzen MF. (2025). Handling missing data in R: A comprehensive guide. https://www.r-bloggers.com/2025/08/handling-missing-data-in-r-a-comprehensive-guide/

Session Details

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0 latex2exp_0.9.8  visdat_0.6.0     caret_7.0-1     
##  [5] lattice_0.22-7   MASS_7.3-65      e1071_1.7-17     GGally_2.4.0    
##  [9] mlbench_2.1-7    lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0   
## [13] dplyr_1.1.4      purrr_1.2.1      readr_2.1.6      tidyr_1.3.2     
## [17] tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    timeDate_4052.112   
##  [4] farver_2.1.2         S7_0.2.1             fastmap_1.2.0       
##  [7] pROC_1.19.0.1        digest_0.6.39        rpart_4.1.24        
## [10] timechange_0.3.0     lifecycle_1.0.5      survival_3.8-6      
## [13] magrittr_2.0.4       compiler_4.5.2       rlang_1.1.7         
## [16] sass_0.4.10          tools_4.5.2          yaml_2.3.12         
## [19] data.table_1.18.2.1  knitr_1.51           labeling_0.4.3      
## [22] xml2_1.5.2           plyr_1.8.9           RColorBrewer_1.1-3  
## [25] withr_3.0.2          stats4_4.5.2         nnet_7.3-20         
## [28] grid_4.5.2           future_1.69.0        globals_0.18.0      
## [31] scales_1.4.0         iterators_1.0.14     cli_3.6.5           
## [34] rmarkdown_2.30       generics_0.1.4       otel_0.2.0          
## [37] rstudioapi_0.18.0    future.apply_1.20.1  reshape2_1.4.5      
## [40] tzdb_0.5.0           cachem_1.1.0         proxy_0.4-29        
## [43] splines_4.5.2        parallel_4.5.2       vctrs_0.7.1         
## [46] hardhat_1.4.2        Matrix_1.7-4         jsonlite_2.0.0      
## [49] hms_1.1.4            listenv_0.10.0       systemfonts_1.3.1   
## [52] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [55] recipes_1.3.1        glue_1.8.0           parallelly_1.46.1   
## [58] ggstats_0.12.0       codetools_0.2-20     stringi_1.8.7       
## [61] gtable_0.3.6         pillar_1.11.1        htmltools_0.5.9     
## [64] ipred_0.9-15         lava_1.8.2           R6_2.6.1            
## [67] textshaping_1.0.4    evaluate_1.0.5       bslib_0.10.0        
## [70] class_7.3-23         Rcpp_1.1.1           svglite_2.2.2       
## [73] nlme_3.1-168         prodlim_2025.04.28   xfun_0.56           
## [76] ModelMetrics_1.2.2.2 pkgconfig_2.0.3

DATA624 Homework 4

Alexander Simon

2026-03-01

Exercise 3.1

Data input

Data checks

Data types

Duplicate data

Missing data

Zero-variance variables

a. Explore predictor variables

Distributions

Relationships between predictors

b. Outliers

c. Transformations

Box-Cox transformations

Spatial sign transformations

Conclusions

Exercise 3.2

Data input

Data checks

Data types

Duplicate data

Missing data

Zero-variance variables

a. Frequency distribution of categorical predictors

b. Patterns in missing data

c. Handling missing data

References

Session Details