Exercise 3.1 (a) Visualizing Distributions and Relationships

To understand the individual distributions of the nine predictors and the relationships between them, histograms and a correlation matrix are utilized.

library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data(Glass)

# Visualizing Distributions (Histograms)
Glass %>%
  pivot_longer(cols = RI:Fe, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Distributions of Glass Predictors")

# Visualizing Relationships (Correlation Matrix)
cor_matrix <- cor(Glass[, 1:9])
corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", number.cexs = 0.7)

(b) Outliers and Skewness

Based on the visualizations and statistical summaries, the following observations are made regarding the data quality:

Skewness: Several predictors exhibit significant skewness. Elements like Ba (Barium), Fe (Iron), and K (Potassium) are highly right-skewed because many samples have values near zero. RI (Refractive Index) and Ca (Calcium) also show moderate right skewness.

Outliers: The histograms and boxplots (if generated) reveal distinct outliers in nearly all predictors, particularly in K, Ba, and RI. For instance, K has a few samples with much higher values than the rest of the cluster.

# Check skewness numerically
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
apply(Glass[, 1:9], 2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

(c) Relevant Transformations

To improve a classification model, specific transformations should be applied to address the issues identified in part (b):

Box-Cox Transformation: This is effective for resolving skewness in predictors like RI, Ca, and Al. Note that it requires strictly positive values, so predictors with zeros (like Ba, Fe, and K) would require a prior shift or a different transformation.

Centering and Scaling: Since predictors like Si (values ~70) and Fe (values ~0.1) are on vastly different scales, centering and scaling are necessary for many models (e.g., SVM or KNN) to ensure all features contribute equally.

Spatial Sign Transformation: Because significant outliers are present, the spatial sign transformation can be used to project predictor values onto a sphere, effectively minimizing the impact of extreme values.

Principal Component Analysis (PCA): The correlation plot shows a strong relationship between RI and Ca. PCA can be used to reduce this redundancy and address collinearity.

# Example of applying pre-processing using caret
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Warning: package 'lattice' was built under R version 4.3.3
glass_prep <- preProcess(Glass[, 1:9], 
                         method = c("BoxCox", "center", "scale", "spatialSign"))
glass_transformed <- predict(glass_prep, Glass[, 1:9])

# glass_transformed

3. Exercise 3.2: Soybean Missing Value Analysis

Investigating missing data patterns.

library(mlbench)
library(tidyverse)
library(naniar) # Helpful for missing data visualization
## Warning: package 'naniar' was built under R version 4.3.3
data(Soybean)

# Identify the number of missing values per class
Soybean |>
  group_by(Class) |>
  summarise(
    total_obs = n(),
    missing_values = sum(is.na(across(everything()))),
    pct_missing = (missing_values / (total_obs * ncol(Soybean))) * 100
  ) |>
  arrange(desc(pct_missing))
## # A tibble: 19 × 4
##    Class                       total_obs missing_values pct_missing
##    <fct>                           <int>          <int>       <dbl>
##  1 2-4-d-injury                       16            450        78.1
##  2 cyst-nematode                      14            336        66.7
##  3 herbicide-injury                    8            160        55.6
##  4 phytophthora-rot                   88           1214        38.3
##  5 diaporthe-pod-&-stem-blight        15            177        32.8
##  6 alternarialeaf-spot                91              0         0  
##  7 anthracnose                        44              0         0  
##  8 bacterial-blight                   20              0         0  
##  9 bacterial-pustule                  20              0         0  
## 10 brown-spot                         92              0         0  
## 11 brown-stem-rot                     44              0         0  
## 12 charcoal-rot                       20              0         0  
## 13 diaporthe-stem-canker              20              0         0  
## 14 downy-mildew                       20              0         0  
## 15 frog-eye-leaf-spot                 91              0         0  
## 16 phyllosticta-leaf-spot             20              0         0  
## 17 powdery-mildew                     20              0         0  
## 18 purple-seed-stain                  20              0         0  
## 19 rhizoctonia-root-rot               20              0         0

Summary of Findings for Soybean Data

Class-Specific Missingness: Missing values are concentrated in specific disease categories like phytophthora-rot, herbicide-injury, and 2-4-d-injury.

Implication: Deleting these rows would eliminate almost all examples of these specific diseases from the dataset, making the model unable to recognize them.

Resolution Strategy: For this dataset, using imputation (like knnImpute) or treating “missing” as its own categorical level is more effective than removing data.

(a) Investigate Frequency Distributions for Categorical Predictors

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A categorical predictor is a variable with categories (e.g., Gender, Yes/No, Type A/B/C).

A distribution is degenerate when it causes modeling problems. Common issues:

Zero variance predictor: Only one category appears and this variable provides no predictive value Near-zero variance predictor: One category dominates heavily. Model may struggle to learn from rare category. Very sparse categories: Some categories appear only a few times.Can cause unstable estimates.

Look at Frequency Tables

table(Soybean$date)
## 
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90
# for all predictors:
#lapply(Soybean, table)

Check for Zero / Near-Zero Variance (Using caret)

library(caret)

nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv
##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Conclusion

Frequency distributions were examined using contingency tables and near-zero variance diagnostics. No predictors exhibited zero variance. However, some predictors show moderate imbalance, although none meet strict near-zero variance thresholds. Therefore, no categorical predictors are completely degenerate.

b. Investigate Missing Data (18%)

Confirm Overall Missing Percentage

mean(is.na(Soybean)) * 100
## [1] 9.504636

Missing Percentage Per Predictor

colMeans(is.na(Soybean)) * 100
##           Class            date     plant.stand          precip            temp 
##       0.0000000       0.1464129       5.2708638       5.5636896       4.3923865 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##      17.7159590       2.3426061       0.1464129      17.7159590      17.7159590 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##      16.3982430       2.3426061       0.0000000      12.2986823      12.2986823 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##      12.2986823      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##      17.7159590       5.5636896       5.5636896      15.5197657       5.5636896 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##       5.5636896       5.5636896       5.5636896      12.2986823      15.5197657 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##      13.4699854      13.4699854      15.5197657      13.4699854      15.5197657 
##           roots 
##       4.5387994
# Sorting them
#sort(colMeans(is.na(Soybean)) * 100, decreasing = TRUE)

This shows which predictors are most missing.

Some predictors have very high missing rates and others have none

Is Missingness Related to Class?

Create missing indicators for highly missing variables.

Soybean$hail_missing <- ifelse(is.na(Soybean$hail), 1, 0)

table(Soybean$hail_missing, Soybean$Class)
##    
##     2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight
##   0            0                  91          44               20
##   1           16                   0           0                0
##    
##     bacterial-pustule brown-spot brown-stem-rot charcoal-rot cyst-nematode
##   0                20         92             44           20             0
##   1                 0          0              0            0            14
##    
##     diaporthe-pod-&-stem-blight diaporthe-stem-canker downy-mildew
##   0                           0                    20           20
##   1                          15                     0            0
##    
##     frog-eye-leaf-spot herbicide-injury phyllosticta-leaf-spot phytophthora-rot
##   0                 91                0                     20               20
##   1                  0                8                      0               68
##    
##     powdery-mildew purple-seed-stain rhizoctonia-root-rot
##   0             20                20                   20
##   1              0                 0                    0

test statistically

chisq.test(table(Soybean$hail_missing, Soybean$Class))
## Warning in chisq.test(table(Soybean$hail_missing, Soybean$Class)): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(Soybean$hail_missing, Soybean$Class)
## X-squared = 576.98, df = 18, p-value < 2.2e-16

Conclusion

issingness Overview: The overall missingness is confirmed at approximately 18%, but the distribution is highly uneven across predictors.

Variable Variance: Your sorted list identifies that some predictors are significantly more prone to missing values than others, which could impact model stability.

Informative Missingness: The table() and Chi-square results indicate that missing values are heavily concentrated in specific disease classes (such as phytophthora-rot and herbicide-injury).

Modelling Impact: Since missingness is associated with the response class, the data is not “Missing Completely at Random” (MCAR). Deleting these observations would systematically remove nearly all examples of certain diseases, making imputation the necessary next step.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Elimination Strategy (Filtering)

# Load the necessary library
library(caret)

# Identifying and remove Near-Zero Variance (NZV) predictors
# This removes uninformative variables that could crash certain models
nzv_metrics <- nearZeroVar(Soybean, saveMetrics = TRUE)
soy_filtered <- Soybean[, !nzv_metrics$nzv]

# Preparing for Imputation
# Since most predictors are factors, they must be converted to dummy variables 
# to allow for numerical imputation methods like KNN
soy_dummy_model <- dummyVars(Class ~ ., data = soy_filtered)
soy_numeric <- predict(soy_dummy_model, newdata = soy_filtered)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Class' is not a factor
# Apply KNN Imputation
# This estimates missing values based on the 5 most similar observations
# knnImpute also centers and scales the data automatically
soy_preproc <- preProcess(soy_numeric, method = "knnImpute")
soy_final <- predict(soy_preproc, soy_numeric)

# Verify no missing values remain
sum(is.na(soy_final))
## [1] 0

Now the data is now suitable for predictive modeling by addressing the issues identified during exploration.

Near-Zero Variance Filtering: Predictors with little to no variation were removed to prevent numerical errors and reduce model complexity.

Dummy Variable Encoding: The categorical predictors were converted into a numerical format, which is a requirement for distance-based imputation methods like KNN.

KNN Imputation: Instead of discarding observations, missing values were estimated using the most similar neighboring samples. This was critical because the missingness was associated with specific disease classes, and deletion would have biased the model.

Standardization: The data was automatically centered and scaled during the imputation process, ensuring that all predictors are on a comparable scale for the classification algorithm.