To understand the individual distributions of the nine predictors and the relationships between them, histograms and a correlation matrix are utilized.
library(mlbench)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data(Glass)
# Visualizing Distributions (Histograms)
Glass %>%
pivot_longer(cols = RI:Fe, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~ Predictor, scales = "free") +
labs(title = "Distributions of Glass Predictors")
# Visualizing Relationships (Correlation Matrix)
cor_matrix <- cor(Glass[, 1:9])
corrplot(cor_matrix, method = "color", type = "upper",
addCoef.col = "black", number.cexs = 0.7)
Based on the visualizations and statistical summaries, the following observations are made regarding the data quality:
Skewness: Several predictors exhibit significant skewness. Elements like Ba (Barium), Fe (Iron), and K (Potassium) are highly right-skewed because many samples have values near zero. RI (Refractive Index) and Ca (Calcium) also show moderate right skewness.
Outliers: The histograms and boxplots (if generated) reveal distinct outliers in nearly all predictors, particularly in K, Ba, and RI. For instance, K has a few samples with much higher values than the rest of the cluster.
# Check skewness numerically
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
apply(Glass[, 1:9], 2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
To improve a classification model, specific transformations should be applied to address the issues identified in part (b):
Box-Cox Transformation: This is effective for resolving skewness in predictors like RI, Ca, and Al. Note that it requires strictly positive values, so predictors with zeros (like Ba, Fe, and K) would require a prior shift or a different transformation.
Centering and Scaling: Since predictors like Si (values ~70) and Fe (values ~0.1) are on vastly different scales, centering and scaling are necessary for many models (e.g., SVM or KNN) to ensure all features contribute equally.
Spatial Sign Transformation: Because significant outliers are present, the spatial sign transformation can be used to project predictor values onto a sphere, effectively minimizing the impact of extreme values.
Principal Component Analysis (PCA): The correlation plot shows a strong relationship between RI and Ca. PCA can be used to reduce this redundancy and address collinearity.
# Example of applying pre-processing using caret
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Warning: package 'lattice' was built under R version 4.3.3
glass_prep <- preProcess(Glass[, 1:9],
method = c("BoxCox", "center", "scale", "spatialSign"))
glass_transformed <- predict(glass_prep, Glass[, 1:9])
# glass_transformed
Investigating missing data patterns.
library(mlbench)
library(tidyverse)
library(naniar) # Helpful for missing data visualization
## Warning: package 'naniar' was built under R version 4.3.3
data(Soybean)
# Identify the number of missing values per class
Soybean |>
group_by(Class) |>
summarise(
total_obs = n(),
missing_values = sum(is.na(across(everything()))),
pct_missing = (missing_values / (total_obs * ncol(Soybean))) * 100
) |>
arrange(desc(pct_missing))
## # A tibble: 19 × 4
## Class total_obs missing_values pct_missing
## <fct> <int> <int> <dbl>
## 1 2-4-d-injury 16 450 78.1
## 2 cyst-nematode 14 336 66.7
## 3 herbicide-injury 8 160 55.6
## 4 phytophthora-rot 88 1214 38.3
## 5 diaporthe-pod-&-stem-blight 15 177 32.8
## 6 alternarialeaf-spot 91 0 0
## 7 anthracnose 44 0 0
## 8 bacterial-blight 20 0 0
## 9 bacterial-pustule 20 0 0
## 10 brown-spot 92 0 0
## 11 brown-stem-rot 44 0 0
## 12 charcoal-rot 20 0 0
## 13 diaporthe-stem-canker 20 0 0
## 14 downy-mildew 20 0 0
## 15 frog-eye-leaf-spot 91 0 0
## 16 phyllosticta-leaf-spot 20 0 0
## 17 powdery-mildew 20 0 0
## 18 purple-seed-stain 20 0 0
## 19 rhizoctonia-root-rot 20 0 0
Summary of Findings for Soybean Data
Class-Specific Missingness: Missing values are concentrated in specific disease categories like phytophthora-rot, herbicide-injury, and 2-4-d-injury.
Implication: Deleting these rows would eliminate almost all examples of these specific diseases from the dataset, making the model unable to recognize them.
Resolution Strategy: For this dataset, using imputation (like knnImpute) or treating “missing” as its own categorical level is more effective than removing data.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
A categorical predictor is a variable with categories (e.g., Gender, Yes/No, Type A/B/C).
A distribution is degenerate when it causes modeling problems. Common issues:
Zero variance predictor: Only one category appears and this variable provides no predictive value Near-zero variance predictor: One category dominates heavily. Model may struggle to learn from rare category. Very sparse categories: Some categories appear only a few times.Can cause unstable estimates.
Look at Frequency Tables
table(Soybean$date)
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
# for all predictors:
#lapply(Soybean, table)
Check for Zero / Near-Zero Variance (Using caret)
library(caret)
nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
Conclusion
Frequency distributions were examined using contingency tables and near-zero variance diagnostics. No predictors exhibited zero variance. However, some predictors show moderate imbalance, although none meet strict near-zero variance thresholds. Therefore, no categorical predictors are completely degenerate.
Confirm Overall Missing Percentage
mean(is.na(Soybean)) * 100
## [1] 9.504636
Missing Percentage Per Predictor
colMeans(is.na(Soybean)) * 100
## Class date plant.stand precip temp
## 0.0000000 0.1464129 5.2708638 5.5636896 4.3923865
## hail crop.hist area.dam sever seed.tmt
## 17.7159590 2.3426061 0.1464129 17.7159590 17.7159590
## germ plant.growth leaves leaf.halo leaf.marg
## 16.3982430 2.3426061 0.0000000 12.2986823 12.2986823
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 12.2986823 14.6412884 12.2986823 15.8125915 2.3426061
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 17.7159590 5.5636896 5.5636896 15.5197657 5.5636896
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 5.5636896 5.5636896 5.5636896 12.2986823 15.5197657
## seed mold.growth seed.discolor seed.size shriveling
## 13.4699854 13.4699854 15.5197657 13.4699854 15.5197657
## roots
## 4.5387994
# Sorting them
#sort(colMeans(is.na(Soybean)) * 100, decreasing = TRUE)
This shows which predictors are most missing.
Some predictors have very high missing rates and others have none
Is Missingness Related to Class?
Create missing indicators for highly missing variables.
Soybean$hail_missing <- ifelse(is.na(Soybean$hail), 1, 0)
table(Soybean$hail_missing, Soybean$Class)
##
## 2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight
## 0 0 91 44 20
## 1 16 0 0 0
##
## bacterial-pustule brown-spot brown-stem-rot charcoal-rot cyst-nematode
## 0 20 92 44 20 0
## 1 0 0 0 0 14
##
## diaporthe-pod-&-stem-blight diaporthe-stem-canker downy-mildew
## 0 0 20 20
## 1 15 0 0
##
## frog-eye-leaf-spot herbicide-injury phyllosticta-leaf-spot phytophthora-rot
## 0 91 0 20 20
## 1 0 8 0 68
##
## powdery-mildew purple-seed-stain rhizoctonia-root-rot
## 0 20 20 20
## 1 0 0 0
test statistically
chisq.test(table(Soybean$hail_missing, Soybean$Class))
## Warning in chisq.test(table(Soybean$hail_missing, Soybean$Class)): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(Soybean$hail_missing, Soybean$Class)
## X-squared = 576.98, df = 18, p-value < 2.2e-16
Conclusion
issingness Overview: The overall missingness is confirmed at approximately 18%, but the distribution is highly uneven across predictors.
Variable Variance: Your sorted list identifies that some predictors are significantly more prone to missing values than others, which could impact model stability.
Informative Missingness: The table() and Chi-square results indicate that missing values are heavily concentrated in specific disease classes (such as phytophthora-rot and herbicide-injury).
Modelling Impact: Since missingness is associated with the response class, the data is not “Missing Completely at Random” (MCAR). Deleting these observations would systematically remove nearly all examples of certain diseases, making imputation the necessary next step.
Elimination Strategy (Filtering)
# Load the necessary library
library(caret)
# Identifying and remove Near-Zero Variance (NZV) predictors
# This removes uninformative variables that could crash certain models
nzv_metrics <- nearZeroVar(Soybean, saveMetrics = TRUE)
soy_filtered <- Soybean[, !nzv_metrics$nzv]
# Preparing for Imputation
# Since most predictors are factors, they must be converted to dummy variables
# to allow for numerical imputation methods like KNN
soy_dummy_model <- dummyVars(Class ~ ., data = soy_filtered)
soy_numeric <- predict(soy_dummy_model, newdata = soy_filtered)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Class' is not a factor
# Apply KNN Imputation
# This estimates missing values based on the 5 most similar observations
# knnImpute also centers and scales the data automatically
soy_preproc <- preProcess(soy_numeric, method = "knnImpute")
soy_final <- predict(soy_preproc, soy_numeric)
# Verify no missing values remain
sum(is.na(soy_final))
## [1] 0
Now the data is now suitable for predictive modeling by addressing the issues identified during exploration.
Near-Zero Variance Filtering: Predictors with little to no variation were removed to prevent numerical errors and reduce model complexity.
Dummy Variable Encoding: The categorical predictors were converted into a numerical format, which is a requirement for distance-based imputation methods like KNN.
KNN Imputation: Instead of discarding observations, missing values were estimated using the most similar neighboring samples. This was critical because the missingness was associated with specific disease classes, and deletion would have biased the model.
Standardization: The data was automatically centered and scaled during the imputation process, ensuring that all predictors are on a comparable scale for the classification algorithm.