03/01/2026library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Histograms (Distribution of Each Predictor)
Glass %>%
pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
facet_wrap(~ Predictor, scales = "free") +
theme_minimal()
Boxplots by Glass Type
Glass %>%
pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Type, y = Value, fill = Type)) +
geom_boxplot() +
facet_wrap(~ Predictor, scales = "free") +
theme_minimal() +
theme(legend.position = "none")
Correlation Matrix (Relationships Between Predictors)
cor_mat <- cor(Glass[, -10])
corrplot::corrplot(cor_mat, method = "color", tl.cex = 0.7)
From the histograms, it can be seen that the distributions of most predictor variables such as RI, Na, Si, and Ca are more or less symmetric and have a moderate spread. However, Ba, Fe, and K have a high degree of right skewness, and most values are concentrated at zero, while a few values are very high.
Boxplots of the predictor variables according to glass type show that several predictor variables can differentiate the classes. For example, Ba can differentiate Type 7 glasses from the other types. Similarly, Mg has a high degree of separation among the classes, especially Types 5, 6, and 7, in which the values are close to zero. Several predictor variables show the presence of outliers, such as Ba, Fe, K, and Ca.
The correlation matrix shows that the predictor variables have a moderate relationship. A strong positive correlation exists between RI and Ca, which is close to 0.81. On the other hand, a strong negative correlation exists between RI and Si, which is close to -0.54.
`
Boxplots for Outliers
Glass %>%
pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(y = Value)) +
geom_boxplot(fill = "tomato") +
facet_wrap(~ Predictor, scales = "free") +
theme_minimal()
Skewness Check
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
apply(Glass[, -10], 2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
The skewness values confirm what the histograms and boxplots suggested. Several of the predictors appear to be somewhat skewed in nature. Of particular interest is the fact that K has a skewness of 6.46, and Ba has a skewness of 3.37, which is quite steep and therefore suggests the presence of quite strong right-tailed skewness and possibly some strong outliers in these two predictors. Ca has a skewness of 2.02, Fe has a skewness of 1.73, and RI has a skewness of 1.60, which is also quite strong and suggests the presence of strong right-tailed skewness in these predictors.
Mg has a skewness of -1.14, which is quite strong and suggests the presence of strong left-tailed skewness in this predictor. Na has a skewness of 0.45 and is roughly symmetric in nature,
library(caret)
library(mlbench)
data(Glass)
# Apply Box-Cox to K
bc_K <- BoxCoxTrans(Glass$K)
bc_K
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1225 0.5550 0.4971 0.6100 6.2100
##
## Lambda could not be estimated; no transformation is applied
bc_K$lambda
## [1] NA
Box-Cox transformation
par(mfrow = c(1,2))
hist(Glass$K, main = "Original K")
hist(log(Glass$K + 1), main = "Log(K + 1)")
The predictor K is highly right-skewed and contains zero values. Because the Box-Cox transformation requires strictly positive data, it could not be applied. As a result, a log(K + 1) transformation was used instead.
The log transformation reduces the right skew and stabilizes variance, making the distribution more symmetric.
Other predictors such as Ba and Fe, which also show strong skewness and many zero values, may benefit from similar log transformations.
Additionally, since the predictors are measured on different scales, centering and scaling the predictors would likely improve the performance of distance-based classification models such as kNN and SVM.
library(mlbench)
library(caret)
library(dplyr)
data(Soybean)
# Frequency tables for each predictor (exclude the outcome 'Class')
freq_list <- lapply(Soybean[, -1], table, useNA = "ifany")
# Example: view one predictor’s distribution
freq_list[["leaf.mild"]]
##
## 0 1 2 <NA>
## 535 20 20 108
# Identify near-zero variance predictors (degenerate distributions)
nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv[nzv$nzv == TRUE, ]
On inspection of the categorical predictors, some variables exhibit highly unbalanced level distributions. Using nearZeroVar(), we identify near-zero variance predictors (e.g., leaf.mild, mycelium, sclerotia). These predictors have very large frequency ratios, meaning one level dominates most observations. Such predictors are considered degenerate and may add little predictive information while increasing noise.
Soybean_missing_level <- Soybean
# Add "Missing" as a level for predictors ONLY
for (j in 2:ncol(Soybean_missing_level)) {
x <- Soybean_missing_level[[j]]
# Ensure it is a factor/ordered factor and add Missing level
x <- as.character(x)
x[is.na(x)] <- "Missing"
Soybean_missing_level[[j]] <- factor(x)
}
# Confirm no missing values remain in predictors
colSums(is.na(Soybean_missing_level[, -1]))
## date plant.stand precip temp hail
## 0 0 0 0 0
## crop.hist area.dam sever seed.tmt germ
## 0 0 0 0 0
## plant.growth leaves leaf.halo leaf.marg leaf.size
## 0 0 0 0 0
## leaf.shread leaf.malf leaf.mild stem lodging
## 0 0 0 0 0
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 0 0 0 0 0
## int.discolor sclerotia fruit.pods fruit.spots seed
## 0 0 0 0 0
## mold.growth seed.discolor seed.size shriveling roots
## 0 0 0 0 0
Because a substantial portion of the data is missing and missingness appears related to the disease classes, deleting rows (listwise deletion) could remove important patterns and introduce bias. Therefore, a practical approach is to treat missing values as informative by adding a separate category called “Missing” to each predictor. This preserves potential signal in the missingness mechanism. Additionally, predictors identified as near-zero variance may be removed to reduce noise and improve model stability.