library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1 (a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Histograms (Distribution of Each Predictor)

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Boxplots by Glass Type

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Type, y = Value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal() +
  theme(legend.position = "none")

Correlation Matrix (Relationships Between Predictors)

cor_mat <- cor(Glass[, -10])

corrplot::corrplot(cor_mat, method = "color", tl.cex = 0.7)

From the histograms, it can be seen that the distributions of most predictor variables such as RI, Na, Si, and Ca are more or less symmetric and have a moderate spread. However, Ba, Fe, and K have a high degree of right skewness, and most values are concentrated at zero, while a few values are very high.

Boxplots of the predictor variables according to glass type show that several predictor variables can differentiate the classes. For example, Ba can differentiate Type 7 glasses from the other types. Similarly, Mg has a high degree of separation among the classes, especially Types 5, 6, and 7, in which the values are close to zero. Several predictor variables show the presence of outliers, such as Ba, Fe, K, and Ca.

The correlation matrix shows that the predictor variables have a moderate relationship. A strong positive correlation exists between RI and Ca, which is close to 0.81. On the other hand, a strong negative correlation exists between RI and Si, which is close to -0.54.

`

3.1 Part B :Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplots for Outliers

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(y = Value)) +
  geom_boxplot(fill = "tomato") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Skewness Check

library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
apply(Glass[, -10], 2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

The skewness values confirm what the histograms and boxplots suggested. Several of the predictors appear to be somewhat skewed in nature. Of particular interest is the fact that K has a skewness of 6.46, and Ba has a skewness of 3.37, which is quite steep and therefore suggests the presence of quite strong right-tailed skewness and possibly some strong outliers in these two predictors. Ca has a skewness of 2.02, Fe has a skewness of 1.73, and RI has a skewness of 1.60, which is also quite strong and suggests the presence of strong right-tailed skewness in these predictors.

Mg has a skewness of -1.14, which is quite strong and suggests the presence of strong left-tailed skewness in this predictor. Na has a skewness of 0.45 and is roughly symmetric in nature,

3.1 Part (c) :Are there any relevant transformations of one or more predictors that might improve the classification model?

library(caret)
library(mlbench)

data(Glass)

# Apply Box-Cox to K
bc_K <- BoxCoxTrans(Glass$K)
bc_K
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied
bc_K$lambda
## [1] NA

Box-Cox transformation

par(mfrow = c(1,2))
hist(Glass$K, main = "Original K")
hist(log(Glass$K + 1), main = "Log(K + 1)")

The predictor K is highly right-skewed and contains zero values. Because the Box-Cox transformation requires strictly positive data, it could not be applied. As a result, a log(K + 1) transformation was used instead.

The log transformation reduces the right skew and stabilizes variance, making the distribution more symmetric.

Other predictors such as Ba and Fe, which also show strong skewness and many zero values, may benefit from similar log transformations.

Additionally, since the predictors are measured on different scales, centering and scaling the predictors would likely improve the performance of distance-based classification models such as kNN and SVM.

3.2 Part A:Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate?

library(mlbench)
library(caret)
library(dplyr)

data(Soybean)

# Frequency tables for each predictor (exclude the outcome 'Class')
freq_list <- lapply(Soybean[, -1], table, useNA = "ifany")

# Example: view one predictor’s distribution
freq_list[["leaf.mild"]]
## 
##    0    1    2 <NA> 
##  535   20   20  108
# Identify near-zero variance predictors (degenerate distributions)
nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv[nzv$nzv == TRUE, ]

On inspection of the categorical predictors, some variables exhibit highly unbalanced level distributions. Using nearZeroVar(), we identify near-zero variance predictors (e.g., leaf.mild, mycelium, sclerotia). These predictors have very large frequency ratios, meaning one level dominates most observations. Such predictors are considered degenerate and may add little predictive information while increasing noise.

3.2 Part C Develop a strategy for handling missing data

Soybean_missing_level <- Soybean

# Add "Missing" as a level for predictors ONLY
for (j in 2:ncol(Soybean_missing_level)) {
  x <- Soybean_missing_level[[j]]
  
  # Ensure it is a factor/ordered factor and add Missing level
  x <- as.character(x)
  x[is.na(x)] <- "Missing"
  Soybean_missing_level[[j]] <- factor(x)
}

# Confirm no missing values remain in predictors
colSums(is.na(Soybean_missing_level[, -1]))
##            date     plant.stand          precip            temp            hail 
##               0               0               0               0               0 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##               0               0               0               0               0 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##               0               0               0               0               0 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##               0               0               0               0               0 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##               0               0               0               0               0 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##               0               0               0               0               0 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##               0               0               0               0               0

Because a substantial portion of the data is missing and missingness appears related to the disease classes, deleting rows (listwise deletion) could remove important patterns and introduce bias. Therefore, a practical approach is to treat missing values as informative by adding a separate category called “Missing” to each predictor. This preserves potential signal in the missingness mechanism. Additionally, predictors identified as near-zero variance may be removed to reduce noise and improve model stability.