Tanzil_HW

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1 (a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Histograms (Distribution of Each Predictor)

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Boxplots by Glass Type

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Type, y = Value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal() +
  theme(legend.position = "none")

Correlation Matrix (Relationships Between Predictors)

cor_mat <- cor(Glass[, -10])

corrplot::corrplot(cor_mat, method = "color", tl.cex = 0.7)

From the histograms, it can be seen that the distributions of most predictor variables such as RI, Na, Si, and Ca are more or less symmetric and have a moderate spread. However, Ba, Fe, and K have a high degree of right skewness, and most values are concentrated at zero, while a few values are very high.

Boxplots of the predictor variables according to glass type show that several predictor variables can differentiate the classes. For example, Ba can differentiate Type 7 glasses from the other types. Similarly, Mg has a high degree of separation among the classes, especially Types 5, 6, and 7, in which the values are close to zero. Several predictor variables show the presence of outliers, such as Ba, Fe, K, and Ca.

The correlation matrix shows that the predictor variables have a moderate relationship. A strong positive correlation exists between RI and Ca, which is close to 0.81. On the other hand, a strong negative correlation exists between RI and Si, which is close to -0.54.

3.1 Part B :Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplots for Outliers

Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(y = Value)) +
  geom_boxplot(fill = "tomato") +
  facet_wrap(~ Predictor, scales = "free") +
  theme_minimal()

Skewness Check

library(e1071)

## Warning: package 'e1071' was built under R version 4.4.3

apply(Glass[, -10], 2, skewness)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

The skewness values confirm what the histograms and boxplots suggested. Several of the predictors appear to be somewhat skewed in nature. Of particular interest is the fact that K has a skewness of 6.46, and Ba has a skewness of 3.37, which is quite steep and therefore suggests the presence of quite strong right-tailed skewness and possibly some strong outliers in these two predictors. Ca has a skewness of 2.02, Fe has a skewness of 1.73, and RI has a skewness of 1.60, which is also quite strong and suggests the presence of strong right-tailed skewness in these predictors.

Mg has a skewness of -1.14, which is quite strong and suggests the presence of strong left-tailed skewness in this predictor. Na has a skewness of 0.45 and is roughly symmetric in nature,

3.1 Part (c) :Are there any relevant transformations of one or more predictors that might improve the classification model?

library(caret)
library(mlbench)

data(Glass)

# Apply Box-Cox to K
bc_K <- BoxCoxTrans(Glass$K)
bc_K

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied

bc_K$lambda

## [1] NA

Box-Cox transformation

par(mfrow = c(1,2))
hist(Glass$K, main = "Original K")
hist(log(Glass$K + 1), main = "Log(K + 1)")

The predictor K is highly right-skewed and contains zero values. Because the Box-Cox transformation requires strictly positive data, it could not be applied. As a result, a log(K + 1) transformation was used instead.

The log transformation reduces the right skew and stabilizes variance, making the distribution more symmetric.

Other predictors such as Ba and Fe, which also show strong skewness and many zero values, may benefit from similar log transformations.

Additionally, since the predictors are measured on different scales, centering and scaling the predictors would likely improve the performance of distance-based classification models such as kNN and SVM.

3.2 Part A:Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate?

library(mlbench)
library(caret)
library(dplyr)

data(Soybean)

# Frequency tables for each predictor (exclude the outcome 'Class')
freq_list <- lapply(Soybean[, -1], table, useNA = "ifany")

# Example: view one predictor’s distribution
freq_list[["leaf.mild"]]

## 
##    0    1    2 <NA> 
##  535   20   20  108

# Identify near-zero variance predictors (degenerate distributions)
nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv[nzv$nzv == TRUE, ]

On inspection of the categorical predictors, some variables exhibit highly unbalanced level distributions. Using nearZeroVar(), we identify near-zero variance predictors (e.g., leaf.mild, mycelium, sclerotia). These predictors have very large frequency ratios, meaning one level dominates most observations. Such predictors are considered degenerate and may add little predictive information while increasing noise.

3.2 Part B :Roughly 18% of the data are missing. Are particular predictors more likely to be missing? Is missingness related to the classes?

# Count missing values per variable
na_counts <- colSums(is.na(Soybean))
na_counts

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

# Percent missing per variable
missing_pct <- na_counts / nrow(Soybean) * 100
round(missing_pct, 2)

##           Class            date     plant.stand          precip            temp 
##            0.00            0.15            5.27            5.56            4.39 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##           17.72            2.34            0.15           17.72           17.72 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##           16.40            2.34            0.00           12.30           12.30 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##           12.30           14.64           12.30           15.81            2.34 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##           17.72            5.56            5.56           15.52            5.56 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##            5.56            5.56            5.56           12.30           15.52 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##           13.47           13.47           15.52           13.47           15.52 
##           roots 
##            4.54

# Sort by most missing
sort(missing_pct, decreasing = TRUE)

##            hail           sever        seed.tmt         lodging            germ 
##      17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##      14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##      12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##       5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
##    plant.growth            stem            date        area.dam           Class 
##       2.3426061       2.3426061       0.1464129       0.1464129       0.0000000 
##          leaves 
##       0.0000000

# Missingness by class (percent missing within each class)
Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100))

Approximately 18% of the data are missing overall, and missingness is not uniform across predictors. Some features (e.g., hail, sever, seed.tmt, lodging) have notably higher missing rates, while others have little or no missing values.

When missingness is examined by Class, some diseases show very high (even 100%) missingness for certain predictors. This suggests the missing data are not completely at random and may be related to the underlying disease class.

3.2 Part C Develop a strategy for handling missing data

Soybean_missing_level <- Soybean

# Add "Missing" as a level for predictors ONLY
for (j in 2:ncol(Soybean_missing_level)) {
  x <- Soybean_missing_level[[j]]
  
  # Ensure it is a factor/ordered factor and add Missing level
  x <- as.character(x)
  x[is.na(x)] <- "Missing"
  Soybean_missing_level[[j]] <- factor(x)
}

# Confirm no missing values remain in predictors
colSums(is.na(Soybean_missing_level[, -1]))

##            date     plant.stand          precip            temp            hail 
##               0               0               0               0               0 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##               0               0               0               0               0 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##               0               0               0               0               0 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##               0               0               0               0               0 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##               0               0               0               0               0 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##               0               0               0               0               0 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##               0               0               0               0               0

Because a substantial portion of the data is missing and missingness appears related to the disease classes, deleting rows (listwise deletion) could remove important patterns and introduce bias. Therefore, a practical approach is to treat missing values as informative by adding a separate category called “Missing” to each predictor. This preserves potential signal in the missingness mechanism. Additionally, predictors identified as near-zero variance may be removed to reduce noise and improve model stability.

Tanzil_HW_4

Md. Tanzil Ehsan

`03/01/2026`

3.1 (a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

3.1 Part B :Do there appear to be any outliers in the data? Are any predictors skewed?

3.1 Part (c) :Are there any relevant transformations of one or more predictors that might improve the classification model?

3.2 Part A:Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate?

3.2 Part C Develop a strategy for handling missing data