Data Pre-Processing

library(ggplot2)
library(mlbench)
library(e1071)
library(tidyverse)
library(corrplot)
library(caret)

#Question 3.1 ##Exercise a

data(Glass)

#separate predictors from RI in dataset
predictors <- Glass[, 1:9]

#Response Indicator
response <- Glass$Type

glass_long <- Glass %>%
  select(-Type) %>% 
  pivot_longer(cols = everything(), 
               names_to = "Predictor", 
               values_to = "Value")

# Plot all predictors as histograms in one faceted figure
ggplot(glass_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Distribution of Glass Predictors",
       x = NULL, y = "Count") +
  theme_minimal()

Roughly symmetric / approximately normal:

Al — fairly bell-shaped, centered around 1–1.5 Na — reasonably symmetric, centered around 13–14 Si — roughly symmetric, centered around 72–73 RI — roughly symmetric but with a slight right tail

Right-skewed (long tail to the right):

Ba — heavily skewed, the vast majority of values are at or near 0 with very few large values Fe — same pattern as Ba, most values clustered at 0 K — most values near 0, with a few observations stretching out to ~6 Ca — moderate right skew, bulk between 7–10 with a tail to ~15

Bimodal:

Mg — this is the most interesting one; there are two distinct peaks (one near 0 and one around 3.5), suggesting two different subgroups in the data

cor_matrix <- cor(Glass[, -10])

corrplot(cor_matrix, 
         method = "color",      
         type = "upper",         
         addCoef.col = "black",  
         tl.col = "black",    
         tl.srt = 45, 
         title = "Correlation Matrix of Glass Predictors",
         mar = c(0,0,1,0))

The correlation matrix reveals several notable relationships between predictors. The strongest positive correlation exists between RI and Ca (0.810), where increasing calcium content raises the refractive index — a relationship that makes chemical sense. Moderate positive correlations are also observed between Al and Ba (0.479), and Al and K (0.326). On the negative side, RI and Si (-0.542) show a strong inverse relationship, as silicon reduces refractive index. Similarly, Mg is negatively correlated with both Al (-0.482) and Ba (-0.492), suggesting these elements tend not to co-occur in the same glass samples. Conversely, RI and Ba show virtually no linear relationship. The color-coded matrix further highlights that many relationships are moderate rather than extreme, with the majority of predictor pairs showing weak correlations close to zero.

##Exercise b

ggplot(glass_long, aes(x = Predictor, y = Value)) +
  geom_boxplot(fill = "steelblue", alpha = 0.5) +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Boxplots of Glass Predictors",
       x = NULL, y = "Value") +
  theme_minimal()

skew_values <- apply(predictors, 2, skewness)
print(round(skew_values, 2))

##    RI    Na    Mg    Al    Si     K    Ca    Ba    Fe 
##  1.60  0.45 -1.14  0.89 -0.72  6.46  2.02  3.37  1.73

Outliers: The boxplots reveal outliers across several predictors. Ca shows a cluster of high-value outliers extending to roughly 15, while K has one extreme outlier near 6, which is well beyond the rest of the data. Ba and Fe both show multiple outliers in their upper tails, considering the majority of their values being clustered near zero. RI and Al show mild outliers at the extremes, while Na and Si are relatively stable.

Skewness: The skewness values confirm what the histograms and boxplots suggested. K (6.46) is the most severely skewed predictor, followed by Ba (3.37), Ca (2.02), Fe (1.73), and RI (1.60), all being highly right-skewed (above 1). Al (0.89) is moderately right-skewed. Mg (-1.14) and Si (-0.72) are negatively skewed, with Mg being so due to its bimodal distribution. Na (0.45) is the only predictor that is approximately symmetric.

##Exercise c

# Apply Box-Cox transformation to predictors
preprocessParams <- preProcess(predictors, method = c("BoxCox"))
print(preprocessParams)

## Created from 214 samples and 5 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1

# Apply the transformation
transformed <- predict(preprocessParams, predictors)

# Compare skewness before and after
skew_before <- apply(predictors, 2, skewness)
skew_after <- apply(transformed, 2, skewness)

skew_comparison <- data.frame(
  Before = round(skew_before, 2),
  After = round(skew_after, 2)
)
print(skew_comparison)

##    Before After
## RI   1.60  1.57
## Na   0.45  0.03
## Mg  -1.14 -1.14
## Al   0.89  0.09
## Si  -0.72 -0.65
## K    6.46  6.46
## Ca   2.02 -0.19
## Ba   3.37  3.37
## Fe   1.73  1.73

transformed_long <- transformed %>%
  pivot_longer(cols = everything(),
               names_to = "Predictor",
               values_to = "Value")

ggplot(transformed_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Distribution of Glass Predictors After Box-Cox Transformation",
       x = NULL, y = "Count") +
  theme_minimal()

The Box-Cox transformation was applied to all nine predictors. The transformation meaningfully improved symmetry for Al (0.89 → 0.09), Na (0.45 → 0.03), and Ca (2.02 → -0.19). However, Ba, Fe, and K showed no improvement, as their distributions are dominated by zero or near-zero values and a single extreme outlier respectively — cases where Box-Cox is ineffective. Mg remains bimodal and resistant to transformation. For these predictors, alternative strategies such as spatial sign transformation or treating zero-heavy predictors as near-zero variance candidates for removal may be more appropriate.

#Question 3.2 ##Exercise a

data(Soybean)

# Check for degenerate predictors
nzv <- nearZeroVar(Soybean[, -1], saveMetrics = TRUE)
print(nzv[nzv$nzv == TRUE, ])

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

# Plot only the degenerate predictors
degenerate <- rownames(nzv[nzv$nzv == TRUE, ])

soybean_long <- Soybean %>%
  select(-Class) %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(cols = everything(),
               names_to = "Predictor",
               values_to = "Value")

soybean_long %>%
  filter(Predictor %in% degenerate) %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "steelblue") +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Degenerate Predictors in Soybean Data",
       x = NULL, y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Three predictors were identified as having degenerate distributions: leaf.mild, mycelium, and sclerotia. As shown in the bar plots, each is heavily dominated by a single category (0), with very few observations in the remaining categories. This is confirmed numerically by the frequency ratios — mycelium has the most extreme ratio of 106.50, meaning the most common value occurs over 106 times more frequently than the second most common value. sclerotia (31.25) and leaf.mild (26.75) follow the same pattern. All three have very low percent unique values (below 0.5%), indicating minimal variation across the 683 observations. While none are strictly zero variance (zeroVar = FALSE), all three are flagged as near-zero variance (nzv = TRUE), meaning they are unlikely to contribute meaningful information to a classification model and are candidates for removal during pre-processing.

##Exercise b

# Calculate missing values per predictor
missing_by_predictor <- Soybean %>%
  select(-Class) %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
  pivot_longer(cols = everything(),
               names_to = "Predictor",
               values_to = "Missing_Pct") %>%
  filter(Missing_Pct > 0) %>%
  arrange(desc(Missing_Pct))

print(missing_by_predictor)

## # A tibble: 34 × 2
##    Predictor       Missing_Pct
##    <chr>                 <dbl>
##  1 hail                   17.7
##  2 sever                  17.7
##  3 seed.tmt               17.7
##  4 lodging                17.7
##  5 germ                   16.4
##  6 leaf.mild              15.8
##  7 fruiting.bodies        15.5
##  8 fruit.spots            15.5
##  9 seed.discolor          15.5
## 10 shriveling             15.5
## # ℹ 24 more rows

# Plot missing data by predictor
ggplot(missing_by_predictor, aes(x = reorder(Predictor, Missing_Pct), y = Missing_Pct)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Percentage of Missing Values by Predictor",
       x = NULL, y = "Missing (%)") +
  theme_minimal()

missing_by_class <- Soybean %>%
  mutate(across(-Class, as.character)) %>%
  group_by(Class) %>%
  summarise(Missing_Pct = mean(is.na(unlist(cur_data()))) * 100) %>%
  arrange(desc(Missing_Pct))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `Missing_Pct = mean(is.na(unlist(cur_data()))) * 100`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.

print(missing_by_class)

## # A tibble: 19 × 2
##    Class                       Missing_Pct
##    <fct>                             <dbl>
##  1 2-4-d-injury                       80.4
##  2 cyst-nematode                      68.6
##  3 herbicide-injury                   57.1
##  4 phytophthora-rot                   39.4
##  5 diaporthe-pod-&-stem-blight        33.7
##  6 alternarialeaf-spot                 0  
##  7 anthracnose                         0  
##  8 bacterial-blight                    0  
##  9 bacterial-pustule                   0  
## 10 brown-spot                          0  
## 11 brown-stem-rot                      0  
## 12 charcoal-rot                        0  
## 13 diaporthe-stem-canker               0  
## 14 downy-mildew                        0  
## 15 frog-eye-leaf-spot                  0  
## 16 phyllosticta-leaf-spot              0  
## 17 powdery-mildew                      0  
## 18 purple-seed-stain                   0  
## 19 rhizoctonia-root-rot                0

The bar plot reveals that missingness is not evenly distributed across predictors. A large cluster of predictors share similar missing rates of around 15–18%, suggesting that missing values occur together in the same observations rather than randomly. The class-level analysis confirms this pattern strongly. Missing data is concentrated in just five disease classes: 2-4-d-injury (80.4%), cyst-nematode (68.6%), herbicide-injury (57.1%), phytophthora-rot (39.4%), and diaporthe-pod-&-stem-blight (33.7%). The remaining 14 classes have no missing data whatsoever. This indicates that the missing data is not missing at random (MNAR) — the missingness is systematically related to the class label. Certain disease types were simply not assessed for particular predictors, likely due to the nature of those diseases making certain measurements irrelevant or unobservable.

##Exercise c

# Check how many observations belong to these classes
table(Soybean$Class) %>% 
  as.data.frame() %>% 
  filter(Var1 %in% c("2-4-d-injury", "cyst-nematode", 
                     "herbicide-injury", "phytophthora-rot",
                     "diaporthe-pod-&-stem-blight")) %>%
  arrange(desc(Freq))

All 35 predictors contain some degree of missing data, ranging from 17.7% (hail, sever, seed.tmt, lodging) down to near zero for date and area.dam, with only leaves having no missing values at all. Given that missingness is systematically tied to specific classes rather than occurring randomly, a two-pronged strategy is recommended. For the four small classes — 2-4-d-injury (16 observations), cyst-nematode (14), herbicide-injury (8), and diaporthe-pod-&-stem-blight (15) — removal of these observations is acceptable considering their very small sample sizes and extremely high missingness rates of 57–80%. Imputing the majority of values for these classes would introduce substantial artificial data, likely doing more harm than good to model performance. For phytophthora-rot, which is too large to discard with 88 observations, KNN imputation is the most appropriate strategy. Since missingness in this class is systematic rather than random, KNN can leverage the similarity between observations within the same class to produce reasonable imputed values. Predictors with very low missingness such as date and area.dam require no special treatment.

Data Pre-Processing

Kerwin Charles