library(mlbench)
library(DataExplorer)
library(corrplot)
library(ggplot2)
library(tidyr)
library(dplyr)
library(GGally)
library(e1071)
library(caret)
library(knitr) 
library(tibble)

Exercises 3.1:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

Before we proceed with the answer, let’s check if there are any missing data.

glass_df <- Glass
sum(is.na(glass_df))  # Total number of missing values
## [1] 0
colSums(is.na(glass_df))  # Missing values per column
##   RI   Na   Mg   Al   Si    K   Ca   Ba   Fe Type 
##    0    0    0    0    0    0    0    0    0    0

There are no missing values.

Distributions and the relationships between predictors: we will Convert dataset to long format for histograms.

glass_long <- pivot_longer(glass_df, cols = -Type, names_to = "Feature", values_to = "Value")

ggplot(glass_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Predictor Variables", x = "Value", y = "Count")

Histogram: Most predictors, like Al, Na, and Si, follow a normal or slightly skewed distribution, while Ba, Fe, and K are highly skewed with many zero values, suggesting potential outliers or categorical-like behavior. Let’s check check relationships using ggpairs().

ggpairs(glass_df, columns = 1:9, aes(color = as.factor(Type)), 
       upper = list(continuous = wrap("cor", size = 3))) +
  theme_minimal() +
  labs(title = "Scatterplot Matrix of Glass Data Predictors")

Scatterplot Matrix (ggpairs): Ca and RI show a strong positive correlation, while Na and Mg show a moderate negative correlation.

cor_matrix <- cor(glass_df[,1:9])
corrplot(cor_matrix, method = "color", type = "lower", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black", number.cex = 0.8,
         col = colorRampPalette(c("blue", "white", "red"))(200))

Correlation Heatmap (corrplot): The heat map confirms Ca and RI as the strongest correlated pair (r = 0.81), while most other predictors have weak correlations.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplot: let’s use Boxplots as it help identify outliers by visualizing data spread and extreme values.

ggplot(glass_long, aes(y = Feature, x = Value)) +
  geom_boxplot(width = 0.3, fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
  theme_minimal() +
  labs(title = "Boxplot of Predictors - Detecting Outliers", x = "Value", y = "Feature") +
  facet_wrap(~ Feature, scales = "free", ncol = 3) +  
  theme(axis.text.y = element_text(size = 12)) 

summary(glass_long$Value[glass_long$Feature == "Ba"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150
var(glass_long$Value[glass_long$Feature == "Ba"])
## [1] 0.247227

Ba’s boxplot appears compressed because the majority of its values are zero”. Also, Variance is low spread. Let’s add Jittering to make overlapping points more visible.

ggplot(glass_long, aes(y = Feature, x = Value)) +
  geom_boxplot(width = 0.3, fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
  geom_jitter(aes(color = Feature), width = 0.2, alpha = 0.5) + 
  theme_minimal() +
  labs(title = "Boxplot of Predictors - Detecting Outliers", x = "Value", y = "Feature") +
  facet_wrap(~ Feature, scales = "free", ncol = 3) +
  theme(axis.text.y = element_text(size = 12))

Using geom_jitter() improved visibility, which is great.

Skewness: Measures how much a distribution deviates from normality.

Skewness interpretation:

glass_df %>%
  select_if(is.numeric) %>%
  apply(., 2, skewness) %>%
  unlist() %>%
  round(4)
##      RI      Na      Mg      Al      Si       K      Ca      Ba      Fe 
##  1.6027  0.4478 -1.1365  0.8946 -0.7202  6.4601  2.0184  3.3687  1.7298

The boxplot confirms that Ba, Fe, K, Ca, and Al have significant outliers, aligning with their high skewness values. This indicates strong asymmetry, suggesting that these variables may require transformation before modeling.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

log transformation vs. Box-Cox transformation: since Ba, Fe, K, Ca, and Al have significant outliers, lets check both log vs. Box-Cox transformations to compare the skewness.

glass_transformed_log <- glass_df
glass_transformed_boxcox <- glass_df

log_transform_vars <- c("Ba", "Fe", "K", "Ca")
glass_transformed_log[log_transform_vars] <- log1p(glass_transformed_log[log_transform_vars]) # Log Transformation
skewness_log <- sapply(glass_transformed_log[log_transform_vars], skewness) # Compute skewness 

boxcox_vars <- c("Ba", "Fe", "K", "Ca")
preprocess_params <- preProcess(glass_transformed_boxcox[boxcox_vars], method = "BoxCox") 
glass_transformed_boxcox[boxcox_vars] <- predict(preprocess_params, glass_transformed_boxcox[boxcox_vars]) # Box-Cox 
skewness_boxcox <- sapply(glass_transformed_boxcox[boxcox_vars], skewness) # Compute skewness


skewness_original <- sapply(glass_df[boxcox_vars], skewness) # Original Skewness

# Create a Data Frame for Comparison
skewness_comparison <- data.frame(
  Original_Skewness = round(skewness_original, 4),
  After_Log = round(skewness_log, 4),
  After_BoxCox = round(skewness_boxcox, 4)
) %>%
  rownames_to_column(var = "Feature")  # Convert row names to a column

knitr::kable(skewness_comparison, caption = "Comparing Skewness After Log vs. Box-Cox Transformations") #Print table
Comparing Skewness After Log vs. Box-Cox Transformations
Feature Original_Skewness After_Log After_BoxCox
Ba 3.3687 2.6886 3.3687
Fe 1.7298 1.5642 1.7298
K 6.4601 1.9504 6.4601
Ca 2.0184 1.1521 -0.1940

From the compassion table above we can see that the Box-Cox transformation only suitable for Ca. So we will use lg transformation for Ba, Fe and K. We will also use Square root and Square transformations and find out the new skewness values.

glass_transformed <- glass_df  

# Log transformation for highly right-skewed variables (Ba, Fe, K)
log_transform_vars <- c("Ba", "Fe", "K")
glass_transformed[log_transform_vars] <- log1p(glass_transformed[log_transform_vars])  # log1p(x) = log(x + 1)

# Square root transformation for moderately right-skewed variables (Al)
glass_transformed$Al <- sqrt(glass_transformed$Al)

# Square transformation for left-skewed variables (Mg, Si)
glass_transformed$Mg <- glass_transformed$Mg^2
glass_transformed$Si <- glass_transformed$Si^2

# Apply Box-Cox transformation for Ca
preprocess_params <- preProcess(glass_transformed[boxcox_vars], method = "BoxCox")
glass_transformed[boxcox_vars] <- predict(preprocess_params, glass_transformed[boxcox_vars])

# Check new skewness after correct transformations
new_skewness <- sapply(glass_transformed, skewness)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.factor(x, mean(x)): '-' not meaningful for factors
round(new_skewness, 4)
##      RI      Na      Mg      Al      Si       K      Ca      Ba      Fe    Type 
##  1.6027  0.4478 -0.8131  0.0911 -0.6509  1.9504 -0.1940  2.6886  1.5642      NA

Exercises 3.2

library(mlbench)
library(caret)

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
glimpse(Soybean)
## Rows: 683
## Columns: 36
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
library(naniar) 

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Let’s identify predictors that might be problematic by checking: Zero variance predictors (only 1 unique value) and near-zero variance predictors (highly imbalanced variables).

nzv_info <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv_info[nzv_info$nzv == TRUE, ]
##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

The nearZeroVar() function identified leaf.mild, mycelium, and sclerotia as near-zero variance predictors. These features are highly imbalanced with very few unique values, making them poor predictors for classification. Removing them is recommended to simplify the dataset and improve model performance.

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Let’s identify missing data patterns first.

missing_counts <- colSums(is.na(Soybean)) # Count missing values per column
missing_counts[missing_counts > 0] # columns with missing values only
##            date     plant.stand          precip            temp            hail 
##               1              36              38              30             121 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##              16               1             121             121             112 
##    plant.growth       leaf.halo       leaf.marg       leaf.size     leaf.shread 
##              16              84              84              84             100 
##       leaf.malf       leaf.mild            stem         lodging    stem.cankers 
##              84             108              16             121              38 
##   canker.lesion fruiting.bodies       ext.decay        mycelium    int.discolor 
##              38             106              38              38              38 
##       sclerotia      fruit.pods     fruit.spots            seed     mold.growth 
##              38              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31
gg_miss_var(Soybean) +
  labs(title = "Missing Values per Predictor in Soybean Data")

The missing data analysis shows that the highest number of missing values occur in variables like hail, sever, seed.tmt, germ, and lodging, each with over 100 missing entries. Several other predictors, including leaf.halo, fruit.pods, and mold.growth, also have a moderate number of missing values, ranging from 30 to 90. On the other hand, some variables such as date, area.dam, and Class have little to no missing data.

missing_class_summary <- Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  summarize(Missing = n(), Proportion = Missing / first(class_Total)) %>% # Proportion of missing data per class
  ungroup()%>%
  distinct(Class, Proportion) 

# Bar plot of missing data proportion by class
ggplot(missing_class_summary, aes(x = reorder(Class, Proportion), y = Proportion)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Proportion of Missing Data by Class",
       x = "Class",
       y = "Proportion of Missing Data")

# Heatmap of missing values per class
missing_by_class_long <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
  pivot_longer(cols = -Class, names_to = "Feature", values_to = "Missing_Count")

ggplot(missing_by_class_long, aes(x = Feature, y = Class, fill = Missing_Count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red") +
  theme_minimal() +
  labs(title = "Heatmap of Missing Data by Class", x = "Feature", y = "Class") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The missing data distribution changes once the highly missing classes are removed:

Soybean %>%
  filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
                       "2-4-d-injury", "herbicide-injury")) %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values with Missing Classes Removed",
       x = "Proportion") +
  scale_fill_manual(values=c("steelblue","red"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The heatmap reveals that missing data is concentrated within specific disease classes rather than occurring randomly. In particular, Diaporthe-pod-&-stem-blight, Cyst-nematode, 2-4-D-injury, and Herbicide-injury have missing values in all observations, while Phytophthora-rot has around 77% missing data. This suggests that these classes may have been underrepresented in data collection or lacked certain recorded measurements.

After removing these highly missing classes, the dataset no longer contains any missing values. This indicates that the missing data issue was primarily driven by these specific classes rather than by inconsistencies across features.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The best strategy to handle missing data in the Soybean dataset is to remove the five disease classes with excessive missing values: Diaporthe-pod-&-stem-blight, Cyst-nematode, 2-4-D-injury, Herbicide-injury, and Phytophthora-rot, as their high missing rates make them unreliable. Once these classes are removed, the dataset no longer contains missing values. This also eliminats the need for imputation. This approach ensures a clean dataset for modeling without introducing potential biases.