HW_4

Joyce Aldrich

2025-09-28

Homework Instruction : Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

Exercise 3.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

## install.packages("mlbench")
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
library(ggplot2)
library(GGally)
# Histograms for  variables
par(mfrow=c(3,3))   # 3x3 grid of plots
for (col in names(Glass)[1:9]) {
  hist(Glass[[col]], main=col, col="skyblue", xlab=col)
}

# Scatterplot matrix
pairs(Glass[,1:9], col=Glass$Type)

#Correlation Matrix
cor_matrix <- cor(Glass[,1:9])
par(mfrow=c(1,1))
image(1:9, 1:9, cor_matrix, col = heat.colors(256), axes = FALSE,
main = "Correlation Matrix of Predictor Variables",
xlab = "", ylab = "")
axis(1, at = 1:9, labels = names(Glass)[1:9], las = 2, cex.axis = 0.8)
axis(2, at = 1:9, labels = names(Glass)[1:9], las = 1, cex.axis = 0.8)
text(expand.grid(1:9, 1:9), sprintf("%.2f", cor_matrix), cex = 0.8, col = "black")

# Boxplots by Type
boxplot(Glass[,1:9], 
        main = "Boxplot of Predictor Variables ", 
        col = "lightblue", 
        border = "darkblue", 
        notch = TRUE)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE

##### (b) Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers can be found in the boxplots for variables such as Ca, Na, and K, which show significant outliers. However, Mg is clear without outliers.

Regarding skewness, it is observed that Ba and Fe are heavily right-skewed. Additionally, K also show skewness.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

For skewed variables, Box-Cox or log transformations can be addressed the issue. Spatial sign can be used for handling outliers.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# Load the Soybean dataset
library(mlbench)
data(Soybean)

# Get column names
cols <- names(Soybean)

# Create bar plots for each column
lapply(cols, function(col_name) {
  ggplot(data = Soybean, aes(x = .data[[col_name]])) +
    geom_bar(fill = "skyblue") +
    coord_flip() +
    labs(title = col_name, x = col_name, y = "Count") +
    theme_minimal()
})
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Based on the above histograms for each of the categorical predictors, a degenerate distribution in the mycelium, sclerotia, and leaf.mild variables. Because of low unique values, making them degenerate.

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Soybean %>%
  summarise(across(everything(), ~mean(is.na(.)))) %>%   # proportion missing per column
  pivot_longer(everything(), names_to = "variables", values_to = "prop_missing") %>%
  ggplot(aes(x = prop_missing, y = reorder(variables, prop_missing))) +
  geom_col(fill = "red") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion Missing",
       y = "Variables") +
  theme_minimal()

Based on the above plot, there are high missing (>20%) variables - sever, seed.tmt, lodging, hail, fruit.spots, seed, mold.growth, shriveling, etc. In addition, the pattern of missing data is associated with the plant classes. For example, predictors related to fruit spots, fruit pods, and seed characteristics often have missing values that are specific to certain classes, such as phytophthora-rot, which exhibits a high proportion of missing data for these predictors.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Strategy for handling missing data: First, eliminate predictors with degenerate distributions or excessive missing values, specifically those with over 30% or 40% missing data or near-zero variance (based on b outcomes - drop mycelium, sclerotia, leaf.mild). For the remaining predictors, apply model-based imputation methods, as they are categorical and their missingness is related to the class variable.