Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

# Reshape to long format
glass_long <- melt(Glass, id.vars = "Type")

# histogram of predictors
ggplot(glass_long, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(x = "Value", y = "Count", title = "Distributions of Glass Predictors")

This histogram of predictors show Rl, Na, Al, Ca, and Si have unimodal, near symmetric distributions. The other predictors are highly skewed. Mg has two peaks: one near 0 and the other around 3.5. Ba and Fe are have a right skewed distribution as most of their values are zero.

# correlation matrix of predictors
corr_mat <- cor(Glass[, 1:9])

# reshape for ggplot
melted_corr <- melt(corr_mat)

ggplot(melted_corr, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0, limit = c(-1,1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title = "Correlation Plot of Glass Predictors",
       x = "", y = "")

This correlation plot shows a strong positive correlation with Ca and Rl, which might be why their distribution plots look very similar. Some pairs like Ba and Al, and Ba and Na show a moderate positive correlation. Many of the predictors show negative correlations with each other like Na and Mg, Si and Ri, Mg and Al and Mg and Ba. Fe stands out in particular as it has weak correlations with the other predictors.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

There are many outliers in the data. Ba, Fe, and K are heavily right skewed with most of their values around zero and outliers on their right tail. Even some of the distributions with a bell shaped curve have obvious outliers. These outliers and skewed variables could cause bias in the model if left unaccounted for.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

For the predictors that have a large amount of zero values Ba, K, and Fe, a log transformation might not be optimal. However, we can use a log shift transformation where we add 1 to the predictor log (predictor + 1). For the other predictors that are moderately skewed or have a somewhat symmetric distribution, we can use a Box-Cox transformation. Stanadardization should also be considered so that they can be placed on a common scale.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Pivotting to long form
soy_long <- Soybean %>%
  select(-Class) %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(everything(), names_to = "Predictor", values_to = "Level") %>%
  mutate(Level = ifelse(is.na(Level), "(Missing)", Level))

ggplot(soy_long, aes(x = Level, fill = Level)) +
  geom_bar() +
  facet_wrap(~ Predictor, scales = "free_x", ncol = 5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Based on the bar plots of categorical predictors, we can see that many predictors are degenerate. hail, lodging, mold.growth, mycelium, sclerotia, shriveling, fruiting.bodies all have one category that dominates and provide little information. These low variance variables are less useful for classification.

b. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Table of missing values by predictor
missing_pct <- colMeans(is.na(Soybean)) * 100
missing_table <- data.frame(
  Predictor = names(missing_pct),
  MissingPercent = round(missing_pct, 1)
)
missing_table[order(-missing_table$MissingPercent), ]

##                       Predictor MissingPercent
## hail                       hail           17.7
## sever                     sever           17.7
## seed.tmt               seed.tmt           17.7
## lodging                 lodging           17.7
## germ                       germ           16.4
## leaf.mild             leaf.mild           15.8
## fruiting.bodies fruiting.bodies           15.5
## fruit.spots         fruit.spots           15.5
## seed.discolor     seed.discolor           15.5
## shriveling           shriveling           15.5
## leaf.shread         leaf.shread           14.6
## seed                       seed           13.5
## mold.growth         mold.growth           13.5
## seed.size             seed.size           13.5
## leaf.halo             leaf.halo           12.3
## leaf.marg             leaf.marg           12.3
## leaf.size             leaf.size           12.3
## leaf.malf             leaf.malf           12.3
## fruit.pods           fruit.pods           12.3
## precip                   precip            5.6
## stem.cankers       stem.cankers            5.6
## canker.lesion     canker.lesion            5.6
## ext.decay             ext.decay            5.6
## mycelium               mycelium            5.6
## int.discolor       int.discolor            5.6
## sclerotia             sclerotia            5.6
## plant.stand         plant.stand            5.3
## roots                     roots            4.5
## temp                       temp            4.4
## crop.hist             crop.hist            2.3
## plant.growth       plant.growth            2.3
## stem                       stem            2.3
## date                       date            0.1
## area.dam               area.dam            0.1
## Class                     Class            0.0
## leaves                   leaves            0.0

Based on this table of missing values, hail, sever, seed.tmt, and lodging have the highest percentage of missing value with 17.7% of their values missing.

table(Soybean$Class, is.na(Soybean$hail))

##                              
##                               FALSE TRUE
##   2-4-d-injury                    0   16
##   alternarialeaf-spot            91    0
##   anthracnose                    44    0
##   bacterial-blight               20    0
##   bacterial-pustule              20    0
##   brown-spot                     92    0
##   brown-stem-rot                 44    0
##   charcoal-rot                   20    0
##   cyst-nematode                   0   14
##   diaporthe-pod-&-stem-blight     0   15
##   diaporthe-stem-canker          20    0
##   downy-mildew                   20    0
##   frog-eye-leaf-spot             91    0
##   herbicide-injury                0    8
##   phyllosticta-leaf-spot         20    0
##   phytophthora-rot               20   68
##   powdery-mildew                 20    0
##   purple-seed-stain              20    0
##   rhizoctonia-root-rot           20    0

The table shows the the amount of missing values for the predictor hail across soybean disease classes. Only certain classes like 2-4-d-injury, cyst-nematode, herbicide-injury, and diaporthe-pod-&-stem-blight have missing values for hail. All of the other classes have values present for hail. This indicates that there is a pattern of missing data that is class dependent.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Removing degenrate predictors such as hail, lodging, mold growth, and shriveling provide little information as they are imbalanced and missing data. I would first drop these predictors. For predictors with missing data that have useful variations, imputation would performed. These variables include temp, precip, leaf.mild, where the values are spread around. I would impute the missing values based on the mode. This preserves the relationship between predictors and the outcome while ensuring the dataset is complete for modeling.

HW4

Jian Quan Chen

2025-09-26

3.1

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

3.2

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.