Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
# Reshape to long format
glass_long <- melt(Glass, id.vars = "Type")
# histogram of predictors
ggplot(glass_long, aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(x = "Value", y = "Count", title = "Distributions of Glass Predictors")
This histogram of predictors show Rl, Na, Al, Ca, and Si have unimodal, near symmetric distributions. The other predictors are highly skewed. Mg has two peaks: one near 0 and the other around 3.5. Ba and Fe are have a right skewed distribution as most of their values are zero.
# correlation matrix of predictors
corr_mat <- cor(Glass[, 1:9])
# reshape for ggplot
melted_corr <- melt(corr_mat)
ggplot(melted_corr, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limit = c(-1,1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Correlation Plot of Glass Predictors",
x = "", y = "")
This correlation plot shows a strong positive correlation with Ca and Rl, which might be why their distribution plots look very similar. Some pairs like Ba and Al, and Ba and Na show a moderate positive correlation. Many of the predictors show negative correlations with each other like Na and Mg, Si and Ri, Mg and Al and Mg and Ba. Fe stands out in particular as it has weak correlations with the other predictors.
There are many outliers in the data. Ba, Fe, and K are heavily right skewed with most of their values around zero and outliers on their right tail. Even some of the distributions with a bell shaped curve have obvious outliers. These outliers and skewed variables could cause bias in the model if left unaccounted for.
For the predictors that have a large amount of zero values Ba, K, and Fe, a log transformation might not be optimal. However, we can use a log shift transformation where we add 1 to the predictor log (predictor + 1). For the other predictors that are moderately skewed or have a somewhat symmetric distribution, we can use a Box-Cox transformation. Stanadardization should also be considered so that they can be placed on a common scale.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
# Pivotting to long form
soy_long <- Soybean %>%
select(-Class) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(everything(), names_to = "Predictor", values_to = "Level") %>%
mutate(Level = ifelse(is.na(Level), "(Missing)", Level))
ggplot(soy_long, aes(x = Level, fill = Level)) +
geom_bar() +
facet_wrap(~ Predictor, scales = "free_x", ncol = 5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Based on the bar plots of categorical predictors, we can see that many predictors are degenerate. hail, lodging, mold.growth, mycelium, sclerotia, shriveling, fruiting.bodies all have one category that dominates and provide little information. These low variance variables are less useful for classification.
Removing degenrate predictors such as hail, lodging, mold growth, and shriveling provide little information as they are imbalanced and missing data. I would first drop these predictors. For predictors with missing data that have useful variations, imputation would performed. These variables include temp, precip, leaf.mild, where the values are spread around. I would impute the missing values based on the mode. This preserves the relationship between predictors and the outcome while ensuring the dataset is complete for modeling.