Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
# Load required libraries
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
data(Glass)
#Histograms of Numerical Predictors
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")
#Boxplots of Numerical Predictors
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
#Correlation Heatmap
Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
#Bar Chart of Target Variable
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")
The Glass dataset exhibits varied distributional characteristics: Aluminum (Al), Barium (Ba), Calcium (Ca), Iron (Fe), Potassium (K), and Refractive Index (RI) display right skewness, with Ba and Fe concentrated near zero, while Magnesium (Mg) and Silicon (Si) show left skewness, Mg being bimodal, and Sodium (Na) approximates a normal distribution with a slight right tail; Glass Type is predominantly focused on Types 1, 2, and 7. Notable correlations include a strong positive relationship between RI and Ca, negative correlations between RI and Si, Al and Mg, Ca and Mg, Ba and Mg, and a positive correlation between Ba and Al.
# Boxplots to detect outliers
Glass %>%
select(-Type) %>% # Exclude the target variable
gather(key = "Predictor", value = "Value") %>%
ggplot(aes(x = Predictor, y = Value)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(title = "Boxplots for Outlier Detection", x = "Predictor", y = "Value")
# Skewness check
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
skew_values <- Glass %>%
select(-Type) %>%
summarise_all(funs(skewness))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(skew_values)
## RI Na Mg Al Si K Ca Ba
## 1 1.602715 0.4478343 -1.136452 0.8946104 -0.7202392 6.460089 2.018446 3.36868
## Fe
## 1 1.729811
The boxplot visualization of the Glass dataset’s numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, Fe) reveals distinct distribution patterns: Ba, Fe, and K exhibit significant outliers and skewness, necessitating log transformations or robust scaling for improved model performance. Conversely, RI, Mg, Na, and Si display more consistent distributions, suggesting they could serve as stable inputs.
# Log-transform skewed predictors (adding a small constant to avoid log(0))
Glass_transformed <- Glass %>%
mutate(
Ba = log1p(Ba), # Log transform Ba
K = log1p(K), # Log transform K
Fe = log1p(Fe) # Log transform Fe
)
# Visualize distributions after transformation
Glass_transformed %>%
select(Ba, K, Fe) %>%
gather(key = "Predictor", value = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 20, fill = "lightgreen", color = "black") +
facet_wrap(~Predictor, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Distributions After Log Transformation", x = "Value", y = "Frequency")
After a log transformation, Barium (Ba) and Iron (Fe) exhibit a concentration of values near zero, indicating minimal presence, with a small spread of higher values suggesting limited variability; conversely, Potassium (K) displays a broader distribution, peaking around 0.5 and extending to 2.0, demonstrating more retained variability. This transformation successfully reduces skewness, making the data more model-ready, though Ba and Fe’s limited variance suggests potentially lower predictive power compared to K, which retains more variability.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
# Load the Soybean dataset and others
library(tidyverse)
library(mlbench)
data(Soybean)
# Explore categorical predictors
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
columns <- colnames(Soybean)
lapply(columns, function(col) {
ggplot(Soybean, aes_string(col, fill = col)) +
geom_bar(color = "black", alpha = 0.8) + # Add borders and transparency for better visuals
coord_flip() +
scale_fill_viridis_d(option = "C") + # Use Viridis color palette for consistency
theme_minimal() +
ggtitle(paste("Frequency Distribution:", col)) +
labs(x = col, y = "Count") +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 10)
)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
Essentially, degenerate distributions represent variables that consistently have a single value; in this dataset, mycelium and sclerotia appear to exhibit this characteristic. Similarly, leaf.mild and leaf.malf show a strong tendency towards a single value, particularly when missing data is accounted for, indicating nearly one-sided distributions.
# Missing values across variables
Soybean %>%
summarise_all(~sum(is.na(.))) %>%
pivot_longer(everything(), names_to = "variables", values_to = "missing_count") %>%
mutate(missing = ifelse(missing_count > 0, "Missing", "Not Missing")) %>%
ggplot(aes(x = reorder(variables, -missing_count), y = missing_count, fill = missing)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("red", "grey")) +
labs(
title = "Missing Data Proportions by Variable",
x = "Variables",
y = "Count of Missing Values",
fill = "Missing Status"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 10)
)
The missing data appears to be non-random, concentrated within specific case groups. After removing these five distinct case groups, the dataset is complete, suggesting the missingness is tied to these particular classifications
# Missing proportions by class
Soybean %>%
group_by(Class) %>%
mutate(class_Total = n()) %>%
ungroup() %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
summarise(
Missing = n(),
class_Total = first(class_Total),
Proportion = Missing / class_Total
) %>%
ggplot(aes(x = reorder(Class, -Proportion), y = Proportion, fill = Class)) +
geom_col(color = "black") +
scale_fill_viridis_d(option = "B") +
labs(
title = "Proportion of Missing Data by Class",
x = "Class",
y = "Proportion of Missing Data"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
legend.position = "none"
)
Two potential strategies exist for addressing the missing data: First, the five case groups containing the missing values could be entirely removed to achieve a complete dataset. Alternatively, the dataset could be divided, isolating these five groups. Within these isolated groups, K-Nearest Neighbors (KNN) imputation could be utilized to fill the missing values. If certain variables within these groups are consistently devoid of data, they could be eliminated from the subsetted data to maintain data integrity.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.