In this report, I will be answering questions 1 and 2 in the 3rd chapter from the Kuhn and Johnson book Applied Predictive Modeling.
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via: > library(mlbench) > data(Glass) > str(Glass)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Glass |>
pivot_longer(-Type, names_to = "Predictor", values_to = "Value") |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 20, fill = "darkgreen", color = "white") +
facet_wrap(~ Predictor, scales = "free") +
labs(title = "Distribution of Glass Predictors")
corr_mat <- Glass |>
select(-Type) |>
cor()
corr_df <- as.data.frame(as.table(corr_mat))
names(corr_df) <- c("Var1", "Var2", "Corr")
ggplot(corr_df, aes(Var1, Var2, fill = Corr)) +
geom_tile() +
labs(title = "Correlation Heatmap of Predictors", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The predictor distributions vary substantially across variables. Some predictors (Si, Al, and Na for example) appear relatively concentrated and roughly unimodal, while others (Ba, K, and Fe for example) show strong clustering near zero with occasional larger values, indicating strong right-skewness. The correlation heatmap shows several moderate relationships among predictors. For example, RI appears positively associated with Ca, and Al shows association with K and Ba. Some predictor pairs have weaker relationships, indicating that not all predictors are strongly related.
I will use a boxplot to easier identify outliers.
Glass |>
pivot_longer(-Type, names_to = "Predictor", values_to = "Value") |>
ggplot(aes(x = Predictor, y = Value)) +
geom_boxplot() +
coord_flip() +
labs(title = "Boxplots of Predictors",
x = "", y = "Value")
The boxplots show the presence of potential outliers in various predictors. In particular, Ca shows multiple high-value outliers, suggesting occasional large calcium concentrations. Na also exhibits some high-end outliers. The predictors Ba and Fe display strong right-skewness, with most observations concentrated near zero and a few larger values appearing as outliers.
Additionally, K shows some isolated larger values compared to its compact central distribution. However, we can also see variables such as Si and RI that seem more tightly clustered with fewer extreme deviations.
There are various predictors, particularly Ba, Fe, and K, exhibit strong right-skewness and have extreme values, suggesting that transformations may improve model performance. A log transformation would help reduce skewness while appropriately handling the many zero values.
Also, since the predictors have different numerical scales (for example, Si around 70–75 and Ba often near 0), standardizing the predictors by centering and scaling would likely improve the performance of many classification methods. So, applying transformations to reduce skewness and scaling would help create a more balanced feature space for classification.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
freq_list <- lapply(Soybean[ , -36], table)
freq_list[1:5]
## $Class
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
##
## 0 1
## 354 293
##
## $precip
##
## 0 1 2
## 74 112 459
##
## $temp
##
## 0 1 2
## 80 374 199
nzv <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
The class distribution is pretty imbalanced. Some diseases like brown-spot(92), alternarialeaf-spot(91), frog-eye-leaf-spot(91), and phytophthora-rot(88) seem to be much more frequent, while others such as herbicide-injury(8), cyst-nematode(14), diaporthe-pod-&-stem-blight(15), and 2-4-d-injury(16) are relatively rare. This difference in frequency may make classification more challenging for the smaller classes.
Looking at the predictors, most do not appear degenerate. For example, plant.stand(354 vs 293) is fairly balanced, and variables like date, temp, and precip are distributed across multiple levels. While some categories occur more often than others, none of these predictors are dominated by a single level. In the end, the predictors seem usable, but the response variable shows noticeable imbalance. Predictors with highly imbalanced categories can also be problematic for classification models because they provide very little information for distinguishing between classes.
missing_pct <- colMeans(is.na(Soybean)) * 100
sort(missing_pct, decreasing = TRUE)
## hail sever seed.tmt lodging germ
## 17.7159590 17.7159590 17.7159590 17.7159590 16.3982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 15.8125915 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 14.6412884 13.4699854 13.4699854 13.4699854 12.2986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 12.2986823 12.2986823 12.2986823 12.2986823 5.5636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## sclerotia plant.stand roots temp crop.hist
## 5.5636896 5.2708638 4.5387994 4.3923865 2.3426061
## plant.growth stem date area.dam Class
## 2.3426061 2.3426061 0.1464129 0.1464129 0.0000000
## leaves
## 0.0000000
missing_df <- tibble(
Predictor = names(missing_pct),
MissingPercent = missing_pct
)
ggplot(missing_df, aes(x = reorder(Predictor, MissingPercent),
y = MissingPercent)) +
geom_col() +
coord_flip() +
labs(title = "Percent Missing by Predictor")
Checking if the pattern of missing data is related to the class
Soybean$MissingCount <- rowSums(is.na(Soybean))
ggplot(Soybean, aes(x = Class, y = MissingCount)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Missing Values by Class")
Missing values are not evenly distributed across all predictors. Some predictors have relatively high proportions of missing data, including hail, sever, seed.tmt, and lodging each missing about 17.7% of their observations. Then, we also have predictors such as germ, leaf.mild, fruiting.bodies, fruit.spots, seed.discolor, and shriveling also have substantial missing rates (roughly 15–16%). On the contrary, some variables contain very little missing data. For example, date and area.dam—and Class and leaves have none.
The boxplot of missing values by class also suggests that the amount of missing data differs across disease categories. Some classes, such as 2-4-d-injury, cyst-nematode, herbicide-injury, and phytophthora-rot, seem to have more missing values, while many other classes have little or none. This pattern indicates that the missing data may be related to the disease type and not completely random.
Since a big part of the dataset has missing values and the pattern of missing data varies across predictors and disease classes, simply removing rows with missing values would discard a large amount of information. So, it would be better to keep the observations and handle the missing values directly.
One strategy that makes sense is to first remove predictors with very high percentages of missing values if it seems like they won’t provide meaningful information. For the remaining predictors, which are categorical, missing values can either be replaced with the most common category or just treated as a separate category just for missing values. If that is the case, treating missing values as their own category may be useful because the analysis above suggests that the presence of missing data varies across disease classes, so it has relevant information.