library(mlbench)
library(DataExplorer)
library(corrplot)
library(ggplot2)
library(tidyr)
library(dplyr)
library(GGally)
library(e1071)
library(caret)
library(knitr)
library(tibble)
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Answer:
Before we proceed with the answer, let’s check if there are any missing data.
glass_df <- Glass
sum(is.na(glass_df)) # Total number of missing values
## [1] 0
colSums(is.na(glass_df)) # Missing values per column
## RI Na Mg Al Si K Ca Ba Fe Type
## 0 0 0 0 0 0 0 0 0 0
There are no missing values.
Distributions and the relationships between predictors: we will Convert dataset to long format for histograms.
glass_long <- pivot_longer(glass_df, cols = -Type, names_to = "Feature", values_to = "Value")
ggplot(glass_long, aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
facet_wrap(~ Feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Predictor Variables", x = "Value", y = "Count")
Histogram: Most predictors, like Al, Na, and Si, follow a normal or slightly skewed distribution, while Ba, Fe, and K are highly skewed with many zero values, suggesting potential outliers or categorical-like behavior. Let’s check check relationships using ggpairs().
ggpairs(glass_df, columns = 1:9, aes(color = as.factor(Type)),
upper = list(continuous = wrap("cor", size = 3))) +
theme_minimal() +
labs(title = "Scatterplot Matrix of Glass Data Predictors")
Scatterplot Matrix (ggpairs): Ca and RI show a strong positive correlation, while Na and Mg show a moderate negative correlation.
cor_matrix <- cor(glass_df[,1:9])
corrplot(cor_matrix, method = "color", type = "lower",
tl.col = "black", tl.srt = 45, addCoef.col = "black", number.cex = 0.8,
col = colorRampPalette(c("blue", "white", "red"))(200))
Correlation Heatmap (corrplot): The heat map confirms Ca and RI as the strongest correlated pair (r = 0.81), while most other predictors have weak correlations.
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Boxplot: let’s use Boxplots as it help identify outliers by visualizing data spread and extreme values.
ggplot(glass_long, aes(y = Feature, x = Value)) +
geom_boxplot(width = 0.3, fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
theme_minimal() +
labs(title = "Boxplot of Predictors - Detecting Outliers", x = "Value", y = "Feature") +
facet_wrap(~ Feature, scales = "free", ncol = 3) +
theme(axis.text.y = element_text(size = 12))
summary(glass_long$Value[glass_long$Feature == "Ba"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.175 0.000 3.150
var(glass_long$Value[glass_long$Feature == "Ba"])
## [1] 0.247227
Ba’s boxplot appears compressed because the majority of its values are zero”. Also, Variance is low spread. Let’s add Jittering to make overlapping points more visible.
ggplot(glass_long, aes(y = Feature, x = Value)) +
geom_boxplot(width = 0.3, fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
geom_jitter(aes(color = Feature), width = 0.2, alpha = 0.5) +
theme_minimal() +
labs(title = "Boxplot of Predictors - Detecting Outliers", x = "Value", y = "Feature") +
facet_wrap(~ Feature, scales = "free", ncol = 3) +
theme(axis.text.y = element_text(size = 12))
Using geom_jitter() improved visibility, which is great.
Skewness: Measures how much a distribution deviates from normality.
Skewness interpretation:
glass_df %>%
select_if(is.numeric) %>%
apply(., 2, skewness) %>%
unlist() %>%
round(4)
## RI Na Mg Al Si K Ca Ba Fe
## 1.6027 0.4478 -1.1365 0.8946 -0.7202 6.4601 2.0184 3.3687 1.7298
The boxplot confirms that Ba, Fe, K, Ca, and Al have significant outliers, aligning with their high skewness values. This indicates strong asymmetry, suggesting that these variables may require transformation before modeling.
c. Are there any relevant transformations of one or more predictors that might improve the classification model?
log transformation vs. Box-Cox transformation: since Ba, Fe, K, Ca, and Al have significant outliers, lets check both log vs. Box-Cox transformations to compare the skewness.
glass_transformed_log <- glass_df
glass_transformed_boxcox <- glass_df
log_transform_vars <- c("Ba", "Fe", "K", "Ca")
glass_transformed_log[log_transform_vars] <- log1p(glass_transformed_log[log_transform_vars]) # Log Transformation
skewness_log <- sapply(glass_transformed_log[log_transform_vars], skewness) # Compute skewness
boxcox_vars <- c("Ba", "Fe", "K", "Ca")
preprocess_params <- preProcess(glass_transformed_boxcox[boxcox_vars], method = "BoxCox")
glass_transformed_boxcox[boxcox_vars] <- predict(preprocess_params, glass_transformed_boxcox[boxcox_vars]) # Box-Cox
skewness_boxcox <- sapply(glass_transformed_boxcox[boxcox_vars], skewness) # Compute skewness
skewness_original <- sapply(glass_df[boxcox_vars], skewness) # Original Skewness
# Create a Data Frame for Comparison
skewness_comparison <- data.frame(
Original_Skewness = round(skewness_original, 4),
After_Log = round(skewness_log, 4),
After_BoxCox = round(skewness_boxcox, 4)
) %>%
rownames_to_column(var = "Feature") # Convert row names to a column
knitr::kable(skewness_comparison, caption = "Comparing Skewness After Log vs. Box-Cox Transformations") #Print table
| Feature | Original_Skewness | After_Log | After_BoxCox |
|---|---|---|---|
| Ba | 3.3687 | 2.6886 | 3.3687 |
| Fe | 1.7298 | 1.5642 | 1.7298 |
| K | 6.4601 | 1.9504 | 6.4601 |
| Ca | 2.0184 | 1.1521 | -0.1940 |
From the compassion table above we can see that the Box-Cox transformation only suitable for Ca. So we will use lg transformation for Ba, Fe and K. We will also use Square root and Square transformations and find out the new skewness values.
glass_transformed <- glass_df
# Log transformation for highly right-skewed variables (Ba, Fe, K)
log_transform_vars <- c("Ba", "Fe", "K")
glass_transformed[log_transform_vars] <- log1p(glass_transformed[log_transform_vars]) # log1p(x) = log(x + 1)
# Square root transformation for moderately right-skewed variables (Al)
glass_transformed$Al <- sqrt(glass_transformed$Al)
# Square transformation for left-skewed variables (Mg, Si)
glass_transformed$Mg <- glass_transformed$Mg^2
glass_transformed$Si <- glass_transformed$Si^2
# Apply Box-Cox transformation for Ca
preprocess_params <- preProcess(glass_transformed[boxcox_vars], method = "BoxCox")
glass_transformed[boxcox_vars] <- predict(preprocess_params, glass_transformed[boxcox_vars])
# Check new skewness after correct transformations
new_skewness <- sapply(glass_transformed, skewness)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in Ops.factor(x, mean(x)): '-' not meaningful for factors
round(new_skewness, 4)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1.6027 0.4478 -0.8131 0.0911 -0.6509 1.9504 -0.1940 2.6886 1.5642 NA
library(mlbench)
library(caret)
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
glimpse(Soybean)
## Rows: 683
## Columns: 36
## $ Class <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
## $ date <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.mild <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mycelium <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ sclerotia <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ fruit.pods <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
library(naniar)
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Let’s identify predictors that might be problematic by checking: Zero variance predictors (only 1 unique value) and near-zero variance predictors (highly imbalanced variables).
nzv_info <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv_info[nzv_info$nzv == TRUE, ]
## freqRatio percentUnique zeroVar nzv
## leaf.mild 26.75 0.4392387 FALSE TRUE
## mycelium 106.50 0.2928258 FALSE TRUE
## sclerotia 31.25 0.2928258 FALSE TRUE
The nearZeroVar() function identified leaf.mild, mycelium, and sclerotia as near-zero variance predictors. These features are highly imbalanced with very few unique values, making them poor predictors for classification. Removing them is recommended to simplify the dataset and improve model performance.
(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Let’s identify missing data patterns first.
missing_counts <- colSums(is.na(Soybean)) # Count missing values per column
missing_counts[missing_counts > 0] # columns with missing values only
## date plant.stand precip temp hail
## 1 36 38 30 121
## crop.hist area.dam sever seed.tmt germ
## 16 1 121 121 112
## plant.growth leaf.halo leaf.marg leaf.size leaf.shread
## 16 84 84 84 100
## leaf.malf leaf.mild stem lodging stem.cankers
## 84 108 16 121 38
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 38 106 38 38 38
## sclerotia fruit.pods fruit.spots seed mold.growth
## 38 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
gg_miss_var(Soybean) +
labs(title = "Missing Values per Predictor in Soybean Data")
The missing data analysis shows that the highest number of missing values occur in variables like hail, sever, seed.tmt, germ, and lodging, each with over 100 missing entries. Several other predictors, including leaf.halo, fruit.pods, and mold.growth, also have a moderate number of missing values, ranging from 30 to 90. On the other hand, some variables such as date, area.dam, and Class have little to no missing data.
missing_class_summary <- Soybean %>%
group_by(Class) %>%
mutate(class_Total = n()) %>%
ungroup() %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
summarize(Missing = n(), Proportion = Missing / first(class_Total)) %>% # Proportion of missing data per class
ungroup()%>%
distinct(Class, Proportion)
# Bar plot of missing data proportion by class
ggplot(missing_class_summary, aes(x = reorder(Class, Proportion), y = Proportion)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Proportion of Missing Data by Class",
x = "Class",
y = "Proportion of Missing Data")
# Heatmap of missing values per class
missing_by_class_long <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
pivot_longer(cols = -Class, names_to = "Feature", values_to = "Missing_Count")
ggplot(missing_by_class_long, aes(x = Feature, y = Class, fill = Missing_Count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
labs(title = "Heatmap of Missing Data by Class", x = "Feature", y = "Class") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The missing data distribution changes once the highly missing classes
are removed:
Soybean %>%
filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
"2-4-d-injury", "herbicide-injury")) %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values with Missing Classes Removed",
x = "Proportion") +
scale_fill_manual(values=c("steelblue","red"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The heatmap reveals that missing data is concentrated within specific disease classes rather than occurring randomly. In particular, Diaporthe-pod-&-stem-blight, Cyst-nematode, 2-4-D-injury, and Herbicide-injury have missing values in all observations, while Phytophthora-rot has around 77% missing data. This suggests that these classes may have been underrepresented in data collection or lacked certain recorded measurements.
After removing these highly missing classes, the dataset no longer contains any missing values. This indicates that the missing data issue was primarily driven by these specific classes rather than by inconsistencies across features.
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The best strategy to handle missing data in the Soybean dataset is to remove the five disease classes with excessive missing values: Diaporthe-pod-&-stem-blight, Cyst-nematode, 2-4-D-injury, Herbicide-injury, and Phytophthora-rot, as their high missing rates make them unreliable. Once these classes are removed, the dataset no longer contains missing values. This also eliminats the need for imputation. This approach ensures a clean dataset for modeling without introducing potential biases.