library(gridExtra)
library(ggplot2)
library(cowplot)
options(scipen=10000)
library(mlbench)
library(tidyverse)
library(corrplot)
library(AppliedPredictiveModeling)
library(caret)
library(DataExplorer)
library(kableExtra)
library(mice)
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Data Overview
data(Glass)
data(Soybean)
head(Glass) |> kable() |> kable_styling() |> kable_classic(full_width = T)
| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
|---|---|---|---|---|---|---|---|---|---|
| 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0.00 | 1 |
| 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0.00 | 1 |
| 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0.00 | 1 |
| 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0.00 | 1 |
| 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0.00 | 1 |
| 1.51596 | 12.79 | 3.61 | 1.62 | 72.97 | 0.64 | 8.07 | 0 | 0.26 | 1 |
GlassCont <- Glass |> dplyr::select(-Type)
# plotGraphs(GlassCont,FALSE)
Glass |>
keep(is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free', ncol = 2) +
ggtitle("Histograms of Numerical Predictors")
Glass |>
keep(is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) +
facet_wrap(~key, scales = 'free', ncol = 2) +
ggtitle("Boxplots of Numerical Predictors")
Glass |>
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")
## Correlation Plot
GlassCont <- Glass |> dplyr::select(-Type)
correlations <- cor(GlassCont)
corrplot(correlations, order = "hclust")
highCorr <- findCorrelation(correlations, cutoff = .75)
GlassUnco <- Glass[,-highCorr]
The correlation plot shows that in general, variables are not highligh correlated with eachother. The expecion is Ca, which is Highly correlated with RI.
correlations |> kable() |> kable_styling() |> kable_classic(full_width = T)
| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | |
|---|---|---|---|---|---|---|---|---|---|
| RI | 1.0000000 | -0.1918854 | -0.1222740 | -0.4073260 | -0.5420522 | -0.2898327 | 0.8104027 | -0.0003860 | 0.1430096 |
| Na | -0.1918854 | 1.0000000 | -0.2737320 | 0.1567937 | -0.0698088 | -0.2660865 | -0.2754425 | 0.3266029 | -0.2413464 |
| Mg | -0.1222740 | -0.2737320 | 1.0000000 | -0.4817985 | -0.1659267 | 0.0053957 | -0.4437500 | -0.4922621 | 0.0830595 |
| Al | -0.4073260 | 0.1567937 | -0.4817985 | 1.0000000 | -0.0055237 | 0.3259584 | -0.2595920 | 0.4794039 | -0.0744022 |
| Si | -0.5420522 | -0.0698088 | -0.1659267 | -0.0055237 | 1.0000000 | -0.1933309 | -0.2087322 | -0.1021513 | -0.0942007 |
| K | -0.2898327 | -0.2660865 | 0.0053957 | 0.3259584 | -0.1933309 | 1.0000000 | -0.3178362 | -0.0426181 | -0.0077190 |
| Ca | 0.8104027 | -0.2754425 | -0.4437500 | -0.2595920 | -0.2087322 | -0.3178362 | 1.0000000 | -0.1128410 | 0.1249682 |
| Ba | -0.0003860 | 0.3266029 | -0.4922621 | 0.4794039 | -0.1021513 | -0.0426181 | -0.1128410 | 1.0000000 | -0.0586918 |
| Fe | 0.1430096 | -0.2413464 | 0.0830595 | -0.0744022 | -0.0942007 | -0.0077190 | 0.1249682 | -0.0586918 | 1.0000000 |
Do there appear to be any outliers in the data? Are any predictors skewed?
The Ri variable is right-skewed, showing outliers at both ends, though the right tail is more pronounced with more extreme values. Similarly, the AI variable is right-skewed, with a more prominent right tail and outliers on both sides, but the right side contains more significant outliers.
The NA variable is also right-skewed, with four outliers on the right and two on the left. The left-side outliers are more extreme, significantly influencing the shape of the distribution and creating an imbalance in how the distribution is perceived.
In contrast, the Mg variable is left-skewed without any outliers, resulting in a smoother distribution. The SI variable shows left-skewness with outliers at both ends, but the left tail is more prominent, drawing attention to this side of the distribution.
Another version of the SI variable is right-skewed, with notable outliers in the right tail, where one specific observation is particularly extreme and stands out.
The CA variable is right-skewed with outliers on both sides, although the right tail is more prominent, a pattern observed consistently across multiple observations.
Lastly, the Ba variable has a highly sparse distribution, with most values at zero. The few non-zero values are considered outliers, making it difficult to assess the distribution meaningfully as it is dominated by zero values.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Let’s transform the data with boxcox transformation to see how the model could be improved.
lambdas <- Glass |>
keep(is.numeric) |>
mutate_all(funs(BoxCoxTrans(.)$lambda)) |> head(1)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
lambdas
## RI Na Mg Al Si K Ca Ba Fe
## 1 -2 -0.1 NA 0.5 2 NA -1.1 NA NA
dataset <- GlassCont |> dplyr::select(-c(Mg,K,Ba,Fe))
for (i in 1:ncol(dataset)) {
transformed <- BoxCoxTrans( dataset[,i])
transformed_data <- predict(transformed,dataset[,i] )
plot_h <- ggplot(dataset) + geom_histogram(aes(x= transformed_data))+xlab(paste(names(dataset)[i], " with Lambda =",transformed$lambda))
box_p <- ggplot(dataset) + geom_boxplot(aes(x= transformed_data),outlier.colour="red", outlier.shape=8,outlier.size=4)+xlab(paste(names(dataset)[i], " with Lambda =",transformed$lambda))
grid.arrange(plot_h, box_p ,ncol=2)
}
The table summarizes the variables and their respective Box-Cox transformations. However, some variables, such as Fe, K, Mg, and Ba, contain zero values, which is problematic when applying the Box-Cox transformation, as it results in undefined values (NaN). To address this issue, we will add a constant to the dataset equals to the absolute value of the minimum value in de data plus 2, allowing us to apply the Box-Cox transformation and observe its effects.
dataset <- Glass |> dplyr::select(c(Mg,K,Ba,Fe))
for (i in 1:ncol(dataset)) {
shift_value <- abs(min( dataset[,i])) + 2
newVal <- dataset[,i] + shift_value
transformed <- BoxCoxTrans(newVal)
transformed_data <- predict(transformed,newVal )
plot_h <- ggplot(dataset) + geom_histogram(aes(x= transformed_data))+xlab(paste(names(dataset)[i], " with constant 1 and Lambda =",transformed$lambda))
box_p <- ggplot(dataset) + geom_boxplot(aes(x= transformed_data),outlier.colour="red", outlier.shape=8,outlier.size=4)+xlab(paste(names(dataset)[i], " with constant 1 and Lambda =",transformed$lambda))
grid.arrange(plot_h, box_p ,ncol=2)
}
Outliers can limit the effectiveness of transformations like Box-Cox by dominating the process, preventing significant changes in the distribution. For both the data increased by a constant and the original data, the minimal impact of the transformation may be due to the strong influence of the outliers, which skew the results and reduce the transformation’s ability to correct the data. Handling the outliers separately could improve the transformation’s effectiveness and help normalize the data.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Data Overview
head(Soybean) |> kable() |> kable_styling() |> kable_classic(full_width = F) |> scroll_box(width = "100%")
| Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 4 | 0 | 2 | 1 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 5 | 0 | 2 | 1 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
par(mfrow=c(2, 2))
for (i in 1:ncol(Soybean)) {
var <- names(Soybean)[i]
print(ggplot(Soybean,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))
}
Based on the charts above, leaf.mild, mycelium, and sclerotia appear to have all their values centered around a single point. We can confirm this by checking for variables with near-zero variance using the nearZeroVar function.
near_zero <- nearZeroVar(Soybean)
soybean_zero <- Soybean[,near_zero]
for (i in 1:ncol(soybean_zero)) {
var <- names(soybean_zero)[i]
print(ggplot(soybean_zero,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))
}
As observed, most of the observations have a value of 0, with a few exceptions that account for no more than two records (ignoring NA values).
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_percentages <- colSums(is.na(Soybean)) / nrow(Soybean) * 100
missing_percentages |> kable(col.names = c("Variables","Percentage")) |> kable_styling() |> kable_classic(full_width = F)
| Variables | Percentage |
|---|---|
| Class | 0.0000000 |
| date | 0.1464129 |
| plant.stand | 5.2708638 |
| precip | 5.5636896 |
| temp | 4.3923865 |
| hail | 17.7159590 |
| crop.hist | 2.3426061 |
| area.dam | 0.1464129 |
| sever | 17.7159590 |
| seed.tmt | 17.7159590 |
| germ | 16.3982430 |
| plant.growth | 2.3426061 |
| leaves | 0.0000000 |
| leaf.halo | 12.2986823 |
| leaf.marg | 12.2986823 |
| leaf.size | 12.2986823 |
| leaf.shread | 14.6412884 |
| leaf.malf | 12.2986823 |
| leaf.mild | 15.8125915 |
| stem | 2.3426061 |
| lodging | 17.7159590 |
| stem.cankers | 5.5636896 |
| canker.lesion | 5.5636896 |
| fruiting.bodies | 15.5197657 |
| ext.decay | 5.5636896 |
| mycelium | 5.5636896 |
| int.discolor | 5.5636896 |
| sclerotia | 5.5636896 |
| fruit.pods | 12.2986823 |
| fruit.spots | 15.5197657 |
| seed | 13.4699854 |
| mold.growth | 13.4699854 |
| seed.discolor | 15.5197657 |
| seed.size | 13.4699854 |
| shriveling | 15.5197657 |
| roots | 4.5387994 |
plot_intro(Soybean, title = 'Missing Information on Soybean Dataset',
ggtheme = theme_minimal())
# Plot missing volume in Column
plot_missing(Soybean[,1:(ncol(Soybean)/2)],title = 'Information about Missing Value in Soybean dataset Part 1',ggtheme = theme_minimal())
plot_missing(Soybean[,(ncol(Soybean)/2):ncol(Soybean)],title = 'Information about Missing Value in Soybean dataset Part 2',ggtheme = theme_minimal())
sb_miss_count <- Soybean |> mutate(nul=rowSums(is.na(Soybean)))|>
group_by(Class)|> summarize(miss=sum(nul)) |> filter(miss!=0)
sb_miss_count |> kable() |> kable_styling() |> kable_classic(full_width = T)
| Class | miss |
|---|---|
| 2-4-d-injury | 450 |
| cyst-nematode | 336 |
| diaporthe-pod-&-stem-blight | 177 |
| herbicide-injury | 160 |
| phytophthora-rot | 1214 |
There is a significant imbalance in the number of missing values across these categories. For instance, phytophthora-rot has 1214 missing values, which is much higher compared to herbicide-injury (160) or diaporthe-pod-&-stem-blight (177). This could suggest that certain conditions or circumstances lead to a higher likelihood of missing values for specific categories.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
For low missingness (<5%), including variables like Class, plant.stand, and precip, we apply simple imputation using the mean or median for numerical data and the mode for categorical data. These methods are sufficient given the minimal missing data and the low likelihood of introducing bias.
For moderate to high missingness (5%-17%), such as date, stem.cankers, germ, hail, lodging, and leaf.mild, we use advanced imputation techniques like KNN or MICE. These methods are more robust, preserving relationships between variables and minimizing bias, even when a substantial portion of data is missing. This ensures that critical variables can still be included in the analysis without distorting the underlying patterns in the data.
In this case, I will use MICE, which allows for flexible imputation based on the type of variable (e.g., numerical or categorical) and the relationships between variables. By default, MICE uses Predictive Mean Matching (PMM) for continuous variables, ensuring that imputed values are realistic by matching them to observed data. This method is particularly well-suited for moderate levels of missingness, as it maintains the integrity of the data while avoiding extreme or unrealistic imputed values.
imputed_data <- mice(Soybean, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
soybean_complete_data <- complete(imputed_data, 1)
plot_intro(soybean_complete_data, title = 'Missing Information on Soybean Dataset',
ggtheme = theme_minimal())
for (i in 1:ncol(soybean_complete_data)) {
var <- names(soybean_complete_data)[i]
print(ggplot(soybean_complete_data,aes_string(var))+geom_bar() + coord_flip()+ ggtitle(var))
}