library(mlbench)
data(Glass)
library(ggplot2)
library(tidyr)
library(dplyr)
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# Histograms of glass predictors
Glass %>%
pivot_longer(-Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "white") +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Histograms of Glass Predictors") +
theme(
plot.title=element_text(hjust=0.5)
)
# Type frequencies
ggplot(data = Glass, aes(Type)) +
geom_bar() +
labs(title = "Frequencies by type") +
theme(
plot.title=element_text(hjust=0.5)
)
# Correlation plot
library(corrplot)
glass_cor <- cor(Glass[ , -10]) # Type is the 10th column
corrplot(glass_cor,
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45)
According to the historam:
The bar plot reveals class imbalance.
According to the correlation plot:
There is a strong positive correlation between Ca and RI
There is a significant positive correlation between the following:
There is a significant negative correlation between the following:
Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>%
pivot_longer(-Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_boxplot() +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Boxplots of Glass Predictors") +
theme(
plot.title=element_text(hjust=0.5)
)
The boxplots confirm that Ba, Fe, and K have strong outliers as there’s a few extreme values that extend far beyond the whiskers. Ca has slight outliers at the higher values. Na, RI, and Na show mild outliers. Si shows extreme values at both ends but seems fairly symmetric.
library(moments)
sapply(Glass[ , -10], skewness)
## RI Na Mg Al Si K Ca
## 1.6140150 0.4509917 -1.1444648 0.9009179 -0.7253173 6.5056358 2.0326774
## Ba Fe
## 3.3924309 1.7420068
From the skewness() function, we can see that Na, Al, and Si are approximately normal as the skewness factor is close to 0. RI, K, Ca, Ba, and Fe are all right-skewed to various degrees as their skewness factor is positive, with K having the strongest right-skewness.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Since RI, K, Ca, Ba, and Fe are all right-skewed, a log transformation or Box-Cox transform could help reduce skewness and make the distributions more symmetric.
For Na, Al, and Si, I believe no transformation is extremely necessary since the distributions are already approximately normal. However, there is a slight right-skewness for Na and Al and a slight left-skewness for Si, so a log transform or Box-Cox transformation may be beneficial.
Since the predictors are on different scales, it would be good to standardize them by applying z-score standardization.
Transformations are not recommended for bimodal distributions, so Mg does not require a transformation.
data(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
All predictors have a minimum of 2 factors. Many predictors have a significant amount of NAs - the only predictor without NAs is leaves.
# bar plot for each predictor
vars <- names(Soybean)[names(Soybean) != "Class"]
for (v in vars) {
p <- ggplot(Soybean, aes_string(x = v)) +
geom_bar(fill = "lightpink") +
labs(title = paste("Distribution of", v)) +
theme(
plot.title = element_text(hjust = 0.5)
)
print(p)
}
Majority of the predictors have the zero case as the most common value. Most of the predictors have imbalanced distributions, with a few exceptions like germ, area.dam, crop.hist, seed.tmt, and plant.stand showing a more balanced distribution.
Mycelium, sclerotia, shriveling, and leaf.mild are not strictly degenerate since they have more than one category, but they have very few cases outside of zero, so they are almost degenerate. These predictors provide little variation and may add little value to the model.
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Missingness by predictor
Soybean %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
ggplot(aes(x = reorder(Variable, -Missing), y = Missing)) +
geom_col(fill = "limegreen") +
coord_flip() +
labs(title = "Missing Values by Predictor",
x = "Predictor", y = "Number of Missing Values") +
theme(
plot.title = element_text(hjust = 0.5)
)
# Missingness by predictor + class
Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~ mean(is.na(.))), .groups = "drop") %>%
pivot_longer(-Class, names_to = "Variable", values_to = "PropMissing") %>%
ggplot(aes(x = Variable, y = Class, fill = PropMissing)) +
geom_tile() +
scale_fill_gradient(low = "darkgreen", high = "white") +
labs(title = "Proportion Missing by Predictor and Class",
x = "Predictor", y = "Class", fill = "Proportion Missing") +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
The column chart shows that hail, lodging,
seed.tmt, server, and germ are the most
frequent predictors with missing values.
The proportion of missing values by class + predictor plot is very helpful as it shows that the missing values only occur in a few classes: 2-4-d-injury, phytophthora-rot, herbicide-injury, diaporthe-pod-&-stem-blight, and cyst-nematode. This means that it’s unlikely that the values are missing at random and the missingness corresponds to the class.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
# remove the nearly-degenerate predictors
soy_new <- Soybean %>%
select(-c(mycelium, sclerotia, leaf.mild, shriveling))
# Impute by replacing all missing values with the global most common category for that column.
cols <- setdiff(names(soy_new), "Class")
soy_imputed <- soy_new %>%
mutate(across(all_of(cols), as.character)) %>%
mutate(across(all_of(cols), ~ {
x <- .
x[is.na(x)] <- {
x0 <- x[!is.na(x)]
ux <- unique(x0); ux[which.max(tabulate(match(x0, ux)))]
}
x
})) %>%
mutate(across(everything(), as.factor))
colSums(is.na(soy_imputed)) # should be all zeros
## Class date plant.stand precip temp
## 0 0 0 0 0
## hail crop.hist area.dam sever seed.tmt
## 0 0 0 0 0
## germ plant.growth leaves leaf.halo leaf.marg
## 0 0 0 0 0
## leaf.size leaf.shread leaf.malf stem lodging
## 0 0 0 0 0
## stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
## 0 0 0 0 0
## fruit.pods fruit.spots seed mold.growth seed.discolor
## 0 0 0 0 0
## seed.size roots
## 0 0
After we impute, let’s take a look at the predictors’ distributions before and after the imputation to make sure it didn’t distort the variables.
# bar plot for each predictor post-imputation
vars2 <- names(soy_imputed)[names(soy_imputed) != "Class"]
for (v in vars2) {
p <- ggplot(soy_imputed, aes_string(x = v)) +
geom_bar(fill = "salmon") +
labs(title = paste("Distribution of", v, "\nPost-Imputation")) +
theme(
plot.title = element_text(hjust = 0.5)
)
print(p)
}
We can see that there are no more NAs in the predictors. After imputation, there are still no degenerate predictors, but there are some that could be considered slightly degenerate (mold.growth, seed.discolor, seed.size, and leaf.malf).