library(tidyverse)
library(mlbench)
library(corrplot)
library(MASS)
In this document, we will be going through exercises 3.1 and 3.2 from Applied Predictive Modeling - Kuhn and Johnson.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Below we utilize histograms for visualizing how each predictor variable is distributed, boxplots to explore outliers, and a corrplot to determine the relationship these variables have together.
data(Glass)
Glass |>
dplyr::select(-Type) -> Glass
par(mfrow = c(2, 5))
for(i in names(Glass))
hist(Glass[[i]], main = i, xlab = "Value")
par(mfrow = c(2, 5))
for(i in names(Glass))
boxplot(Glass[[i]], main = i, xlab = "Value")
Glass |>
cor() |>
corrplot()
There are many outliers very apparent in almost every single distribution besides Mg and Na. The extreme frequency of outliers in the other variables is a consequence of their skewness. For example: Fe, Ba, and K are all extremely rightward skewed with Ca being slightly less so in comparison. Mg is also bimodal and fairly leftward skewed. It’s probably to helpful to note that many of these skews come from the capping effect when a variable can not go below 0 and thus 0 becomes a mode.
A box-cox transform would be able to help normalize the strong skewing with K, Ca, Ba, and Fe. However, Mg would not be able to be transformed to be normalized as bimodal data mathematically does not have a method to keep the data’s integrity while normalizing.
b <- boxcox(lm((Glass$Ba+1)*100 ~ 1))
lambda <- -1.9
hist((((Glass$Ba + 1)*100) ^ lambda - 1) / lambda)
Yet upon attempting to boxcox transform Ba we run into hurdles of not being able to box-cox transform the data without further transforming it to get values away from 0 which can not be transformed. Yet even with the additive and multiplicative transformations the data here seems to be too skewed to properly transform. Thus it should be noted that box-cox transformations that can be automatically optimized aren’t necessarily the 100% cure of normalizing a distribution.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
Soybean <- Soybean |>
dplyr::select(-Class)
summary(Soybean)
## date plant.stand precip temp hail crop.hist
## 5 :149 0 :354 0 : 74 0 : 80 0 :435 0 : 65
## 4 :131 1 :293 1 :112 1 :374 1 :127 1 :165
## 3 :118 NA's: 36 2 :459 2 :199 NA's:121 2 :219
## 2 : 93 NA's: 38 NA's: 30 3 :218
## 6 : 90 NA's: 16
## (Other):101
## NA's : 1
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0 :123 0 :195 0 :305 0 :165 0 :441 0: 77 0 :221
## 1 :227 1 :322 1 :222 1 :213 1 :226 1:606 1 : 36
## 2 :145 2 : 45 2 : 35 2 :193 NA's: 16 2 :342
## 3 :187 NA's:121 NA's:121 NA's:112 NA's: 84
## NA's: 1
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0 :357 0 : 51 0 :487 0 :554 0 :535 0 :296 0 :520
## 1 : 21 1 :327 1 : 96 1 : 45 1 : 20 1 :371 1 : 42
## 2 :221 2 :221 NA's:100 NA's: 84 2 : 20 NA's: 16 NA's:121
## NA's: 84 NA's: 84 NA's:108
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0 :379 0 :320 0 :473 0 :497 0 :639 0 :581
## 1 : 39 1 : 83 1 :104 1 :135 1 : 6 1 : 44
## 2 : 36 2 :177 NA's:106 2 : 13 NA's: 38 2 : 20
## 3 :191 3 : 65 NA's: 38 NA's: 38
## NA's: 38 NA's: 38
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor
## 0 :625 0 :407 0 :345 0 :476 0 :524 0 :513
## 1 : 20 1 :130 1 : 75 1 :115 1 : 67 1 : 64
## NA's: 38 2 : 14 2 : 57 NA's: 92 NA's: 92 NA's:106
## 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## seed.size shriveling roots
## 0 :532 0 :539 0 :551
## 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
The way that categorical data would be degenerate based on its frequency is if both the fraction of unique values is lower than 10% of the sample size and the second most prevalent value is 5% of the first most prevalent value. Each category here violates the first principle. However, both conditions are violated on only: mycelium, sclerotia, shriveling, and leaf.mild. We should consider removing these categories as they may negatively affect the predictive power of our model if it is susceptible to that.
The particular groups of predictors that are more likely to be missing are the ones involving food, seed, and leaf. There is a definite pattern here of missing data that is related to the classes. Looking for information about the classes in the information about the data we see that for many of these soybean plant attributes there is no option to indicate that something like the absence of a variable category. For example, the leaf attributes almost all share the same amount of 84 NAs and it’s the same for the number 38 we see time and time again. It is very likely that most of these NAs show the absence of said variable. Which is important information itself.
For the missing data that we have deemed as meaningful and relating to the absence of said category we should recode these NAs to a new level that would correspond to absence. Some NAs will be eliminated from the removal of degenerate frequencies such as mycelium, sclerotia, shriveling, and leaf.mild. Finally, any remaining NAs that don’t fit into either of these categories can be imputed through KNN imputation.