All exercises can be found in “Applied Predictive Modelling” written by Max Kuhn and Kjell Johnson.
The following packages are required to rerun this .rmd
file:
seed_num <- 200
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
library(mlbench)
library(e1071)
library(caret)
library(fpp3)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labelled as one of seven class categories. There are nine predictors, including the refractive index and percentage of the eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand these distributions as well as the relationships between predictors.
First, the cell below separates the data in Glass into
its predictors (X) and target (y):
X <- select(Glass, -Type)
y <- Glass$Type
The cell below plots histograms of all the predictor variables in order to view their distributions:
plt_data <- X %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
ggplot(plt_data, aes(x = value)) +
geom_histogram(fill = 'steel blue', bins=30) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of All Predictors", x = "Value", y = "Frequency") +
theme_minimal()
Next, the above plot is recreated, but this time includes separate distributions for each category of the target variable:
plt_data <- Glass %>%
pivot_longer(cols = -Type, names_to = "variable", values_to = "value")
ggplot(plt_data, aes(x = value, fill = Type)) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of All Predictors (Separated by Class)",
x = "Value",
y = "Frequency") +
theme_minimal()
To evaluate the relationships between the predictors themselves, the cell below creates a correlation plot:
cor_matrix <- cor(X)
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black")
Do there appear to be any outliers in the data? Are any predictors skewed?
The plots produced in part (a) reveal a lot about the structure of
the Glass dataset and how the predictor variables relate to
themselves and the target variable, Type. The histogram of
the predictors reveal that fields Al, Ca, Mg, are the ones that most
closely follow a normal distribution while the rest all have some level
of skew. This skewness is not equal however amongst the remaining
predictors, with fields Fe and Ba being the worst offenders. We can
check statistically which of the predictors can be considered normal
using a Shapiro-Wilk test:
col_tests <- function(X){
# This function takes in a dataframe X as an input and returns the following
# for each of its columns:
# - Determines whether or not the data is normal
# - Determines the skewness of the data
# initialize empty dataframe for values
col_tests <- data.frame(
Variable = character(),
SW_P_Value = numeric(),
Normal = character(),
Skewness = numeric(),
High_Skew = character()
)
# loop through dataframe columns
for (column in names(X)){
# perform shapiro wilk test and calculate skewness
vals <- as.numeric(na.omit(X[[column]]))
p_val <- shapiro.test(vals)$p.value
skew <- skewness(vals)
# determine if data is normal
normal <- FALSE
if (p_val > 0.05){
normal <- TRUE
}
# determine if data is highly skewed
high_skew <- FALSE
if (skew > 2 | skew < -2){
high_skew <- TRUE
}
# append data to col_tests dataframe
row_to_add <- data.frame(
Variable = column,
P_Value = p_val,
Normal = normal,
Skewness = skew,
High_Skew = high_skew
)
col_tests <- rbind(col_tests, row_to_add)
}
return(col_tests)
}
col_tests(X)
## Variable P_Value Normal Skewness High_Skew
## 1 RI 1.076671e-12 FALSE 1.6027151 FALSE
## 2 Na 3.465543e-07 FALSE 0.4478343 FALSE
## 3 Mg 2.390921e-19 FALSE -1.1364523 FALSE
## 4 Al 2.083156e-07 FALSE 0.8946104 FALSE
## 5 Si 2.175032e-09 FALSE -0.7202392 FALSE
## 6 K 2.172188e-25 FALSE 6.4600889 TRUE
## 7 Ca 4.286584e-16 FALSE 2.0184463 TRUE
## 8 Ba 5.383302e-26 FALSE 3.3686800 TRUE
## 9 Fe 1.156668e-20 FALSE 1.7298107 FALSE
Using the Shapiro-Wilk test, it appears as though none of the variables can be considered normal. Despite this, only three predictors were categorized as having high skew.
The histograms of each predictor also reveal that a number of them likely contain outliers. In particular, the Ba, Fe, and K fields all appear to have values that are far outside the area of the main distribution. We can check for this quantitatively by determining if any of the fields have values that are outside the third or first quartile by one and half times their inter-quartile range (IQF):
outlier_test <- function(X){
# This function takes in a dataframe X as an input and returns the number of
# outliers in each column.
# initialize empty dataframe for values
outlier_test <- data.frame(
Variable = character(),
Outliers = numeric(),
Outlier_Percentage = numeric()
)
# loop through dataframe columns
for (column in names(X)){
# determine the 1st and 3rd quartiles and the IQR
quar_1 <- quantile(X[[column]])[2]
quar_3 <- quantile(X[[column]])[4]
iqr_range <- quar_3 - quar_1
# check for outliers
outliers <- 0
for (val in X[[column]]){
min_limit <- quar_1 - (1.5 * iqr_range)
max_limit <- quar_3 + (1.5 * iqr_range)
if( val < min_limit | val > max_limit){
outliers <- outliers + 1
}
}
# determine the fraction of values in the column that are outliers
outlier_frac <- round((outliers / nrow(X)) * 100, 2)
# append data to outlier_test dataframe
row_to_add <- data.frame(
Variable = column,
Outliers = outliers,
Outlier_Percentage = outlier_frac
)
outlier_test <- rbind(outlier_test, row_to_add)
}
return(outlier_test)
}
outlier_test(X)
## Variable Outliers Outlier_Percentage
## 1 RI 17 7.94
## 2 Na 7 3.27
## 3 Mg 0 0.00
## 4 Al 18 8.41
## 5 Si 12 5.61
## 6 K 7 3.27
## 7 Ca 26 12.15
## 8 Ba 38 17.76
## 9 Fe 12 5.61
The results above show that only one of the predictors does not contain outliers, with the percentage of outliers in a column reaching as high as 17.75%.
Looking at the histograms of each predictor separated by class provides insight into which of the predictors might actually be useful when making predictions. For example, the Al, Na, and Si fields exhibit a range of distributions that appear shifted from one another by class. This means that the information they contain could be useful in determining the category that new observations are a part of.
Lastly, the correlation plot produced above reveals the the predictors do not seem to suffer from a high degree of multicollinearity. Typically, predictor pairs with correlations greater than 0.7 are considered too high, and the only pair of variables that breaks that threshold is RI and Ca.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Part (b) emphasized two possible issues present with the predictor variables: the presence of outliers and high skewness.
Box-Cox transformations can be used to help quell the possible negative effects of skewness by stabilizing the variance and keeping it closer to constant across the entire range of values in a field. Box-Cox transformations can only be performed if all values are positive, which means this transformation needs to come before any centering and scaling that would transform the mean to 0.
# k ca ba
BC_K <- BoxCoxTrans(X$K + 0.0001) # need to add a small bit to avoid 0 values
BC_Ca <- BoxCoxTrans(X$Ca)
BC_Ba <- BoxCoxTrans(X$Ba)
X$K <- predict(BC_K, X$K)
X$Ca <- predict(BC_Ca, X$Ca)
X$Ba <- predict(BC_Ba, X$Ba)
The output below shows that the Box-Cox transformations resulted in two of the three problematic predictors no longer having high skew.
col_tests(X)
## Variable P_Value Normal Skewness High_Skew
## 1 RI 1.076671e-12 FALSE 1.60271508 FALSE
## 2 Na 3.465543e-07 FALSE 0.44783426 FALSE
## 3 Mg 2.390921e-19 FALSE -1.13645228 FALSE
## 4 Al 2.083156e-07 FALSE 0.89461042 FALSE
## 5 Si 2.175032e-09 FALSE -0.72023921 FALSE
## 6 K 4.893872e-15 FALSE -0.06559656 FALSE
## 7 Ca 1.133295e-11 FALSE -0.19395573 FALSE
## 8 Ba 5.383302e-26 FALSE 3.36867997 TRUE
## 9 Fe 1.156668e-20 FALSE 1.72981071 FALSE
To help inhibit the influence of outliers, a spatial sign
transformation can be applied. Spatial sign transformations transform
each data point to a unit vector, which scales their magnitudes down to
1 effectively neutralizing their extreme values. Because all but 1
predictor contained outliers, we can apply this transformation to the
entire dataset. However, the data also needs to be centered and scaled
(transformed so that the mean \(\mu\)
and standard deviation \(\sigma\) of
each column are 0 and 1, respectively) before doing so. To perform the
scaling and spatial sign transformation all in one pipeline process, we
can invoke the preProcess function (from the
caret library):
pipeline <- preProcess(X, method = c("center", "scale", "spatialSign"))
X <- predict(pipeline, X)
The output below reveals that this transformation significantly reduced the number of outliers in the dataset:
outlier_test(X)
## Variable Outliers Outlier_Percentage
## 1 RI 1 0.47
## 2 Na 0 0.00
## 3 Mg 0 0.00
## 4 Al 0 0.00
## 5 Si 0 0.00
## 6 K 0 0.00
## 7 Ca 0 0.00
## 8 Ba 28 13.08
## 9 Fe 0 0.00
The combination of these transformations should result in a dataset with a higher degree of predictive power.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
data(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
The cell below uses the col_tests function created
earlier in order to test the predictor columns of the
Soybean dataset:
X <- select(Soybean, -Class)
y <- Soybean$Class
col_tests(X) %>% arrange(by_group = desc(Skewness))
## Variable P_Value Normal Skewness High_Skew
## 1 mycelium 3.793219e-48 FALSE 10.19921824 TRUE
## 2 sclerotia 1.631058e-46 FALSE 5.39870500 TRUE
## 3 leaf.mild 1.703066e-42 FALSE 3.95290557 TRUE
## 4 shriveling 1.181246e-42 FALSE 3.49157641 TRUE
## 5 int.discolor 5.735058e-43 FALSE 3.33861193 TRUE
## 6 lodging 8.570681e-42 FALSE 3.22582942 TRUE
## 7 leaf.malf 7.907694e-43 FALSE 3.21564565 TRUE
## 8 seed.size 1.678775e-41 FALSE 2.66303034 TRUE
## 9 seed.discolor 1.177280e-40 FALSE 2.47154019 TRUE
## 10 roots 1.149566e-40 FALSE 2.45781443 TRUE
## 11 mold.growth 6.053615e-41 FALSE 2.43281985 TRUE
## 12 fruit.pods 1.001294e-34 FALSE 1.83817833 FALSE
## 13 leaf.shread 7.396918e-39 FALSE 1.80367508 FALSE
## 14 ext.decay 8.899983e-38 FALSE 1.69543723 FALSE
## 15 fruiting.bodies 3.411697e-38 FALSE 1.65939253 FALSE
## 16 seed 4.003077e-38 FALSE 1.53904600 FALSE
## 17 hail 1.807481e-36 FALSE 1.30690508 FALSE
## 18 fruit.spots 5.695136e-31 FALSE 0.94650965 FALSE
## 19 seed.tmt 1.398438e-29 FALSE 0.73966698 FALSE
## 20 plant.growth 1.310802e-36 FALSE 0.67949699 FALSE
## 21 stem.cankers 1.676351e-33 FALSE 0.60983130 FALSE
## 22 canker.lesion 9.881052e-29 FALSE 0.51457211 FALSE
## 23 leaf.marg 1.407673e-33 FALSE 0.46484621 FALSE
## 24 plant.stand 6.291546e-35 FALSE 0.18896734 FALSE
## 25 sever 4.309345e-28 FALSE 0.17391297 FALSE
## 26 area.dam 2.684910e-24 FALSE 0.01799005 FALSE
## 27 germ 6.842615e-26 FALSE -0.08680952 FALSE
## 28 temp 6.455716e-29 FALSE -0.15829545 FALSE
## 29 stem 1.999130e-35 FALSE -0.22581409 FALSE
## 30 leaf.size 1.340863e-28 FALSE -0.24946067 FALSE
## 31 date 4.350968e-17 FALSE -0.30397011 FALSE
## 32 crop.hist 2.525308e-24 FALSE -0.39757148 FALSE
## 33 leaf.halo 6.950384e-33 FALSE -0.41080342 FALSE
## 34 precip 1.015203e-35 FALSE -1.41630633 FALSE
## 35 leaves 2.331813e-43 FALSE -2.44354029 TRUE
We see that 11 of the 35 rows exhibit a high degree of skewness, and none of the predictor distributions are considered normal according to a Shapiro-Wilk test.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
The following outputs the percentage of values in each field that are missing:
missing_data_cols <- data.frame(
Column = names(X),
MissingValues =
round(sort(colSums(is.na(X)), decreasing = TRUE) / nrow(X), 4) * 100
)
rownames(missing_data_cols) <- NULL
ggplot(missing_data_cols,
aes(x = reorder(Column, MissingValues), y = MissingValues)) +
geom_bar(stat = "identity", fill = "steel blue") +
labs(
title = "% of Values Missing Per Column",
x = "Predictor Name",
y = "% Missing Values") +
coord_flip() +
theme_minimal()
Based on the above plot, there are clearly some predictors with a much higher degree of missing data compared to others. It also looks as though predictors that might have come from the same source (i.e. weather based predictors, leaf measurements, mycology based predictors) all have constant levels of missing information.
Next, the cell below evaluates missing values by class:
missing_data_class <- Soybean %>%
group_by(Class) %>%
summarise(MissingValues = sum(is.na(across(everything()))),
TotalValues = n())
missing_data_class$TotalValues <- missing_data_class$TotalValues * ncol(X)
missing_data_class$FracMissing <- missing_data_class$MissingValues / missing_data_class$TotalValues
ggplot(missing_data_class,
aes(x = reorder(Class, FracMissing), y = FracMissing)) +
geom_bar(stat = "identity", fill='steel blue') +
labs(title = "Missing Values by Class", x = "Class", y = "Number of Missing Values") +
coord_flip()
The plot above is much more enlightening as to the cause of the missing data. It appears as though all missing data comes from observations of five different class, each seeming to represent some kind of injury or disease (i.e. cyst, rot, blight, etc.). As such, it might have been impossible to take measurements of these predictors while the soybean plants are affected with any of these conditions.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Since all of the predictors have some level of missing data, it would likely be incorrect to remove any of them. As such, a data imputation strategy is necessary. Because each of the predictor fields are categorical, a mode imputation strategy will likely be effective. More specifically, because the missing values seem to be class driven, we can implement mode imputation by class in which missing values in each class are imputed with the most frequent value within that class:
# develop function to calc mode that can be used by mutate later on
Mode <- function(x) {
ux <- unique(na.omit(x))
ux[which.max(tabulate(match(x, ux)))]
}
# mode imputation by class
imputed <- Soybean %>%
group_by(Class) %>%
mutate(across(everything(), ~ ifelse(is.na(.), Mode(.), .)))
The cell above shows how this imputation strategy can be implemented in R.