Exercises from Chapter 3 of textbook Applied Predictive Modeling by Kuhn & Johnson

Exercise 3.1

(a) Using visualizations, explore the predictor variables to understand their distributions as well as their relatoinships between predictors.

The histograms below show the different scales we are dealing with across the predictor variables. All are showing bins = 0.5. Variables BA, FE, K, and RI have numerically very small ranges of values so they don’t look as varied as the other variables. This scale shows Al, Ca, Mg, Na, and Si to have something resembling normal distributions, but with some skew on a few, and a lot of zero values on some.

Glass %>%
  subset(select = -Type) %>%
  #reshape data
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~ key, scales = "free") +
  labs(title = "Checking Distribution of Glass Predcitor Variables") +
  my_plot_theme

Looking at the correlation plot below, note that any correlations not significant at a level fo p=0.05 are omitted. We see a few darker squares, the most prominent being the positive (gray) correlation between RI and Ca that is statistically significant. The next largest positive correlation is between Ba and Al, also significant. In the negative direction, the three darkest (teal) correlations, all also significant, are Si and Rl, Al and Mg, and Ba, and Mg.

#drop target variable (and non-numeric)
Glass_num <- subset(Glass, select = -Type)

#create correlation matrix
glass_cor <- cor(Glass_num)

#get p-values
testRes = cor.mtest(mtcars, conf.level = 0.95)

corrplot(glass_cor, p.mat = testRes$p, method = 'color', diag = FALSE, type = 'lower',
         sig.level = 0.05, pch.cex = 0.9, insig='blank',
         addCoef.col = "black", 
         pch.col = 'grey20', order = 'AOE',
         number.cex = 1.5, tl.cex = 1.5, cl.cex = 1.5,
         col=colorRampPalette(c("#0b5d69", "white", "#4c4c4c"))(100))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

A visual inspection of the boxplots below shows there appear to be many outliers, by the definition of outliers being beyond the IQR. We have skew in Ba, Fe, and K, likely due to the zero values - which we might be able to fix with a log transformation.

Glass %>%
  #histograms only for numeric variables
  subset(select = -Type) %>%
  #reshape
  gather() %>% 
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~ key, scales = "free") +
  labs(title = "Checking Distribution of Glass Predcitor Variables") +
  my_plot_theme

Using the handy diagnose_outlier() function from dlookr package we see each variable with the count of outliers and the associated ratio, and the mean of the grouped outliers. The final two columns are very valuable, showing us how the inclusion of these outliers affects the overall mean of the variable.

In our case, we see that Ba has the largest proportion of outliers and it affects the mean a fair amount considering the scale of 0-3.2 we see on the boxplot above. In contrast, while Rl has 17 outliers this appears to be a small enough proportion, and small enough outliers, that the mean isn’t affects with or without the outliers included.

outlier <- diagnose_outlier(Glass) %>%
   arrange(desc(outliers_cnt)) %>%
   mutate_if(is.numeric, round , digits=3)

knitr::kable(outlier)
variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
Ba 38 17.757 0.986 0.175 0.000
Ca 26 12.150 11.173 8.957 8.651
Al 18 8.411 2.088 1.445 1.386
RI 17 7.944 1.524 1.518 1.518
Si 12 5.607 71.824 72.651 72.700
Fe 12 5.607 0.324 0.057 0.041
Na 7 3.271 12.661 13.408 13.433
K 7 3.271 3.061 0.497 0.410
Mg 0 0.000 NaN 2.685 2.685

(c) Are there any relevant transformation of one or more predictors that might improve the classification model?

Due to the skewness, based around zero, that was identified above a few variables might benefit from log transformations. Let’s check quick. As predicted from looking at the box plots, variables Ba, Fe, and K appear to benefit from a log transformation. Most of the other variables have some flaring at the tails on the QQ-plots but are otherwise reasonable. Mg stands out, likely due to the large number of values of zero - I’d be curious to ask a content-expert on if these are truly zero measurements or are meant to be NAN values. The log transformation might be appropriate depending on that information. It appears the skewness in Al not created by zero values benefits from either a sqrt or log transformation, finding lambda could help make that decision.

Glass %>% plot_normality()

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Again using the skim function we see our dataset contains 36 variables of factor type. There are quite a few missing values though each variable as at least 82% of it’s data available.

data(Soybean)
skim_without_charts(Soybean)
Data summary
Name Soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15

(a) Investiage the frequency distributions for the categorical predictors. Are any of the distributions degenerate in ways discussed earlier in this chapter?

Looking at this large output of histograms we get a sense visually of the number of levels for each factor variable and the proportion of NAs.

A degenerate distribution is when the variable has only a single value present, or if just a few unique values occur very seldom - essentially we are identifying zero or near-zero variance. In our dataset the variables that look problematic are: int.discolor, leaf.malf, leaf.mild, leaves, lodging, mycelium, mold.growth, roots, sclerotia, seed.discolor, seed.size, and shriveling. This is a lot, but many of these only had 2 factors and one dominates the dataset.

Soybean %>%
  subset(select = -Class) %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram(stat = "count") +
  facet_wrap(~ key, scales = "free", ncol = 3) +
  labs(title = "Checking Distribution of Soybean Predcitor Variables") +
  my_plot_theme

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In this case, my strategy would be to drop the variables with degenerate distributions: leaf.mild, mycelium, and sclerotia. After that I would choose to use k-NN methods to impute the missing values. A mean or mode doesn’t make sense to me in the case of these variables as so many have only 2 levels to the factor. Choosing k-NN means we can rely on the many cases of complete observations and let the algorithm fill in the most likely data after learning what sort of observations are usually grouped together. (I attempted to find a package to do this for categorical data to give it a test run, but couldn’t find one nor could I find an example in our textbook. I hope to learn how to do this soon!)