Exercises from Chapter 3 of textbook Applied Predictive Modeling by Kuhn & Johnson

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consit of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

I like using the skimr package as a first step to explore the data. Below we see that the target variable is factor with 6 levels, and all of the predictor variables are numeric. There are zero missing values across the whole dataset. The means have quite a large range between variables. The mini-histograms show quite a few variables have one dominating value with the rest quite infrequent. We’ll need to investigate that more.

#load in dataset
data(Glass)
skim(Glass)

Data summary
Name	Glass
Number of rows	214
Number of columns	10
_______________________
Column type frequency:
factor	1
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Type	0	1	FALSE	6	2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RI	1	1.52	0.00	1.51	1.52	1.52	1.52	1.53	▁▇▂▁▁
Na	1	13.41	0.82	10.73	12.91	13.30	13.83	17.38	▁▇▆▁▁
Mg	1	2.68	1.44	0.00	2.11	3.48	3.60	4.49	▃▁▁▇▅
Al	1	1.44	0.50	0.29	1.19	1.36	1.63	3.50	▂▇▃▁▁
Si	1	72.65	0.77	69.81	72.28	72.79	73.09	75.41	▁▂▇▂▁
K	1	0.50	0.65	0.00	0.12	0.56	0.61	6.21	▇▁▁▁▁
Ca	1	8.96	1.42	5.43	8.24	8.60	9.17	16.19	▁▇▁▁▁
Ba	1	0.18	0.50	0.00	0.00	0.00	0.00	3.15	▇▁▁▁▁
Fe	1	0.06	0.10	0.00	0.00	0.00	0.10	0.51	▇▁▁▁▁

(a) Using visualizations, explore the predictor variables to understand their distributions as well as their relatoinships between predictors.

The histograms below show the different scales we are dealing with across the predictor variables. All are showing bins = 0.5. Variables BA, FE, K, and RI have numerically very small ranges of values so they don’t look as varied as the other variables. This scale shows Al, Ca, Mg, Na, and Si to have something resembling normal distributions, but with some skew on a few, and a lot of zero values on some.

Glass %>%
  subset(select = -Type) %>%
  #reshape data
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~ key, scales = "free") +
  labs(title = "Checking Distribution of Glass Predcitor Variables") +
  my_plot_theme

Looking at the correlation plot below, note that any correlations not significant at a level fo p=0.05 are omitted. We see a few darker squares, the most prominent being the positive (gray) correlation between RI and Ca that is statistically significant. The next largest positive correlation is between Ba and Al, also significant. In the negative direction, the three darkest (teal) correlations, all also significant, are Si and Rl, Al and Mg, and Ba, and Mg.

#drop target variable (and non-numeric)
Glass_num <- subset(Glass, select = -Type)

#create correlation matrix
glass_cor <- cor(Glass_num)

#get p-values
testRes = cor.mtest(mtcars, conf.level = 0.95)

corrplot(glass_cor, p.mat = testRes$p, method = 'color', diag = FALSE, type = 'lower',
         sig.level = 0.05, pch.cex = 0.9, insig='blank',
         addCoef.col = "black", 
         pch.col = 'grey20', order = 'AOE',
         number.cex = 1.5, tl.cex = 1.5, cl.cex = 1.5,
         col=colorRampPalette(c("#0b5d69", "white", "#4c4c4c"))(100))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

A visual inspection of the boxplots below shows there appear to be many outliers, by the definition of outliers being beyond the IQR. We have skew in Ba, Fe, and K, likely due to the zero values - which we might be able to fix with a log transformation.

Glass %>%
  #histograms only for numeric variables
  subset(select = -Type) %>%
  #reshape
  gather() %>% 
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~ key, scales = "free") +
  labs(title = "Checking Distribution of Glass Predcitor Variables") +
  my_plot_theme

Using the handy diagnose_outlier() function from dlookr package we see each variable with the count of outliers and the associated ratio, and the mean of the grouped outliers. The final two columns are very valuable, showing us how the inclusion of these outliers affects the overall mean of the variable.

In our case, we see that Ba has the largest proportion of outliers and it affects the mean a fair amount considering the scale of 0-3.2 we see on the boxplot above. In contrast, while Rl has 17 outliers this appears to be a small enough proportion, and small enough outliers, that the mean isn’t affects with or without the outliers included.

outlier <- diagnose_outlier(Glass) %>%
   arrange(desc(outliers_cnt)) %>%
   mutate_if(is.numeric, round , digits=3)

knitr::kable(outlier)

variables	outliers_cnt	outliers_ratio	outliers_mean	with_mean	without_mean
Ba	38	17.757	0.986	0.175	0.000
Ca	26	12.150	11.173	8.957	8.651
Al	18	8.411	2.088	1.445	1.386
RI	17	7.944	1.524	1.518	1.518
Si	12	5.607	71.824	72.651	72.700
Fe	12	5.607	0.324	0.057	0.041
Na	7	3.271	12.661	13.408	13.433
K	7	3.271	3.061	0.497	0.410
Mg	0	0.000	NaN	2.685	2.685

(c) Are there any relevant transformation of one or more predictors that might improve the classification model?

Due to the skewness, based around zero, that was identified above a few variables might benefit from log transformations. Let’s check quick. As predicted from looking at the box plots, variables Ba, Fe, and K appear to benefit from a log transformation. Most of the other variables have some flaring at the tails on the QQ-plots but are otherwise reasonable. Mg stands out, likely due to the large number of values of zero - I’d be curious to ask a content-expert on if these are truly zero measurements or are meant to be NAN values. The log transformation might be appropriate depending on that information. It appears the skewness in Al not created by zero values benefits from either a sqrt or log transformation, finding lambda could help make that decision.

Glass %>% plot_normality()

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Again using the skim function we see our dataset contains 36 variables of factor type. There are quite a few missing values though each variable as at least 82% of it’s data available.

data(Soybean)
skim_without_charts(Soybean)

Data summary
Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

(a) Investiage the frequency distributions for the categorical predictors. Are any of the distributions degenerate in ways discussed earlier in this chapter?

Looking at this large output of histograms we get a sense visually of the number of levels for each factor variable and the proportion of NAs.

A degenerate distribution is when the variable has only a single value present, or if just a few unique values occur very seldom - essentially we are identifying zero or near-zero variance. In our dataset the variables that look problematic are: int.discolor, leaf.malf, leaf.mild, leaves, lodging, mycelium, mold.growth, roots, sclerotia, seed.discolor, seed.size, and shriveling. This is a lot, but many of these only had 2 factors and one dominates the dataset.

Soybean %>%
  subset(select = -Class) %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram(stat = "count") +
  facet_wrap(~ key, scales = "free", ncol = 3) +
  labs(title = "Checking Distribution of Soybean Predcitor Variables") +
  my_plot_theme

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The plot below has been restricted to see the 7 most frequent patterns of missing data. Looking at the plot below from left to right we see the variables with the highest counts of missing data. For example, hail, server, seed.tmt, and lodging are missing in all 121 incomplete cases. There are 34 variables that have some missing values, but these are the highest and most consistent to be missing. Furtherwe see that 55 cases have the first 16 variables (from left to right) missing, that seems to be a strong pattern.

Soybean %>% 
  plot_na_intersect(only_na = TRUE, typographic = TRUE, n_intersacts = 7)

The graphic below helps look at the patterns of missingness in a slightly different way. For display purposes I’ve restricted it to show the 7 most frequent patterns. As seen above, 55 cases are missing those 16 variables. After that we see more cases missing different combinations of those same variables.

gg_miss_upset(Soybean, nsets = 16)

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In this case, my strategy would be to drop the variables with degenerate distributions: leaf.mild, mycelium, and sclerotia. After that I would choose to use k-NN methods to impute the missing values. A mean or mode doesn’t make sense to me in the case of these variables as so many have only 2 levels to the factor. Choosing k-NN means we can rely on the many cases of complete observations and let the algorithm fill in the most likely data after learning what sort of observations are usually grouped together. (I attempted to find a package to do this for categorical data to give it a test run, but couldn’t find one nor could I find an example in our textbook. I hope to learn how to do this soon!)

HW4 - Preprocessing/Overfitting

Rachel Greenlee

2/24/2022