Exercises from Chapter 3 of textbook Applied Predictive Modeling by Kuhn & Johnson
The histograms below show the different scales we are dealing with across the predictor variables. All are showing bins = 0.5. Variables BA, FE, K, and RI have numerically very small ranges of values so they don’t look as varied as the other variables. This scale shows Al, Ca, Mg, Na, and Si to have something resembling normal distributions, but with some skew on a few, and a lot of zero values on some.
Glass %>%
subset(select = -Type) %>%
#reshape data
gather() %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~ key, scales = "free") +
labs(title = "Checking Distribution of Glass Predcitor Variables") +
my_plot_theme
Looking at the correlation plot below, note that any correlations not significant at a level fo p=0.05 are omitted. We see a few darker squares, the most prominent being the positive (gray) correlation between RI and Ca that is statistically significant. The next largest positive correlation is between Ba and Al, also significant. In the negative direction, the three darkest (teal) correlations, all also significant, are Si and Rl, Al and Mg, and Ba, and Mg.
#drop target variable (and non-numeric)
Glass_num <- subset(Glass, select = -Type)
#create correlation matrix
glass_cor <- cor(Glass_num)
#get p-values
testRes = cor.mtest(mtcars, conf.level = 0.95)
corrplot(glass_cor, p.mat = testRes$p, method = 'color', diag = FALSE, type = 'lower',
sig.level = 0.05, pch.cex = 0.9, insig='blank',
addCoef.col = "black",
pch.col = 'grey20', order = 'AOE',
number.cex = 1.5, tl.cex = 1.5, cl.cex = 1.5,
col=colorRampPalette(c("#0b5d69", "white", "#4c4c4c"))(100))
A visual inspection of the boxplots below shows there appear to be many outliers, by the definition of outliers being beyond the IQR. We have skew in Ba, Fe, and K, likely due to the zero values - which we might be able to fix with a log transformation.
Glass %>%
#histograms only for numeric variables
subset(select = -Type) %>%
#reshape
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~ key, scales = "free") +
labs(title = "Checking Distribution of Glass Predcitor Variables") +
my_plot_theme
Using the handy diagnose_outlier() function from dlookr package we see each variable with the count of outliers and the associated ratio, and the mean of the grouped outliers. The final two columns are very valuable, showing us how the inclusion of these outliers affects the overall mean of the variable.
In our case, we see that Ba has the largest proportion of outliers and it affects the mean a fair amount considering the scale of 0-3.2 we see on the boxplot above. In contrast, while Rl has 17 outliers this appears to be a small enough proportion, and small enough outliers, that the mean isn’t affects with or without the outliers included.
outlier <- diagnose_outlier(Glass) %>%
arrange(desc(outliers_cnt)) %>%
mutate_if(is.numeric, round , digits=3)
knitr::kable(outlier)
| variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
|---|---|---|---|---|---|
| Ba | 38 | 17.757 | 0.986 | 0.175 | 0.000 |
| Ca | 26 | 12.150 | 11.173 | 8.957 | 8.651 |
| Al | 18 | 8.411 | 2.088 | 1.445 | 1.386 |
| RI | 17 | 7.944 | 1.524 | 1.518 | 1.518 |
| Si | 12 | 5.607 | 71.824 | 72.651 | 72.700 |
| Fe | 12 | 5.607 | 0.324 | 0.057 | 0.041 |
| Na | 7 | 3.271 | 12.661 | 13.408 | 13.433 |
| K | 7 | 3.271 | 3.061 | 0.497 | 0.410 |
| Mg | 0 | 0.000 | NaN | 2.685 | 2.685 |
Due to the skewness, based around zero, that was identified above a few variables might benefit from log transformations. Let’s check quick. As predicted from looking at the box plots, variables Ba, Fe, and K appear to benefit from a log transformation. Most of the other variables have some flaring at the tails on the QQ-plots but are otherwise reasonable. Mg stands out, likely due to the large number of values of zero - I’d be curious to ask a content-expert on if these are truly zero measurements or are meant to be NAN values. The log transformation might be appropriate depending on that information. It appears the skewness in Al not created by zero values benefits from either a sqrt or log transformation, finding lambda could help make that decision.
Glass %>% plot_normality()
Again using the skim function we see our dataset contains 36 variables of factor type. There are quite a few missing values though each variable as at least 82% of it’s data available.
data(Soybean)
skim_without_charts(Soybean)
| Name | Soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
Looking at this large output of histograms we get a sense visually of the number of levels for each factor variable and the proportion of NAs.
A degenerate distribution is when the variable has only a single value present, or if just a few unique values occur very seldom - essentially we are identifying zero or near-zero variance. In our dataset the variables that look problematic are: int.discolor, leaf.malf, leaf.mild, leaves, lodging, mycelium, mold.growth, roots, sclerotia, seed.discolor, seed.size, and shriveling. This is a lot, but many of these only had 2 factors and one dominates the dataset.
Soybean %>%
subset(select = -Class) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(stat = "count") +
facet_wrap(~ key, scales = "free", ncol = 3) +
labs(title = "Checking Distribution of Soybean Predcitor Variables") +
my_plot_theme
In this case, my strategy would be to drop the variables with degenerate distributions: leaf.mild, mycelium, and sclerotia. After that I would choose to use k-NN methods to impute the missing values. A mean or mode doesn’t make sense to me in the case of these variables as so many have only 2 levels to the factor. Choosing k-NN means we can rely on the many cases of complete observations and let the algorithm fill in the most likely data after learning what sort of observations are usually grouped together. (I attempted to find a package to do this for categorical data to give it a test run, but couldn’t find one nor could I find an example in our textbook. I hope to learn how to do this soon!)