Variable Inspection
library(mlbench)
library(tidyverse)
library(gridExtra)
data(Glass)
glass_pred <- Glass[1:9]
plot_it <- function(col){
ggplot(Glass) + geom_histogram(aes_string(x = col), color = 'black', fill = 'gray')
}
plots <- lapply(colnames(glass_pred), plot_it)
grid.arrange(grobs = plots, ncol = 3, nrow = 3)
The notable features are:
library(GGally)
ggpairs(glass_pred, upper ='blank')
There are noticeable outliers and unusual patterns in the pairwise plots, though this is most prevalent when at least one of the variables has a high proportion of zeros. For instance, in the Mg-Ba plot there are three distinct patterns. One line where the Mg value is 0, one where the Ba value is zero, and an upside-down U shape where they are both non-zero.
PCA
library(caret)
trans <- preProcess(x = glass_pred)
glass_cs <- predict(trans, glass_pred)
pca_obj <- prcomp(glass_cs)
pca_obj$rotation
## PC1 PC2 PC3 PC4 PC5
## RI -0.5451766 0.28568318 -0.0869108293 -0.14738099 0.073542700
## Na 0.2581256 0.27035007 0.3849196197 -0.49124204 -0.153683304
## Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124
## Al 0.4287086 0.29521154 -0.3292371183 0.13750592 -0.014108879
## Si 0.2288364 -0.15509891 0.4587088382 0.65253771 -0.008500117
## K 0.2193440 -0.15397013 -0.6625741197 0.03853544 0.307039842
## Ca -0.4923061 0.34537980 0.0009847321 0.27644322 0.188187742
## Ba 0.2503751 0.48470218 -0.0740547309 -0.13317545 -0.251334261
## Fe -0.1858415 -0.06203879 -0.2844505524 0.23049202 -0.873264047
## PC6 PC7 PC8 PC9
## RI -0.11528772 -0.08186724 -0.75221590 -0.02573194
## Na 0.55811757 -0.14858006 -0.12769315 0.31193718
## Mg -0.30818598 0.20604537 -0.07689061 0.57727335
## Al 0.01885731 0.69923557 -0.27444105 0.19222686
## Si -0.08609797 -0.21606658 -0.37992298 0.29807321
## K 0.24363237 -0.50412141 -0.10981168 0.26050863
## Ca 0.14866937 0.09913463 0.39870468 0.57932321
## Ba -0.65721884 -0.35178255 0.14493235 0.19822820
## Fe 0.24304431 -0.07372136 -0.01627141 0.01466944
rot <- as.data.frame(round(pca_obj$rotation, 3)) %>%
rownames_to_column(var = 'Var') %>%
gather(key = 'PC', value = 'Val', -Var)
ggplot(rot) + geom_tile(aes(x = PC, y = Var, fill = Val))
screeplot(pca_obj)
pca_obj$sdev^2/sum(pca_obj$sdev^2)
## [1] 0.2790181918 0.2277857983 0.1560937771 0.1286513829 0.1015558052
## [6] 0.0586261325 0.0409953826 0.0070947720 0.0001787575
Only PC 5 is mostly dominated by a single variable, indicating some collinearity. The scree plot doesn’t have an obvious elbow. The largest dropoffs are between PC2 and PC3, and PC5 and PC6.
Correlation
library(corrplot)
cor_mat <- cor(glass_pred) %>%
corrplot(order = 'hclust')
We see Ca, RI, the variable that make up PC1 as the highest correlated. There is a pocket of collinearity, with Ri, Ca, Mg, and K but these variables are only lightly correlated. None of the correlations exceeed .6. I would only consider PCA to deal with the Ca-Ra relationship and only after seeing a high vif in the modeling step.
library(psych)
library(kableExtra)
profile <- function (df){
ins_desc <- df %>%
keep(is.numeric) %>%
describe() %>%
dplyr::select(-c(vars,n,range, mad, trimmed))
rns <- rownames(ins_desc)
round_2 <- function(col) round(col,2)
ins_desc2 <- ins_desc %>%
mutate_all(round_2)
rownames(ins_desc2) <- rns
ins_desc2 %>%
kable %>%
kable_styling
}
profile(glass_pred)
| mean | sd | median | min | max | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|
| RI | 1.52 | 0.00 | 1.52 | 1.51 | 1.53 | 1.60 | 4.72 | 0.00 |
| Na | 13.41 | 0.82 | 13.30 | 10.73 | 17.38 | 0.45 | 2.90 | 0.06 |
| Mg | 2.68 | 1.44 | 3.48 | 0.00 | 4.49 | -1.14 | -0.45 | 0.10 |
| Al | 1.44 | 0.50 | 1.36 | 0.29 | 3.50 | 0.89 | 1.94 | 0.03 |
| Si | 72.65 | 0.77 | 72.79 | 69.81 | 75.41 | -0.72 | 2.82 | 0.05 |
| K | 0.50 | 0.65 | 0.56 | 0.00 | 6.21 | 6.46 | 52.87 | 0.04 |
| Ca | 8.96 | 1.42 | 8.60 | 5.43 | 16.19 | 2.02 | 6.41 | 0.10 |
| Ba | 0.18 | 0.50 | 0.00 | 0.00 | 3.15 | 3.37 | 12.08 | 0.03 |
| Fe | 0.06 | 0.10 | 0.00 | 0.00 | 0.51 | 1.73 | 2.52 | 0.01 |
Skew and Kurtosis is a problem in many of these variables, as we saw in the histograms. With the high number of zeros in some of these variables, the Box-Cox transformation might not be effective in enhancing their predictive power and could lead to nonsensical transformations due to numerical instability. One should proceed with caution when applying the Box-Cox to distributions like these.
We first try the multivarite version of the Box-Cox, that tries to find a multivariate normal distribution for all the variables in unison.
library(car)
library(caret)
new_glass <- glass_pred + 1
bc_summary <- summary(bc <- powerTransform(as.matrix(new_glass)))
bc_summary$result
## Est Power Rounded Pwr Wald Lwr bnd Wald Upr Bnd
## RI -25.0853114 -25.09 -25.1140249 -25.05659793
## Na 1.3755562 1.00 0.6042300 2.14688241
## Mg 1.7699080 2.00 1.5354127 2.00440341
## Al 0.9773267 1.00 0.6451343 1.30951920
## Si 10.9452696 10.95 5.9995589 15.89098032
## K -0.1441078 0.00 -0.3431581 0.05494246
## Ca 0.6774333 0.50 0.4470012 0.90786544
## Ba -6.8620464 -6.86 -8.0320821 -5.69201071
## Fe -14.9245600 -14.92 -17.8279259 -12.02119414
As expected, some of these transformations are extreme. If we were insistant on using the multvariate version, we’d cap these at an absolute value of 2. The univariate version will be preferable because we don’t want the extreme distributions driving the lambda estimates of the more well behaved predictors.
trans <- preProcess(x = new_glass, method = c( 'BoxCox'))
trans
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (9)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.2, 2, 0, 2, -1, -1.4, -2, -2
This procedure apparently caps at an absolute value of 2. Assuming one chooses that ceiling, the transformations are still pretty similar to the multivariate version
Now We’ll have a look at the distributions after the Box-Cox.
glass_trans <- predict(trans, new_glass)
plot_it2 <- function(col){
ggplot(glass_trans) + geom_histogram(aes_string(x = col), color = 'black', fill = 'gray')
}
plots <- lapply(colnames(glass_trans), plot_it2)
grid.arrange(grobs = plots, ncol = 3, nrow = 3)
The distribution of Al looks nearly normal. Aso do Na and Si, other than a couple outliers. The heavy zero variables can’t really be helped.
profile(glass_trans)
| mean | sd | median | min | max | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|
| RI | 0.42 | 0.00 | 0.42 | 0.42 | 0.42 | 1.58 | 4.62 | 0.00 |
| Na | 2.67 | 0.06 | 2.66 | 2.46 | 2.91 | 0.06 | 2.50 | 0.00 |
| Mg | 7.32 | 4.16 | 9.54 | 0.00 | 14.57 | -0.93 | -0.79 | 0.28 |
| Al | 0.87 | 0.20 | 0.86 | 0.25 | 1.50 | 0.00 | 1.00 | 0.01 |
| Si | 2712.03 | 56.83 | 2721.98 | 2506.53 | 2918.74 | -0.65 | 2.78 | 3.88 |
| K | 0.28 | 0.17 | 0.36 | 0.00 | 0.86 | -0.09 | 0.13 | 0.01 |
| Ca | 0.68 | 0.01 | 0.68 | 0.66 | 0.70 | -0.25 | 4.33 | 0.00 |
| Ba | 0.06 | 0.13 | 0.00 | 0.00 | 0.47 | 2.14 | 2.92 | 0.01 |
| Fe | 0.04 | 0.07 | 0.00 | 0.00 | 0.28 | 1.33 | 0.45 | 0.00 |
Al now has zero skew, and about half of the variables are closer to zero. The only variables with high skew either had many zeros(MG, BA, Fe) or high kurtosis (RI). In the modelling step, I would try no Box-Cox, applying the Box-Cox to all variables, and only applying Box-Cox to the 5 that it helped the most. In practice, I would also round the lambdas to the nearest .5.
data("Soybean")
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
None of these categoricals meet the first heuristic from the Kuhn and Johnson book, ratio of the frequency between most and second most prevalent values > 20. However, some might meet the second condition, unique values make up less than 10% of the sample.
df <- Soybean %>% gather('var', 'value') %>%
mutate(value = as.character(value),
value = ifelse(is.na(value), 'unknown', value),
rows = nrow(Soybean)
) %>%
group_by(var, rows, value) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(proportion = count/rows) %>%
filter(proportion >= .9)
kable(df) %>% kable_styling()
| var | rows | value | count | proportion |
|---|---|---|---|---|
| mycelium | 683 | 0 | 639 | 0.9355783 |
| sclerotia | 683 | 0 | 625 | 0.9150805 |
Based on that criteria we would likely drop the variables mycelium and sclerotia
library(naniar)
gg_miss_case(as.data.frame(Soybean))
The missing values are clearly non-random. There are nearly 100 observations where nearly half the variables are missing.
Keying in on particular missing values, we limit the intersection size:
gg_miss_upset(Soybean, nsets = 5, nintersects = 5)
I don’t have the domain knowledge to see the connection between the 121 row group of missing values. However, given the size in the intersection between missing values, it appears to be non-random. All of the missing values for hail, sever, seed.tmt, and lodging occur on the same rows.
My firstline strategy for missing values in categorical data is to replace the NAs with ‘unknown’. This will extend the feature space and improve predictive power if the missing data is non-random. For example, if the missing data was caused by temperatures too low to collect data, that would clearly add explanatory power.
I would also drop the variables identified in Part A as degenerate.
The code below produces the model matrix using this strategy.
replace_it <- function(x) ifelse(is.na(x), '_unknown', x)
soybean_2 <- dplyr::select(Soybean, - c(mycelium, sclerotia, Class)) %>%
mutate_all(as.character) %>%
mutate_all(replace_it)
model_mat <- dummyVars(~ ., data = soybean_2)
new_soy <- predict(model_mat, soybean_2)
If the values are missing at random, then they could be imputed with a package like mice. mice uses logistic regression to impute categoricals. One way to determine missing at random is whether the ‘unknown’ binary variables you create are not significant in your model.
The code below shows how to create the imputed dataset with mice. In practice, you’d up the maxit and m arguments, assuming execution time is not an issue.
library(mice)
Soybean_3 <- dplyr::select(Soybean, - c(mycelium, sclerotia, Class))
pred <- mice(data = Soybean, m = 3, method = "polyreg", maxit = 2)
imputed_soy <- pred$imp