library(tidyverse)
library(mlbench)
library(GGally)
library(forecast)
library(mice)
ggpairs(Glass[,-10])
A faceted visualization of the numeric predictor variables reveals some patterns in the data. Several of the variables (Na, Al, and Si) have relatively symmetric distributions, while the rest show typically rightward skewness. Regarding correlation, one pair of variables (Ca and RI) shares a strong positive relationship (r = 0.81). Several additional pairs share weakly negative relationships (-0.6 < r < -0.4).
Glass %>%
select(-Type) %>%
gather() %>%
ggplot(aes(x = "", y = value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot() +
coord_flip() +
labs(x = NULL, y = NULL)
Univariate boxplots display statistical outliers in the numeric predictors. All of the variables but for Mg have numerous outliers. Building on the visualization from 3.1.a, several of the variables are strongly rightward skewed--the most notable are K and Ba--while Al, Na, and Si are the most symmetric. Mg shows some leftward skewness
(lambdas <- sapply(Glass[,-10], function(x) {BoxCox.lambda(x)}))
## RI Na Mg Al Si K Ca Ba Fe
## -0.99992425 -0.74253321 0.99998953 0.36163616 -0.99992425 0.06348530 -0.38503405 0.08844884 0.13148336
Box-Cox transformations could help address skewness in the predictors. Suggested lambdas include general power transformations as well as possible logarithmic transformations of K and Ba. These two variables are the most clearly skewed, though Fe is another candidate. Ultimately, further investigation would be needed to assess the impacts of these transformations, depending upon the model.
data(Soybean)
#?Soybean
#str(Soybean)
Soybean %>%
select(-Class) %>%
gather() %>%
ggplot(aes(x = value)) +
facet_wrap(~ key, scales = "free") +
geom_bar() +
labs(x = NULL, y = NULL)
Soybean %>%
select(-Class) %>%
caret::nearZeroVar(saveMetrics = TRUE) %>%
arrange(percentUnique, desc(freqRatio)) %>%
head(5)
## freqRatio percentUnique zeroVar nzv
## mycelium 106.50000 0.2928258 FALSE TRUE
## sclerotia 31.25000 0.2928258 FALSE TRUE
## shriveling 14.18421 0.2928258 FALSE FALSE
## lodging 12.38095 0.2928258 FALSE FALSE
## leaf.malf 12.31111 0.2928258 FALSE FALSE
Plots of the frequency distributions reveal the categorical nature of the predictor variables and the relatively small numbers of unique values across them. Considering degeneracy, which refers to the quality of a random variable having a single or few unique values and thus near-zero variance, mycelium and sclerotia are identified as particularly degenerate. The former has a frequency ratio--the ratio of the highest and second-highest frequencies--of 106, while the latter has a frequency ratio of 31. Other variables with high frequency ratios and low percentages unique include shriveling (14), lodging (12), and leaf.malf (12).
Having some understanding of the nature of the missing data is key to developing a strategy for handling them. The relatively small size of the Soybean data set means that eliminating predictors limits the number of cases available to inform prediction. Regardless, if eliminating predictors, I would start with the two that show near-zero variance: mycelium and sclerotia.
As for imputation, I always hesitate to impute meaning where it may not be appropriate. Here, the apparent relationship between class and missing values suggests that the data are, at best, missing at random (MAR). This assumption itself is not strong, and there may be unobserved variables related to missingness, meaning the data are not missing at random (MNAR). I am inclined to assume MNAR in this case given, again, the small data set--there may be a lot of additional information out there. However, assuming MAR, I would look to predictive mean modeling (PMM), which seems to be a common method per my cursory research; the UCLA IDRE passage below introduces the method.
"Predictive Mean Matching (PMM) is a semi-parametric imputation approach. It is similar to the regression method except that for each missing value, it fills in a value randomly from among the a observed donor values from an observation whose regression-predicted values are closest to the regression-predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996). The PMM method ensures that imputed values are plausible; it might be more appropriate than the regression method (which assumes a joint multivariate normal distribution) if the normality assumption is violated (Horton and Lipsitz 2001, p. 246)."
set.seed(624)
soybean_pmm <- complete(mice(Soybean, method = "pmm"))
sum(is.na(soybean_pmm))
## [1] 0
Above are brief code snippets using the mice library. PMM is one of several methods supported by the eponymous mice function, and it runs through five iterations of imputation by default. After PMM, the complete and imputed data set has no missing values.
Source: Institute of Digital Research & Education Statistical Consulting, University of California, Los Angeles. (2020). How do I perform Multiple Imputation using Predictive Mean Matching in R? | R FAQ. Accessed 09/22/20 from https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/