Week 5 - Data Preprocessing / Overfitting - Homework

C. Rosemond 09.27.20

library(tidyverse)
library(mlbench)
library(GGally)
library(forecast)
library(mice)


3.1

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggpairs(Glass[,-10])

A faceted visualization of the numeric predictor variables reveals some patterns in the data. Several of the variables (Na, Al, and Si) have relatively symmetric distributions, while the rest show typically rightward skewness. Regarding correlation, one pair of variables (Ca and RI) shares a strong positive relationship (r = 0.81). Several additional pairs share weakly negative relationships (-0.6 < r < -0.4).


b. Do there appear to be any outliers in the data? Are any predictors skewed?

Glass %>% 
  select(-Type) %>%
  gather() %>%
  ggplot(aes(x = "", y = value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot() +
    coord_flip() +
    labs(x = NULL, y = NULL)

Univariate boxplots display statistical outliers in the numeric predictors. All of the variables but for Mg have numerous outliers. Building on the visualization from 3.1.a, several of the variables are strongly rightward skewed--the most notable are K and Ba--while Al, Na, and Si are the most symmetric. Mg shows some leftward skewness


c. Are there any relevant transformations of one or more predictors that might improve the classification model?

(lambdas <- sapply(Glass[,-10], function(x) {BoxCox.lambda(x)}))
##          RI          Na          Mg          Al          Si           K          Ca          Ba          Fe 
## -0.99992425 -0.74253321  0.99998953  0.36163616 -0.99992425  0.06348530 -0.38503405  0.08844884  0.13148336

Box-Cox transformations could help address skewness in the predictors. Suggested lambdas include general power transformations as well as possible logarithmic transformations of K and Ba. These two variables are the most clearly skewed, though Fe is another candidate. Ultimately, further investigation would be needed to assess the impacts of these transformations, depending upon the model.



3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
#?Soybean
#str(Soybean)


a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean %>% 
  select(-Class) %>%
  gather() %>%
  ggplot(aes(x = value)) +
    facet_wrap(~ key, scales = "free") +
    geom_bar() +
    labs(x = NULL, y = NULL)

Soybean %>%
  select(-Class) %>%
  caret::nearZeroVar(saveMetrics = TRUE) %>%
  arrange(percentUnique, desc(freqRatio)) %>%
  head(5)
##            freqRatio percentUnique zeroVar   nzv
## mycelium   106.50000     0.2928258   FALSE  TRUE
## sclerotia   31.25000     0.2928258   FALSE  TRUE
## shriveling  14.18421     0.2928258   FALSE FALSE
## lodging     12.38095     0.2928258   FALSE FALSE
## leaf.malf   12.31111     0.2928258   FALSE FALSE

Plots of the frequency distributions reveal the categorical nature of the predictor variables and the relatively small numbers of unique values across them. Considering degeneracy, which refers to the quality of a random variable having a single or few unique values and thus near-zero variance, mycelium and sclerotia are identified as particularly degenerate. The former has a frequency ratio--the ratio of the highest and second-highest frequencies--of 106, while the latter has a frequency ratio of 31. Other variables with high frequency ratios and low percentages unique include shriveling (14), lodging (12), and leaf.malf (12).


c. Develop a strategy for handling missing data, either by eliminating the predictors or imputation.

Having some understanding of the nature of the missing data is key to developing a strategy for handling them. The relatively small size of the Soybean data set means that eliminating predictors limits the number of cases available to inform prediction. Regardless, if eliminating predictors, I would start with the two that show near-zero variance: mycelium and sclerotia.

As for imputation, I always hesitate to impute meaning where it may not be appropriate. Here, the apparent relationship between class and missing values suggests that the data are, at best, missing at random (MAR). This assumption itself is not strong, and there may be unobserved variables related to missingness, meaning the data are not missing at random (MNAR). I am inclined to assume MNAR in this case given, again, the small data set--there may be a lot of additional information out there. However, assuming MAR, I would look to predictive mean modeling (PMM), which seems to be a common method per my cursory research; the UCLA IDRE passage below introduces the method.

"Predictive Mean Matching (PMM) is a semi-parametric imputation approach. It is similar to the regression method except that for each missing value, it fills in a value randomly from among the a observed donor values from an observation whose regression-predicted values are closest to the regression-predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996). The PMM method ensures that imputed values are plausible; it might be more appropriate than the regression method (which assumes a joint multivariate normal distribution) if the normality assumption is violated (Horton and Lipsitz 2001, p. 246)."


set.seed(624)
soybean_pmm <- complete(mice(Soybean, method = "pmm"))
sum(is.na(soybean_pmm))
## [1] 0

Above are brief code snippets using the mice library. PMM is one of several methods supported by the eponymous mice function, and it runs through five iterations of imputation by default. After PMM, the complete and imputed data set has no missing values.


Source: Institute of Digital Research & Education Statistical Consulting, University of California, Los Angeles. (2020). How do I perform Multiple Imputation using Predictive Mean Matching in R? | R FAQ. Accessed 09/22/20 from https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/