Week 5 - Data Preprocessing / Overfitting - Homework

C. Rosemond 09.27.20

library(tidyverse)
library(mlbench)
library(GGally)
library(forecast)
library(mice)

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggpairs(Glass[,-10])

A faceted visualization of the numeric predictor variables reveals some patterns in the data. Several of the variables (Na, Al, and Si) have relatively symmetric distributions, while the rest show typically rightward skewness. Regarding correlation, one pair of variables (Ca and RI) shares a strong positive relationship (r = 0.81). Several additional pairs share weakly negative relationships (-0.6 < r < -0.4).

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Glass %>% 
  select(-Type) %>%
  gather() %>%
  ggplot(aes(x = "", y = value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot() +
    coord_flip() +
    labs(x = NULL, y = NULL)

Univariate boxplots display statistical outliers in the numeric predictors. All of the variables but for Mg have numerous outliers. Building on the visualization from 3.1.a, several of the variables are strongly rightward skewed--the most notable are K and Ba--while Al, Na, and Si are the most symmetric. Mg shows some leftward skewness

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

(lambdas <- sapply(Glass[,-10], function(x) {BoxCox.lambda(x)}))

##          RI          Na          Mg          Al          Si           K          Ca          Ba          Fe 
## -0.99992425 -0.74253321  0.99998953  0.36163616 -0.99992425  0.06348530 -0.38503405  0.08844884  0.13148336

Box-Cox transformations could help address skewness in the predictors. Suggested lambdas include general power transformations as well as possible logarithmic transformations of K and Ba. These two variables are the most clearly skewed, though Fe is another candidate. Ultimately, further investigation would be needed to assess the impacts of these transformations, depending upon the model.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
#?Soybean
#str(Soybean)

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean %>% 
  select(-Class) %>%
  gather() %>%
  ggplot(aes(x = value)) +
    facet_wrap(~ key, scales = "free") +
    geom_bar() +
    labs(x = NULL, y = NULL)

Soybean %>%
  select(-Class) %>%
  caret::nearZeroVar(saveMetrics = TRUE) %>%
  arrange(percentUnique, desc(freqRatio)) %>%
  head(5)

##            freqRatio percentUnique zeroVar   nzv
## mycelium   106.50000     0.2928258   FALSE  TRUE
## sclerotia   31.25000     0.2928258   FALSE  TRUE
## shriveling  14.18421     0.2928258   FALSE FALSE
## lodging     12.38095     0.2928258   FALSE FALSE
## leaf.malf   12.31111     0.2928258   FALSE FALSE

Plots of the frequency distributions reveal the categorical nature of the predictor variables and the relatively small numbers of unique values across them. Considering degeneracy, which refers to the quality of a random variable having a single or few unique values and thus near-zero variance, mycelium and sclerotia are identified as particularly degenerate. The former has a frequency ratio--the ratio of the highest and second-highest frequencies--of 106, while the latter has a frequency ratio of 31. Other variables with high frequency ratios and low percentages unique include shriveling (14), lodging (12), and leaf.malf (12).

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

sapply(Soybean[,-1], function(x) {sum(is.na(x))})

##            date     plant.stand          precip            temp            hail       crop.hist        area.dam           sever        seed.tmt            germ    plant.growth          leaves 
##               1              36              38              30             121              16               1             121             121             112              16               0 
##       leaf.halo       leaf.marg       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##              84              84              84             100              84             108              16             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots            seed     mold.growth   seed.discolor       seed.size      shriveling           roots 
##              38              38              38              84             106              92              92             106              92             106              31

totals <- Soybean %>%
  group_by(Class) %>%
  summarise_each(funs(sum(is.na(.)))) %>%
  transmute(Class, total_missings = rowSums(.[-1])) %>%
  filter(total_missings != 0)
Soybean %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  count() %>%
  left_join(totals, by = "Class") %>%
  arrange(desc(n))

## # A tibble: 5 x 3
## # Groups:   Class [5]
##   Class                           n total_missings
##   <fct>                       <int>          <dbl>
## 1 phytophthora-rot               68           1214
## 2 2-4-d-injury                   16            450
## 3 diaporthe-pod-&-stem-blight    15            177
## 4 cyst-nematode                  14            336
## 5 herbicide-injury                8            160

Some predictors show particularly high numbers of missing values, including hail (121 missing values), sever (121), seed.tmt (121), and lodging (121).Missingness appears related to class. After grouping by Class, five classes have incomplete cases: "phytophthora-rot" with 68 incomplete cases (1214 total missing values), "2-4-d-injury" with 16 incomplete cases (450), "diaporthe-pod-&-stem-blight" with 15 incomplete cases (177), "cyst-nematode" with 14 incomplete cases (336), and "herbicide-injury" with 8 incomplete cases (160).

c. Develop a strategy for handling missing data, either by eliminating the predictors or imputation.

Having some understanding of the nature of the missing data is key to developing a strategy for handling them. The relatively small size of the Soybean data set means that eliminating predictors limits the number of cases available to inform prediction. Regardless, if eliminating predictors, I would start with the two that show near-zero variance: mycelium and sclerotia.

As for imputation, I always hesitate to impute meaning where it may not be appropriate. Here, the apparent relationship between class and missing values suggests that the data are, at best, missing at random (MAR). This assumption itself is not strong, and there may be unobserved variables related to missingness, meaning the data are not missing at random (MNAR). I am inclined to assume MNAR in this case given, again, the small data set--there may be a lot of additional information out there. However, assuming MAR, I would look to predictive mean modeling (PMM), which seems to be a common method per my cursory research; the UCLA IDRE passage below introduces the method.

"Predictive Mean Matching (PMM) is a semi-parametric imputation approach. It is similar to the regression method except that for each missing value, it fills in a value randomly from among the a observed donor values from an observation whose regression-predicted values are closest to the regression-predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996). The PMM method ensures that imputed values are plausible; it might be more appropriate than the regression method (which assumes a joint multivariate normal distribution) if the normality assumption is violated (Horton and Lipsitz 2001, p. 246)."

set.seed(624)
soybean_pmm <- complete(mice(Soybean, method = "pmm"))

sum(is.na(soybean_pmm))

## [1] 0

Above are brief code snippets using the mice library. PMM is one of several methods supported by the eponymous mice function, and it runs through five iterations of imputation by default. After PMM, the complete and imputed data set has no missing values.

Source: Institute of Digital Research & Education Statistical Consulting, University of California, Los Angeles. (2020). How do I perform Multiple Imputation using Predictive Mean Matching in R? | R FAQ. Accessed 09/22/20 from https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/