library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(inspectdf) #numeric variable distributions
library(naniar) #missing values
library(corrplot) #correlation

Background

The purpose of this assignment was to explore the Data Pre-processing exercises from Applied Predictive Modeling.


3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data is accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
  2. Do there appear to be any outliers in the data? Are any predictors skewed?
  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

First, we explore whether we’re dealing with any missing data:

#missing values
vis_miss(Glass)

We see that we’re not missing data and proceed to use the inspect_num() function from the inspectdf library to visit the histograms for our numeric variables in a clear, concise manner:

#Explore histograms for numeric variables
inspectdf::inspect_num(Glass) %>% 
  show_plot()

From above, we interpret each predictor variable’s distribution as:

  • Al: slightly right skewed.
  • Ba: unimodal and heavily right skewed / with outliers. There’s a heavy concentration @ / near 0.
  • Ca: right skewed with outliers.
  • Fe: unimodal and heavily right skewed / with outliers. There’s a heavy concentration @ / near 0.
  • K: non-normal with outliers. There’s a heavy concentration @ / near 0.
  • Mg: left skewed, non-normal, and bi-modal.
  • Na: relatively normal with potential outliers.
  • RI: right skewed with outliers.
  • Si: slightly left skewed.

From, here we explore our categorical Type variable to familiarize ourselves with the frequency of each level:

#Explore our non-numeric factor variable
ggplot(Glass, aes(Type)) +
  geom_bar()

For the Type factor variable we observe that the greatest frequencies are early and late (1,2, and 7). From this we may extend that related variables may be of the same type.

As a final familiarization visualization, we check out the corresponding correlation matrix to observe how correlated our variables are with one another:

#corrplot / confusion matrix
numeric_values <- Glass %>% select_if(is.numeric)
train_cor <- cor(numeric_values)
corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')

From above, we see that collinearity and multicollinearity is a concern:

  • Ca and RI (0.81) have a very strong positive correlation.
  • Ba and Al (0.48), Al and K (0.33), Ba and Na (0.33) have positive correlations.
  • Si and RI (0.54) have a strong negative correlation.
  • Ba and Mg (-0.49), Mg and Al (-0.48), Ca and Mg (-0.44), and Al and RI (-0.41) have negative correlations.
  • The majority of the remainder have slight / less noteworthy correlations.

The visualizations above provide insight regarding our variables, their distributions, and where transformation may lead to improvement of classification. We revisit each variable and consider what transformation (if any) is applicable:

  • feature removal: RI. If we were to elect a correlation threshold of 0.75, the relationship between Ca and RI would be flagged. To reduce complexity without losing relevant information, one variable could be removed. Being that RI was also correlated with 2 other variables vs Ca’s 1, I elected to remove RI.
  • center and scale (with a twist): Mg. We could subtract the mean while taking the absolute value to normalize our bi-modal distribution.
  • center and scale: Al, Ca, Na, RI, and Si. All of these variables have relatively normal distributions, from which we could either center and scale or remove outliers to address the slight non-normality.
  • log transformation: Ba, Fe, and K. To deal with the right skewness / heavy concentration near 0, a log transformation (or maybe even Box Cox) may be applied.

3.2

The soybean data can also be found at UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
?Soybean

str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
  2. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We use the inspect_cat() function from the inspectdf library to visit frequency distribtions for our categorical predictors in a clear, concise manner:

#distributions for categorical predictors
inspectdf::inspect_cat(Soybean) %>%
    show_plot()

We can interpret the above output for each predictor as each color representing a different level (incl. white), with gray representing missing values and the size of the bar representing level frequency. There are missing values and our predictors have numerous different levels.

In addition to these high-level observations we note that: Class and date may be dropped excluded from discussion and that mycelium has a degenerate distribution. Aside from the missing values, ~100% of the predictors values are “0”. Whereas leaf.malf, leaf.mild and sclerotia received honorable mention since their distributions also favor a heavy probability for the “0”th level.

We explore missing data with a separate vis_miss() visualization:

#missing values
vis_miss(Soybean)

We find that 9.5% of the data are missing and that:

  • hail, sever, seed.tmt, lodging are missing the most data at 17.72%,
  • germ is missing the second most data at 16.4%, and
  • leaf.mild is missing the third most data at 15.81%.

Additionally, there appears to be a pattern to the missing-ness that is related to Class. For certain classes there was no missing data, for some there was a little missing data, and for others there was a relatively significant amount of missing data that aligned perfectly with other predictors (ie. hail, sever …).

While there is a high number of predictors (35) relative to the number of observations (683), there is nuance to each individual predictor (ie. distribution, number of levels, etc.) and none of our predictors carry a high enough proportion of missing data (ie. >= 60%) to warrant feature removal based on the amount of missing data.

Because the data is, I believe, missing not at random (MNAR), I would favor a strategy of imputation over removal. As for a specific forward-reaching strategy, I would implement multiple imputation (for categorical variables) as outlined in the similarly named section of How to Handle Missing Data or Handling Missing Values in R.