library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(inspectdf) #numeric variable distributions
library(naniar) #missing values
library(corrplot) #correlation
The purpose of this assignment was to explore the Data Pre-processing exercises from Applied Predictive Modeling.
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data is accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
First, we explore whether we’re dealing with any missing data:
#missing values
vis_miss(Glass)
We see that we’re not missing data and proceed to use the inspect_num()
function from the inspectdf
library to visit the histograms for our numeric variables in a clear, concise manner:
#Explore histograms for numeric variables
::inspect_num(Glass) %>%
inspectdfshow_plot()
From above, we interpret each predictor variable’s distribution as:
From, here we explore our categorical Type
variable to familiarize ourselves with the frequency of each level:
#Explore our non-numeric factor variable
ggplot(Glass, aes(Type)) +
geom_bar()
For the Type
factor variable we observe that the greatest frequencies are early and late (1,2, and 7). From this we may extend that related variables may be of the same type.
As a final familiarization visualization, we check out the corresponding correlation matrix to observe how correlated our variables are with one another:
#corrplot / confusion matrix
<- Glass %>% select_if(is.numeric)
numeric_values <- cor(numeric_values)
train_cor corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')
From above, we see that collinearity and multicollinearity is a concern:
The visualizations above provide insight regarding our variables, their distributions, and where transformation may lead to improvement of classification. We revisit each variable and consider what transformation (if any) is applicable:
The soybean data can also be found at UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
?Soybean
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
We use the inspect_cat()
function from the inspectdf
library to visit frequency distribtions for our categorical predictors in a clear, concise manner:
#distributions for categorical predictors
::inspect_cat(Soybean) %>%
inspectdfshow_plot()
We can interpret the above output for each predictor as each color representing a different level (incl. white), with gray representing missing values and the size of the bar representing level frequency. There are missing values and our predictors have numerous different levels.
In addition to these high-level observations we note that: Class
and date
may be dropped excluded from discussion and that mycelium
has a degenerate distribution. Aside from the missing values, ~100% of the predictors values are “0”. Whereas leaf.malf
, leaf.mild
and sclerotia
received honorable mention since their distributions also favor a heavy probability for the “0”th level.
We explore missing data with a separate vis_miss()
visualization:
#missing values
vis_miss(Soybean)
We find that 9.5% of the data are missing and that:
hail
, sever
, seed.tmt
, lodging
are missing the most data at 17.72%,germ
is missing the second most data at 16.4%, andleaf.mild
is missing the third most data at 15.81%.Additionally, there appears to be a pattern to the missing-ness that is related to Class
. For certain classes there was no missing data, for some there was a little missing data, and for others there was a relatively significant amount of missing data that aligned perfectly with other predictors (ie. hail
, sever
…).
While there is a high number of predictors (35) relative to the number of observations (683), there is nuance to each individual predictor (ie. distribution, number of levels, etc.) and none of our predictors carry a high enough proportion of missing data (ie. >= 60%) to warrant feature removal based on the amount of missing data.
Because the data is, I believe, missing not at random (MNAR), I would favor a strategy of imputation over removal. As for a specific forward-reaching strategy, I would implement multiple imputation (for categorical variables) as outlined in the similarly named section of How to Handle Missing Data or Handling Missing Values in R.