Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

if (!require("mlbench")) install.packages("mlbench")
if (!require("car")) install.packages("car")
library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Exercise 3.1.a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Approach

The Histogram bins the data by frequency of values and plots the frequency bins against the ordered values. The height of each bin represents the amount of the frequency. The Scatterplot Matrix is a grid with a scatterplot for each pair of predictors. The diagonal of the matrix indicates which variable spans across the row and column.

Results

X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  hist(X[ ,i], xlab = names(X[i]), main = paste(names(X[i]), "Histogram"), col="steelblue")  
}

pairs(X, main="Scatterplot Matrix") # equivalent to plot(X)

Interpretation

From the histograms of the predictor variables it appears that RI, Na, Al, and Si have relatively normal (symmetric) distributions. The remainder of the predictor variables appear to have non-normal (asymmetric) distributions. The exceptions are RI, Na, Al, and Si which appear to be relatively normal. Most of the variable pairs in the scatterplot matrix do not appear to show strong correlations. The exceptions are the RI, Ca pair which show a clearly positive relationship and possible the RI, Si pair which show a slight negative relationship.

Exercise 3.1.b

Do there appear to be any outliers in the data? Are any predictors skewed?

Approach

The Boxplot displays the data in quartiles. Inside the box lie the data filing within the 25th and 75th percentile, also called the 1st and 3rd quartile. The line inside the box represents the median, also known as the 50th percentile, also called the 2nd quartile. Extending from the box are whiskers. Any values outside the whiskers are considered outliers. Symmetric distributions have a box with equally sized whiskers. Skewed distributions have the boxes with one long and one short whisker. The Density Plot is very much like a histogram in that it is a visual representation of the frequency of the data, but it does so with a continuous curve rather than discrete bins. Density Plots are slightly better for visualizing the skew in data.

Results

X <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(X)) {
  boxplot(X[ ,i], ylab = names(X[i]), horizontal=T,
          main = paste(names(X[i]), "Boxplot"), col="steelblue")
}

for (i in 1:ncol(X)) {
  d <- density(X[,i], na.rm = TRUE)
  plot(d, main = paste(names(X[i]), "Density"))
  polygon(d, col="steelblue")
}

Interpretation

The Boxplots show outliers in every variable except for Mg. The most extreme outliers appear in the K and Ba variables. The Boxplots show skewing in several variables, but does not serve variables with extreme outliers well. Using both the Boxplots and Density Plots it can be seen that Mg is left-skewed, while K, Ba, and Fe are right skewed. Ca also seems to be slightly right skewed.

Exercise 3.1.c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Approach

The Box-Cox transformation is used to normalize data and improve forecasting. Given the vector \(\mathbf{x}\), the Box and Cox power transformation is calculated and returned as the value \(\lambda\) such that raising the vector of data to the power of \(\lambda\) yields the transformed data \(\mathbf{y}\). Simply stated, \(\mathbf{x}^\lambda=\mathbf{y}\). The function powerTransform() in the car package calculates the Box-Cox transformation using the maximum likelihood approach and returns information on the estimated values along with convenient rounded values that are within 1.96 standard deviations of the maximum likelihood estimate.

Results

library(car)
summary(powerTransform(Glass[,1:9], family="yjPower"))$result[,1:2]

##      Est Power Rounded Pwr
## RI -25.5304405      -25.53
## Na   1.3571336        1.00
## Mg   1.7422811        1.74
## Al   0.9935500        1.00
## Si  10.9318996       10.93
## K   -0.1354867        0.00
## Ca   0.6753708        0.50
## Ba  -6.7971854       -6.80
## Fe -14.8719453      -14.87

Interpretation

The powerTransform() function with a family="bcPower" parameter works for only strictly positive numbers, \(\mathbb{R}^+\). Although all the variables are non-negative, some include zero. Therefore the family="yjPower" parameter which allows a few negative values is used. Variables were therefore shifted slightly right; \(1\mathrm{e}{-10}=1\cdot10^{-64}\) to be exact; to conform with the strictly positive requirement. The Box-Cox rounded \(\lambda\)’s calculated by the powerTransform() function show that transformations might improve the classification model. The suggested transformations are:

No transformation for RI, Na, Si, and K since \(\lambda=1\).
Log transformations for Mg, K, Ba, and Fe since \(\lambda\approx0\).
Square root transformation for Ca since \(\lambda=0.5\).

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

if (!require("mlbench")) install.packages("mlbench")
if (!require("caret")) install.packages("caret")
if (!require("VIM")) install.packages("mlbench")
if (!require("dplyr")) install.packages("dplyr")
if (!require("mice")) install.packages("mice")
library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Exercise 3.2.a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Approach

Degenerate distributions are those where the predictor variable has a single unique value (zero variance) or a handful of unique values that occur with very low frequencies (near-zero variance). The smoothScatter() function produces a smoothed color density representation of a scatterplot. This is very useful for examining categorical data. The nearZeroVar() function from the caret library examines the uniqueness of data and returns a table indicating whether each variable has zero or near-zero variance.

Results

X <- Soybean[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(X)) {
  smoothScatter(X[ ,i], ylab = names(X[i]))
}

library(caret)
nearZeroVar(X, names = TRUE, saveMetrics=T)

##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Interpretation

The Smoothed Density Scatterplots show that all variables have few unique values. There are a few that could possibly be degenerate due to the low frequencies in some of the values. The two most glaring examples are mycelium and sclerotia. The Smoothed Density Scatterplot for those variables shows one solid color across the whole chart. The variables leaf.mild and int.discolor appear to show near-zero variance as well, but running the nearZeroVar() function on the data shows that only mycelium, sclerotia, and leaf.mild have near-zero variance (“nzv”). No variable has zero variance (“zeroVar”).

Exercise 3.2.b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Approach

The aggr function in the VIM package plots and calculates the amount of missing values in each variable. The dply function is useful for wrangling data into aggregate summaries and is used to find the pattern of missing data related to the classes.

Results

library(VIM)
aggr(Soybean, prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

library(dplyr)
Soybean %>%
  mutate(Total = n()) %>% 
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n(), Proportion=Missing/Total) %>%
  select(Class, Missing, Proportion) %>%
  unique()

## Source: local data frame [5 x 3]
## Groups: Class [5]
## 
##                         Class Missing Proportion
##                        <fctr>   <int>      <dbl>
## 1            phytophthora-rot      68 0.09956076
## 2 diaporthe-pod-&-stem-blight      15 0.02196193
## 3               cyst-nematode      14 0.02049780
## 4                2-4-d-injury      16 0.02342606
## 5            herbicide-injury       8 0.01171303

Interpretation

The visualizations produced by the aggr function in the VIM package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows several predictors variables have over 15% of their values missing. The grid shows the combination of all with 82% of data not missing in accordance with the problem description (18% missing). The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis. The non-graphical output of the function shows the exact proportion of missing values per variable. Checking if a pattern of missing data related to the classes exists is done by checking if some classes hold most of the incomplete cases. This is accomplished by filtering, grouping, and mutating the data with dplyr. The majority of the missing values are in the phytophthora-rot class which has nearly 10% incomplete cases. The are only four more, out of the eighteen other, variables with incomplete cases. The pattern of missing data is related to the classes. Mostly the phytophthora-rot class however since the other four variables only have between 1% and 2% incomplete cases.

Exercise 3.2.c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Approach

The mice() function in the mice package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has over imputation 20 methods that can be applied to the data. The one used with these data is the predictive mean matching method which is currently the most popular in online forums. After the imputations are made, a complete dataset is created using the complete() function. The aggr function from the VIM package used in the previous example (plots and calculates the amount of missing values in each variable) is then reran for comparison.

Results

library(mice)
MICE <- mice(Soybean, method="pmm", printFlag=F, seed=624)
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.marg     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##        leaf.mild     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##         mycelium     0
##     int.discolor     0
##        sclerotia     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

Interpretation

Multivariate Imputation by Chained Equations (MICE) assumes values are missing at random and is implement by imputing missing data for all variables with a simple method, removing the imputations for one variable, imputing the removed data using regression, repeating the remove-regress imputation for every other imputed variable, then continuing the remove-regress imputation in a loop over the whole dataset \(n\) times. The simple imputation method used in here is Predictive Mean Matching (PMM) which “imputes missing values by means of the nearest-neighbor donor with distance based on the expected values of the missing variables conditional on the observed covariates.” Given that this is a categorial dataset, the regression method using is logistic regression. After applying this imputation using the mice() function from the mice, there are no missing data left in the data. The aggr function from the VIM package used in the previous example now shows no missing data in any variable.