Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

 

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Make a copy of the Glass dataset and remove the categorical variable - "Type".
glassCopy <- subset(Glass, select = -Type) 

# Plot the predictor variables distribution.
glassCopy %>%
  gather() %>% 
  ggplot(aes(value, color = 'red', fill = 'brown')) +
  facet_wrap(~ key, scales = 'free') +
  geom_histogram(bins = 16) +
  theme_light() +
  theme(legend.position = 'none') +
  ggtitle('Distribution of Predictor Variables')

# Create a correlation matrix of the predictor variables.
corrplot(cor(glassCopy))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Looking at the "Distribution of Predictor Variables" plot above, we can see that some of the variables are close to normally distributed (AI, Ca, Na, RI, and Si), whilst the remaining variables are skewed (Ba, Fe, K, and Mg). Ba, Fe, and K are skewed to the right. K has an outlier at 3 and 6, and there are a lot of outliers in Al, Ba, Ca, Mg, Fe, and Ri.

The correlation matrix tells us that most of the variables are not strongly related. Some exceptions to this are the relationships between Si and RI, Ca and RI, Ba and Mg.

 

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes - applying a Box-Cox or Log transformation to the skewed variables - Ba, Fe, K, and Mg, might improve the classification model.

   

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

 

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

nearZeroVar(Soybean, saveMetrics = TRUE) %>%
  kable(caption = 'Variables Near Zero Variance Status Report') %>%
  kable_styling()
Variables Near Zero Variance Status Report
freqRatio percentUnique zeroVar nzv
Class 1.010989 2.7818448 FALSE FALSE
date 1.137405 1.0248902 FALSE FALSE
plant.stand 1.208191 0.2928258 FALSE FALSE
precip 4.098214 0.4392387 FALSE FALSE
temp 1.879397 0.4392387 FALSE FALSE
hail 3.425197 0.2928258 FALSE FALSE
crop.hist 1.004587 0.5856515 FALSE FALSE
area.dam 1.213904 0.5856515 FALSE FALSE
sever 1.651282 0.4392387 FALSE FALSE
seed.tmt 1.373874 0.4392387 FALSE FALSE
germ 1.103627 0.4392387 FALSE FALSE
plant.growth 1.951327 0.2928258 FALSE FALSE
leaves 7.870130 0.2928258 FALSE FALSE
leaf.halo 1.547511 0.4392387 FALSE FALSE
leaf.marg 1.615385 0.4392387 FALSE FALSE
leaf.size 1.479638 0.4392387 FALSE FALSE
leaf.shread 5.072917 0.2928258 FALSE FALSE
leaf.malf 12.311111 0.2928258 FALSE FALSE
leaf.mild 26.750000 0.4392387 FALSE TRUE
stem 1.253378 0.2928258 FALSE FALSE
lodging 12.380952 0.2928258 FALSE FALSE
stem.cankers 1.984293 0.5856515 FALSE FALSE
canker.lesion 1.807910 0.5856515 FALSE FALSE
fruiting.bodies 4.548077 0.2928258 FALSE FALSE
ext.decay 3.681481 0.4392387 FALSE FALSE
mycelium 106.500000 0.2928258 FALSE TRUE
int.discolor 13.204546 0.4392387 FALSE FALSE
sclerotia 31.250000 0.2928258 FALSE TRUE
fruit.pods 3.130769 0.5856515 FALSE FALSE
fruit.spots 3.450000 0.5856515 FALSE FALSE
seed 4.139130 0.2928258 FALSE FALSE
mold.growth 7.820895 0.2928258 FALSE FALSE
seed.discolor 8.015625 0.2928258 FALSE FALSE
seed.size 9.016949 0.2928258 FALSE FALSE
shriveling 14.184211 0.2928258 FALSE FALSE
roots 6.406977 0.4392387 FALSE FALSE
# Search for degenerate distributions in the Soybean dataset.
degenerateDistributions <- nearZeroVar(Soybean)
colnames(Soybean)[degenerateDistributions]
## [1] "leaf.mild" "mycelium"  "sclerotia"

As per the above "Variables Near Zero Variance Status Report" table and NearZeroVar() search results, There are 3 variables in the Soybean dataset with degenerate distributions - leaf.mild, mycelium, and sclerotia.

 

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Print out a table of missing values by column (sorted in descending order).
missingValuesOrdered <- order(-colSums(is.na(Soybean)))

kable(colSums(is.na(Soybean))[missingValuesOrdered], caption = 'Missing Values By Column') %>%
    kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>% 
    scroll_box(width = '100%', height = '600px')
Missing Values By Column
x
hail 121
sever 121
seed.tmt 121
lodging 121
germ 112
leaf.mild 108
fruiting.bodies 106
fruit.spots 106
seed.discolor 106
shriveling 106
leaf.shread 100
seed 92
mold.growth 92
seed.size 92
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.malf 84
fruit.pods 84
precip 38
stem.cankers 38
canker.lesion 38
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
plant.stand 36
roots 31
temp 30
crop.hist 16
plant.growth 16
stem 16
date 1
area.dam 1
Class 0
leaves 0

 

# Print a table containing a count of missing values by class.
classesMissingValues <- Soybean %>%
  mutate(nul = rowSums(is.na(Soybean))) %>%
  group_by(Class) %>%
  summarize(missing = sum(nul)) %>%
  filter(missing != 0)

kable(classesMissingValues, caption = 'Missing Values By Class') %>%
      kable_styling(bootstrap_options = c('striped', 'hover', 'condensed', 'responsive')) %>% 
      scroll_box(width = '100%')
Missing Values By Class
Class missing
2-4-d-injury 450
cyst-nematode 336
diaporthe-pod-&-stem-blight 177
herbicide-injury 160
phytophthora-rot 1214

 

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For this question, I decided to impute missing values using the MICE (Multivariate Imputation by Chained Equations) package's mice() imputation function. As per the below before and after imputation missing values count tables, we can see that the imputation has removed all missing values from the dataset.

#' mice_imputation - Mice Imputation.
#'
#' Given a dataset, runs the MICE algorithm on the dataset
#' to impute both numerical and categorical missing values.
#'
#' @param dataframe A dataframe on which to run the MICE algorithm.
#'
#' @return The passed dataset with missing values imputed to complete values.
#'
mice_imputation <- function(dataframe) {
  imputation <- mice(dataframe, m = 1, method = 'cart', printFlag = FALSE)
  imputed <- mice::complete(imputation)
}
# Check for empty values prior to imputing the data.
sapply(Soybean, function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count Before Imputation') %>% kable_styling()
Missing Values Count Before Imputation
x
hail 121
sever 121
seed.tmt 121
lodging 121
germ 112
leaf.mild 108
fruiting.bodies 106
fruit.spots 106
seed.discolor 106
shriveling 106
leaf.shread 100
seed 92
mold.growth 92
seed.size 92
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.malf 84
fruit.pods 84
precip 38
stem.cankers 38
canker.lesion 38
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
plant.stand 36
roots 31
temp 30
crop.hist 16
plant.growth 16
stem 16
date 1
area.dam 1
Class 0
leaves 0
# Check for empty values once again after running the MICE imputation on the data.
sapply(mice_imputation(Soybean), function(x) sum(is.na(x))) %>% sort(decreasing = TRUE) %>% kable(caption = 'Missing Values Count After Imputation') %>% kable_styling()
Missing Values Count After Imputation
x
Class 0
date 0
plant.stand 0
precip 0
temp 0
hail 0
crop.hist 0
area.dam 0
sever 0
seed.tmt 0
germ 0
plant.growth 0
leaves 0
leaf.halo 0
leaf.marg 0
leaf.size 0
leaf.shread 0
leaf.malf 0
leaf.mild 0
stem 0
lodging 0
stem.cankers 0
canker.lesion 0
fruiting.bodies 0
ext.decay 0
mycelium 0
int.discolor 0
sclerotia 0
fruit.pods 0
fruit.spots 0
seed 0
mold.growth 0
seed.discolor 0
seed.size 0
shriveling 0
roots 0