library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.3.5     v purrr   0.3.2
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.6.3
library(moments)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.3
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Firstly, as we can see below, none of the predictor variables seem to have any missing values in the data.

glass_df <- Glass #We'll make a copy to make edits without changing the original dataframe
index <- sapply(glass_df, is.factor) #find any columns that are factors
glass_df[index] <- lapply(glass_df[index], function(x) as.numeric(as.character(x))) #convert factors to numerics
plot_missing(glass_df)

From what we can observe below, it seems there are three predictor variables that follow a close to normal distribution, and these are “Al”, “Na” and “Si”.

plot_histogram(glass_df)

plot_density(glass_df)

Additionally, from the correlation plot below we can see that a few of the variables have strong correlations to the glass type. The variables with strong positive correlations include “Ba”, “Al” and “Na”, while there is one with a strong negative correlation “Mg”. We can also observe there are some very strong correlations between the predictor variables themselves such as “RI” and “Ca”, which have a correlation of 0.81.

plot_correlation(glass_df)

b. Do there appear to be any outliers in the data? Are any predictors skewed?

We can see there are outliers throughout the data, but from the skewness tests below we are able to identify that the variables with the greatest degree of skewness include “Ca”, “K” and “Ba”.

boxplot(glass_df$RI)

skewness(glass_df$RI)
## [1] 1.614015
boxplot(glass_df$Na)

skewness(glass_df$Na)
## [1] 0.4509917
boxplot(glass_df$Mg)

skewness(glass_df$Mg)
## [1] -1.144465
boxplot(glass_df$Al)

skewness(glass_df$Al)
## [1] 0.9009179
boxplot(glass_df$Si)

skewness(glass_df$Si)
## [1] -0.7253173
boxplot(glass_df$K)

skewness(glass_df$K)
## [1] 6.505636
boxplot(glass_df$Ca)

skewness(glass_df$Ca)
## [1] 2.032677
boxplot(glass_df$Ba)

skewness(glass_df$Ba)
## [1] 3.392431
boxplot(glass_df$Fe)

skewness(glass_df$Fe)
## [1] 1.742007
boxplot(glass_df$Type)

skewness(glass_df$Type)
## [1] 1.107085

c.Are there any relevant transformations of one or more predictors that might improve the classification model?

It seems that the predictors with the highest skewness are “Ca”, “K” and “Ba”. I will apply a Box-Cox transformation to “Ca”, but will not be able to do the same for the other two variables as they contain “0”s in the data and the function will not work. I will use a log transformation instead on these two variables and compare results to the above histograms.

boxcox(lm(glass_df$Ca ~ 1))

hist(glass_df$Ca^(-1))

hist(log(glass_df$K))

hist(log(glass_df$Ba))

As we can see above, the transformations helped bring these variables closer to normal distribution.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## See ?Soybean for details

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

By using the “nearZeroVar” function, we are able to identify the variables with near to zero variance, meaning they have a single value for most samples and hence their distributions are degenerate.

bean_df <- Soybean #We'll make a copy to make edits without changing the original dataframe

nearZeroVar(bean_df)
## [1] 19 26 28
colnames(bean_df)[19]
## [1] "leaf.mild"
colnames(bean_df)[26]
## [1] "mycelium"
colnames(bean_df)[28]
## [1] "sclerotia"
bean_df %>% 
  group_by(bean_df$leaf.mild) %>%
  summarise(freq=n()) %>%
  mutate(rel.freq=freq/sum(freq)) %>%
  arrange(-freq)
## # A tibble: 4 x 3
##   `bean_df$leaf.mild`  freq rel.freq
##   <fct>               <int>    <dbl>
## 1 0                     535   0.783 
## 2 NA                    108   0.158 
## 3 1                      20   0.0293
## 4 2                      20   0.0293
bean_df %>% 
  group_by(bean_df$mycelium) %>%
  summarise(freq=n()) %>%
  mutate(rel.freq=freq/sum(freq)) %>%
  arrange(-freq)
## # A tibble: 3 x 3
##   `bean_df$mycelium`  freq rel.freq
##   <fct>              <int>    <dbl>
## 1 0                    639  0.936  
## 2 NA                    38  0.0556 
## 3 1                      6  0.00878
bean_df %>% 
  group_by(bean_df$sclerotia) %>%
  summarise(freq=n()) %>%
  mutate(rel.freq=freq/sum(freq)) %>%
  arrange(-freq)
## # A tibble: 3 x 3
##   `bean_df$sclerotia`  freq rel.freq
##   <fct>               <int>    <dbl>
## 1 0                     625   0.915 
## 2 NA                     38   0.0556
## 3 1                      20   0.0293

We can conclude that the following variables have distributions that are degenerate: “leaf.mild”, “mycelium” and “sclerotia”.

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can observe below that the predictors more likely to have missing data are “hail”, “sever”, “seed.tmt” and “lodging” among others. And yes, there seems to be a pattern observed on the table below. The Classes associated with the missing data seem to be “phytophthora-rot”, “diaporthe-pod-&-stem-blight”, “cyst-nematode”, “2-4-d-injury” and “herbicide-injury”. Perhaps the data gathering process was compromised for this Classes.

plot_missing(bean_df)

ref <- dplyr::select(bean_df, Class, hail, sever, seed.tmt, lodging, germ, leaf.mild, fruiting.bodies)
DT <- data.table(ref)
DT[, lapply(.SD, function(x) sum(is.na(x))) , by = list(Class)]
##                           Class hail sever seed.tmt lodging germ leaf.mild
##  1:       diaporthe-stem-canker    0     0        0       0    0         0
##  2:                charcoal-rot    0     0        0       0    0         0
##  3:        rhizoctonia-root-rot    0     0        0       0    0         0
##  4:            phytophthora-rot   68    68       68      68   68        55
##  5:              brown-stem-rot    0     0        0       0    0         0
##  6:              powdery-mildew    0     0        0       0    0         0
##  7:                downy-mildew    0     0        0       0    0         0
##  8:                  brown-spot    0     0        0       0    0         0
##  9:            bacterial-blight    0     0        0       0    0         0
## 10:           bacterial-pustule    0     0        0       0    0         0
## 11:           purple-seed-stain    0     0        0       0    0         0
## 12:                 anthracnose    0     0        0       0    0         0
## 13:      phyllosticta-leaf-spot    0     0        0       0    0         0
## 14:         alternarialeaf-spot    0     0        0       0    0         0
## 15:          frog-eye-leaf-spot    0     0        0       0    0         0
## 16: diaporthe-pod-&-stem-blight   15    15       15      15    6        15
## 17:               cyst-nematode   14    14       14      14   14        14
## 18:                2-4-d-injury   16    16       16      16   16        16
## 19:            herbicide-injury    8     8        8       8    8         8
##     fruiting.bodies
##  1:               0
##  2:               0
##  3:               0
##  4:              68
##  5:               0
##  6:               0
##  7:               0
##  8:               0
##  9:               0
## 10:               0
## 11:               0
## 12:               0
## 13:               0
## 14:               0
## 15:               0
## 16:               0
## 17:              14
## 18:              16
## 19:               8

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

It seems that the Class with the highest number of missing values is “phytophthora-rot”, thus If we were to eliminate this Class altogether, this could cut the percentage of missing values for each predictor by half as observed in the table and graph below:

miss_class <- bean_df %>%
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  mutate(tot_na = dplyr::select(.,date:roots) %>% rowSums())
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
miss_class %>% dplyr::select('Class','tot_na') %>% arrange(-tot_na)
## # A tibble: 19 x 2
##    Class                       tot_na
##    <fct>                        <dbl>
##  1 phytophthora-rot              1214
##  2 2-4-d-injury                   450
##  3 cyst-nematode                  336
##  4 diaporthe-pod-&-stem-blight    177
##  5 herbicide-injury               160
##  6 alternarialeaf-spot              0
##  7 anthracnose                      0
##  8 bacterial-blight                 0
##  9 bacterial-pustule                0
## 10 brown-spot                       0
## 11 brown-stem-rot                   0
## 12 charcoal-rot                     0
## 13 diaporthe-stem-canker            0
## 14 downy-mildew                     0
## 15 frog-eye-leaf-spot               0
## 16 phyllosticta-leaf-spot           0
## 17 powdery-mildew                   0
## 18 purple-seed-stain                0
## 19 rhizoctonia-root-rot             0
bean_df2 <- Soybean %>%
  filter(Class !="phytophthora-rot")

plot_missing(bean_df2)

As a rule of thumb the first option is to try to recover any missing data, and in this case it is related to a few number of classes, thus it could be possible. If recovering the missing data is not an option, it could be worth trying the MICE imputation, although with such a small dataset it is unlikely to improve the predictions through imputation. We would also need a lot clearer understanding of each of the predictors we would attempt to impute.