Chapter 3: Data Pre-processing

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: \(Na, Mg, Al, Si, K, Ca, Ba,\) and \(Fe\).

Exercise 3.1

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
library(tidyr)
library(dplyr)
library(knitr)
library(utils)
library(ggplot2)
library(mlbench)
library(corrplot)
library(purrr)
library(mice)
library(reshape2)
library(VIM)
library(gridExtra)
data(Glass)
glimpse(Glass)
## Observations: 214
## Variables: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.5...
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 1...
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.6...
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.3...
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 7...
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.5...
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.4...
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.1...
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
# For a Correlation Matrix, we need to remove the categorical variable
Glass2 = select(Glass, -Type)
m = cor(Glass2)
corrplot(m)

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
# Show distributions of all variables

Glass2 %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins=16)

From the histogram plots above we find that the data contain distributions that are skewed as well as those that are roughly normal. The distributions of \(Al, Ca, Na, Rl\) and \(Sl\) are roughly normal, while others are clearly skewed.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

To improve classification, variables that are skewed may be transformed using a Box-Cox transformation. In the existing dataset, this can be done for variables \(Ba, Fe, K\) and \(Mg\).

Exercise 3.2

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data(Soybean)
Soybean = Soybean %>% keep(is.factor)

for (col in 2:ncol(Soybean)) {
    t = table(Soybean[[col]], dnn = colnames(Soybean[col]))
    print(colnames(Soybean[col]))
    print(t)
}
## [1] "date"
## date
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## [1] "plant.stand"
## plant.stand
##   0   1 
## 354 293 
## [1] "precip"
## precip
##   0   1   2 
##  74 112 459 
## [1] "temp"
## temp
##   0   1   2 
##  80 374 199 
## [1] "hail"
## hail
##   0   1 
## 435 127 
## [1] "crop.hist"
## crop.hist
##   0   1   2   3 
##  65 165 219 218 
## [1] "area.dam"
## area.dam
##   0   1   2   3 
## 123 227 145 187 
## [1] "sever"
## sever
##   0   1   2 
## 195 322  45 
## [1] "seed.tmt"
## seed.tmt
##   0   1   2 
## 305 222  35 
## [1] "germ"
## germ
##   0   1   2 
## 165 213 193 
## [1] "plant.growth"
## plant.growth
##   0   1 
## 441 226 
## [1] "leaves"
## leaves
##   0   1 
##  77 606 
## [1] "leaf.halo"
## leaf.halo
##   0   1   2 
## 221  36 342 
## [1] "leaf.marg"
## leaf.marg
##   0   1   2 
## 357  21 221 
## [1] "leaf.size"
## leaf.size
##   0   1   2 
##  51 327 221 
## [1] "leaf.shread"
## leaf.shread
##   0   1 
## 487  96 
## [1] "leaf.malf"
## leaf.malf
##   0   1 
## 554  45 
## [1] "leaf.mild"
## leaf.mild
##   0   1   2 
## 535  20  20 
## [1] "stem"
## stem
##   0   1 
## 296 371 
## [1] "lodging"
## lodging
##   0   1 
## 520  42 
## [1] "stem.cankers"
## stem.cankers
##   0   1   2   3 
## 379  39  36 191 
## [1] "canker.lesion"
## canker.lesion
##   0   1   2   3 
## 320  83 177  65 
## [1] "fruiting.bodies"
## fruiting.bodies
##   0   1 
## 473 104 
## [1] "ext.decay"
## ext.decay
##   0   1   2 
## 497 135  13 
## [1] "mycelium"
## mycelium
##   0   1 
## 639   6 
## [1] "int.discolor"
## int.discolor
##   0   1   2 
## 581  44  20 
## [1] "sclerotia"
## sclerotia
##   0   1 
## 625  20 
## [1] "fruit.pods"
## fruit.pods
##   0   1   2   3 
## 407 130  14  48 
## [1] "fruit.spots"
## fruit.spots
##   0   1   2   4 
## 345  75  57 100 
## [1] "seed"
## seed
##   0   1 
## 476 115 
## [1] "mold.growth"
## mold.growth
##   0   1 
## 524  67 
## [1] "seed.discolor"
## seed.discolor
##   0   1 
## 513  64 
## [1] "seed.size"
## seed.size
##   0   1 
## 532  59 
## [1] "shriveling"
## shriveling
##   0   1 
## 539  38 
## [1] "roots"
## roots
##   0   1   2 
## 551  86  15

We see from the tables above that the following variables are somewhat degenerate:

- shriveling
- seed.discolor
- mycelium
- sclerotia
- leaf.mild
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# mice::md.pattern(Soybean)
# summary(VIM::aggr(Soybean))
aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE,
                  sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3,
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The MICE library can be used for imputation of missing data.

imputed.1 <- mice(Soybean, m=5, maxit = 5, seed = 500, printFlag=F)
Soybean2 = complete(imputed.1, 2)
# Check number of incomplete rows in original and imputed
table(is.na(Soybean))
## 
## FALSE  TRUE 
## 22251  2337
table(is.na(Soybean2))
## 
## FALSE 
## 24588