Applied Predictive Modeling (Kuhn & Johnson)

Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

  2. Do there appear to be any outliers in the data? Are any predictors skewed?

  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

## [1] 1.602715

## [1] 0.4478343

## [1] -1.136452

## [1] 0.8946104

## [1] -0.7202392

## [1] 6.460089

## [1] 2.018446

## [1] 3.36868

## [1] 1.729811

b) Most of the 8 elements predictors show a marked skewness (left or right or even bimodal). The ones that exhibit a more normal distribution are: Na, Al & Si. The refractive index variable shows moderate right-skewed distribution. There are many predictors showing outliers, primarily: Na, Al, K and Ca
c)
Normalizing (centering & scaling) (Z-score)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
##           RI         Na        Mg         Al          Si           K
## 1  0.8708258  0.2842867 1.2517037 -0.6908222 -1.12444556 -0.67013422
## 2 -0.2487502  0.5904328 0.6346799 -0.1700615  0.10207972 -0.02615193
## 3 -0.7196308  0.1495824 0.6000157  0.1904651  0.43776033 -0.16414813
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979  0.11184428
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126  0.55395746  0.08117845
## 6 -0.7920739 -0.7566101 0.6416128  0.3506992  0.41193874  0.21917466
##           Ca         Ba         Fe
## 1 -0.1454254 -0.3520514 -0.5850791
## 2 -0.7918771 -0.3520514 -0.5850791
## 3 -0.8270103 -0.3520514 -0.5850791
## 4 -0.5178378 -0.3520514 -0.5850791
## 5 -0.6232375 -0.3520514 -0.5850791
## 6 -0.6232375 -0.3520514  2.0832652
Applying cube root transformation on the most right-skewed predictors, useful for right/left skewed data and zero values.
Attempt to apply Boxcox Transformation first, but unfortunately the library could not calculate the right lambda for 2 variables
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied
## [1] 6.460089
## [1] -0.5836242
## [1] 2.018446
## [1] 1.38769
## [1] 3.36868
## [1] 2.044037

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

  2. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $date
## 
## 
## Var1    Freq
## -----  -----
## 0         26
## 1         75
## 2         93
## 3        118
## 4        131
## 5        149
## 6         90
## 
## $plant.stand
## 
## 
## Var1    Freq
## -----  -----
## 0        354
## 1        293
## 
## $precip
## 
## 
## Var1    Freq
## -----  -----
## 0         74
## 1        112
## 2        459
## 
## $temp
## 
## 
## Var1    Freq
## -----  -----
## 0         80
## 1        374
## 2        199
## 
## $hail
## 
## 
## Var1    Freq
## -----  -----
## 0        435
## 1        127
## 
## $crop.hist
## 
## 
## Var1    Freq
## -----  -----
## 0         65
## 1        165
## 2        219
## 3        218
## 
## $area.dam
## 
## 
## Var1    Freq
## -----  -----
## 0        123
## 1        227
## 2        145
## 3        187
## 
## $sever
## 
## 
## Var1    Freq
## -----  -----
## 0        195
## 1        322
## 2         45
## 
## $seed.tmt
## 
## 
## Var1    Freq
## -----  -----
## 0        305
## 1        222
## 2         35
## 
## $germ
## 
## 
## Var1    Freq
## -----  -----
## 0        165
## 1        213
## 2        193
## 
## $plant.growth
## 
## 
## Var1    Freq
## -----  -----
## 0        441
## 1        226
## 
## $leaves
## 
## 
## Var1    Freq
## -----  -----
## 0         77
## 1        606
## 
## $leaf.halo
## 
## 
## Var1    Freq
## -----  -----
## 0        221
## 1         36
## 2        342
## 
## $leaf.marg
## 
## 
## Var1    Freq
## -----  -----
## 0        357
## 1         21
## 2        221
## 
## $leaf.size
## 
## 
## Var1    Freq
## -----  -----
## 0         51
## 1        327
## 2        221
## 
## $leaf.shread
## 
## 
## Var1    Freq
## -----  -----
## 0        487
## 1         96
## 
## $leaf.malf
## 
## 
## Var1    Freq
## -----  -----
## 0        554
## 1         45
## 
## $leaf.mild
## 
## 
## Var1    Freq
## -----  -----
## 0        535
## 1         20
## 2         20
## 
## $stem
## 
## 
## Var1    Freq
## -----  -----
## 0        296
## 1        371
## 
## $lodging
## 
## 
## Var1    Freq
## -----  -----
## 0        520
## 1         42
## 
## $stem.cankers
## 
## 
## Var1    Freq
## -----  -----
## 0        379
## 1         39
## 2         36
## 3        191
## 
## $canker.lesion
## 
## 
## Var1    Freq
## -----  -----
## 0        320
## 1         83
## 2        177
## 3         65
## 
## $fruiting.bodies
## 
## 
## Var1    Freq
## -----  -----
## 0        473
## 1        104
## 
## $ext.decay
## 
## 
## Var1    Freq
## -----  -----
## 0        497
## 1        135
## 2         13
## 
## $mycelium
## 
## 
## Var1    Freq
## -----  -----
## 0        639
## 1          6
## 
## $int.discolor
## 
## 
## Var1    Freq
## -----  -----
## 0        581
## 1         44
## 2         20
## 
## $sclerotia
## 
## 
## Var1    Freq
## -----  -----
## 0        625
## 1         20
## 
## $fruit.pods
## 
## 
## Var1    Freq
## -----  -----
## 0        407
## 1        130
## 2         14
## 3         48
## 
## $fruit.spots
## 
## 
## Var1    Freq
## -----  -----
## 0        345
## 1         75
## 2         57
## 4        100
## 
## $seed
## 
## 
## Var1    Freq
## -----  -----
## 0        476
## 1        115
## 
## $mold.growth
## 
## 
## Var1    Freq
## -----  -----
## 0        524
## 1         67
## 
## $seed.discolor
## 
## 
## Var1    Freq
## -----  -----
## 0        513
## 1         64
## 
## $seed.size
## 
## 
## Var1    Freq
## -----  -----
## 0        532
## 1         59
## 
## $shriveling
## 
## 
## Var1    Freq
## -----  -----
## 0        539
## 1         38
## 
## $roots
## 
## 
## Var1    Freq
## -----  -----
## 0        551
## 1         86
## 2         15
a) The only distribution closest to a degenerate (single value - deterministic distribution) is the one for the variable: mycelium (639-6). Couple other follow: shriveling (539-38) and sclerotia (625-20)
## [1] 0.1771596
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
0 1 36 38 30 121 16 1 121 121 112 16 0 84 84 84 100 84 108 16 121 38 38 106 38 38 38 38 84 106 92 92 106 92 106 31
Class n
2-4-d-injury 16
cyst-nematode 14
diaporthe-pod-&-stem-blight 15
herbicide-injury 8
phytophthora-rot 68
Class n
2-4-d-injury 16
alternarialeaf-spot 91
anthracnose 44
bacterial-blight 20
bacterial-pustule 20
brown-spot 92
brown-stem-rot 44
charcoal-rot 20
cyst-nematode 14
diaporthe-pod-&-stem-blight 15
diaporthe-stem-canker 20
downy-mildew 20
frog-eye-leaf-spot 91
herbicide-injury 8
phyllosticta-leaf-spot 20
phytophthora-rot 88
powdery-mildew 20
purple-seed-stain 20
rhizoctonia-root-rot 20
b) There are many predictors that show higher likelihood of being missing (setting a threshold of above 100 missing ocurrences): hail, sever, seed.tmt, germ, leaf.mild, lodging, fruiting.bodies, fruit.spots, seed.discolor and shriveling
There seems to be a marked pattern for missing values for 5 specific classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury, phytophthora-rot
## [1] 121
## [1] 562
## Warning in median(as.numeric(x)): NAs introduced by coercion
## [1] 33
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
41 phytophthora-rot 2 1 2 1 NA 1 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
48 phytophthora-rot 2 1 1 2 NA 2 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
58 phytophthora-rot 2 1 2 2 NA 1 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
59 phytophthora-rot 3 1 1 2 NA 2 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
61 phytophthora-rot 2 1 2 2 NA 3 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
62 phytophthora-rot 3 1 1 1 NA 0 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
64 phytophthora-rot 3 1 1 1 NA 1 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
292 diaporthe-pod-&-stem-blight 6 0 2 2 NA 2 3 NA NA 1 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
293 diaporthe-pod-&-stem-blight 5 0 2 2 NA 3 3 NA NA 0 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
295 diaporthe-pod-&-stem-blight 5 NA 2 2 NA 2 3 NA NA NA 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
296 diaporthe-pod-&-stem-blight 5 0 2 2 NA 2 3 NA NA 0 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
342 phytophthora-rot 3 1 1 1 NA 2 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
343 phytophthora-rot 3 1 1 1 NA 3 1 NA NA NA 1 1 0 2 2 0 0 0 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
347 phytophthora-rot 4 1 1 2 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
348 phytophthora-rot 1 1 2 2 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 1 2 NA 2 0 0 0 NA NA NA NA NA NA NA 1
355 phytophthora-rot 2 1 1 2 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 1 2 NA 2 0 0 0 NA NA NA NA NA NA NA 1
357 phytophthora-rot 3 1 1 1 NA 1 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
358 phytophthora-rot 1 1 2 2 NA 1 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 0 2 NA 2 0 0 0 NA NA NA NA NA NA NA 1
360 phytophthora-rot 3 1 1 1 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
361 phytophthora-rot 4 1 1 1 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
362 phytophthora-rot 1 1 2 2 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 0 2 NA 2 0 0 0 NA NA NA NA NA NA NA 1
363 phytophthora-rot 2 1 2 2 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
373 phytophthora-rot 2 1 2 2 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 2 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
375 phytophthora-rot 4 1 1 1 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
376 phytophthora-rot 1 1 2 1 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 2 2 NA 2 0 0 0 NA NA NA NA NA NA NA 1
378 phytophthora-rot 3 1 1 1 NA 1 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
379 phytophthora-rot 4 1 1 1 NA 1 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
383 phytophthora-rot 2 1 1 1 NA 3 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
384 phytophthora-rot 3 1 1 1 NA 2 1 NA NA NA 1 1 NA NA NA NA NA NA 1 NA 3 2 NA 0 0 0 0 NA NA NA NA NA NA NA 1
648 diaporthe-pod-&-stem-blight 6 NA 2 2 NA 2 3 NA NA NA 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
649 diaporthe-pod-&-stem-blight 6 NA 2 2 NA 1 3 NA NA NA 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
650 diaporthe-pod-&-stem-blight 5 NA 2 2 NA 1 3 NA NA NA 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
651 diaporthe-pod-&-stem-blight 6 NA 2 2 NA 3 3 NA NA NA 0 0 NA NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 1 2 1 1 1 1 1 NA
c) 121/562 cases have at least one missing value. The strategy will be first to eliminate those records that have more than 20 missing column values in a single case (more than 50% of missind values in a single record); then eliminate those records where a column median is equal to NA, meaning they cannot be imputed from column values derivation. From 121 cases with missing values, brought down to 33 that can actually be imputed as per different techniques depending on the specific characteristics of the predictors