624_4

3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

describe(Glass)

##       vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04    -0.29
##         se
## RI    0.00
## Na    0.06
## Mg    0.10
## Al    0.03
## Si    0.05
## K     0.04
## Ca    0.10
## Ba    0.03
## Fe    0.01
## Type* 0.12

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Density
ggplot(melt(Glass),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density()

# Plot a correlation graph while ignoring NAs
cor.glass <- cor(Glass[-10], use = "pairwise.complete.obs")
corrplot(cor.glass, method = "color",addCoef.col="black",na.label = "NA",number.cex=0.5, tl.cex=0.5)

There are no missing values in the dataframe. Mg, K, Ba and Fe all have zeroes in their values.
There is a strong correlation between Ca and RI (0.81), some other variable exhibit moderate correlation, Ba/AI (0.48), Si/RI (-0.54), Ba/Mg (-0.49).

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

K might have outliers as the tail is very long and narrow. Additionally, RI, Ca, Ba, and Fe appear to be right-skewed.

skewValues <- apply(Glass[-10],2,skewness)
skewValues

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

glass_transformed <- preProcess(Glass[-10], method= c('BoxCox','center','scale'))
glass_transformed

## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1

glass_transformed_df <- predict(glass_transformed,Glass[-10])
ggplot(melt(glass_transformed_df),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density()

ss_df <- spatialSign(glass_transformed_df)
ggplot(melt(ss_df),aes(value)) +                  
    facet_wrap(~ Var2, scales = "free") +   
    geom_density()

Glass<- Glass %>% 
  transform(K_log = log(Glass$K+1)) %>%
  transform(Ba_log = log(Glass$Ba+1)) %>%
  transform(Ca_log = log(Glass$Ca+1)) %>%
  transform(Fe_log = log(Glass$Fe+1))
  
ggplot(melt(Glass),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density()

After centering, scaling, applying Box-Cox and spatial sign, transformations didn’t seem to be sufficient to improve most problematic variables’ distribution.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

describe(Soybean)

##                  vars   n mean   sd median trimmed  mad min max range  skew
## Class*              1 683 9.30 5.51      8    9.18 7.41   1  19    18  0.11
## date*               2 682 4.55 1.69      5    4.62 1.48   1   7     6 -0.30
## plant.stand*        3 647 1.45 0.50      1    1.44 0.00   1   2     1  0.19
## precip*             4 645 2.60 0.69      3    2.74 0.00   1   3     2 -1.42
## temp*               5 653 2.18 0.63      2    2.23 0.00   1   3     2 -0.16
## hail*               6 562 1.23 0.42      1    1.16 0.00   1   2     1  1.31
## crop.hist*          7 667 2.88 0.98      3    2.98 1.48   1   4     3 -0.40
## area.dam*           8 682 2.58 1.07      2    2.60 1.48   1   4     3  0.02
## sever*              9 562 1.73 0.60      2    1.69 0.00   1   3     2  0.17
## seed.tmt*          10 562 1.52 0.61      1    1.45 0.00   1   3     2  0.74
## germ*              11 571 2.05 0.79      2    2.06 1.48   1   3     2 -0.09
## plant.growth*      12 667 1.34 0.47      1    1.30 0.00   1   2     1  0.68
## leaves*            13 683 1.89 0.32      2    1.98 0.00   1   2     1 -2.44
## leaf.halo*         14 599 2.20 0.95      3    2.25 0.00   1   3     2 -0.41
## leaf.marg*         15 599 1.77 0.96      1    1.72 0.00   1   3     2  0.46
## leaf.size*         16 599 2.28 0.61      2    2.34 0.00   1   3     2 -0.25
## leaf.shread*       17 583 1.16 0.37      1    1.08 0.00   1   2     1  1.80
## leaf.malf*         18 599 1.08 0.26      1    1.00 0.00   1   2     1  3.22
## leaf.mild*         19 575 1.10 0.40      1    1.00 0.00   1   3     2  3.95
## stem*              20 667 1.56 0.50      2    1.57 0.00   1   2     1 -0.23
## lodging*           21 562 1.07 0.26      1    1.00 0.00   1   2     1  3.23
## stem.cankers*      22 645 2.06 1.35      1    1.95 0.00   1   4     3  0.61
## canker.lesion*     23 645 1.98 1.08      2    1.85 1.48   1   4     3  0.51
## fruiting.bodies*   24 577 1.18 0.38      1    1.10 0.00   1   2     1  1.66
## ext.decay*         25 645 1.25 0.48      1    1.16 0.00   1   3     2  1.70
## mycelium*          26 645 1.01 0.10      1    1.00 0.00   1   2     1 10.20
## int.discolor*      27 645 1.13 0.42      1    1.00 0.00   1   3     2  3.34
## sclerotia*         28 645 1.03 0.17      1    1.00 0.00   1   2     1  5.40
## fruit.pods*        29 599 1.50 0.88      1    1.28 0.00   1   4     3  1.84
## fruit.spots*       30 577 1.85 1.17      1    1.69 0.00   1   4     3  0.95
## seed*              31 591 1.19 0.40      1    1.12 0.00   1   2     1  1.54
## mold.growth*       32 591 1.11 0.32      1    1.02 0.00   1   2     1  2.43
## seed.discolor*     33 577 1.11 0.31      1    1.02 0.00   1   2     1  2.47
## seed.size*         34 591 1.10 0.30      1    1.00 0.00   1   2     1  2.66
## shriveling*        35 577 1.07 0.25      1    1.00 0.00   1   2     1  3.49
## roots*             36 652 1.18 0.44      1    1.07 0.00   1   3     2  2.46
##                  kurtosis   se
## Class*              -1.38 0.21
## date*               -0.90 0.06
## plant.stand*        -1.97 0.02
## precip*              0.55 0.03
## temp*               -0.58 0.02
## hail*               -0.29 0.02
## crop.hist*          -0.92 0.04
## area.dam*           -1.29 0.04
## sever*              -0.56 0.03
## seed.tmt*           -0.44 0.03
## germ*               -1.40 0.03
## plant.growth*       -1.54 0.02
## leaves*              3.98 0.01
## leaf.halo*          -1.76 0.04
## leaf.marg*          -1.75 0.04
## leaf.size*          -0.63 0.02
## leaf.shread*         1.26 0.02
## leaf.malf*           8.35 0.01
## leaf.mild*          14.68 0.02
## stem*               -1.95 0.02
## lodging*             8.42 0.01
## stem.cankers*       -1.51 0.05
## canker.lesion*      -1.24 0.04
## fruiting.bodies*     0.75 0.02
## ext.decay*           1.98 0.02
## mycelium*          102.18 0.00
## int.discolor*       10.57 0.02
## sclerotia*          27.19 0.01
## fruit.pods*          2.41 0.04
## fruit.spots*        -0.76 0.05
## seed*                0.37 0.02
## mold.growth*         3.93 0.01
## seed.discolor*       4.12 0.01
## seed.size*           5.10 0.01
## shriveling*         10.21 0.01
## roots*               5.49 0.02

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are 35 categorical variables labeled as numerical that do not explicitly provide information. Some variables appear continious data binned into categories, such as precipitation or temperature. We investigate the actual meaning by calling ?Soybean.

[,1] Class the 19 classes
[,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3] plant.stand normal(0),lt-normal(1).
[,4] precip lt-norm(0),norm(1),gt-norm(2).
[,5] temp lt-norm(0),norm(1),gt-norm(2).
[,6] hail yes(0),no(1).
[,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9] sever minor(0),pot-severe(1),severe(2).
[,10] seed.tmt none(0),fungicide(1),other(2).
[,11] germ 90-100%(0),80-89%(1),lt-80%(2).
[,12] plant.growth norm(0),abnorm(1).
[,13] leaves norm(0),abnorm(1).
[,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2).
[,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2).
[,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2).
[,17] leaf.shread absent(0),present(1).
[,18] leaf.malf absent(0),present(1).
[,19] leaf.mild absent(0),upper-surf(1),lower-surf(2).
[,20] stem norm(0),abnorm(1).
[,21] lodging yes(0),no(1).
[,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24] fruiting.bodies absent(0),present(1).
[,25] ext.decay absent(0),firm-and-dry(1),watery(2).
[,26] mycelium absent(0),present(1).
[,27] int.discolor none(0),brown(1),black(2).
[,28] sclerotia absent(0),present(1).
[,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3).
[,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31] seed norm(0),abnorm(1).
[,32] mold.growth absent(0),present(1).
[,33] seed.discolor absent(0),present(1).
[,34] seed.size norm(0),lt-norm(1).
[,35] shriveling absent(0),present(1).
[,36] roots norm(0),rotted(1),galls-cysts(2).

Additionally, there are zero variance predictors that carry no information and can be disregarded:

names(Soybean[,nearZeroVar(Soybean)])

## [1] "leaf.mild" "mycelium"  "sclerotia"

par(mar=c(2,1,2,1),mfrow=c(5,7))
for(i in colnames(Soybean[-1])){
  tab <- table(Soybean[i], useNA='always')
  names(tab)[is.na(names(tab))] <- "NA"
  barplot(tab, main=i)
}

Almost all variables have some missing data, and in some cases, NAs are equal or greater than factors within variable (ex., sever, fruit.spots)

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

pred_na <- Soybean[-1] %>% 
  apply(2, is.na) %>% 
  apply(2, sum, na.rm=T)
as.data.frame(pred_na[order(-pred_na)])

##                 pred_na[order(-pred_na)]
## hail                                 121
## sever                                121
## seed.tmt                             121
## lodging                              121
## germ                                 112
## leaf.mild                            108
## fruiting.bodies                      106
## fruit.spots                          106
## seed.discolor                        106
## shriveling                           106
## leaf.shread                          100
## seed                                  92
## mold.growth                           92
## seed.size                             92
## leaf.halo                             84
## leaf.marg                             84
## leaf.size                             84
## leaf.malf                             84
## fruit.pods                            84
## precip                                38
## stem.cankers                          38
## canker.lesion                         38
## ext.decay                             38
## mycelium                              38
## int.discolor                          38
## sclerotia                             38
## plant.stand                           36
## roots                                 31
## temp                                  30
## crop.hist                             16
## plant.growth                          16
## stem                                  16
## date                                   1
## area.dam                               1
## leaves                                 0

Soybean$na <- apply(Soybean, 1, function(x) sum(is.na(x)))
Soybean %>%
  select(Class, na) %>%
  group_by(Class) %>%
  summarise(na = sum(na)) %>%
  arrange(desc(na))

## # A tibble: 19 x 2
##    Class                          na
##    <fct>                       <int>
##  1 phytophthora-rot             1214
##  2 2-4-d-injury                  450
##  3 cyst-nematode                 336
##  4 diaporthe-pod-&-stem-blight   177
##  5 herbicide-injury              160
##  6 alternarialeaf-spot             0
##  7 anthracnose                     0
##  8 bacterial-blight                0
##  9 bacterial-pustule               0
## 10 brown-spot                      0
## 11 brown-stem-rot                  0
## 12 charcoal-rot                    0
## 13 diaporthe-stem-canker           0
## 14 downy-mildew                    0
## 15 frog-eye-leaf-spot              0
## 16 phyllosticta-leaf-spot          0
## 17 powdery-mildew                  0
## 18 purple-seed-stain               0
## 19 rhizoctonia-root-rot            0

There appear to be groups of predictors that have exact same number of missing values (table above). In addition, missing data seem to be related to the following classes: phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the above, we could eliminate degenerate predictors that have near-zero variance. We also could potentially remove classes that have a high number of missing values. Because of a high number of missing values, imputation might not be the best solution.