3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
describe(Glass)
##       vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04    -0.29
##         se
## RI    0.00
## Na    0.06
## Mg    0.10
## Al    0.03
## Si    0.05
## K     0.04
## Ca    0.10
## Ba    0.03
## Fe    0.01
## Type* 0.12

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Density
ggplot(melt(Glass),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density()                         

# Plot a correlation graph while ignoring NAs
cor.glass <- cor(Glass[-10], use = "pairwise.complete.obs")
corrplot(cor.glass, method = "color",addCoef.col="black",na.label = "NA",number.cex=0.5, tl.cex=0.5)

There are no missing values in the dataframe. Mg, K, Ba and Fe all have zeroes in their values.
There is a strong correlation between Ca and RI (0.81), some other variable exhibit moderate correlation, Ba/AI (0.48), Si/RI (-0.54), Ba/Mg (-0.49).

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

K might have outliers as the tail is very long and narrow. Additionally, RI, Ca, Ba, and Fe appear to be right-skewed.

skewValues <- apply(Glass[-10],2,skewness)
skewValues
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

glass_transformed <- preProcess(Glass[-10], method= c('BoxCox','center','scale'))
glass_transformed
## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
glass_transformed_df <- predict(glass_transformed,Glass[-10])
ggplot(melt(glass_transformed_df),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density()   

ss_df <- spatialSign(glass_transformed_df)
ggplot(melt(ss_df),aes(value)) +                  
    facet_wrap(~ Var2, scales = "free") +   
    geom_density()  

Glass<- Glass %>% 
  transform(K_log = log(Glass$K+1)) %>%
  transform(Ba_log = log(Glass$Ba+1)) %>%
  transform(Ca_log = log(Glass$Ca+1)) %>%
  transform(Fe_log = log(Glass$Fe+1))
  
ggplot(melt(Glass),aes(value)) +                  
    facet_wrap(~ variable, scales = "free") +   
    geom_density() 

After centering, scaling, applying Box-Cox and spatial sign, transformations didn’t seem to be sufficient to improve most problematic variables’ distribution.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
describe(Soybean)
##                  vars   n mean   sd median trimmed  mad min max range  skew
## Class*              1 683 9.30 5.51      8    9.18 7.41   1  19    18  0.11
## date*               2 682 4.55 1.69      5    4.62 1.48   1   7     6 -0.30
## plant.stand*        3 647 1.45 0.50      1    1.44 0.00   1   2     1  0.19
## precip*             4 645 2.60 0.69      3    2.74 0.00   1   3     2 -1.42
## temp*               5 653 2.18 0.63      2    2.23 0.00   1   3     2 -0.16
## hail*               6 562 1.23 0.42      1    1.16 0.00   1   2     1  1.31
## crop.hist*          7 667 2.88 0.98      3    2.98 1.48   1   4     3 -0.40
## area.dam*           8 682 2.58 1.07      2    2.60 1.48   1   4     3  0.02
## sever*              9 562 1.73 0.60      2    1.69 0.00   1   3     2  0.17
## seed.tmt*          10 562 1.52 0.61      1    1.45 0.00   1   3     2  0.74
## germ*              11 571 2.05 0.79      2    2.06 1.48   1   3     2 -0.09
## plant.growth*      12 667 1.34 0.47      1    1.30 0.00   1   2     1  0.68
## leaves*            13 683 1.89 0.32      2    1.98 0.00   1   2     1 -2.44
## leaf.halo*         14 599 2.20 0.95      3    2.25 0.00   1   3     2 -0.41
## leaf.marg*         15 599 1.77 0.96      1    1.72 0.00   1   3     2  0.46
## leaf.size*         16 599 2.28 0.61      2    2.34 0.00   1   3     2 -0.25
## leaf.shread*       17 583 1.16 0.37      1    1.08 0.00   1   2     1  1.80
## leaf.malf*         18 599 1.08 0.26      1    1.00 0.00   1   2     1  3.22
## leaf.mild*         19 575 1.10 0.40      1    1.00 0.00   1   3     2  3.95
## stem*              20 667 1.56 0.50      2    1.57 0.00   1   2     1 -0.23
## lodging*           21 562 1.07 0.26      1    1.00 0.00   1   2     1  3.23
## stem.cankers*      22 645 2.06 1.35      1    1.95 0.00   1   4     3  0.61
## canker.lesion*     23 645 1.98 1.08      2    1.85 1.48   1   4     3  0.51
## fruiting.bodies*   24 577 1.18 0.38      1    1.10 0.00   1   2     1  1.66
## ext.decay*         25 645 1.25 0.48      1    1.16 0.00   1   3     2  1.70
## mycelium*          26 645 1.01 0.10      1    1.00 0.00   1   2     1 10.20
## int.discolor*      27 645 1.13 0.42      1    1.00 0.00   1   3     2  3.34
## sclerotia*         28 645 1.03 0.17      1    1.00 0.00   1   2     1  5.40
## fruit.pods*        29 599 1.50 0.88      1    1.28 0.00   1   4     3  1.84
## fruit.spots*       30 577 1.85 1.17      1    1.69 0.00   1   4     3  0.95
## seed*              31 591 1.19 0.40      1    1.12 0.00   1   2     1  1.54
## mold.growth*       32 591 1.11 0.32      1    1.02 0.00   1   2     1  2.43
## seed.discolor*     33 577 1.11 0.31      1    1.02 0.00   1   2     1  2.47
## seed.size*         34 591 1.10 0.30      1    1.00 0.00   1   2     1  2.66
## shriveling*        35 577 1.07 0.25      1    1.00 0.00   1   2     1  3.49
## roots*             36 652 1.18 0.44      1    1.07 0.00   1   3     2  2.46
##                  kurtosis   se
## Class*              -1.38 0.21
## date*               -0.90 0.06
## plant.stand*        -1.97 0.02
## precip*              0.55 0.03
## temp*               -0.58 0.02
## hail*               -0.29 0.02
## crop.hist*          -0.92 0.04
## area.dam*           -1.29 0.04
## sever*              -0.56 0.03
## seed.tmt*           -0.44 0.03
## germ*               -1.40 0.03
## plant.growth*       -1.54 0.02
## leaves*              3.98 0.01
## leaf.halo*          -1.76 0.04
## leaf.marg*          -1.75 0.04
## leaf.size*          -0.63 0.02
## leaf.shread*         1.26 0.02
## leaf.malf*           8.35 0.01
## leaf.mild*          14.68 0.02
## stem*               -1.95 0.02
## lodging*             8.42 0.01
## stem.cankers*       -1.51 0.05
## canker.lesion*      -1.24 0.04
## fruiting.bodies*     0.75 0.02
## ext.decay*           1.98 0.02
## mycelium*          102.18 0.00
## int.discolor*       10.57 0.02
## sclerotia*          27.19 0.01
## fruit.pods*          2.41 0.04
## fruit.spots*        -0.76 0.05
## seed*                0.37 0.02
## mold.growth*         3.93 0.01
## seed.discolor*       4.12 0.01
## seed.size*           5.10 0.01
## shriveling*         10.21 0.01
## roots*               5.49 0.02

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are 35 categorical variables labeled as numerical that do not explicitly provide information. Some variables appear continious data binned into categories, such as precipitation or temperature. We investigate the actual meaning by calling ?Soybean.

[,1] Class the 19 classes
[,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3] plant.stand normal(0),lt-normal(1).
[,4] precip lt-norm(0),norm(1),gt-norm(2).
[,5] temp lt-norm(0),norm(1),gt-norm(2).
[,6] hail yes(0),no(1).
[,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9] sever minor(0),pot-severe(1),severe(2).
[,10] seed.tmt none(0),fungicide(1),other(2).
[,11] germ 90-100%(0),80-89%(1),lt-80%(2).
[,12] plant.growth norm(0),abnorm(1).
[,13] leaves norm(0),abnorm(1).
[,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2).
[,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2).
[,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2).
[,17] leaf.shread absent(0),present(1).
[,18] leaf.malf absent(0),present(1).
[,19] leaf.mild absent(0),upper-surf(1),lower-surf(2).
[,20] stem norm(0),abnorm(1).
[,21] lodging yes(0),no(1).
[,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24] fruiting.bodies absent(0),present(1).
[,25] ext.decay absent(0),firm-and-dry(1),watery(2).
[,26] mycelium absent(0),present(1).
[,27] int.discolor none(0),brown(1),black(2).
[,28] sclerotia absent(0),present(1).
[,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3).
[,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31] seed norm(0),abnorm(1).
[,32] mold.growth absent(0),present(1).
[,33] seed.discolor absent(0),present(1).
[,34] seed.size norm(0),lt-norm(1).
[,35] shriveling absent(0),present(1).
[,36] roots norm(0),rotted(1),galls-cysts(2).

Additionally, there are zero variance predictors that carry no information and can be disregarded:

names(Soybean[,nearZeroVar(Soybean)])
## [1] "leaf.mild" "mycelium"  "sclerotia"
par(mar=c(2,1,2,1),mfrow=c(5,7))
for(i in colnames(Soybean[-1])){
  tab <- table(Soybean[i], useNA='always')
  names(tab)[is.na(names(tab))] <- "NA"
  barplot(tab, main=i)
}

Almost all variables have some missing data, and in some cases, NAs are equal or greater than factors within variable (ex., sever, fruit.spots)

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the above, we could eliminate degenerate predictors that have near-zero variance. We also could potentially remove classes that have a high number of missing values. Because of a high number of missing values, imputation might not be the best solution.