Exercise 3.1: The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
ggplot(gather(Glass[,1:9]), aes(value)) + 
  geom_histogram(bins = 9) +
  facet_wrap(~key, scales = 'free_x')

x <- melt(Glass[,c(2:9)])
## No id variables; using all as measure variables
plt <- ggplot(data = x, aes(x = variable, y = value))
plt + geom_boxplot() + theme_minimal() + labs(x = "Element", y = "Percent (%)")

boxplot(Glass[,1], main = 'Boxplot of Refractive Index', ylab='Refrective Index (RI)')

corr <- cor(Glass[,1:9])
corrplot(corr, method = 'circle', order = 'AOE')

The Refractive Index and Calcium content seem to have an especially strong positive correlation.


  1. Do there appear to be any outliers in the data? Are any predictors skewed?

A look at the predictors shows us that some are more evenly distributed within the sample than others.

Aluminum (Al), Calcium (Ca), Sodium (Na), and Silicon (Si) seem to be fairly normally distributed within the 214 observations here, though Al, Ca, and Na show slight right-skewedness; the Refractive Index (RI) also follows a fairly normal distribution with a right tail.

Barium (Ba), Iron (Fe), and Potassium (K) are strongly right-skewed, with a majority of the measurements falling on the low end of the presented values due to a large portion containing none of those elements. Since the percentage can’t be below 0, the distribution bunches up there.

Finally, Magnesium (Mg) presents a fairly unique distribution where most of the readings fall on the right end of the spectrum, but there is a healthy amount at readings of zero (0), presenting a multimodal distribution with peaks at 0 and approximately 3.5.

None of the predictors appear to have outliers, unless some of the zero values are due to a lack of data as opposed to actual lack of the element altogether. One way to test this is to create a column with the total percentages from these 8 elements and see if the zeroes in some columns correspond with numbers noticeably lower than 100% for the total:

Glass['Total'] = Glass[2]+ Glass[3]+ Glass[4]+ Glass[5]+ Glass[6]+ Glass[7]+ Glass[8]+ Glass[9]
boxplot(Glass[,11], main = 'Boxplot of Total Calculated Percentages', ylab='Percentage (%)')

It appears that the lowest total falls at around 99%, meaning 1% is unaccounted for in the 8 provided values. It is certainly possible that this is data missing from our 8 elements, though also likely that it is simply data accounted for by elements not listed in our table.


  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

We can take the logarithm of the skewed predictors like Ba, Fe, and K. This could improve the accuracy of our classification model.

Exercise 3.2: The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
ggplot(gather(Soybean[,2:36], na.rm = TRUE), aes(value)) + 
  geom_histogram(stat='count') +
  facet_wrap(~key, scales = 'free_x')
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad

In order to determine potential degenerate distributions, we can use the nearZeroVar function from the caret package. It provides an evaluation of whether the variance of a predictor is near zero, suggesting that it may be degenerate.

nearZeroVar(Soybean[,2:36], names = TRUE)
## [1] "leaf.mild" "mycelium"  "sclerotia"

We can see from thsi function that the three variables \(leaf.mild\), \(mycelium\), and \(sclerotia\) may be degenerate distributions.


  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
sapply(Soybean, function(x) sum(is.na(x)))
##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

From the table above we can actually see that there is missing data in every column except for Class and leaves, with date, and area.dam having 1 missing datapoint. We can see that the highest number of missing data is from hail, sever, lodging, and seed.tmt each of which is missing 121 datapoints in 683 observations, or 17.7%.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
missing <- Soybean %>%
  select(everything()) %>%  
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  mutate(Total = select(.,date:roots) %>% rowSums())
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
missing[,c('Class','Total')] %>% arrange(-Total)
## # A tibble: 19 x 2
##    Class                       Total
##    <fct>                       <dbl>
##  1 phytophthora-rot             1214
##  2 2-4-d-injury                  450
##  3 cyst-nematode                 336
##  4 diaporthe-pod-&-stem-blight   177
##  5 herbicide-injury              160
##  6 alternarialeaf-spot             0
##  7 anthracnose                     0
##  8 bacterial-blight                0
##  9 bacterial-pustule               0
## 10 brown-spot                      0
## 11 brown-stem-rot                  0
## 12 charcoal-rot                    0
## 13 diaporthe-stem-canker           0
## 14 downy-mildew                    0
## 15 frog-eye-leaf-spot              0
## 16 phyllosticta-leaf-spot          0
## 17 powdery-mildew                  0
## 18 purple-seed-stain               0
## 19 rhizoctonia-root-rot            0

From the table above we can see that the \(phytophthora-rot\) class has far an away the most missing data (though obviously multiple missing fields for the same observation exacerbates these numbers), followed by \(2-d-4-injury\) and \(cyst-nematode\). \(diaporthe-pod-&-stem-blight\) and \(herbicide-injury\) round out the list of classes with missing data, meaning that 14 of the classes don’t have any missing data.


  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

One option to deal with missing data is to remove all observations with any missing data:

Soybean_clean <- Soybean[complete.cases(Soybean), ]
dim(Soybean_clean)
## [1] 562  36

This is a simple and quick solution, but it removes 18% of our observations, meaning any predictive model we develop will have nearly a fifth less data to work with. Let’s look at a more complicated (but les destructive) method of dealing with missing data: imputation.

A very popular package to deal with missing data imputation is Mice:

md.pattern(Soybean, rotate.names = TRUE)

##     Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562     1      1    1        1         1            1    1    1     1
## 13      1      1    1        1         1            1    1    1     1
## 55      1      1    1        1         1            1    1    1     1
## 8       1      1    1        1         1            1    1    1     1
## 9       1      1    1        1         1            1    1    1     0
## 6       1      1    1        1         1            1    1    1     0
## 14      1      1    1        1         1            1    1    0     1
## 15      1      1    1        1         0            0    0    0     0
## 1       1      1    0        0         0            0    0    0     0
##         0      0    1        1        16           16   16   30    31
##     plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562           1      1            1             1         1        1
## 13            1      1            1             1         1        1
## 55            1      1            1             1         1        1
## 8             1      0            0             0         0        0
## 9             1      1            1             1         1        1
## 6             0      1            1             1         1        1
## 14            0      0            0             0         0        0
## 15            0      0            0             0         0        0
## 1             0      0            0             0         0        0
##              36     38           38            38        38       38
##     int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 562            1         1         1         1         1         1          1
## 13             1         1         1         1         1         1          0
## 55             1         1         0         0         0         0          0
## 8              0         0         1         1         1         1          1
## 9              1         1         0         0         0         0          1
## 6              1         1         0         0         0         0          1
## 14             0         0         0         0         0         0          1
## 15             0         0         1         1         1         1          0
## 1              0         0         1         1         1         1          0
##               38        38        84        84        84        84         84
##     seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots
## 562    1           1         1           1               1           1
## 13     0           0         0           1               0           0
## 55     0           0         0           0               0           0
## 8      0           0         0           1               0           0
## 9      1           1         1           0               1           1
## 6      1           1         1           0               1           1
## 14     1           1         1           0               0           0
## 15     0           0         0           0               0           0
## 1      0           0         0           0               0           0
##       92          92        92         100             106         106
##     seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging     
## 562             1          1         1    1    1     1        1       1    0
## 13              0          0         1    0    0     0        0       0   13
## 55              0          0         0    0    0     0        0       0   19
## 8               0          0         0    0    0     0        0       0   20
## 9               1          1         0    1    0     0        0       0   11
## 6               1          1         0    0    0     0        0       0   13
## 14              0          0         0    0    0     0        0       0   24
## 15              0          0         0    0    0     0        0       0   28
## 1               0          0         0    0    0     0        0       0   30
##               106        106       108  112  121   121      121     121 2337

We will impute using the Predictive Mean Matching (pmm) method, which picks a value from similar observations to assign to the missing datapoint.

imputed_Soy <- mice(Soybean, method = 'pmm', seed = 44, printFlag = FALSE)
## Warning: Number of logged events: 1675
summary(imputed_Soy)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##           Class            date     plant.stand          precip            temp 
##              ""           "pmm"           "pmm"           "pmm"           "pmm" 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##           "pmm"           "pmm"           "pmm"           "pmm"           "pmm" 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##           "pmm"           "pmm"              ""           "pmm"           "pmm" 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##           "pmm"           "pmm"           "pmm"           "pmm"           "pmm" 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##           "pmm"           "pmm"           "pmm"           "pmm"           "pmm" 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##           "pmm"           "pmm"           "pmm"           "pmm"           "pmm" 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##           "pmm"           "pmm"           "pmm"           "pmm"           "pmm" 
##           roots 
##           "pmm" 
## PredictorMatrix:
##             Class date plant.stand precip temp hail crop.hist area.dam sever
## Class           0    1           1      1    1    1         1        1     1
## date            1    0           1      1    1    1         1        1     1
## plant.stand     1    1           0      1    1    1         1        1     1
## precip          1    1           1      0    1    1         1        1     1
## temp            1    1           1      1    0    1         1        1     1
## hail            1    1           1      1    1    0         1        1     1
##             seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## Class              1    1            1      1         1         1         1
## date               1    1            1      1         1         1         1
## plant.stand        1    1            1      1         1         1         1
## precip             1    1            1      1         1         1         1
## temp               1    1            1      1         1         1         1
## hail               1    1            1      1         1         1         1
##             leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## Class                 1         1         1    1       1            1
## date                  1         1         1    1       1            1
## plant.stand           1         1         1    1       1            1
## precip                1         1         1    1       1            1
## temp                  1         1         1    1       1            1
## hail                  1         1         1    1       1            1
##             canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## Class                   1               1         1        1            1
## date                    1               1         1        1            1
## plant.stand             1               1         1        1            1
## precip                  1               1         1        1            1
## temp                    1               1         1        1            1
## hail                    1               1         1        1            1
##             sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor
## Class               1          1           1    1           1             1
## date                1          1           1    1           1             1
## plant.stand         1          1           1    1           1             1
## precip              1          1           1    1           1             1
## temp                1          1           1    1           1             1
## hail                1          1           1    1           1             1
##             seed.size shriveling roots
## Class               1          1     1
## date                1          1     1
## plant.stand         1          1     1
## precip              1          1     1
## temp                1          1     1
## hail                1          1     1
## Number of logged events:  1675 
##   it im         dep meth
## 1  1  1        date  pmm
## 2  1  1 plant.stand  pmm
## 3  1  1 plant.stand  pmm
## 4  1  1      precip  pmm
## 5  1  1      precip  pmm
## 6  1  1        temp  pmm
##                                                                                                                                                                                                                                                        out
## 1                                                                                                                                                                                                                                            int.discolor2
## 2                                                                                                                                                                               Classcyst-nematode, Classdowny-mildew, leaf.mild2, sclerotia1, fruit.pods1
## 3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
## 4                                                                                                                                         Classbrown-spot, Classcharcoal-rot, Classcyst-nematode, Classherbicide-injury, stem1, int.discolor2, fruit.pods1
## 5 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
## 6                                                                                                                                              Classbrown-spot, Classcyst-nematode, Classdiaporthe-pod-&-stem-blight, Classfrog-eye-leaf-spot, fruit.pods2
final_soy <- complete(imputed_Soy,4)

Now that we have imputed the data let’s confirm nothing is missing:

missing2 <- final_soy %>%
  select(everything()) %>%  # replace to your needs
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  mutate(Total = select(.,date:roots) %>% rowSums())
missing2[,c('Class','Total')] %>% arrange(-Total)
## # A tibble: 19 x 2
##    Class                       Total
##    <fct>                       <dbl>
##  1 2-4-d-injury                    0
##  2 alternarialeaf-spot             0
##  3 anthracnose                     0
##  4 bacterial-blight                0
##  5 bacterial-pustule               0
##  6 brown-spot                      0
##  7 brown-stem-rot                  0
##  8 charcoal-rot                    0
##  9 cyst-nematode                   0
## 10 diaporthe-pod-&-stem-blight     0
## 11 diaporthe-stem-canker           0
## 12 downy-mildew                    0
## 13 frog-eye-leaf-spot              0
## 14 herbicide-injury                0
## 15 phyllosticta-leaf-spot          0
## 16 phytophthora-rot                0
## 17 powdery-mildew                  0
## 18 purple-seed-stain               0
## 19 rhizoctonia-root-rot            0