3.1.

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The Glass data set was loaded, and some basic information, such as the dimensions, structure, and missing values was returned. This data is complete, and there are no missing values.

# load data
data(Glass)

# get dimensions
dim(Glass)
## [1] 214  10
# check out the structure
str(Glass) 
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
# check for missing values
introduce(Glass)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  214      10                1                  9                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0           214               2140        19984
# dataframe of predictors only
predictors <- Glass %>% 
  dplyr::select(1:9)

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Visualizations for understanding distribution of variables included density plots, boxplots, and qq-plots.

# density of the predictors
Glass %>%
  pivot_longer(1:9) %>%
  ggplot(aes(x=value)) +
    geom_density(fill="blue", alpha=0.4)+ 
    facet_wrap(~name, scales="free") +
  theme_minimal()

# boxplots 
Glass %>%
  pivot_longer(1:9) %>%
  ggplot(aes(x=value)) +
    geom_boxplot(fill="blue", alpha=0.4)+ 
    facet_wrap(~name, scales="free") +
  theme_minimal()

# QQ 
plot_qq(predictors)

# distributions of predictors vs target
plot_scatterplot(Glass, by = "Type", sampled_rows = 1000L)

Correlation between all predictors can be visualized with a correlation matrix. High correlations are in dark blue and red, indication positive and negative correlation, respectively. The correlation matrix, however, only provides a value for linear correlation, and is not useful for relationships that are not linear. Alternatively, a scatterplot can be used.

# correlation matrix of predictors
corrplot(cor(predictors), order = "hclust")

# scatterplot matrix of predictors
pairs(predictors)

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

From the visualizations above we can get an understanding of the data. In particular we see a number of skewed distributions: K, Ba, Ca, Fe. Even the more symmetric distributions among predictors show a degree of skew, and this observation can govern decisions regarding transforming before modeling. A calculation of skewness was applied to each of the variables to confirm observations from the visuals.

A number of the variables seem to have extreme values as indicated from the visuals above. Ba, Na, K in particular seem to have extreme values. The getOutliers() function provides a way of indexing potential outliers in a variable. Based on classification method used (i.e. model sensitivity to extreme values), the outliers can be removed, kept, or otherwise handled.

# quantify the skewness of each predictor
apply(predictors, 2, skewness) %>% sort(decreasing = T)
##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523
# index outliers
getOutliers(predictors$K)$iRight
## [1] 164 172 173 186 187 202 208
getOutliers(predictors$Ba)$iRight
##  [1]  62 107 129 164 186 187 190 191 192 193 194 195 196 197 198 199 200 201 203
## [20] 204 205 206 207 208 209 210 211 212 213 214
getOutliers(predictors$Na)$iRight
## [1] 185 190

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

As determined previously, most of the variables contain at least some skew, so Box-Cox transformations would also be beneficial. A Box-Cox transformation was applied to RI, Ca, Na, Al, Si and had an overall effect of reducing skewness. The original data were on the same scale (% abundance) but the Box-Cox transformed data are not - for that reason a center and scale was applied.

# transformation to scale and center
trans1 <- preProcess(predictors, method=c("BoxCox","center","scale"))
transform1 <- predict(trans1, predictors)
head(transform1)
##           RI         Na        Mg          Al          Si           K
## 1  0.8756898  0.3133883 1.2517037 -0.65520274 -1.12729016 -0.67013422
## 2 -0.2471367  0.6129977 0.6346799 -0.08726137  0.09719851 -0.02615193
## 3 -0.7216425  0.1798164 0.6000157  0.27454124  0.43512776 -0.16414813
## 4 -0.2305698 -0.2150217 0.6970756 -0.23439154 -0.05836211  0.11184428
## 5 -0.3101056 -0.1402661 0.6485456 -0.34194384  0.55238422  0.08117845
## 6 -0.7947626 -0.7480171 0.6416128  0.42852325  0.40909039  0.21917466
##            Ca         Ba         Fe
## 1 -0.01194308 -0.3520514 -0.5850791
## 2 -0.87918677 -0.3520514 -0.5850791
## 3 -0.93250264 -0.3520514 -0.5850791
## 4 -0.48665658 -0.3520514 -0.5850791
## 5 -0.63291627 -0.3520514 -0.5850791
## 6 -0.63291627 -0.3520514  2.0832652
transform1 %>%
  pivot_longer(1:9) %>%
  ggplot(aes(x=value)) +
    geom_density(fill="blue", alpha=0.4)+ 
    facet_wrap(~name, scales="free") +
  theme_minimal()

apply(transform1, 2, skewness) %>% sort(decreasing = T)
##           K          Ba          Fe          RI          Al          Na 
##  6.46008890  3.36867997  1.72981071  1.56566039  0.09105899  0.03384644 
##          Ca          Si          Mg 
## -0.19395573 -0.65090568 -1.13645228

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

First, some basic information regarding the dataset…

# load data
data("Soybean")

str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# overview of missing missing data
introduce(Soybean)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  683      36               36                  0                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                 2337           562              24588       128600
# dataframe of just the predictors
predictors <- Soybean %>% dplyr::select(-Class)

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are three variables that can be categorized as near zero variance variables, which are leaf.mild, mycelium, and sclerotia. These variables will offer no value to the classification model and will be removed from the dataset in a later section.

# plot distributions
plot_bar(predictors)

# find and remove degenerates
degenerates <- nearZeroVar(predictors)

(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The missing data are related to only 5 of the classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury, and phytophthora-rot. There are missing data in almost all of the observations of these classes. It could be related to difficulties collecting data for these classes, but whatever the reason, we know that the data are informative, that is, the pattern of missing data is related to the outcome. The missing values also come in discrete bins (ie several missing 121 times, 106 times and so on) which gives further validity to the claim that the they are not missing at random.

# plot of missing values by attribute
plot_missing(Soybean)

# get counts of missing vlaues by predictor
counts <- apply(is.na(predictors), 2, sum) %>%  sort(decreasing=T)
counts
##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             108             106             106             106             106 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##             100              92              92              92              84 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##              38              38              38              38              38 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##              38              36              31              30              16 
##    plant.growth            stem            date        area.dam          leaves 
##              16              16               1               1               0
# table of Soybean Class counts
counts <- table(Soybean$Class) %>% 
  data.frame()

# classes with missing observation
Soybean[!complete.cases(Soybean),] %>% 
  group_by(Var1=Class) %>% 
  summarize(missing_data=n()) %>% 
  left_join(counts, by="Var1")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
##   Var1                        missing_data  Freq
##   <fct>                              <int> <int>
## 1 2-4-d-injury                          16    16
## 2 cyst-nematode                         14    14
## 3 diaporthe-pod-&-stem-blight           15    15
## 4 herbicide-injury                       8     8
## 5 phytophthora-rot                      68    88

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First, the degenerate variables determined in part (a) can be removed since they offer no predictive value. That drops the number of predictors down to 32.

There are a number of ways to deal with the remaining missing values:

One option would be to filter the data for complete cases only. This is an expensive option - as stated previously 17% of the observations would be eliminated and since there are only 683 observations to begin with, the predictive performance of the model could be compromised. This would also effectively reduce the number of distinct classes of the target from 19 to 15. The importance of these classes would have to be determined by the modeler.

Another option is to create a new variable that expresses whether or not an observation has a missing value. As stated previously, the missingness appears to be informative, and the presence of a missing value would suggest 1 of the 5 classes.

Finally imputation can be used to estimate values where missing. This approach adds a layer of uncertainty to the model. This would probably not be my first option, but I’ll fill in values in this manner anyway. After imputing the number of missing values goes to zero and there is no loss in number of observations.

# first eliminate degenerates
Soybean <- Soybean[,-degenerates]

Soybean_imputed <- knnImputation(Soybean, k=3)
introduce(Soybean_imputed)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  683      33               33                  0                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0           683              22539       118112