Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.

# Required R packages
library(mlbench)
library(tidyverse)
library(GGally)
library(caret)
library(VIM)

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
par(mfrow = c(3,3))
for (i in 1:9){
  rcompanion::plotNormalDensity(
    Glass[,i], main = sprintf("Density of %s", names(Glass)[i]), 
    xlab = sprintf("skewness = %1.2f", psych::describe(Glass)[i,11]), 
    col2 = "steelblue2", col3 = "royalblue4") 
}

The plots above represent a density plot for a vector of values and a superimposed normal curve with the same mean and standard deviation. The plot can be used to quickly compare the distribution of data to a normal distribution. It is evident that no variables are truly normally distributed. While Na, Al, and Si are nearly normal, there is a small deviation in the tails. The refractive index, Mg, and K show evidence of bimodal distribution, while Ca, Ba, Fe, as well as K, are positively skewed.

ggpairs(Glass[1:9], title = "Correlogram with the variables", progress = FALSE, 
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) 

From the correlogram, the relationship between the refractive index and Ca suggests that is a highly positive correlation. There are some variables with moderate correlations, but no other relationship seems noteworthy.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
par(mfrow = c(5,2))
for (i in 1:9){
  boxplot(
    Glass[i], main = sprintf("%s", names(Glass)[i]), col = "steelblue2", horizontal = TRUE, 
    xlab = sprintf("skewness = %1.2f      # of outliers = %d", psych::describe(Glass)[i,11], 
                   length(boxplot(Glass[i], plot = FALSE)$out)))
}

The boxplots reveal that there are a number of outliers within each variable and that they may be the likely cause for the skewness of their distribution as discussed previously. It is interesting to see that while Mg does not appear to have any outliers, the distribution is slight, negatively skewed. Ba and Ca appears to have more outliers that may influence modeling.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Because there are quite a few variables that are affected by skewness, Box-cox transformation is one important method that can improve the model. Moreover, there are no missing values, so imputation is not necessary. Lastly, because there were some correlations, data reduction is considered to analyze if the data by generating a smaller set of predictors can capture a majority of the information in the original variables. As a result, a series of transformations to multiple variables is done, namely, Box-cox transformation and PCA.

glass.t = preProcess(Glass, method = c("BoxCox", "pca"))
glass.t
## Created from 214 samples and 10 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (1)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

The pre-processing transformation that can be applied to all the variables. Refractive index, Na, Al, Si, and Ca were box-cox transformed, followed by centering, scaling and PCA, along with all other variables. After applying the transformation, the density is nearly normal and not heavily skewed, and there is no correlation among the data.

transformed = predict(glass.t, Glass)

par(mfrow = c(3,3))
for (i in 2:8){
  rcompanion::plotNormalDensity(
    transformed[,i], main = sprintf("Density of %s", names(transformed)[i]), 
    xlab = sprintf("skewness = %1.2f", psych::describe(transformed)[i,11]), 
    col2 = "steelblue2", col3 = "royalblue4") 
}

ggpairs(transformed[2:8], title = "Correlogram with the PCA variables", progress = FALSE, 
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1))) 

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value dna means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

A data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class.

summarytools::dfSummary(Soybean, plain.ascii = TRUE, style = "grid", graph.col = FALSE, footnote = NA)
## Data Frame Summary  
## Soybean  
## Dimensions: 683 x 36  
## Duplicates: 52  
## 
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | No | Variable          | Stats / Values                  | Freqs (% of Valid) | Valid    | Missing  |
## +====+===================+=================================+====================+==========+==========+
## | 1  | Class             | 1. 2-4-d-injury                 |  16 ( 2.3%)        | 683      | 0        |
## |    | [factor]          | 2. alternarialeaf-spot          |  91 (13.3%)        | (100%)   | (0%)     |
## |    |                   | 3. anthracnose                  |  44 ( 6.4%)        |          |          |
## |    |                   | 4. bacterial-blight             |  20 ( 2.9%)        |          |          |
## |    |                   | 5. bacterial-pustule            |  20 ( 2.9%)        |          |          |
## |    |                   | 6. brown-spot                   |  92 (13.5%)        |          |          |
## |    |                   | 7. brown-stem-rot               |  44 ( 6.4%)        |          |          |
## |    |                   | 8. charcoal-rot                 |  20 ( 2.9%)        |          |          |
## |    |                   | 9. cyst-nematode                |  14 ( 2.0%)        |          |          |
## |    |                   | 10. diaporthe-pod-&-stem-blig   |  15 ( 2.2%)        |          |          |
## |    |                   | [ 9 others ]                    | 307 (45.0%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 2  | date              | 1. 0                            |  26 ( 3.8%)        | 682      | 1        |
## |    | [factor]          | 2. 1                            |  75 (11.0%)        | (99.85%) | (0.15%)  |
## |    |                   | 3. 2                            |  93 (13.6%)        |          |          |
## |    |                   | 4. 3                            | 118 (17.3%)        |          |          |
## |    |                   | 5. 4                            | 131 (19.2%)        |          |          |
## |    |                   | 6. 5                            | 149 (21.9%)        |          |          |
## |    |                   | 7. 6                            |  90 (13.2%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 3  | plant.stand       | 1. 0                            | 354 (54.7%)        | 647      | 36       |
## |    | [ordered, factor] | 2. 1                            | 293 (45.3%)        | (94.73%) | (5.27%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 4  | precip            | 1. 0                            |  74 (11.5%)        | 645      | 38       |
## |    | [ordered, factor] | 2. 1                            | 112 (17.4%)        | (94.44%) | (5.56%)  |
## |    |                   | 3. 2                            | 459 (71.2%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 5  | temp              | 1. 0                            |  80 (12.2%)        | 653      | 30       |
## |    | [ordered, factor] | 2. 1                            | 374 (57.3%)        | (95.61%) | (4.39%)  |
## |    |                   | 3. 2                            | 199 (30.5%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 6  | hail              | 1. 0                            | 435 (77.4%)        | 562      | 121      |
## |    | [factor]          | 2. 1                            | 127 (22.6%)        | (82.28%) | (17.72%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 7  | crop.hist         | 1. 0                            |  65 ( 9.8%)        | 667      | 16       |
## |    | [factor]          | 2. 1                            | 165 (24.7%)        | (97.66%) | (2.34%)  |
## |    |                   | 3. 2                            | 219 (32.8%)        |          |          |
## |    |                   | 4. 3                            | 218 (32.7%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 8  | area.dam          | 1. 0                            | 123 (18.0%)        | 682      | 1        |
## |    | [factor]          | 2. 1                            | 227 (33.3%)        | (99.85%) | (0.15%)  |
## |    |                   | 3. 2                            | 145 (21.3%)        |          |          |
## |    |                   | 4. 3                            | 187 (27.4%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 9  | sever             | 1. 0                            | 195 (34.7%)        | 562      | 121      |
## |    | [factor]          | 2. 1                            | 322 (57.3%)        | (82.28%) | (17.72%) |
## |    |                   | 3. 2                            |  45 ( 8.0%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 10 | seed.tmt          | 1. 0                            | 305 (54.3%)        | 562      | 121      |
## |    | [factor]          | 2. 1                            | 222 (39.5%)        | (82.28%) | (17.72%) |
## |    |                   | 3. 2                            |  35 ( 6.2%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 11 | germ              | 1. 0                            | 165 (28.9%)        | 571      | 112      |
## |    | [ordered, factor] | 2. 1                            | 213 (37.3%)        | (83.6%)  | (16.4%)  |
## |    |                   | 3. 2                            | 193 (33.8%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 12 | plant.growth      | 1. 0                            | 441 (66.1%)        | 667      | 16       |
## |    | [factor]          | 2. 1                            | 226 (33.9%)        | (97.66%) | (2.34%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 13 | leaves            | 1. 0                            |  77 (11.3%)        | 683      | 0        |
## |    | [factor]          | 2. 1                            | 606 (88.7%)        | (100%)   | (0%)     |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 14 | leaf.halo         | 1. 0                            | 221 (36.9%)        | 599      | 84       |
## |    | [factor]          | 2. 1                            |  36 ( 6.0%)        | (87.7%)  | (12.3%)  |
## |    |                   | 3. 2                            | 342 (57.1%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 15 | leaf.marg         | 1. 0                            | 357 (59.6%)        | 599      | 84       |
## |    | [factor]          | 2. 1                            |  21 ( 3.5%)        | (87.7%)  | (12.3%)  |
## |    |                   | 3. 2                            | 221 (36.9%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 16 | leaf.size         | 1. 0                            |  51 ( 8.5%)        | 599      | 84       |
## |    | [ordered, factor] | 2. 1                            | 327 (54.6%)        | (87.7%)  | (12.3%)  |
## |    |                   | 3. 2                            | 221 (36.9%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 17 | leaf.shread       | 1. 0                            | 487 (83.5%)        | 583      | 100      |
## |    | [factor]          | 2. 1                            |  96 (16.5%)        | (85.36%) | (14.64%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 18 | leaf.malf         | 1. 0                            | 554 (92.5%)        | 599      | 84       |
## |    | [factor]          | 2. 1                            |  45 ( 7.5%)        | (87.7%)  | (12.3%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 19 | leaf.mild         | 1. 0                            | 535 (93.0%)        | 575      | 108      |
## |    | [factor]          | 2. 1                            |  20 ( 3.5%)        | (84.19%) | (15.81%) |
## |    |                   | 3. 2                            |  20 ( 3.5%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 20 | stem              | 1. 0                            | 296 (44.4%)        | 667      | 16       |
## |    | [factor]          | 2. 1                            | 371 (55.6%)        | (97.66%) | (2.34%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 21 | lodging           | 1. 0                            | 520 (92.5%)        | 562      | 121      |
## |    | [factor]          | 2. 1                            |  42 ( 7.5%)        | (82.28%) | (17.72%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 22 | stem.cankers      | 1. 0                            | 379 (58.8%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            |  39 ( 6.0%)        | (94.44%) | (5.56%)  |
## |    |                   | 3. 2                            |  36 ( 5.6%)        |          |          |
## |    |                   | 4. 3                            | 191 (29.6%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 23 | canker.lesion     | 1. 0                            | 320 (49.6%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            |  83 (12.9%)        | (94.44%) | (5.56%)  |
## |    |                   | 3. 2                            | 177 (27.4%)        |          |          |
## |    |                   | 4. 3                            |  65 (10.1%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 24 | fruiting.bodies   | 1. 0                            | 473 (82.0%)        | 577      | 106      |
## |    | [factor]          | 2. 1                            | 104 (18.0%)        | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 25 | ext.decay         | 1. 0                            | 497 (77.0%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            | 135 (20.9%)        | (94.44%) | (5.56%)  |
## |    |                   | 3. 2                            |  13 ( 2.0%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 26 | mycelium          | 1. 0                            | 639 (99.1%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            |   6 ( 0.9%)        | (94.44%) | (5.56%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 27 | int.discolor      | 1. 0                            | 581 (90.1%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            |  44 ( 6.8%)        | (94.44%) | (5.56%)  |
## |    |                   | 3. 2                            |  20 ( 3.1%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 28 | sclerotia         | 1. 0                            | 625 (96.9%)        | 645      | 38       |
## |    | [factor]          | 2. 1                            |  20 ( 3.1%)        | (94.44%) | (5.56%)  |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 29 | fruit.pods        | 1. 0                            | 407 (68.0%)        | 599      | 84       |
## |    | [factor]          | 2. 1                            | 130 (21.7%)        | (87.7%)  | (12.3%)  |
## |    |                   | 3. 2                            |  14 ( 2.3%)        |          |          |
## |    |                   | 4. 3                            |  48 ( 8.0%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 30 | fruit.spots       | 1. 0                            | 345 (59.8%)        | 577      | 106      |
## |    | [factor]          | 2. 1                            |  75 (13.0%)        | (84.48%) | (15.52%) |
## |    |                   | 3. 2                            |  57 ( 9.9%)        |          |          |
## |    |                   | 4. 4                            | 100 (17.3%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 31 | seed              | 1. 0                            | 476 (80.5%)        | 591      | 92       |
## |    | [factor]          | 2. 1                            | 115 (19.5%)        | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 32 | mold.growth       | 1. 0                            | 524 (88.7%)        | 591      | 92       |
## |    | [factor]          | 2. 1                            |  67 (11.3%)        | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 33 | seed.discolor     | 1. 0                            | 513 (88.9%)        | 577      | 106      |
## |    | [factor]          | 2. 1                            |  64 (11.1%)        | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 34 | seed.size         | 1. 0                            | 532 (90.0%)        | 591      | 92       |
## |    | [factor]          | 2. 1                            |  59 (10.0%)        | (86.53%) | (13.47%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 35 | shriveling        | 1. 0                            | 539 (93.4%)        | 577      | 106      |
## |    | [factor]          | 2. 1                            |  38 ( 6.6%)        | (84.48%) | (15.52%) |
## +----+-------------------+---------------------------------+--------------------+----------+----------+
## | 36 | roots             | 1. 0                            | 551 (84.5%)        | 652      | 31       |
## |    | [factor]          | 2. 1                            |  86 (13.2%)        | (95.46%) | (4.54%)  |
## |    |                   | 3. 2                            |  15 ( 2.3%)        |          |          |
## +----+-------------------+---------------------------------+--------------------+----------+----------+

A random variable, X, is degenerate if, for some a constant, c, P(X = c) = 1. These near-zero variance predictors may have a single value for the vast majority of the samples. The rule of thumb for detecting near-zero variance predictors is:

If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model. Therefore, from the above table, there are a few questionable variables that may be degenerate. For criteria #1, a low sample size, these include leaf.malf, leaf.mild, lodging, mycelium, sclerotia, int.discolor, mold.growth, seed.discolor, seed.size, and shriveling. Let’s further determine which of these can be removed as predictors.

df = distinct(Soybean)
variables = c("leaf.malf", "lodging", "mycelium", "sclerotia",  "mold.growth", "seed.discolor", "seed.size", 
              "shriveling", "leaf.mild", "int.discolor")
counts = data.frame()
for (i in variables) {
  counts = rbind(counts, as.data.frame(table(df[i])))
}
ratio = c()
for (i in seq(1, 16, by = 2)) {
  ratio[i] = counts$Freq[i]/counts$Freq[i+1]
}
for (i in c(17,20)) {
  ratio[i] = counts$Freq[i]/counts$Freq[i+1]
  ratio[i+1] = counts$Freq[i]/counts$Freq[i+2]
  ratio[22] = NA
}
decision = c()
for (i in 1:22) {
  if (is.na(ratio[i])){
    decision[i] = ""
  } else if (ratio[i] > 20) {
    decision[i] = "Remove"
  } else {
    decision[i] = "Keep"
  }
}
variables = c("leaf.malf","", "lodging","","mycelium","", "sclerotia","", "mold.growth", "","seed.discolor",
              "", "seed.size","", "shriveling","", "leaf.mild","","", "int.discolor","","")
options(knitr.kable.NA = '')
cbind(variables, rename(counts, factors = Var1, freq = Freq), ratio, decision) %>% 
  knitr::kable(digits = 2L, caption = "Near-zero Variance Predictors")
Near-zero Variance Predictors
variables factors freq ratio decision
leaf.malf 0 523 11.62 Keep
1 45
lodging 0 490 11.67 Keep
1 42
mycelium 0 591 98.50 Remove
1 6
sclerotia 0 577 28.85 Remove
1 20
mold.growth 0 490 7.42 Keep
1 66
seed.discolor 0 483 7.67 Keep
1 63
seed.size 0 502 9.30 Keep
1 54
shriveling 0 509 13.76 Keep
1 37
leaf.mild 0 504 25.20 Remove
1 20 25.20 Remove
2 20
int.discolor 0 535 12.74 Keep
1 42 26.75 Remove
2 20

From the investigation above, it is indicative that mycelium, sclerotia, and leaf.mild are strongly imbalanced. Thus, it is advantageous to remove these variables from the model. Note that int.discolor, resulted in both a keep and remove for each factor, given that we can keep one factor, the variable is kept unless there is another indication that is affecting the model.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Soybean[which(!complete.cases(Soybean)),] %>% 
  group_by(Class) %>%  summarise(Count = n()) %>%
  mutate(Proportion = (Count/nrow(Soybean))) %>%
  arrange(desc(Count)) %>% 
  knitr::kable(digits = 3L, caption = "Proportion of Incomplete Cases by Class")
Proportion of Incomplete Cases by Class
Class Count Proportion
phytophthora-rot 68 0.100
2-4-d-injury 16 0.023
diaporthe-pod-&-stem-blight 15 0.022
cyst-nematode 14 0.020
herbicide-injury 8 0.012

Looking within the Class variables, it appears that nearly 10% of the missing data is the phytophthora-rot class. So we dive further into the proportion of missing data within each variable below.

na.counts = as.data.frame(((sapply(Soybean, function(x) sum(is.na(x))))/nrow(Soybean))*100)
names(na.counts) = "counts"
na.counts = cbind(variables = rownames(na.counts), data.frame(na.counts, row.names = NULL))

na.counts %>% arrange(counts) %>% mutate(name = factor(variables, levels = variables)) %>%
  ggplot(aes(x = name, y = counts)) + geom_segment( aes(xend = name, yend = 0)) +
  geom_point(size = 4, color = "steelblue2") + coord_flip() + theme_bw() +
  labs(title = "Proportion of Missing Data", x = "Variables", y = "% of Missing data") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

aggr(Soybean, col = c('steelblue2','royalblue4'), numbers = FALSE, sortVars = TRUE, 
     oma = c(6,4,3,2), labels = names(Soybean), cex.axis = 0.8, gap = 3, axes = TRUE, bars = FALSE, 
     combined = TRUE, Prop = TRUE, ylab = c("Combination of Missing Data"))

## 
##  Variables sorted by number of missings: 
##         Variable Count
##             hail   121
##            sever   121
##         seed.tmt   121
##          lodging   121
##             germ   112
##        leaf.mild   108
##  fruiting.bodies   106
##      fruit.spots   106
##    seed.discolor   106
##       shriveling   106
##      leaf.shread   100
##             seed    92
##      mold.growth    92
##        seed.size    92
##        leaf.halo    84
##        leaf.marg    84
##        leaf.size    84
##        leaf.malf    84
##       fruit.pods    84
##           precip    38
##     stem.cankers    38
##    canker.lesion    38
##        ext.decay    38
##         mycelium    38
##     int.discolor    38
##        sclerotia    38
##      plant.stand    36
##            roots    31
##             temp    30
##        crop.hist    16
##     plant.growth    16
##             stem    16
##             date     1
##         area.dam     1
##            Class     0
##           leaves     0

The graphs above are very helpful in indicating the amount of missing data the Soybean data contains. From the first plot, it highlights lodging, hail, sever and seed.tmt accounts for nearly 18% each. The second plot shows the pattern of the missing data as it relates to the other variables. It shows 82% are complete, in addition to the Class and leaves variables. There are quite a few missingness patterns, but their overall proportion is not extreme. For example, from the graph, the first set of variables, from hail to fruit.pods, accounts for 8% of the missing data when the other variables are complete, note this does not indicate within variable missingness. Therefore, for some imputation methods, such as certain types of multiple imputations, having fewer missingness patterns is helpful, as it requires fitting fewer models.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

From Part A, mycelium, sclerotia, and leaf.mild are strongly imbalanced and it was deemed advantageous to remove these variables from the model. If the data set is large enough, rows with missing values can be deleted. However, because these proportions are not too extreme for most of the variables, the imputation by k-Nearest Neighbor is conducted. The distance computation for defining the nearest neighbors is based on Gower distance (Gower 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous and semi-continuous. As a result, the data set is now complete.

Soybean.complete = Soybean %>% select(-c(mycelium, sclerotia, leaf.mild)) %>% kNN()
aggr(Soybean.complete, col = c('steelblue2','royalblue4'), numbers = FALSE, sortVars = FALSE, 
     oma = c(8,4,3,2), labels = names(Soybean.complete), cex.axis = 0.8, gap = 3, axes = TRUE, 
     bars = FALSE, combined = TRUE, Prop = TRUE, ylab = c("Combination of Missing Data"))