knitr::opts_chunk$set(fig.width = 11, fig.height = 11)
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(fpp2)
## Loading required package: fpp2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
require(corrplot)
## Loading required package: corrplot
## corrplot 0.84 loaded

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

So here are the frequency distributions for all 35 predictors.

table(Soybean$date, dnn = "Date")
## Date
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90
table(Soybean$plant.stand, dnn = "Plant Stand")
## Plant Stand
##   0   1 
## 354 293
table(Soybean$precip, dnn = "Precip")
## Precip
##   0   1   2 
##  74 112 459
table(Soybean$temp, dnn = "Temp")
## Temp
##   0   1   2 
##  80 374 199
table(Soybean$hail, dnn = "Hail")
## Hail
##   0   1 
## 435 127
table(Soybean$crop.hist, dnn = "Crop Hist")
## Crop Hist
##   0   1   2   3 
##  65 165 219 218
table(Soybean$area.dam, dnn = "Area Dam")
## Area Dam
##   0   1   2   3 
## 123 227 145 187
table(Soybean$sever, dnn = "Sever")
## Sever
##   0   1   2 
## 195 322  45
table(Soybean$seed.tmt, dnn = "Seed Tmt")
## Seed Tmt
##   0   1   2 
## 305 222  35
table(Soybean$germ, dnn = "Germ")
## Germ
##   0   1   2 
## 165 213 193
table(Soybean$plant.growth, dnn = "Plant Growth")
## Plant Growth
##   0   1 
## 441 226
table(Soybean$leaves, dnn = "Leaves")
## Leaves
##   0   1 
##  77 606
table(Soybean$leaf.halo, dnn = "Leaf Halo")
## Leaf Halo
##   0   1   2 
## 221  36 342
table(Soybean$leaf.marg, dnn = "Leaf Marg")
## Leaf Marg
##   0   1   2 
## 357  21 221
table(Soybean$leaf.size, dnn = "Leaf Size")
## Leaf Size
##   0   1   2 
##  51 327 221
table(Soybean$leaf.shread, dnn = "Leaf Shread")
## Leaf Shread
##   0   1 
## 487  96
table(Soybean$leaf.malf, dnn = "Leaf Malf")
## Leaf Malf
##   0   1 
## 554  45
table(Soybean$leaf.mild, dnn = "Leaf Mild")
## Leaf Mild
##   0   1   2 
## 535  20  20
table(Soybean$stem, dnn = "Stem")
## Stem
##   0   1 
## 296 371
table(Soybean$lodging, dnn = "Lodging")
## Lodging
##   0   1 
## 520  42
table(Soybean$stem.cankers, dnn = "Stem Cankers")
## Stem Cankers
##   0   1   2   3 
## 379  39  36 191
table(Soybean$canker.lesion, dnn = "Canker Lesion")
## Canker Lesion
##   0   1   2   3 
## 320  83 177  65
table(Soybean$fruiting.bodies, dnn = "Fruiting Bodies")
## Fruiting Bodies
##   0   1 
## 473 104
table(Soybean$ext.decay, dnn = "Ext Decay")
## Ext Decay
##   0   1   2 
## 497 135  13
table(Soybean$mycelium, dnn = "Mycelium")
## Mycelium
##   0   1 
## 639   6
table(Soybean$int.discolor, dnn = "Int Discolor")
## Int Discolor
##   0   1   2 
## 581  44  20
table(Soybean$sclerotia, dnn = "Sclerotia")
## Sclerotia
##   0   1 
## 625  20
table(Soybean$fruit.pods, dnn = "Fruit Pods")
## Fruit Pods
##   0   1   2   3 
## 407 130  14  48
table(Soybean$fruit.spots, dnn = "Fruit Spots")
## Fruit Spots
##   0   1   2   4 
## 345  75  57 100
table(Soybean$seed, dnn = "Seed")
## Seed
##   0   1 
## 476 115
table(Soybean$mold.growth, dnn = "Mold Growth")
## Mold Growth
##   0   1 
## 524  67
table(Soybean$seed.discolor, dnn = "Seed Discolor")
## Seed Discolor
##   0   1 
## 513  64
table(Soybean$seed.size, dnn = "Seed Size")
## Seed Size
##   0   1 
## 532  59
table(Soybean$shriveling, dnn = "Shriveling")
## Shriveling
##   0   1 
## 539  38
table(Soybean$roots, dnn = "Roots")
## Roots
##   0   1   2 
## 551  86  15

At first blush, a few of them seem to be degenerate. leaf.malf is split 554 to 45, and leaf.mild is split 535 to 20 and 20. However, we won’t know for sure until we calculate them.

degenerate <- function(df){
  ans <- numeric(0)
  
  for(i in 2:length(df)){
    t <- table(df[,i])
    
    t <- t[order(-t)]
    
    if(t.ratio(t) & uv(t)){
      ans <- c(ans, i)
    }
  }
  
  return(ans)
}

t.ratio <- function(t){
  if(t[1] / t[2] >19){
    return(TRUE)
  }
  
  else{
    return(FALSE)
  }
}

uv <- function(t){
  t <- t / sum(t)
  
  if(t[1] > 0.89){
    return(TRUE)
  }
  
  else{
    return(FALSE)
  }
}

l <- degenerate(Soybean)

head(Soybean[, l])
##   leaf.mild mycelium sclerotia
## 1         0        0         0
## 2         0        0         0
## 3         0        0         0
## 4         0        0         0
## 5         0        0         0
## 6         0        0         0

It appears, by our calculations, that at least three categories are degenerate: leaf.mild, mycelium, and sclerotia. That is to say the fraction of unique values is 10% or less, and the ratio between the first and second variables is 20 or greater. Now, this might not matter because we could simply choose a model that is not susceptible to this sort of variance.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For the data in which class is NA, it would probably be best to delete them outright. It’s possible to predict the data, but as it seems that’s what the model would be built to do in the first place, this seems a little redundant.

For the missing data that has a pattern, such as those outlined above, it might be possible to predict the missing variables based on the group. As we’ve grouped together some of them into the dataframes eight.four, one.two.one, and three.eight, we could do as the book suggests and run a k-nearest neighbors algorithm on each one to predict the missing variables. There is a potential issue, in that our training/test set would have to be comprised of only the complete entries. This might be too small of a size to get any meaningful data out of.

Still, this is better than the alternative, which would be to replace the NA with the most common variable in the given category. If they were numeric values, we could take the mean or median, but they are all unfortunately factor variables, so we would be working with straight replacements. If we are going to just replace NA's with values that make the most sense based on previous common entries, it makes more sense to let the kNN algorithm do it rather than fiddle with it manually.