3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(mlbench)
data(Soybean)
## See ?Soybean for details

A degenerate distribution is when a random variable has a single possible value. First we must remove empty values in order for our functions to work. We can see that this does not occur in this data set as there are no predictors who have a minimum value equal to their maximum value thus there are no columns with a pure degenerate distribution. However, upon further inspection of the frequency distributions plot we could remove a few predictors such as leaves,shriveling, leaf.malf, mold.growth and lodging. We should definitely remove mycelium since it has nearly all the same value suggesting it is a degenerate distribution.

S<-drop_na(Soybean)


mi<-as.data.frame(apply(S,2,min))
ma<-as.data.frame(apply(S,2,max))
mi<-tibble::rownames_to_column(mi, "predictor")
ma<-tibble::rownames_to_column(ma, "predictor")

m<-as.data.frame(merge(mi,ma,by="predictor"))
m<-m %>% 
  rename(
     "min"="apply(S, 2, min)",
     "max"="apply(S, 2, max)" 
    )
m
##          predictor                 min                  max
## 1         area.dam                   0                    3
## 2    canker.lesion                   0                    3
## 3            Class alternarialeaf-spot rhizoctonia-root-rot
## 4        crop.hist                   0                    3
## 5             date                   0                    6
## 6        ext.decay                   0                    1
## 7       fruit.pods                   0                    3
## 8      fruit.spots                   0                    4
## 9  fruiting.bodies                   0                    1
## 10            germ                   0                    2
## 11            hail                   0                    1
## 12    int.discolor                   0                    2
## 13       leaf.halo                   0                    2
## 14       leaf.malf                   0                    1
## 15       leaf.marg                   0                    2
## 16       leaf.mild                   0                    2
## 17     leaf.shread                   0                    1
## 18       leaf.size                   0                    2
## 19          leaves                   0                    1
## 20         lodging                   0                    1
## 21     mold.growth                   0                    1
## 22        mycelium                   0                    1
## 23    plant.growth                   0                    1
## 24     plant.stand                   0                    1
## 25          precip                   0                    2
## 26           roots                   0                    2
## 27       sclerotia                   0                    1
## 28            seed                   0                    1
## 29   seed.discolor                   0                    1
## 30       seed.size                   0                    1
## 31        seed.tmt                   0                    2
## 32           sever                   0                    2
## 33      shriveling                   0                    1
## 34            stem                   0                    1
## 35    stem.cankers                   0                    3
## 36            temp                   0                    2
m%>%filter(min==max)
## [1] predictor min       max      
## <0 rows> (or 0-length row.names)
asNumeric <- function(x) as.numeric(as.character(x))
factorsNumeric <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],   
                                                   asNumeric))

So<-factorsNumeric(Soybean)
So %>%keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

### (b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Using the skim function from the skimr library you can see that some predictors are much more likely to be missing. The lowest of which are hail,sever,seed.tm, and lodging which ae all missing 121 records for a completion rate of 82.3%. We can also see that only 5 of the classes have NA fields with all of 4 classes containing a null field and one class having 3/4 of its records with a null field.

skim(Soybean)
Data summary
Name Soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15
sNA <- Soybean[rowSums(is.na(Soybean)) > 0,]
library(sqldf)

sm<-sqldf("select count(Class) null_count, Class from sNA group by Class")
soy<-sqldf("select count(Class) full_count, Class from Soybean group by Class")
s<-sqldf("select a.class, null_count,full_count from sm a join soy b on a.class = b.class")
s
##                         Class null_count full_count
## 1                2-4-d-injury         16         16
## 2               cyst-nematode         14         14
## 3 diaporthe-pod-&-stem-blight         15         15
## 4            herbicide-injury          8          8
## 5            phytophthora-rot         68         88

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For predictors that were entirely NA for a whole class (i.e. hail is NA for 5 different classes) I would create a dummy variable to show if the predictor was filled in or not or remove it entirely. filling in may be an issue because that is likely something to do with data collection and may not keep up over time. for predictors that have some data within a class I would impute an average for that predictor for a given class.