DATA624 HW4

Exercise 3.1

3.1. The UC Irvine Machine Learning Repository¹ contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

kable(summary(Glass[,1:5]))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

RI	Na	Mg	Al	Si
Min. :1.511	Min. :10.73	Min. :0.000	Min. :0.290	Min. :69.81
1st Qu.:1.517	1st Qu.:12.91	1st Qu.:2.115	1st Qu.:1.190	1st Qu.:72.28
Median :1.518	Median :13.30	Median :3.480	Median :1.360	Median :72.79
Mean :1.518	Mean :13.41	Mean :2.685	Mean :1.445	Mean :72.65
3rd Qu.:1.519	3rd Qu.:13.82	3rd Qu.:3.600	3rd Qu.:1.630	3rd Qu.:73.09
Max. :1.534	Max. :17.38	Max. :4.490	Max. :3.500	Max. :75.41

kable(summary(Glass[,6:10]))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

K	Ca	Ba	Fe	Type
Min. :0.0000	Min. : 5.430	Min. :0.000	Min. :0.00000	1:70
1st Qu.:0.1225	1st Qu.: 8.240	1st Qu.:0.000	1st Qu.:0.00000	2:76
Median :0.5550	Median : 8.600	Median :0.000	Median :0.00000	3:17
Mean :0.4971	Mean : 8.957	Mean :0.175	Mean :0.05701	5:13
3rd Qu.:0.6100	3rd Qu.: 9.172	3rd Qu.:0.000	3rd Qu.:0.10000	6: 9
Max. :6.2100	Max. :16.190	Max. :3.150	Max. :0.51000	7:29

There 9 numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe) and one categorical target with 6 levels (Type). The data set is highly skewed toward categories 1 and 2 with 146 out of 214 observations falling into those two categories alone. Ba, and Fe have their first quartile and median values at 0 indicating that they have a lot of zeros in their distributions.

Glass[,1:9] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = '#4575b4') +
  theme(panel.background = element_blank(), legend.position="top")

Histograms of the predictor distributions show that all of them except maybe one, Si are moderately to highly skewed. The histogram below has been colored by the target categories and shows that the distributions for different target values are slightly different (especially for Al and Na) indicating that those variables may be good predictors. Almost all of the zero values in Mg belong in categories 5, 6 and 7 and almost all of the non-zero values in Ba correspond to category 7, so treating zeroes as missing data would probably reduce the predictive accuracy of our model.

Glass %>%
  gather(-Type, key = "var", value = "val") %>%
  ggplot(aes(x = val, fill=Type)) +
  geom_histogram(bins=10, alpha=1) +
  facet_wrap(~ var, scales = "free") +
  scale_fill_manual("target",
                    values = c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
  xlab("") +
  ylab("") +
  theme(panel.background = element_blank(), legend.position="top")

Those differences in the distributions of the predictors when separated by the target variable are even more evident in the bar plots below. All of the predictors show a lot of variation in their distributions for each target value. The plot for Si has the least difference in the distributions for each category indicating that it may have less predictive value. We can also see a lot of outliers indicated by the red dots to the top and bottom of the plot ‘whiskers’. There are some really extreme outliers like the two at the top of the Ba plot, one at the top of the K plot, etc…

Glass %>%
  gather(-Type,key = "var", value = "val") %>%
  ggplot(aes(x=factor(Type), y=val)) +
  geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
  stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
               size=2, show.legend=TRUE) +
  stat_summary(aes(colour="median"), fun.y=median, geom="point",
               size=2, show.legend=TRUE) +
  facet_wrap(~ var, scales = "free", ncol=3) +
  labs(colour="Statistics", x="", y="") +
  scale_colour_manual(values=c("#9900FF", "#3300FF")) +
  theme(panel.background=element_blank())

Only RI and Ca seem to be very highly correlated with each other with a correlation statistic of about 0.81.

corrs <- round(cor(Glass[,1:9]),3)
ggcorrplot::ggcorrplot(corrs,
       type = 'lower', lab=T, lab_size=2)

The correlation plot below has been colored by the target variable and shows some clear patterns in the distributions by Type confirming what we noted above.

Glass %>% 
  ggscatmat(color="Type", alpha=0.6) +
  scale_color_manual(values=c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
  theme(panel.background=element_blank(), legend.position="top",
        axis.text.x = element_text(angle=-40, vjust=1, hjust=0))

highCorr <- findCorrelation(corrs, .75)
highCorr

## [1] 7

The findCorrelation function suggests removing column 7 which corresponds to Ca.

There are no missing values in the data set.

missing <- data.frame(t(apply(is.na(Glass), 2, sum)))
# kable(missing, col.names = "NA's")
row.names(missing) <- c("NA's")
kable(missing, caption='Missing Values in the Glass Dataset') %>%
  kable_styling(bootstrap_options = c("condensed"))

Missing Values in the Glass Dataset
	RI	Na	Mg	Al	Si	K	Ca	Ba	Fe	Type
NA’s	0	0	0	0	0	0	0	0	0	0

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

As noted above most if not all of the predictors have both outliers and skewed distributions as well as a large percentage of zeroes in a few predictors.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

skewness_statistic <- apply(Glass[,1:9], 2, skewness)
ratio_max_to_min <- apply(Glass[,1:9], 2, max)/apply(Glass[,1:9], 2, min)
kable(data.frame(skewness_statistic, ratio_max_to_min)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	skewness_statistic	ratio_max_to_min
RI	1.6027151	1.015075
Na	0.4478343	1.619758
Mg	-1.1364523	Inf
Al	0.8946104	12.068965
Si	-0.7202392	1.080218
K	6.4600889	Inf
Ca	2.0184463	2.981584
Ba	3.3686800	Inf
Fe	1.7298107	Inf

Because there are so many zero values in the data it’s hard to estimate the severity of the skew using the max to min ratio since division by zero is undefined. But we clearly have some large skewness values and can clearly see the skew in the histograms.

The skewed distributions might be improved by log, square root, or inverse transformations or a Box-Cox transformation.

Unfortunately because of the zero values Mg, K, Ba, and Fe cannot be transformed using Box-Cox. I’m still not sure how to handle that although I found this post by Rob Hyndman on the subject… Transforming data with zeros.

# https://stackoverflow.com/questions/46485024/how-to-use-boxcoxtrans-function-in-r
GlassBC <- apply(Glass[,1:9], 2, BoxCoxTrans)
trans_Glass <- purrr::map2(GlassBC, Glass[,1:9], function(x, y) predict(x, y)) 
trans_Glass_df = as.data.frame(do.call(cbind, trans_Glass))
kable(head(trans_Glass_df)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

RI	Na	Mg	Al	Si	K	Ca	Fe
0.2838746	2.613007	4.49	0.0976177	2575.684	0.06	0.8254539	0.00
0.2829051	2.631169	3.60	0.3323808	2644.326	0.48	0.8145827	0.00
0.2824954	2.604909	3.55	0.4819347	2663.270	0.39	0.8139144	0.00
0.2829194	2.580974	3.69	0.2715633	2635.606	0.57	0.8195032	0.00
0.2828507	2.585506	3.62	0.2271057	2669.843	0.55	0.8176698	0.00
0.2824323	2.548664	3.61	0.5455844	2661.810	0.64	0.8176698	0.26

trans_Glass_df %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = '#4575b4') +
  theme(panel.background = element_blank(), legend.position="top")

I’m not seeing a lot of improvement in the histograms after transformation with Box-Cox so let’s try something else… Another post suggests three ways of handling the zeros:

Add a constant value © to each value of variable then take a log transformation

Impute zero value with mean

Take square root instead of log for transformation²

As noted before the zero values have predictive value and do not seem to be data collection errors, so imputing them with the mean would not be a good choice, so let’s try the square root transformation first.

# Square Root transformation
Glass_SqRt <- data.frame(apply(Glass[,1:9], 2, sqrt))

Glass_SqRt %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = '#4575b4') +
  theme(panel.background = element_blank(), legend.position="top")

The square root transformation seems to have done a little bit better at taming our distributions, however they are still highly skewed in some cases.

Let’s try one more of the suggestions by using \(log(x+1)\) to transform our data.

# log(x+1) transformation
Glass_log <- data.frame(apply(Glass[,1:9]+1, 2, log))

Glass_log %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = '#4575b4') +
  theme(panel.background = element_blank(), legend.position="top")

Maybe a little better, but still not great!

kable(nearZeroVar(Glass, names = TRUE, saveMetrics=TRUE)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	freqRatio	percentUnique	zeroVar	nzv
RI	1.000000	83.177570	FALSE	FALSE
Na	1.000000	66.355140	FALSE	FALSE
Mg	5.250000	43.925234	FALSE	FALSE
Al	1.333333	55.140187	FALSE	FALSE
Si	1.000000	62.149533	FALSE	FALSE
K	2.500000	30.373832	FALSE	FALSE
Ca	1.000000	66.822430	FALSE	FALSE
Ba	88.000000	15.887851	FALSE	FALSE
Fe	20.571429	14.953271	FALSE	FALSE
Type	1.085714	2.803738	FALSE	FALSE

The caret function nearZeroVar does not indicate that the variables with high frequencies of zeroes should be removed so another solution might be to try the alternative Box-Cox transformation that Hyndman suggested in his blog post.³

The data below has been transformed using the log1p function Hyndman referenced.

Alt_BoxCox_Glass <- log1p(Glass[,1:9])
kable(head(Alt_BoxCox_Glass))  %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

RI	Na	Mg	Al	Si	K	Ca	Fe
0.9246596	2.683758	1.702928	0.7419373	4.287441	0.0582689	2.277267	0.0000000
0.9233100	2.700690	1.526056	0.8586616	4.300410	0.3920421	2.178155	0.0000000
0.9227419	2.676216	1.515127	0.9321641	4.303930	0.3293037	2.172476	0.0000000
0.9233299	2.653946	1.545433	0.8285518	4.298781	0.4510756	2.221375	0.0000000
0.9232346	2.658159	1.530395	0.8064759	4.305146	0.4382549	2.204972	0.0000000
0.9226544	2.623944	1.528228	0.9631743	4.303660	0.4946962	2.204972	0.2311117

Alt_BoxCox_Glass %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = '#4575b4') +
  theme(panel.background = element_blank(), legend.position="top")

I think that in this case it might be necessary to know if these are real zeroes, or ‘censored’ data where the zeroes may actually be trace amounts that are below the detection limit. In this case Kuhn and Johnson state:

when a sample has a value below the limit of detection, the actual limit can be used in place of the real value. For this situation, it is also common to use a random number between zero and the limit of detection.⁴

Imputing a non-zero value might improve our transformation options by allowing us to use a regular Box-Cox transformation on all the predictors, but I am not convinced it will result in much more normalized distributions than the alternative Box-Cox transformation.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

# library(mlbench)
data(Soybean)
## See ?Soybean for details

data(Soybean)
# str(Soybean)

Data Dictionary from Soybean {mlbench} Documentation

[,1]    Class   the 19 classes
[,2]    date    apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3]    plant.stand normal(0),lt-normal(1).
[,4]    precip  lt-norm(0),norm(1),gt-norm(2).
[,5]    temp    lt-norm(0),norm(1),gt-norm(2).
[,6]    hail    yes(0),no(1).
[,7]    crop.hist   dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8]    area.dam    scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9]    sever   minor(0),pot-severe(1),severe(2).
[,10]   seed.tmt    none(0),fungicide(1),other(2).
[,11]   germ    90-100%(0),80-89%(1),lt-80%(2).
[,12]   plant.growth    norm(0),abnorm(1).
[,13]   leaves  norm(0),abnorm(1).
[,14]   leaf.halo   absent(0),yellow-halos(1),no-yellow-halos(2).
[,15]   leaf.marg   w-s-marg(0),no-w-s-marg(1),dna(2).
[,16]   leaf.size   lt-1/8(0),gt-1/8(1),dna(2).
[,17]   leaf.shread absent(0),present(1).
[,18]   leaf.malf   absent(0),present(1).
[,19]   leaf.mild   absent(0),upper-surf(1),lower-surf(2).
[,20]   stem    norm(0),abnorm(1).
[,21]   lodging yes(0),no(1).
[,22]   stem.cankers    absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23]   canker.lesion   dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24]   fruiting.bodies absent(0),present(1).
[,25]   ext.decay   absent(0),firm-and-dry(1),watery(2).
[,26]   mycelium    absent(0),present(1).
[,27]   int.discolor    none(0),brown(1),black(2).
[,28]   sclerotia   absent(0),present(1).
[,29]   fruit.pods  norm(0),diseased(1),few-present(2),dna(3).
[,30]   fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31]   seed    norm(0),abnorm(1).
[,32]   mold.growth absent(0),present(1).
[,33]   seed.discolor   absent(0),present(1).
[,34]   seed.size   norm(0),lt-norm(1).
[,35]   shriveling  absent(0),present(1).
[,36]   roots   norm(0),rotted(1),galls-cysts(2).

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# reorder columns by number of levels in factors
Soybeans <- Soybean[, c(1,2,7,8,22,23,29,30,4,5,9:11,14:16,19,25,27,36,
                       3,6,12,13,17,18,20,21,24,26,28,31:35)]
kable(summary(data.frame(Soybeans[,1:2]))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,3:8])) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,9:14])) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,15:20])) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,21:28])) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,29:36])) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

In Applied Predictive Modeling by Kuhn & Johnson, they describe ‘degenerate distributions’ as those with zero or near zero variance.⁵ The output of the nearZeroVar function shows that there are 3 variables, leaf.mild, mycelium, sclerotia, in the Soybean data set that are ‘degenerate’ in this way.

near_zero_soy <- nearZeroVar(Soybean, names = TRUE, saveMetrics=TRUE) %>%
  rownames_to_column('variable') %>%
  filter(zeroVar == TRUE | nzv == TRUE)
kable(near_zero_soy) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

variable	freqRatio	percentUnique	zeroVar	nzv
leaf.mild	26.75	0.4392387	FALSE	TRUE
mycelium	106.50	0.2928258	FALSE	TRUE
sclerotia	31.25	0.2928258	FALSE	TRUE

We can see this near zero variance in the plots below as well as some interesting correlations between NA values and certain target classes which we will explore more in the next section.

par(mar=c(1,1,1,1))
Soybean %>%
  gather(-Class, key = "var", value = "val") %>%
  ggplot(aes(x = val, fill=Class)) +
  geom_bar(alpha=1) +
  facet_wrap(~ var, scales = "free") +
  scale_fill_manual("target",
                    values = c('#f1b6da','#c51b7d','#b2df8a','#33a02c','#fb9a99',
                               '#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a',
                               '#ffeda0','#dfc27d','#80cdc1','#01665e','#a6cee3',
                               '#1f78b4','#bf812d','#8c510a','#bababa')) +
  xlab("") +
  ylab("") +
  theme(panel.background = element_blank(), legend.position="top")

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can see in the plot below that there are certain variables with much higher percentages of missing values than others. lodging, seed.tmt, sever, and hail have the highest percentage of missing values at 17.72%. However, more than half the predictors have more than 12% missing data.

plot_missing(Soybean)

The plot below shows us some clear patterns in the missing data related to the target classes. ALL of the missing data belongs to just 5 of the 19 classes with noticeably different patterns of missing data for each of those 5 classes.

gg_miss_var(x = Soybean, facet = Class, show_pct = TRUE) + 
  labs(y = "Percent missing values")

The plot below shows us that all of the missing data falls into one of 7 patterns. Since we also know that all of the missing values fall into one of 5 classes, the missing data could have strong predictive value, however it is important to know why they are missing. This is a judgement where domain knowledge can be invaluable in determining what to do about those missing values.

gg_miss_upset(Soybean, 
              nsets = 36,
              nintersects = NA)

The table below shows us that for 4 out of the 5 classes with missing data, 100% of the records in each of those classes are missing data in all of the fields where they have missing data. Only phytophthora-rot has some data for some records in the fields where they also have missing values.

missingness <- Soybean %>%
  group_by(Class) %>%
  miss_var_summary() 
kable(missingness %>% 
  filter(pct_miss > 0)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Class	variable	n_miss	pct_miss
phytophthora-rot	hail	68	77.27273
phytophthora-rot	sever	68	77.27273
phytophthora-rot	seed.tmt	68	77.27273
phytophthora-rot	germ	68	77.27273
phytophthora-rot	lodging	68	77.27273
phytophthora-rot	fruiting.bodies	68	77.27273
phytophthora-rot	fruit.pods	68	77.27273
phytophthora-rot	fruit.spots	68	77.27273
phytophthora-rot	seed	68	77.27273
phytophthora-rot	mold.growth	68	77.27273
phytophthora-rot	seed.discolor	68	77.27273
phytophthora-rot	seed.size	68	77.27273
phytophthora-rot	shriveling	68	77.27273
phytophthora-rot	leaf.halo	55	62.50000
phytophthora-rot	leaf.marg	55	62.50000
phytophthora-rot	leaf.size	55	62.50000
phytophthora-rot	leaf.shread	55	62.50000
phytophthora-rot	leaf.malf	55	62.50000
phytophthora-rot	leaf.mild	55	62.50000
diaporthe-pod-&-stem-blight	hail	15	100.00000
diaporthe-pod-&-stem-blight	sever	15	100.00000
diaporthe-pod-&-stem-blight	seed.tmt	15	100.00000
diaporthe-pod-&-stem-blight	leaf.halo	15	100.00000
diaporthe-pod-&-stem-blight	leaf.marg	15	100.00000
diaporthe-pod-&-stem-blight	leaf.size	15	100.00000
diaporthe-pod-&-stem-blight	leaf.shread	15	100.00000
diaporthe-pod-&-stem-blight	leaf.malf	15	100.00000
diaporthe-pod-&-stem-blight	leaf.mild	15	100.00000
diaporthe-pod-&-stem-blight	lodging	15	100.00000
diaporthe-pod-&-stem-blight	roots	15	100.00000
diaporthe-pod-&-stem-blight	plant.stand	6	40.00000
diaporthe-pod-&-stem-blight	germ	6	40.00000
cyst-nematode	plant.stand	14	100.00000
cyst-nematode	precip	14	100.00000
cyst-nematode	temp	14	100.00000
cyst-nematode	hail	14	100.00000
cyst-nematode	sever	14	100.00000
cyst-nematode	seed.tmt	14	100.00000
cyst-nematode	germ	14	100.00000
cyst-nematode	leaf.halo	14	100.00000
cyst-nematode	leaf.marg	14	100.00000
cyst-nematode	leaf.size	14	100.00000
cyst-nematode	leaf.shread	14	100.00000
cyst-nematode	leaf.malf	14	100.00000
cyst-nematode	leaf.mild	14	100.00000
cyst-nematode	lodging	14	100.00000
cyst-nematode	stem.cankers	14	100.00000
cyst-nematode	canker.lesion	14	100.00000
cyst-nematode	fruiting.bodies	14	100.00000
cyst-nematode	ext.decay	14	100.00000
cyst-nematode	mycelium	14	100.00000
cyst-nematode	int.discolor	14	100.00000
cyst-nematode	sclerotia	14	100.00000
cyst-nematode	fruit.spots	14	100.00000
cyst-nematode	seed.discolor	14	100.00000
cyst-nematode	shriveling	14	100.00000
2-4-d-injury	plant.stand	16	100.00000
2-4-d-injury	precip	16	100.00000
2-4-d-injury	temp	16	100.00000
2-4-d-injury	hail	16	100.00000
2-4-d-injury	crop.hist	16	100.00000
2-4-d-injury	sever	16	100.00000
2-4-d-injury	seed.tmt	16	100.00000
2-4-d-injury	germ	16	100.00000
2-4-d-injury	plant.growth	16	100.00000
2-4-d-injury	leaf.shread	16	100.00000
2-4-d-injury	leaf.mild	16	100.00000
2-4-d-injury	stem	16	100.00000
2-4-d-injury	lodging	16	100.00000
2-4-d-injury	stem.cankers	16	100.00000
2-4-d-injury	canker.lesion	16	100.00000
2-4-d-injury	fruiting.bodies	16	100.00000
2-4-d-injury	ext.decay	16	100.00000
2-4-d-injury	mycelium	16	100.00000
2-4-d-injury	int.discolor	16	100.00000
2-4-d-injury	sclerotia	16	100.00000
2-4-d-injury	fruit.pods	16	100.00000
2-4-d-injury	fruit.spots	16	100.00000
2-4-d-injury	seed	16	100.00000
2-4-d-injury	mold.growth	16	100.00000
2-4-d-injury	seed.discolor	16	100.00000
2-4-d-injury	seed.size	16	100.00000
2-4-d-injury	shriveling	16	100.00000
2-4-d-injury	roots	16	100.00000
2-4-d-injury	date	1	6.25000
2-4-d-injury	area.dam	1	6.25000
herbicide-injury	precip	8	100.00000
herbicide-injury	hail	8	100.00000
herbicide-injury	sever	8	100.00000
herbicide-injury	seed.tmt	8	100.00000
herbicide-injury	germ	8	100.00000
herbicide-injury	leaf.mild	8	100.00000
herbicide-injury	lodging	8	100.00000
herbicide-injury	stem.cankers	8	100.00000
herbicide-injury	canker.lesion	8	100.00000
herbicide-injury	fruiting.bodies	8	100.00000
herbicide-injury	ext.decay	8	100.00000
herbicide-injury	mycelium	8	100.00000
herbicide-injury	int.discolor	8	100.00000
herbicide-injury	sclerotia	8	100.00000
herbicide-injury	fruit.spots	8	100.00000
herbicide-injury	seed	8	100.00000
herbicide-injury	mold.growth	8	100.00000
herbicide-injury	seed.discolor	8	100.00000
herbicide-injury	seed.size	8	100.00000
herbicide-injury	shriveling	8	100.00000

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Because of the predictive value of the missingness in this data set, unless my domain knowledge told me of some reason not to, I would choose to create an additional factor level for each of those variables to indicate missingness.

# https://gist.github.com/riinuots/e517c36b1feb480df981721a00e0e24a

Soybean_Imputed <- Soybean %>%
  mutate_if(is.factor, fct_explicit_na)

Distributions after imputing missing values as a factor level

par(mar=c(1,1,1,1))
Soybean_Imputed %>%
  gather(-Class, key = "var", value = "val") %>%
  ggplot(aes(x = val, fill=Class)) +
  geom_bar(alpha=1) +
  facet_wrap(~ var, scales = "free") +
  scale_fill_manual("target",
                    values = c('#f1b6da','#c51b7d','#b2df8a','#33a02c','#fb9a99',
                               '#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a',
                               '#ffeda0','#dfc27d','#80cdc1','#01665e','#a6cee3',
                               '#1f78b4','#bf812d','#8c510a','#bababa')) +
  xlab("") +
  ylab("") +
  theme(panel.background = element_blank(), legend.position="top")