The Glass dataset consists of 214 observations and 10 variables.
## starting httpd help server ... done
Glass Identification Database
Description
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).
Usage
data(Glass)
Format
A data frame with 214 observations on 10 variables:
[,1] RI refractive index [,2] Na Sodium [,3] Mg Magnesium [,4] Al Aluminum [,5] Si Silicon [,6] K Potassium [,7] Ca Calcium [,8] Ba Barium [,9] Fe Iron [,10] Type Type of glass (class attribute) Source Creator: B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN
Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation
These data have been taken from the UCI Repository Of Machine Learning Databases at
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
http://www.ics.uci.edu/~mlearn/MLRepository.html
and were converted to R format by Friedrich Leisch.
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Answer
First, separate out the nine predictor variables (Ri,Na, Mg, Al, Si, K, Ca, Ba, and Fe) from the dependent variable, Type:
The following table below shows basic statistics, mean, standard deviation, median, min, max,skew, and the percentage of missing items for each variable in the field, pct_missing.
| STATS | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | pct_missing |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RI | 1 | 214 | 1.5183654 | 0.0030369 | 1.51768 | 1.5180119 | 0.0018755 | 1.51115 | 1.53393 | 0.02278 | 1.6027151 | 4.7167266 | 0.0002076 | 0 |
| Na | 2 | 214 | 13.4078505 | 0.8166036 | 13.30000 | 13.3768023 | 0.6449310 | 10.73000 | 17.38000 | 6.65000 | 0.4478343 | 2.8979666 | 0.0558219 | 0 |
| Mg | 3 | 214 | 2.6845327 | 1.4424078 | 3.48000 | 2.8655233 | 0.3039330 | 0.00000 | 4.49000 | 4.49000 | -1.1364523 | -0.4526762 | 0.0986010 | 0 |
| Al | 4 | 214 | 1.4449065 | 0.4992696 | 1.36000 | 1.4122093 | 0.3113460 | 0.29000 | 3.50000 | 3.21000 | 0.8946104 | 1.9383534 | 0.0341294 | 0 |
| Si | 5 | 214 | 72.6509346 | 0.7745458 | 72.79000 | 72.7073256 | 0.5708010 | 69.81000 | 75.41000 | 5.60000 | -0.7202392 | 2.8163627 | 0.0529469 | 0 |
| K | 6 | 214 | 0.4970561 | 0.6521918 | 0.55500 | 0.4318023 | 0.1704990 | 0.00000 | 6.21000 | 6.21000 | 6.4600889 | 52.8665268 | 0.0445829 | 0 |
| Ca | 7 | 214 | 8.9569626 | 1.4231535 | 8.60000 | 8.7421512 | 0.6597570 | 5.43000 | 16.19000 | 10.76000 | 2.0184463 | 6.4104000 | 0.0972848 | 0 |
| Ba | 8 | 214 | 0.1750467 | 0.4972193 | 0.00000 | 0.0337791 | 0.0000000 | 0.00000 | 3.15000 | 3.15000 | 3.3686800 | 12.0801412 | 0.0339892 | 0 |
| Fe | 9 | 214 | 0.0570093 | 0.0974387 | 0.00000 | 0.0358140 | 0.0000000 | 0.00000 | 0.51000 | 0.51000 | 1.7298107 | 2.5203615 | 0.0066608 | 0 |
The colSums function confirms that there are no missing values in the dataset.
## RI Na Mg Al Si K Ca Ba Fe
## 0 0 0 0 0 0 0 0 0
The correlation plot shows that the strongest correlation among the predictors is between “RI”, the refractive index and “Ca”, calcium.
corr <- round(cor(predictors), 1)
ggcorrplot(corr,
type="lower",
lab=TRUE,
lab_size=3,
method="circle",
colors=c("tomato2", "white", "springgreen3"),
title="Correlation of Predictor variables in the Glass Data Set",
ggtheme=theme_bw) In figure 2 below, the lower left triangle shows a scatter plot relationships between each predictor along with a regression line through each plot. The diagonal shows the histogram distribution of each predictor. We can see that K, Ba, and Fe are skewed to the right. The upper triangle shows the correlation between each predictor. Again we see that RI (refractive index) and Ca (Calcium).
Do there appear to be any outliers in the data? Are any predictors skewed?
Yes. We can clearly see from Figures 2 (the Correlation plots) and 3 (the Boxplots) that variables Ba, Fe, and K have several outliers. These three variables are also skewed.
## No id variables; using all as measure variables
suppressWarnings(ggplot(datasub_1, aes(x= "value", y=value)) +
geom_boxplot(fill='lightblue') + facet_wrap(~variable, scales = 'free') )Are there any relevant transformations of one or more predictors that might improve the classification model?
There are a few transformations that could be applied:
The most straightforward and common data transformation is to center scale the predictor variables. To center a predictor variable, the average predictor value is subtracted from all the values. As a result of centering, the predictor has a zero mean.
Box Cox transformation
Log and Square Root transformation
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean Database
Description
There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
Usage
data(Soybean)
Format
A data frame with 683 observations on 36 variables. There are 35 categorical
attributes, all numerical and a nominal denoting the class.
[,1] Class the 19 classes [,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6). [,3] plant.stand normal(0),lt-normal(1). [,4] precip lt-norm(0),norm(1),gt-norm(2). [,5] temp lt-norm(0),norm(1),gt-norm(2). [,6] hail yes(0),no(1). [,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3). [,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3). [,9] sever minor(0),pot-severe(1),severe(2). [,10] seed.tmt none(0),fungicide(1),other(2). [,11] germ 90-100%(0),80-89%(1),lt-80%(2). [,12] plant.growth norm(0),abnorm(1). [,13] leaves norm(0),abnorm(1). [,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2). [,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2). [,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2). [,17] leaf.shread absent(0),present(1). [,18] leaf.malf absent(0),present(1). [,19] leaf.mild absent(0),upper-surf(1),lower-surf(2). [,20] stem norm(0),abnorm(1). [,21] lodging yes(0),no(1). [,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3). [,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3). [,24] fruiting.bodies absent(0),present(1). [,25] ext.decay absent(0),firm-and-dry(1),watery(2). [,26] mycelium absent(0),present(1). [,27] int.discolor none(0),brown(1),black(2). [,28] sclerotia absent(0),present(1). [,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3). [,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4). [,31] seed norm(0),abnorm(1). [,32] mold.growth absent(0),present(1). [,33] seed.discolor absent(0),present(1). [,34] seed.size norm(0),lt-norm(1). [,35] shriveling absent(0),present(1). [,36] roots norm(0),rotted(1),galls-cysts(2). Source
Source: R.S. Michalski and R.L. Chilausky “Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis”, International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.
Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmer%cs.cmu.edu)
These data have been taken from the UCI Repository Of Machine Learning Databases at
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
http://www.ics.uci.edu/~mlearn/MLRepository.html
and were converted to R format by Evgenia Dimitriadou.
References Tan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning (pp. 121-134). Ann Arbor, Michigan: Morgan Kaufmann. - IWN recorded a 97.1% classification accuracy - 290 training and 340 test instances
Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and Predictive Accuracy. Proceedings of the Fifth International Conference on Machine Learning (pp. 22-28). Ann Arbor, Michigan: Morgan Kaufmann. - Notes why this database is highly predictable
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
metastats <- data.frame(describe(Soybean))
metastats <- tibble::rownames_to_column(metastats, "STATS")
metastats["pct_missing"] <- 1-round(metastats["n"]/683, 3)
metastats## STATS vars n mean sd median trimmed mad min max
## 1 Class* 1 683 9.295754 5.51115341 8 9.179159 7.4130 1 19
## 2 date* 2 682 4.554252 1.69411726 5 4.615385 1.4826 1 7
## 3 plant.stand* 3 647 1.452859 0.49815792 1 1.441233 0.0000 1 2
## 4 precip* 4 645 2.596899 0.68614709 3 2.744681 0.0000 1 3
## 5 temp* 5 653 2.182236 0.62821435 2 2.227533 0.0000 1 3
## 6 hail* 6 562 1.225979 0.41859776 1 1.157778 0.0000 1 2
## 7 crop.hist* 7 667 2.884558 0.97576561 3 2.977570 1.4826 1 4
## 8 area.dam* 8 682 2.580645 1.07437412 2 2.600733 1.4826 1 4
## 9 sever* 9 562 1.733096 0.59702831 2 1.691111 0.0000 1 3
## 10 seed.tmt* 10 562 1.519573 0.61224099 1 1.446667 0.0000 1 3
## 11 germ* 11 571 2.049037 0.79098758 2 2.061269 1.4826 1 3
## 12 plant.growth* 12 667 1.338831 0.47366739 1 1.299065 0.0000 1 2
## 13 leaves* 13 683 1.887262 0.31650395 2 1.983547 0.0000 1 2
## 14 leaf.halo* 14 599 2.202003 0.94899841 3 2.251559 0.0000 1 3
## 15 leaf.marg* 15 599 1.772955 0.95651425 1 1.717256 0.0000 1 3
## 16 leaf.size* 16 599 2.283806 0.61169336 2 2.336798 0.0000 1 3
## 17 leaf.shread* 17 583 1.164666 0.37119689 1 1.081370 0.0000 1 2
## 18 leaf.malf* 18 599 1.075125 0.26381357 1 1.000000 0.0000 1 2
## 19 leaf.mild* 19 575 1.104348 0.40411457 1 1.000000 0.0000 1 3
## 20 stem* 20 667 1.556222 0.49720190 2 1.570093 0.0000 1 2
## 21 lodging* 21 562 1.074733 0.26319445 1 1.000000 0.0000 1 2
## 22 stem.cankers* 22 645 2.060465 1.35169658 1 1.951644 0.0000 1 4
## 23 canker.lesion* 23 645 1.979845 1.08400138 2 1.851064 1.4826 1 4
## 24 fruiting.bodies* 24 577 1.180243 0.38472295 1 1.101512 0.0000 1 2
## 25 ext.decay* 25 645 1.249612 0.47746159 1 1.162476 0.0000 1 3
## 26 mycelium* 26 645 1.009302 0.09607342 1 1.000000 0.0000 1 2
## 27 int.discolor* 27 645 1.130233 0.41899848 1 1.000000 0.0000 1 3
## 28 sclerotia* 28 645 1.031008 0.17347313 1 1.000000 0.0000 1 2
## 29 fruit.pods* 29 599 1.504174 0.88251272 1 1.282744 0.0000 1 4
## 30 fruit.spots* 30 577 1.847487 1.17006859 1 1.686825 0.0000 1 4
## 31 seed* 31 591 1.194585 0.39621658 1 1.118393 0.0000 1 2
## 32 mold.growth* 32 591 1.113367 0.31730966 1 1.016913 0.0000 1 2
## 33 seed.discolor* 33 577 1.110919 0.31430372 1 1.015119 0.0000 1 2
## 34 seed.size* 34 591 1.099831 0.30002820 1 1.000000 0.0000 1 2
## 35 shriveling* 35 577 1.065858 0.24824873 1 1.000000 0.0000 1 2
## 36 roots* 36 652 1.177914 0.43882605 1 1.068966 0.0000 1 3
## range skew kurtosis se pct_missing
## 1 18 0.11302119 -1.3791026 0.210878424 0.000
## 2 6 -0.30397011 -0.9045074 0.064871103 0.001
## 3 1 0.18896734 -1.9673249 0.019584609 0.053
## 4 2 -1.41630633 0.5502093 0.027017015 0.056
## 5 2 -0.15829545 -0.5843151 0.024583927 0.044
## 6 1 1.30690508 -0.2925101 0.017657481 0.177
## 7 3 -0.39757148 -0.9187916 0.037781795 0.023
## 8 3 0.01799005 -1.2864923 0.041139911 0.001
## 9 2 0.17391297 -0.5647524 0.025184119 0.177
## 10 2 0.73966698 -0.4396667 0.025825828 0.177
## 11 2 -0.08680952 -1.3998550 0.033101800 0.164
## 12 1 0.67949699 -1.5405868 0.018340474 0.023
## 13 1 -2.44354029 3.9767180 0.012110687 0.000
## 14 2 -0.41080342 -1.7648507 0.038775024 0.123
## 15 2 0.46484621 -1.7465620 0.039082113 0.123
## 16 2 -0.24946067 -0.6293671 0.024993113 0.123
## 17 1 1.80367508 1.2554060 0.015373404 0.146
## 18 1 3.21564565 8.3543324 0.010779130 0.123
## 19 2 3.95290557 14.6848261 0.016852743 0.158
## 20 1 -0.22581409 -1.9519277 0.019251734 0.023
## 21 1 3.22582942 8.4209689 0.011102188 0.177
## 22 3 0.60983130 -1.5090610 0.053223001 0.056
## 23 3 0.51457211 -1.2379837 0.042682512 0.056
## 24 1 1.65939253 0.7549009 0.016016226 0.155
## 25 2 1.69543723 1.9750241 0.018800032 0.056
## 26 1 10.19921824 102.1824824 0.003782887 0.056
## 27 2 3.33861193 10.5712527 0.016498049 0.056
## 28 1 5.39870500 27.1881751 0.006830498 0.056
## 29 3 1.83817833 2.4130176 0.036058492 0.123
## 30 3 0.94650965 -0.7574031 0.048710593 0.155
## 31 1 1.53904600 0.3692961 0.016298172 0.135
## 32 1 2.43281985 3.9252628 0.013052375 0.135
## 33 1 2.47154019 4.1156528 0.013084635 0.155
## 34 1 2.66303034 5.1003693 0.012341511 0.135
## 35 1 3.49157641 10.2088077 0.010334730 0.155
## 36 2 2.45781443 5.4857676 0.017185754 0.045
According to the authors:
"Some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables. Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor.
. The fraction of unique values over the sample size is low (say 10 %).
. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20)."
The function nearZerovar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are or have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
Here, Soybean2 is created by removing the nearZeroVar predictors. Taking the different column names shows the three columns with a near zero variance, “leaf.mild” “mycelium” “sclerotia”. Taking a summary of these columns shows that the zero value is the predominate value.
Soybean_cols <- colnames(Soybean)
Soybean_cols2 <- colnames(Soybean2)
setdiff(Soybean_cols,Soybean_cols2)## [1] "leaf.mild" "mycelium" "sclerotia"
## leaf.mild
## 0 :535
## 1 : 20
## 2 : 20
## NA's:108
## mycelium
## 0 :639
## 1 : 6
## NA's: 38
## sclerotia
## 0 :625
## 1 : 20
## NA's: 38
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
From the text:
“In our experience, missing values are more often related to predictor variables than the sample. Because of this, amount of missing data may be concentrated in a subset of predictors rather than occurring randomly across all the predictors. In some cases, the percentage of missing data is substantial enough to remove this predictor from subsequent modeling activities.”
The metastats data on the dataset were saved into the metrics variable below. From there, those predictors with the highest percentage of missing values are listed below. “hail”, “sever”, “seed.tmt”, and “lodging” have the highest percentage of missing data, approx. 18%. Each one is a factor variable.
There does not seem to be a clear explanation as to why the data is missing. Also, there is no clear understanding that the missing data is related to the outcome.
metrics <- as_tibble(metastats)
metrics %>% dplyr::select(STATS, pct_missing )%>%
arrange(desc(pct_missing))## # A tibble: 36 x 2
## STATS pct_missing
## <chr> <dbl>
## 1 hail* 0.177
## 2 sever* 0.177
## 3 seed.tmt* 0.177
## 4 lodging* 0.177
## 5 germ* 0.164
## 6 leaf.mild* 0.158
## 7 fruiting.bodies* 0.155
## 8 fruit.spots* 0.155
## 9 seed.discolor* 0.155
## 10 shriveling* 0.155
## # ... with 26 more rows
## [1] diaporthe-stem-canker charcoal-rot
## [3] rhizoctonia-root-rot phytophthora-rot
## [5] brown-stem-rot powdery-mildew
## [7] downy-mildew brown-spot
## [9] bacterial-blight bacterial-pustule
## [11] purple-seed-stain anthracnose
## [13] phyllosticta-leaf-spot alternarialeaf-spot
## [15] frog-eye-leaf-spot diaporthe-pod-&-stem-blight
## [17] cyst-nematode 2-4-d-injury
## [19] herbicide-injury
## 19 Levels: 2-4-d-injury alternarialeaf-spot anthracnose ... rhizoctonia-root-rot
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Imputation may not be the best option in this instance due to these predictors being categorical variables with a finite number of entries. The best strategy just may be removing the predictors entirely.
Perhaps the overall best strategy would be to build the model with and without these predictors and select the model with the best results.