Assignment #4
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
plot_histogram(Glass)
ggpairs(Glass[,-10])
The visualizations show that some of the variables are skewed and there is a strong correlation between Ca and Ri. There are also moderate correlation levels between other variables.
ggplot(melt(Glass), aes(x = variable, y = value)) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
geom_boxplot()
skewValues <- apply(Glass[,-10], 2, skewness)
skewValues
RI Na Mg Al Si K
1.6140150 0.4509917 -1.1444648 0.9009179 -0.7253173 6.5056358
Ca Ba Fe
2.0326774 3.3924309 1.7420068
The predictors are skewed and have outlier, the boxplot show the outliers and the skewValues function shows the magnitude of the skew. K and Ba are heavily positively skewed. Mg and Si are negatively skewed.
There are values that are heavily right skewed and would benefit from a transformation. Predictors Mg and Si are left skewed. Two predictors are correlated over 50%. Let’s perform a BoxCox transformation while also centering and scaling the data and performing a PCA afterwards.
transformD <- preProcess(Glass[,-10], method=c("BoxCox", "center", "scale", "pca"))
transformD
Created from 214 samples and 9 variables
Pre-processing:
- Box-Cox transformation (5)
- centered (9)
- ignored (0)
- principal component signal extraction (9)
- scaled (9)
Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1
PCA needed 7 components to capture 95 percent of the variance
trans_Glass <- predict(transformD, Glass)
ggpairs(trans_Glass[,-1])
Re-Check Skew Values
skewValues <- apply(trans_Glass[,-1], 2, skewness)
skewValues
PC1 PC2 PC3 PC4 PC5
0.07082340 1.30105063 2.46147761 -0.04549204 -0.24548695
PC6 PC7
-0.15912510 1.75360432
The BoxCox transformation as well as centering, scaling, and then performing PCA for signal extraction eliminated the correlation between the variables and greatly reduced the skewness.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
'data.frame': 683 obs. of 36 variables:
$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
$ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
$ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
Class date plant.stand precip
brown-spot : 92 5 :149 0 :354 0 : 74
alternarialeaf-spot: 91 4 :131 1 :293 1 :112
frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459
phytophthora-rot : 88 2 : 93 NA's: 38
anthracnose : 44 6 : 90
brown-stem-rot : 44 (Other):101
(Other) :233 NA's : 1
temp hail crop.hist area.dam sever seed.tmt
0 : 80 0 :435 0 : 65 0 :123 0 :195 0 :305
1 :374 1 :127 1 :165 1 :227 1 :322 1 :222
2 :199 NA's:121 2 :219 2 :145 2 : 45 2 : 35
NA's: 30 3 :218 3 :187 NA's:121 NA's:121
NA's: 16 NA's: 1
germ plant.growth leaves leaf.halo leaf.marg leaf.size
0 :165 0 :441 0: 77 0 :221 0 :357 0 : 51
1 :213 1 :226 1:606 1 : 36 1 : 21 1 :327
2 :193 NA's: 16 2 :342 2 :221 2 :221
NA's:112 NA's: 84 NA's: 84 NA's: 84
leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
0 :487 0 :554 0 :535 0 :296 0 :520 0 :379
1 : 96 1 : 45 1 : 20 1 :371 1 : 42 1 : 39
NA's:100 NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36
NA's:108 3 :191
NA's: 38
canker.lesion fruiting.bodies ext.decay mycelium int.discolor
0 :320 0 :473 0 :497 0 :639 0 :581
1 : 83 1 :104 1 :135 1 : 6 1 : 44
2 :177 NA's:106 2 : 13 NA's: 38 2 : 20
3 : 65 NA's: 38 NA's: 38
NA's: 38
sclerotia fruit.pods fruit.spots seed mold.growth
0 :625 0 :407 0 :345 0 :476 0 :524
1 : 20 1 :130 1 : 75 1 :115 1 : 67
NA's: 38 2 : 14 2 : 57 NA's: 92 NA's: 92
3 : 48 4 :100
NA's: 84 NA's:106
seed.discolor seed.size shriveling roots
0 :513 0 :532 0 :539 0 :551
1 : 64 1 : 59 1 : 38 1 : 86
NA's:106 NA's: 92 NA's:106 2 : 15
NA's: 31
zero <- nearZeroVar(Soybean)
colnames(Soybean[zero])
[1] "leaf.mild" "mycelium" "sclerotia"
Using the nearZeroVar function we find 3 variables with close to zero variance. Variables with almost no variance can cause major disruptions to a model and should be removed.
library(summarytools)
apply(Soybean, 2, freq)
Frequencies
Freq % Valid % Valid Cum. % Total % Total Cum.
--------------------------------- ------ --------- -------------- --------- --------------
2-4-d-injury 16 2.34 2.34 2.34 2.34
alternarialeaf-spot 91 13.32 15.67 13.32 15.67
anthracnose 44 6.44 22.11 6.44 22.11
bacterial-blight 20 2.93 25.04 2.93 25.04
bacterial-pustule 20 2.93 27.96 2.93 27.96
brown-spot 92 13.47 41.43 13.47 41.43
brown-stem-rot 44 6.44 47.88 6.44 47.88
charcoal-rot 20 2.93 50.81 2.93 50.81
cyst-nematode 14 2.05 52.86 2.05 52.86
diaporthe-pod-&-stem-blight 15 2.20 55.05 2.20 55.05
diaporthe-stem-canker 20 2.93 57.98 2.93 57.98
downy-mildew 20 2.93 60.91 2.93 60.91
frog-eye-leaf-spot 91 13.32 74.23 13.32 74.23
herbicide-injury 8 1.17 75.40 1.17 75.40
phyllosticta-leaf-spot 20 2.93 78.33 2.93 78.33
phytophthora-rot 88 12.88 91.22 12.88 91.22
powdery-mildew 20 2.93 94.14 2.93 94.14
purple-seed-stain 20 2.93 97.07 2.93 97.07
rhizoctonia-root-rot 20 2.93 100.00 2.93 100.00
<NA> 0 0.00 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 26 3.81 3.81 3.81 3.81
1 75 11.00 14.81 10.98 14.79
2 93 13.64 28.45 13.62 28.40
3 118 17.30 45.75 17.28 45.68
4 131 19.21 64.96 19.18 64.86
5 149 21.85 86.80 21.82 86.68
6 90 13.20 100.00 13.18 99.85
<NA> 1 0.15 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 354 54.71 54.71 51.83 51.83
1 293 45.29 100.00 42.90 94.73
<NA> 36 5.27 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 74 11.47 11.47 10.83 10.83
1 112 17.36 28.84 16.40 27.23
2 459 71.16 100.00 67.20 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 80 12.25 12.25 11.71 11.71
1 374 57.27 69.53 54.76 66.47
2 199 30.47 100.00 29.14 95.61
<NA> 30 4.39 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 435 77.40 77.40 63.69 63.69
1 127 22.60 100.00 18.59 82.28
<NA> 121 17.72 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 65 9.75 9.75 9.52 9.52
1 165 24.74 34.48 24.16 33.67
2 219 32.83 67.32 32.06 65.74
3 218 32.68 100.00 31.92 97.66
<NA> 16 2.34 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 123 18.04 18.04 18.01 18.01
1 227 33.28 51.32 33.24 51.24
2 145 21.26 72.58 21.23 72.47
3 187 27.42 100.00 27.38 99.85
<NA> 1 0.15 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 195 34.70 34.70 28.55 28.55
1 322 57.30 91.99 47.14 75.70
2 45 8.01 100.00 6.59 82.28
<NA> 121 17.72 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 305 54.27 54.27 44.66 44.66
1 222 39.50 93.77 32.50 77.16
2 35 6.23 100.00 5.12 82.28
<NA> 121 17.72 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 165 28.90 28.90 24.16 24.16
1 213 37.30 66.20 31.19 55.34
2 193 33.80 100.00 28.26 83.60
<NA> 112 16.40 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 441 66.12 66.12 64.57 64.57
1 226 33.88 100.00 33.09 97.66
<NA> 16 2.34 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 77 11.27 11.27 11.27 11.27
1 606 88.73 100.00 88.73 100.00
<NA> 0 0.00 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 221 36.89 36.89 32.36 32.36
1 36 6.01 42.90 5.27 37.63
2 342 57.10 100.00 50.07 87.70
<NA> 84 12.30 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 357 59.60 59.60 52.27 52.27
1 21 3.51 63.11 3.07 55.34
2 221 36.89 100.00 32.36 87.70
<NA> 84 12.30 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 51 8.51 8.51 7.47 7.47
1 327 54.59 63.11 47.88 55.34
2 221 36.89 100.00 32.36 87.70
<NA> 84 12.30 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 487 83.53 83.53 71.30 71.30
1 96 16.47 100.00 14.06 85.36
<NA> 100 14.64 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 554 92.49 92.49 81.11 81.11
1 45 7.51 100.00 6.59 87.70
<NA> 84 12.30 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 535 93.04 93.04 78.33 78.33
1 20 3.48 96.52 2.93 81.26
2 20 3.48 100.00 2.93 84.19
<NA> 108 15.81 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 296 44.38 44.38 43.34 43.34
1 371 55.62 100.00 54.32 97.66
<NA> 16 2.34 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 520 92.53 92.53 76.13 76.13
1 42 7.47 100.00 6.15 82.28
<NA> 121 17.72 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 379 58.76 58.76 55.49 55.49
1 39 6.05 64.81 5.71 61.20
2 36 5.58 70.39 5.27 66.47
3 191 29.61 100.00 27.96 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 320 49.61 49.61 46.85 46.85
1 83 12.87 62.48 12.15 59.00
2 177 27.44 89.92 25.92 84.92
3 65 10.08 100.00 9.52 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 473 81.98 81.98 69.25 69.25
1 104 18.02 100.00 15.23 84.48
<NA> 106 15.52 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 497 77.05 77.05 72.77 72.77
1 135 20.93 97.98 19.77 92.53
2 13 2.02 100.00 1.90 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 639 99.07 99.07 93.56 93.56
1 6 0.93 100.00 0.88 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 581 90.08 90.08 85.07 85.07
1 44 6.82 96.90 6.44 91.51
2 20 3.10 100.00 2.93 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 625 96.90 96.90 91.51 91.51
1 20 3.10 100.00 2.93 94.44
<NA> 38 5.56 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 407 67.95 67.95 59.59 59.59
1 130 21.70 89.65 19.03 78.62
2 14 2.34 91.99 2.05 80.67
3 48 8.01 100.00 7.03 87.70
<NA> 84 12.30 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 345 59.79 59.79 50.51 50.51
1 75 13.00 72.79 10.98 61.49
2 57 9.88 82.67 8.35 69.84
4 100 17.33 100.00 14.64 84.48
<NA> 106 15.52 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 476 80.54 80.54 69.69 69.69
1 115 19.46 100.00 16.84 86.53
<NA> 92 13.47 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 524 88.66 88.66 76.72 76.72
1 67 11.34 100.00 9.81 86.53
<NA> 92 13.47 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 513 88.91 88.91 75.11 75.11
1 64 11.09 100.00 9.37 84.48
<NA> 106 15.52 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 532 90.02 90.02 77.89 77.89
1 59 9.98 100.00 8.64 86.53
<NA> 92 13.47 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 539 93.41 93.41 78.92 78.92
1 38 6.59 100.00 5.56 84.48
<NA> 106 15.52 100.00
Total 683 100.00 100.00 100.00 100.00
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
0 551 84.51 84.51 80.67 80.67
1 86 13.19 97.70 12.59 93.27
2 15 2.30 100.00 2.20 95.46
<NA> 31 4.54 100.00
Total 683 100.00 100.00 100.00 100.00
gg_miss_var(Soybean)
# A tibble: 5 x 2
Class Count
<fct> <int>
1 2-4-d-injury 16
2 cyst-nematode 14
3 diaporthe-pod-&-stem-blight 15
4 herbicide-injury 8
5 phytophthora-rot 68
There are five classes missing data. Phyrophthora-rot has largest amount of missing data while 2-4-d-injury, cyst-nematode, and diaporthe-pod-&-stem-blight have similar amounts of missing data.
Eliminating the variables with near zero variance should be eliminated from the data.
Soybean_new <- Soybean%>%select(-"leaf.mild", -"mycelium", -"sclerotia")
We can use imputation for the missing values.
The method being applied will be predictive mean matching from the mice package.
For additional information:
https://statisticsglobe.com/predictive-mean-matching-imputation-method/
imp_soy <- mice(Soybean_new, m = 5, method = "pmm")
iter imp variable
1 1 date plant.stand precip* temp* hail* crop.hist* area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.marg leaf.size leaf.shread* leaf.malf stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
1 2 date plant.stand* precip temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.marg leaf.size leaf.shread* leaf.malf stem lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
1 3 date plant.stand* precip temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.marg leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
1 4 date plant.stand precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.marg leaf.size leaf.shread* leaf.malf stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
1 5 date plant.stand* precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.marg leaf.size leaf.shread* leaf.malf stem* lodging* stem.cankers* canker.lesion fruiting.bodies* ext.decay int.discolor fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
2 1 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
2 2 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
2 3 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
2 4 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
2 5 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
3 1 date* plant.stand* precip temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.marg* leaf.size* leaf.shread leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
3 2 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.marg leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
3 3 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
3 4 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots
3 5 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
4 1 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
4 2 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
4 3 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
4 4 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
4 5 date* plant.stand* precip* temp* hail* crop.hist* area.dam sever* seed.tmt* germ* plant.growth* leaf.halo leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
5 1 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
5 2 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
5 3 date* plant.stand precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
5 4 date* plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
5 5 date plant.stand* precip* temp* hail* crop.hist* area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.marg* leaf.size* leaf.shread* leaf.malf* stem* lodging* stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling roots*
multi_imp <- complete(imp_soy)
Check that the are no more missing values
inc2 <- multi_imp[which(!complete.cases(multi_imp)),]
inc2 %>%
group_by(Class) %>%
dplyr::summarize(Count=n())
# A tibble: 0 x 2
# ... with 2 variables: Class <fct>, Count <int>