The Soybean dataset is a data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class. There are 19 values for the Class target variable. Of these 19, 4 classes have very little data. Of the 35 predictors, all are categorical variables - some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
From the above, we can see that all the predictors have missing values.
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##
## 0 1
## 532 59
Degenerate predictors are defined as those that are singled-valued i.e. variance = 0, or those where the the frequency of the most frequent value is at least 20 times the frequency of the next-most frequent value. In this dataset, we use the nearZeroVar function.
## [1] 19 26 28
From the above, we can see that there are 3 degenerate predictors with near zero variance i.e. leaf.mild, mycelium and sclerotia.
## [1] "leaf.mild" "mycelium" "sclerotia"
###(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
| x | |
|---|---|
| hail | 121 |
| sever | 121 |
| seed.tmt | 121 |
| lodging | 121 |
| germ | 112 |
| leaf.mild | 108 |
| fruiting.bodies | 106 |
| fruit.spots | 106 |
| seed.discolor | 106 |
| shriveling | 106 |
| leaf.shread | 100 |
| seed | 92 |
| mold.growth | 92 |
| seed.size | 92 |
| leaf.halo | 84 |
| leaf.marg | 84 |
| leaf.size | 84 |
| leaf.malf | 84 |
| fruit.pods | 84 |
| precip | 38 |
| stem.cankers | 38 |
| canker.lesion | 38 |
| ext.decay | 38 |
| mycelium | 38 |
| int.discolor | 38 |
| sclerotia | 38 |
| plant.stand | 36 |
| roots | 31 |
| temp | 30 |
| crop.hist | 16 |
| plant.growth | 16 |
| stem | 16 |
| date | 1 |
| area.dam | 1 |
| leaves | 0 |
| x | |
|---|---|
| hail | 0.1771596 |
| sever | 0.1771596 |
| seed.tmt | 0.1771596 |
| lodging | 0.1771596 |
| germ | 0.1639824 |
| leaf.mild | 0.1581259 |
| fruiting.bodies | 0.1551977 |
| fruit.spots | 0.1551977 |
| seed.discolor | 0.1551977 |
| shriveling | 0.1551977 |
| leaf.shread | 0.1464129 |
| seed | 0.1346999 |
| mold.growth | 0.1346999 |
| seed.size | 0.1346999 |
| leaf.halo | 0.1229868 |
| leaf.marg | 0.1229868 |
| leaf.size | 0.1229868 |
| leaf.malf | 0.1229868 |
| fruit.pods | 0.1229868 |
| precip | 0.0556369 |
| stem.cankers | 0.0556369 |
| canker.lesion | 0.0556369 |
| ext.decay | 0.0556369 |
| mycelium | 0.0556369 |
| int.discolor | 0.0556369 |
| sclerotia | 0.0556369 |
| plant.stand | 0.0527086 |
| roots | 0.0453880 |
| temp | 0.0439239 |
| crop.hist | 0.0234261 |
| plant.growth | 0.0234261 |
| stem | 0.0234261 |
| date | 0.0014641 |
| area.dam | 0.0014641 |
| leaves | 0.0000000 |
The above table shows that about 18% of the values are missing for 4 of the predictors.
| leaves | date | area.dam | crop.hist | plant.growth | stem | temp | roots | plant.stand | precip | stem.cankers | canker.lesion | ext.decay | mycelium | int.discolor | sclerotia | leaf.halo | leaf.marg | leaf.size | leaf.malf | fruit.pods | seed | mold.growth | seed.size | leaf.shread | fruiting.bodies | fruit.spots | seed.discolor | shriveling | leaf.mild | germ | hail | sever | seed.tmt | lodging | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 562 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 13 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 13 |
| 55 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 19 |
| 8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20 |
| 9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 11 |
| 6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
| 14 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 |
| 15 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30 |
| 0 | 1 | 1 | 16 | 16 | 16 | 30 | 31 | 36 | 38 | 38 | 38 | 38 | 38 | 38 | 38 | 84 | 84 | 84 | 84 | 84 | 92 | 92 | 92 | 100 | 106 | 106 | 106 | 106 | 108 | 112 | 121 | 121 | 121 | 121 | 2337 |
The above shows that 562 records have no missing values. The remaining rows can be studied further to analyze any patterns in missing date. For example,it seems that the following 3 predictors have missing values in conjunction: seed, mold.growth and seed.size - this may be of interest to a biologist who can use domain knowledge to better understand why these 3 predictors always have missing values in conjunction with each other.
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## leaves 0.000000000
The above plot re-iterates that this soybean dataset has about 82% completeness. It shows the combinations of predictors that are typically missing together or not at all. There is not a single predictor that is missing on its own - it is always groups of variables that are missing together. This may have something to do with how the data was collected and be of interest to a domain expert.
###(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The approach I would follow is to eliminate the 3 degenerate predictors followed by impuration of missing values for the remaining predictors.
The MICE package provides different methods for imputing missing values such as Predictive Mean Matching and CART.
| x | |
|---|---|
| Class | 0 |
| date | 0 |
| plant.stand | 0 |
| precip | 0 |
| temp | 0 |
| hail | 0 |
| crop.hist | 0 |
| area.dam | 0 |
| sever | 0 |
| seed.tmt | 0 |
| germ | 0 |
| plant.growth | 0 |
| leaves | 0 |
| leaf.halo | 0 |
| leaf.marg | 0 |
| leaf.size | 0 |
| leaf.shread | 0 |
| leaf.malf | 0 |
| stem | 0 |
| lodging | 0 |
| stem.cankers | 0 |
| canker.lesion | 0 |
| fruiting.bodies | 0 |
| ext.decay | 0 |
| int.discolor | 0 |
| fruit.pods | 0 |
| fruit.spots | 0 |
| seed | 0 |
| mold.growth | 0 |
| seed.discolor | 0 |
| seed.size | 0 |
| shriveling | 0 |
| roots | 0 |
In order to compare the different imputation methods, one would need to check the impact of the imputation on the variance of the predictor, as well as it’s impact on the target variable.