3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The Soybean dataset is a data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class. There are 19 values for the Class target variable. Of these 19, 4 classes have very little data. Of the 35 predictors, all are categorical variables - some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

From the above, we can see that all the predictors have missing values.

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

## 
##   0   1 
## 532  59

Degenerate predictors are defined as those that are singled-valued i.e. variance = 0, or those where the the frequency of the most frequent value is at least 20 times the frequency of the next-most frequent value. In this dataset, we use the nearZeroVar function.

## [1] 19 26 28

From the above, we can see that there are 3 degenerate predictors with near zero variance i.e. leaf.mild, mycelium and sclerotia.

## [1] "leaf.mild" "mycelium"  "sclerotia"

###(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

x
hail 121
sever 121
seed.tmt 121
lodging 121
germ 112
leaf.mild 108
fruiting.bodies 106
fruit.spots 106
seed.discolor 106
shriveling 106
leaf.shread 100
seed 92
mold.growth 92
seed.size 92
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.malf 84
fruit.pods 84
precip 38
stem.cankers 38
canker.lesion 38
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
plant.stand 36
roots 31
temp 30
crop.hist 16
plant.growth 16
stem 16
date 1
area.dam 1
leaves 0
x
hail 0.1771596
sever 0.1771596
seed.tmt 0.1771596
lodging 0.1771596
germ 0.1639824
leaf.mild 0.1581259
fruiting.bodies 0.1551977
fruit.spots 0.1551977
seed.discolor 0.1551977
shriveling 0.1551977
leaf.shread 0.1464129
seed 0.1346999
mold.growth 0.1346999
seed.size 0.1346999
leaf.halo 0.1229868
leaf.marg 0.1229868
leaf.size 0.1229868
leaf.malf 0.1229868
fruit.pods 0.1229868
precip 0.0556369
stem.cankers 0.0556369
canker.lesion 0.0556369
ext.decay 0.0556369
mycelium 0.0556369
int.discolor 0.0556369
sclerotia 0.0556369
plant.stand 0.0527086
roots 0.0453880
temp 0.0439239
crop.hist 0.0234261
plant.growth 0.0234261
stem 0.0234261
date 0.0014641
area.dam 0.0014641
leaves 0.0000000

The above table shows that about 18% of the values are missing for 4 of the predictors.

leaves date area.dam crop.hist plant.growth stem temp roots plant.stand precip stem.cankers canker.lesion ext.decay mycelium int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging
562 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 13
55 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19
8 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 20
9 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 11
6 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 13
14 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 24
15 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30
0 1 1 16 16 16 30 31 36 38 38 38 38 38 38 38 84 84 84 84 84 92 92 92 100 106 106 106 106 108 112 121 121 121 121 2337

The above shows that 562 records have no missing values. The remaining rows can be studied further to analyze any patterns in missing date. For example,it seems that the following 3 predictors have missing values in conjunction: seed, mold.growth and seed.size - this may be of interest to a biologist who can use domain knowledge to better understand why these 3 predictors always have missing values in conjunction with each other.

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##           leaves 0.000000000

The above plot re-iterates that this soybean dataset has about 82% completeness. It shows the combinations of predictors that are typically missing together or not at all. There is not a single predictor that is missing on its own - it is always groups of variables that are missing together. This may have something to do with how the data was collected and be of interest to a domain expert.

###(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The approach I would follow is to eliminate the 3 degenerate predictors followed by impuration of missing values for the remaining predictors.

The MICE package provides different methods for imputing missing values such as Predictive Mean Matching and CART.

x
Class 0
date 0
plant.stand 0
precip 0
temp 0
hail 0
crop.hist 0
area.dam 0
sever 0
seed.tmt 0
germ 0
plant.growth 0
leaves 0
leaf.halo 0
leaf.marg 0
leaf.size 0
leaf.shread 0
leaf.malf 0
stem 0
lodging 0
stem.cankers 0
canker.lesion 0
fruiting.bodies 0
ext.decay 0
int.discolor 0
fruit.pods 0
fruit.spots 0
seed 0
mold.growth 0
seed.discolor 0
seed.size 0
shriveling 0
roots 0

In order to compare the different imputation methods, one would need to check the impact of the imputation on the variance of the predictor, as well as it’s impact on the target variable.