Data 624 HW4

Assignment #4

Gabe Abreu https://example.com/norajones (Spacely Sprockets)https://example.com/spacelysprokets
2021-10-03

HW 4

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

Show code
library(mlbench)
data(Glass)
str(Glass)
'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Plotting the distributions of the variables
Show code

Correlation PLot
Show code
correlations <- cor(Glass[,-10])
corrplot(correlations, order = "hclust")

Show code
ggpairs(Glass[,-10])

The visualizations show that some of the variables are skewed and there is a strong correlation between Ca and Ri. There are also moderate correlation levels between other variables.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
Show code
ggplot(melt(Glass), aes(x = variable,  y = value)) + 
    facet_wrap(~ variable, scales = "free", ncol = 3) + 
    geom_boxplot()

Show code
skewValues <- apply(Glass[,-10], 2, skewness)
skewValues
        RI         Na         Mg         Al         Si          K 
 1.6140150  0.4509917 -1.1444648  0.9009179 -0.7253173  6.5056358 
        Ca         Ba         Fe 
 2.0326774  3.3924309  1.7420068 

The predictors are skewed and have outlier, the boxplot show the outliers and the skewValues function shows the magnitude of the skew. K and Ba are heavily positively skewed. Mg and Si are negatively skewed.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

There are values that are heavily right skewed and would benefit from a transformation. Predictors Mg and Si are left skewed. Two predictors are correlated over 50%. Let’s perform a BoxCox transformation while also centering and scaling the data and performing a PCA afterwards.

Show code
transformD <- preProcess(Glass[,-10], method=c("BoxCox", "center", "scale", "pca"))
transformD
Created from 214 samples and 9 variables

Pre-processing:
  - Box-Cox transformation (5)
  - centered (9)
  - ignored (0)
  - principal component signal extraction (9)
  - scaled (9)

Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1
PCA needed 7 components to capture 95 percent of the variance
Show code
trans_Glass <- predict(transformD, Glass)
Show code
ggpairs(trans_Glass[,-1])

Re-Check Skew Values

Show code
skewValues <- apply(trans_Glass[,-1], 2, skewness)
skewValues
        PC1         PC2         PC3         PC4         PC5 
 0.07082340  1.30105063  2.46147761 -0.04549204 -0.24548695 
        PC6         PC7 
-0.15912510  1.75360432 

The BoxCox transformation as well as centering, scaling, and then performing PCA for signal extraction eliminated the correlation between the variables and greatly reduced the skewness.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

Show code
data(Soybean)
str(Soybean)
'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Show code
summary(Soybean)
                 Class          date     plant.stand  precip   
 brown-spot         : 92   5      :149   0   :354    0   : 74  
 alternarialeaf-spot: 91   4      :131   1   :293    1   :112  
 frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459  
 phytophthora-rot   : 88   2      : 93               NA's: 38  
 anthracnose        : 44   6      : 90                         
 brown-stem-rot     : 44   (Other):101                         
 (Other)            :233   NA's   :  1                         
   temp       hail     crop.hist  area.dam    sever     seed.tmt  
 0   : 80   0   :435   0   : 65   0   :123   0   :195   0   :305  
 1   :374   1   :127   1   :165   1   :227   1   :322   1   :222  
 2   :199   NA's:121   2   :219   2   :145   2   : 45   2   : 35  
 NA's: 30              3   :218   3   :187   NA's:121   NA's:121  
                       NA's: 16   NA's:  1                        
                                                                  
                                                                  
   germ     plant.growth leaves  leaf.halo  leaf.marg  leaf.size 
 0   :165   0   :441     0: 77   0   :221   0   :357   0   : 51  
 1   :213   1   :226     1:606   1   : 36   1   : 21   1   :327  
 2   :193   NA's: 16             2   :342   2   :221   2   :221  
 NA's:112                        NA's: 84   NA's: 84   NA's: 84  
                                                                 
                                                                 
                                                                 
 leaf.shread leaf.malf  leaf.mild    stem     lodging    stem.cankers
 0   :487    0   :554   0   :535   0   :296   0   :520   0   :379    
 1   : 96    1   : 45   1   : 20   1   :371   1   : 42   1   : 39    
 NA's:100    NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36    
                        NA's:108                         3   :191    
                                                         NA's: 38    
                                                                     
                                                                     
 canker.lesion fruiting.bodies ext.decay  mycelium   int.discolor
 0   :320      0   :473        0   :497   0   :639   0   :581    
 1   : 83      1   :104        1   :135   1   :  6   1   : 44    
 2   :177      NA's:106        2   : 13   NA's: 38   2   : 20    
 3   : 65                      NA's: 38              NA's: 38    
 NA's: 38                                                        
                                                                 
                                                                 
 sclerotia  fruit.pods fruit.spots   seed     mold.growth
 0   :625   0   :407   0   :345    0   :476   0   :524   
 1   : 20   1   :130   1   : 75    1   :115   1   : 67   
 NA's: 38   2   : 14   2   : 57    NA's: 92   NA's: 92   
            3   : 48   4   :100                          
            NA's: 84   NA's:106                          
                                                         
                                                         
 seed.discolor seed.size  shriveling  roots    
 0   :513      0   :532   0   :539   0   :551  
 1   : 64      1   : 59   1   : 38   1   : 86  
 NA's:106      NA's: 92   NA's:106   2   : 15  
                                     NA's: 31  
                                               
                                               
                                               
Show code
zero <- nearZeroVar(Soybean)
colnames(Soybean[zero])
[1] "leaf.mild" "mycelium"  "sclerotia"

Using the nearZeroVar function we find 3 variables with close to zero variance. Variables with almost no variance can cause major disruptions to a model and should be removed.

Show code
library(summarytools)
apply(Soybean, 2, freq)
Frequencies  

                                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------------------------- ------ --------- -------------- --------- --------------
                     2-4-d-injury     16      2.34           2.34      2.34           2.34
              alternarialeaf-spot     91     13.32          15.67     13.32          15.67
                      anthracnose     44      6.44          22.11      6.44          22.11
                 bacterial-blight     20      2.93          25.04      2.93          25.04
                bacterial-pustule     20      2.93          27.96      2.93          27.96
                       brown-spot     92     13.47          41.43     13.47          41.43
                   brown-stem-rot     44      6.44          47.88      6.44          47.88
                     charcoal-rot     20      2.93          50.81      2.93          50.81
                    cyst-nematode     14      2.05          52.86      2.05          52.86
      diaporthe-pod-&-stem-blight     15      2.20          55.05      2.20          55.05
            diaporthe-stem-canker     20      2.93          57.98      2.93          57.98
                     downy-mildew     20      2.93          60.91      2.93          60.91
               frog-eye-leaf-spot     91     13.32          74.23     13.32          74.23
                 herbicide-injury      8      1.17          75.40      1.17          75.40
           phyllosticta-leaf-spot     20      2.93          78.33      2.93          78.33
                 phytophthora-rot     88     12.88          91.22     12.88          91.22
                   powdery-mildew     20      2.93          94.14      2.93          94.14
                purple-seed-stain     20      2.93          97.07      2.93          97.07
             rhizoctonia-root-rot     20      2.93         100.00      2.93         100.00
                             <NA>      0                               0.00         100.00
                            Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     26      3.81           3.81      3.81           3.81
          1     75     11.00          14.81     10.98          14.79
          2     93     13.64          28.45     13.62          28.40
          3    118     17.30          45.75     17.28          45.68
          4    131     19.21          64.96     19.18          64.86
          5    149     21.85          86.80     21.82          86.68
          6     90     13.20         100.00     13.18          99.85
       <NA>      1                               0.15         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    354     54.71          54.71     51.83          51.83
          1    293     45.29         100.00     42.90          94.73
       <NA>     36                               5.27         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     74     11.47          11.47     10.83          10.83
          1    112     17.36          28.84     16.40          27.23
          2    459     71.16         100.00     67.20          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     80     12.25          12.25     11.71          11.71
          1    374     57.27          69.53     54.76          66.47
          2    199     30.47         100.00     29.14          95.61
       <NA>     30                               4.39         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    435     77.40          77.40     63.69          63.69
          1    127     22.60         100.00     18.59          82.28
       <NA>    121                              17.72         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     65      9.75           9.75      9.52           9.52
          1    165     24.74          34.48     24.16          33.67
          2    219     32.83          67.32     32.06          65.74
          3    218     32.68         100.00     31.92          97.66
       <NA>     16                               2.34         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    123     18.04          18.04     18.01          18.01
          1    227     33.28          51.32     33.24          51.24
          2    145     21.26          72.58     21.23          72.47
          3    187     27.42         100.00     27.38          99.85
       <NA>      1                               0.15         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    195     34.70          34.70     28.55          28.55
          1    322     57.30          91.99     47.14          75.70
          2     45      8.01         100.00      6.59          82.28
       <NA>    121                              17.72         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    305     54.27          54.27     44.66          44.66
          1    222     39.50          93.77     32.50          77.16
          2     35      6.23         100.00      5.12          82.28
       <NA>    121                              17.72         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    165     28.90          28.90     24.16          24.16
          1    213     37.30          66.20     31.19          55.34
          2    193     33.80         100.00     28.26          83.60
       <NA>    112                              16.40         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    441     66.12          66.12     64.57          64.57
          1    226     33.88         100.00     33.09          97.66
       <NA>     16                               2.34         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     77     11.27          11.27     11.27          11.27
          1    606     88.73         100.00     88.73         100.00
       <NA>      0                               0.00         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    221     36.89          36.89     32.36          32.36
          1     36      6.01          42.90      5.27          37.63
          2    342     57.10         100.00     50.07          87.70
       <NA>     84                              12.30         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    357     59.60          59.60     52.27          52.27
          1     21      3.51          63.11      3.07          55.34
          2    221     36.89         100.00     32.36          87.70
       <NA>     84                              12.30         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0     51      8.51           8.51      7.47           7.47
          1    327     54.59          63.11     47.88          55.34
          2    221     36.89         100.00     32.36          87.70
       <NA>     84                              12.30         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    487     83.53          83.53     71.30          71.30
          1     96     16.47         100.00     14.06          85.36
       <NA>    100                              14.64         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    554     92.49          92.49     81.11          81.11
          1     45      7.51         100.00      6.59          87.70
       <NA>     84                              12.30         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    535     93.04          93.04     78.33          78.33
          1     20      3.48          96.52      2.93          81.26
          2     20      3.48         100.00      2.93          84.19
       <NA>    108                              15.81         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    296     44.38          44.38     43.34          43.34
          1    371     55.62         100.00     54.32          97.66
       <NA>     16                               2.34         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    520     92.53          92.53     76.13          76.13
          1     42      7.47         100.00      6.15          82.28
       <NA>    121                              17.72         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    379     58.76          58.76     55.49          55.49
          1     39      6.05          64.81      5.71          61.20
          2     36      5.58          70.39      5.27          66.47
          3    191     29.61         100.00     27.96          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    320     49.61          49.61     46.85          46.85
          1     83     12.87          62.48     12.15          59.00
          2    177     27.44          89.92     25.92          84.92
          3     65     10.08         100.00      9.52          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    473     81.98          81.98     69.25          69.25
          1    104     18.02         100.00     15.23          84.48
       <NA>    106                              15.52         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    497     77.05          77.05     72.77          72.77
          1    135     20.93          97.98     19.77          92.53
          2     13      2.02         100.00      1.90          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    639     99.07          99.07     93.56          93.56
          1      6      0.93         100.00      0.88          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    581     90.08          90.08     85.07          85.07
          1     44      6.82          96.90      6.44          91.51
          2     20      3.10         100.00      2.93          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    625     96.90          96.90     91.51          91.51
          1     20      3.10         100.00      2.93          94.44
       <NA>     38                               5.56         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    407     67.95          67.95     59.59          59.59
          1    130     21.70          89.65     19.03          78.62
          2     14      2.34          91.99      2.05          80.67
          3     48      8.01         100.00      7.03          87.70
       <NA>     84                              12.30         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    345     59.79          59.79     50.51          50.51
          1     75     13.00          72.79     10.98          61.49
          2     57      9.88          82.67      8.35          69.84
          4    100     17.33         100.00     14.64          84.48
       <NA>    106                              15.52         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    476     80.54          80.54     69.69          69.69
          1    115     19.46         100.00     16.84          86.53
       <NA>     92                              13.47         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    524     88.66          88.66     76.72          76.72
          1     67     11.34         100.00      9.81          86.53
       <NA>     92                              13.47         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    513     88.91          88.91     75.11          75.11
          1     64     11.09         100.00      9.37          84.48
       <NA>    106                              15.52         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    532     90.02          90.02     77.89          77.89
          1     59      9.98         100.00      8.64          86.53
       <NA>     92                              13.47         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    539     93.41          93.41     78.92          78.92
          1     38      6.59         100.00      5.56          84.48
       <NA>    106                              15.52         100.00
      Total    683    100.00         100.00    100.00         100.00

Type: Character  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
          0    551     84.51          84.51     80.67          80.67
          1     86     13.19          97.70     12.59          93.27
          2     15      2.30         100.00      2.20          95.46
       <NA>     31                               4.54         100.00
      Total    683    100.00         100.00    100.00         100.00
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Show code
Show code
gg_miss_var(Soybean)

Show code
library(dplyr)
inc <- Soybean[which(!complete.cases(Soybean)),]
inc %>% 
  group_by(Class) %>%
 dplyr::summarize(Count=n())
# A tibble: 5 x 2
  Class                       Count
  <fct>                       <int>
1 2-4-d-injury                   16
2 cyst-nematode                  14
3 diaporthe-pod-&-stem-blight    15
4 herbicide-injury                8
5 phytophthora-rot               68

There are five classes missing data. Phyrophthora-rot has largest amount of missing data while 2-4-d-injury, cyst-nematode, and diaporthe-pod-&-stem-blight have similar amounts of missing data.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Eliminating the variables with near zero variance should be eliminated from the data.

Show code
Soybean_new <- Soybean%>%select(-"leaf.mild", -"mycelium", -"sclerotia")

We can use imputation for the missing values.

The method being applied will be predictive mean matching from the mice package.

For additional information:

https://statisticsglobe.com/predictive-mean-matching-imputation-method/

Show code
Show code
imp_soy <- mice(Soybean_new, m = 5, method = "pmm")

 iter imp variable
  1   1  date  plant.stand  precip*  temp*  hail*  crop.hist*  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread*  leaf.malf  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  1   2  date  plant.stand*  precip  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread*  leaf.malf  stem  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  1   3  date  plant.stand*  precip  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.marg  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  1   4  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread*  leaf.malf  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  1   5  date  plant.stand*  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread*  leaf.malf  stem*  lodging*  stem.cankers*  canker.lesion  fruiting.bodies*  ext.decay  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  2   1  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  2   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  2   3  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  2   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  2   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  3   1  date*  plant.stand*  precip  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.marg*  leaf.size*  leaf.shread  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  3   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.marg  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  3   3  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  3   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots
  3   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  4   1  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  4   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  4   3  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  4   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  4   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  5   1  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  5   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  5   3  date*  plant.stand  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  5   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
  5   5  date  plant.stand*  precip*  temp*  hail*  crop.hist*  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.marg*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging*  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling  roots*
Show code
multi_imp <- complete(imp_soy)

Check that the are no more missing values

Show code
inc2 <- multi_imp[which(!complete.cases(multi_imp)),]
inc2 %>% 
  group_by(Class) %>%
 dplyr::summarize(Count=n())
# A tibble: 0 x 2
# ... with 2 variables: Class <fct>, Count <int>