DATA 624: Home Work 04

Libraries
Exercise 3.1
Exercise 3.2

Libraries

library(tidyverse)
library(corrplot)
library(kableExtra)
library(reshape2)
library(caret)
library(Amelia)

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First of all, let’s do some data basic exploration

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

# Check for null values
any(is.null(Glass))

## [1] FALSE

No null found in the data frame

# Check for missing values NA's
any(is.na(Glass))

## [1] FALSE

No missing values found in the data frame

In order to see predictor variables distribution we need to pivot the data.

#pivot the data
long_glass <- Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>%
  mutate(Predictor = as.factor(Predictor))
head(long_glass)

## # A tibble: 6 x 3
##   Type  Predictor Value
##   <fct> <fct>     <dbl>
## 1 1     RI         1.52
## 2 1     Na        13.6 
## 3 1     Mg         4.49
## 4 1     Al         1.1 
## 5 1     Si        71.8 
## 6 1     K          0.06

We are using histogram to see the distribution of predictor variables.

long_glass %>%
  ggplot(aes(Value, color = Predictor, fill = Predictor)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ Predictor, ncol = 3, scales = "free") +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  theme_light() +
  theme(legend.position = "none") +
  ggtitle("Distribution of Predictor Variables")

We can see high concentrations of silica (Si), sodium (Na) and lime (Ca) in the histogram which are the main raw matreials of glass.

Source: https://www.cmog.org/article/chemistry-glass

Let’s look at the predictor variables relationship using corrplot

corrplot(cor(Glass[,1:9]), method="color",  type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE)

We can see from above that most of the predictors are negatively correlated. Here, data represent chemical concentration on a percentage scale. As a result we can see when one element increases other element decreases.

In addtion, we observed that most of the correlations are not very strong except correlation between calcium and the refractive index which are strongly positively correlated.

(b)Do there appear to be any outliers in the data? Are any predictors skewed?

Let’s look at the basic statistics of the predictor variables.

Predictor	Min	1st Qu.	Median	Mean	3rd Qu.	Max
Al	0.29000	1.190000	1.36000	1.4449065	1.630000	3.50000
Ba	0.00000	0.000000	0.00000	0.1750467	0.000000	3.15000
Ca	5.43000	8.240000	8.60000	8.9569626	9.172500	16.19000
Fe	0.00000	0.000000	0.00000	0.0570093	0.100000	0.51000
K	0.00000	0.122500	0.55500	0.4970561	0.610000	6.21000
Mg	0.00000	2.115000	3.48000	2.6845327	3.600000	4.49000
Na	10.73000	12.907500	13.30000	13.4078505	13.825000	17.38000
RI	1.51115	1.516522	1.51768	1.5183654	1.519157	1.53393
Si	69.81000	72.280000	72.79000	72.6509346	73.087500	75.41000

I want to see how the predictors are distributed by the type of glass. I will use a scatter plot to do this but will be excluding scilica because of the difference in scale.

long_glass %>%
  ggplot(aes(x = Type, y = Value, color = Predictor)) +
  geom_jitter() +
  ylim(0, 20) + 
  scale_color_brewer(palette = "Set1") +
  theme_light()

## Warning: Removed 415 rows containing missing values (geom_point).

It looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potasium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.

Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.

(c)Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, transformations like a log or a Box Cox could help improve the classification model.
Removing skew is removing outliers that improves a model’s performance.
Also, centering and scaling can be important for all variables with any model.
Checking if there are any missing values in any columns that can cause a delay or miscalculate or need to addressed by imputation/removal/or other means. We can confirm there are no missing value in this dataset.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a)Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First of all, let’s do some data basic exploration

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

head(Soybean)

##                   Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker    6           0      2    1    0         1
## 2 diaporthe-stem-canker    4           0      2    1    0         2
## 3 diaporthe-stem-canker    3           0      2    1    0         1
## 4 diaporthe-stem-canker    3           0      2    1    0         1
## 5 diaporthe-stem-canker    6           0      2    1    0         2
## 6 diaporthe-stem-canker    5           0      2    1    0         3
##   area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1        1     1        0    0            1      1         0         2
## 2        0     2        1    1            1      1         0         2
## 3        0     2        1    2            1      1         0         2
## 4        0     2        0    1            1      1         0         2
## 5        0     1        0    2            1      1         0         2
## 6        0     1        0    1            1      1         0         2
##   leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1         2           0         0         0    1       1            3
## 2         2           0         0         0    1       0            3
## 3         2           0         0         0    1       0            3
## 4         2           0         0         0    1       0            3
## 5         2           0         0         0    1       0            3
## 6         2           0         0         0    1       0            3
##   canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             1               1         1        0            0         0
## 2             1               1         1        0            0         0
## 3             0               1         1        0            0         0
## 4             0               1         1        0            0         0
## 5             1               1         1        0            0         0
## 6             0               1         1        0            0         0
##   fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1          0           4    0           0             0         0
## 2          0           4    0           0             0         0
## 3          0           4    0           0             0         0
## 4          0           4    0           0             0         0
## 5          0           4    0           0             0         0
## 6          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0

# Check for null values
any(is.null(Soybean))

## [1] FALSE

No null found in the data frame

# Check for missing values NA's
any(is.na(Soybean))

## [1] TRUE

There are missing values found in the data frame

Soybean[, 2:length(Soybean)] <- lapply(Soybean[, 2:length(Soybean)], function(x) as.numeric(as.character(x)))

ggplot(data=melt(Soybean), mapping=aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = 'free_x')

## Using Class as id variables

## Warning: Removed 2337 rows containing non-finite values (stat_count).

I am assuming the degenerate distibuted variables discussed earlier in the chapter refers to section 3.5 on removing predictors which states:

• The fraction of unique values over the sample size is low (say 10 %).
• The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

There’s a lot of missing variables. The authors recommended removing variables with near zero variance. The caret package has a function for that. Here’s the output from that function:

nearZeroVar(Soybean, saveMetrics = T) %>%
  kable() %>%
  kable_styling()

	freqRatio	percentUnique	zeroVar	nzv
Class	1.010989	2.7818448	FALSE	FALSE
date	1.137405	1.0248902	FALSE	FALSE
plant.stand	1.208191	0.2928258	FALSE	FALSE
precip	4.098214	0.4392387	FALSE	FALSE
temp	1.879397	0.4392387	FALSE	FALSE
hail	3.425197	0.2928258	FALSE	FALSE
crop.hist	1.004587	0.5856515	FALSE	FALSE
area.dam	1.213904	0.5856515	FALSE	FALSE
sever	1.651282	0.4392387	FALSE	FALSE
seed.tmt	1.373874	0.4392387	FALSE	FALSE
germ	1.103627	0.4392387	FALSE	FALSE
plant.growth	1.951327	0.2928258	FALSE	FALSE
leaves	7.870130	0.2928258	FALSE	FALSE
leaf.halo	1.547511	0.4392387	FALSE	FALSE
leaf.marg	1.615385	0.4392387	FALSE	FALSE
leaf.size	1.479638	0.4392387	FALSE	FALSE
leaf.shread	5.072917	0.2928258	FALSE	FALSE
leaf.malf	12.311111	0.2928258	FALSE	FALSE
leaf.mild	26.750000	0.4392387	FALSE	TRUE
stem	1.253378	0.2928258	FALSE	FALSE
lodging	12.380952	0.2928258	FALSE	FALSE
stem.cankers	1.984293	0.5856515	FALSE	FALSE
canker.lesion	1.807910	0.5856515	FALSE	FALSE
fruiting.bodies	4.548077	0.2928258	FALSE	FALSE
ext.decay	3.681481	0.4392387	FALSE	FALSE
mycelium	106.500000	0.2928258	FALSE	TRUE
int.discolor	13.204546	0.4392387	FALSE	FALSE
sclerotia	31.250000	0.2928258	FALSE	TRUE
fruit.pods	3.130769	0.5856515	FALSE	FALSE
fruit.spots	3.450000	0.5856515	FALSE	FALSE
seed	4.139130	0.2928258	FALSE	FALSE
mold.growth	7.820895	0.2928258	FALSE	FALSE
seed.discolor	8.015625	0.2928258	FALSE	FALSE
seed.size	9.016949	0.2928258	FALSE	FALSE
shriveling	14.184211	0.2928258	FALSE	FALSE
roots	6.406977	0.4392387	FALSE	FALSE

There are three variables (leaf.mild, mycelium, sclerotia) that have a near zero variance, and should probably be removed.

(b)Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")

There are blocks of observations that are missing. Since the data are arranged by the classes this suggests that the patterns of missing data are related to the classes which can also be varified by below

Soybean$na_count <- apply(Soybean, 1, function(x) sum(is.na(x)))
(Soybean %>%
  select(Class, na_count) %>%
  group_by(Class) %>%
  summarise(na_count = sum(na_count)) %>%
  arrange(desc(na_count)))

## # A tibble: 19 x 2
##    Class                       na_count
##    <fct>                          <int>
##  1 phytophthora-rot                1214
##  2 2-4-d-injury                     450
##  3 cyst-nematode                    336
##  4 diaporthe-pod-&-stem-blight      177
##  5 herbicide-injury                 160
##  6 alternarialeaf-spot                0
##  7 anthracnose                        0
##  8 bacterial-blight                   0
##  9 bacterial-pustule                  0
## 10 brown-spot                         0
## 11 brown-stem-rot                     0
## 12 charcoal-rot                       0
## 13 diaporthe-stem-canker              0
## 14 downy-mildew                       0
## 15 frog-eye-leaf-spot                 0
## 16 phyllosticta-leaf-spot             0
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

(c)Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will be eliminating the three near zero variance predictiors. For all other predictors I will be imputing values. I don’t have any domain expertise that would inform the imputations, so I will be using decision trees to (hopefully) produce good imputations. It has been my experience that decision trees preform really well. The dlookr package integrates well with the tidyverse.

library(dlookr)

## Warning: package 'dlookr' was built under R version 3.6.3

## Loading required package: mice

## Warning: package 'mice' was built under R version 3.6.3

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').

## 
## Attaching package: 'dlookr'

## The following object is masked from 'package:base':
## 
##     transform

Soybean_complete <- Soybean %>%
  # Impute missing values using rpart
  mutate(
    date = imputate_na(Soybean, date, Class, method = "rpart", no_attrs = TRUE),
    plant.stand = imputate_na(Soybean, plant.stand, Class, method = "rpart", no_attrs = TRUE),
    precip = imputate_na(Soybean, precip, Class, method = "rpart", no_attrs = TRUE),
    temp = imputate_na(Soybean, temp, Class, method = "rpart", no_attrs = TRUE),
    hail = imputate_na(Soybean, hail, Class, method = "rpart", no_attrs = TRUE),
    crop.hist = imputate_na(Soybean, crop.hist, Class, method = "rpart", no_attrs = TRUE),
    area.dam = imputate_na(Soybean, area.dam, Class, method = "rpart", no_attrs = TRUE),
    sever = imputate_na(Soybean, sever, Class, method = "rpart", no_attrs = TRUE),
    seed.tmt = imputate_na(Soybean, seed.tmt, Class, method = "rpart", no_attrs = TRUE),
    germ = imputate_na(Soybean, germ, Class, method = "rpart", no_attrs = TRUE),
    plant.growth = imputate_na(Soybean, plant.growth, Class, method = "rpart", no_attrs = TRUE),
    leaf.halo = imputate_na(Soybean, leaf.halo, Class, method = "rpart", no_attrs = TRUE),
    leaf.marg = imputate_na(Soybean, leaf.marg, Class, method = "rpart", no_attrs = TRUE),
    leaf.size = imputate_na(Soybean, leaf.size, Class, method = "rpart", no_attrs = TRUE),
    leaf.shread = imputate_na(Soybean, leaf.shread, Class, method = "rpart", no_attrs = TRUE),
    leaf.malf = imputate_na(Soybean, leaf.malf, Class, method = "rpart", no_attrs = TRUE),
    stem = imputate_na(Soybean, stem, Class, method = "rpart", no_attrs = TRUE),
    lodging = imputate_na(Soybean, lodging, Class, method = "rpart", no_attrs = TRUE),
    stem.cankers = imputate_na(Soybean, stem.cankers, Class, method = "rpart", no_attrs = TRUE),
    canker.lesion = imputate_na(Soybean, canker.lesion, Class, method = "rpart", no_attrs = TRUE),
    fruiting.bodies = imputate_na(Soybean, fruiting.bodies, Class, method = "rpart", no_attrs = TRUE),
    ext.decay = imputate_na(Soybean, ext.decay, Class, method = "rpart", no_attrs = TRUE),
    int.discolor = imputate_na(Soybean, int.discolor, Class, method = "rpart", no_attrs = TRUE),
    fruit.pods = imputate_na(Soybean, fruit.pods, Class, method = "rpart", no_attrs = TRUE),
    seed = imputate_na(Soybean, seed, Class, method = "rpart", no_attrs = TRUE),
    mold.growth = imputate_na(Soybean, mold.growth, Class, method = "rpart", no_attrs = TRUE),
    seed.discolor = imputate_na(Soybean, seed.discolor, Class, method = "rpart", no_attrs = TRUE),
    seed.size = imputate_na(Soybean, seed.size, Class, method = "rpart", no_attrs = TRUE),
    shriveling = imputate_na(Soybean, shriveling, Class, method = "rpart", no_attrs = TRUE),
    fruit.spots = imputate_na(Soybean, fruit.spots, Class, method = "rpart", no_attrs = TRUE),
    roots = imputate_na(Soybean, roots, Class, method = "rpart", no_attrs = TRUE)) %>%
  # Drop the near zero variance predictors
  select(-leaf.mild, -mycelium, -sclerotia)

## Warning: Unquoting language objects with `!!!` is deprecated as of rlang 0.4.0.
## Please use `!!` instead.
## 
##   # Bad:
##   dplyr::select(data, !!!enquo(x))
## 
##   # Good:
##   dplyr::select(data, !!enquo(x))    # Unquote single quosure
##   dplyr::select(data, !!!enquos(x))  # Splice list of quosures
## 
## This warning is displayed once per session.

We can prove that it worked by presenting the following

Soybean_complete %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")