3.1

The UC Irvine Machine Learning Repository 6 contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass_long <- pivot_longer(Glass, cols = c(RI, Na, Mg, Al, Si, K, Ca, Ba, Fe), 
                           names_to = "Predictor", values_to = "Value")

ggplot(Glass_long, aes(x = Value)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 20) +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of Glass Predictors")

ggplot(Glass_long, aes(y = Value)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Boxplots of Glass Predictors")

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Transformation type:

In the Data(Glass), I tries to apply Yeo-Johnson Transformation. After applied Yeo-Johnson Transformation: - Symmetric: Al, Ca, Na, Ri, and Si - Skewness: Ba, Fe, Mg, and K - Outliers: Al, Ba, Ca, Fe, Na, Rl, and Si

The Yeo-Johnson transformation has likely imporved the model by normalizing the data.

Glass_transformed <- Glass
for (pred in c("Ba", "Fe", "K")) {
  if (any(Glass[[pred]] <= 0)) {
    Glass_transformed[[pred]] <- Glass[[pred]] + 0.1
  }
  bc_trans <- BoxCoxTrans(Glass_transformed[[pred]])
  Glass_transformed[[pred]] <- predict(bc_trans, Glass_transformed[[pred]])
}

preproc <- preProcess(Glass[, 1:9], method = c("YeoJohnson", "center", "scale"))
transformed_data <- predict(preproc, Glass[, 1:9])

post_skewness <- apply(transformed_data, 2, e1071::skewness)
print(post_skewness)
##            RI            Na            Mg            Al            Si 
##  1.6027150827 -0.0088476749 -0.8770969306  0.0002128329 -0.7202392108 
##             K            Ca            Ba            Fe 
## -0.0708227694 -0.2063893005  3.3686799688  1.7298107096
transformed_long <- as.data.frame(transformed_data) %>%
  pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value")

ggplot(transformed_long, aes(x = Value)) +
  geom_histogram(fill = "lightgreen", color = "black", bins = 20) +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of Transformed Glass Predictors")

ggplot(transformed_long, aes(y = Value)) +
  geom_boxplot(fill = "lightgreen") +
  facet_wrap(~ Predictor, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Boxplots of Glass Predictors")

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g.,temperature, precipitation) and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Based on the table, the predictors leaf.mild, mycelium, and sclerotia have been flagged as near-zero variance, meaning they have extremely low variability in their values.

str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
predictors <- Soybean[, -1]  

categorical_predictors <- predictors[, sapply(predictors, is.factor)]

nzv <- nearZeroVar(categorical_predictors, saveMetrics = TRUE)
print(nzv)
##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE
flagged_predictors <- rownames(nzv[nzv$nzv == TRUE, ])
print(flagged_predictors)
## [1] "leaf.mild" "mycelium"  "sclerotia"
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Yes, there are likely particular predictors more likely to be missing. (~15% ~18%) (hail, sever, seed.tmt, lodging, germ, leaf.mild, fruiting.bodies, fruit.spots, seed.discolor, shriveling)

Yes, the pattern is strongly related to the disease classes. (2 4 d injury, Cyst nematode, Diapother pod & stem blight, Herbicide injury, and Phytophthora rot)

missing_pct <- colSums(is.na(predictors)) / nrow(predictors) * 100
print(missing_pct[order(missing_pct, decreasing = TRUE)])
##            hail           sever        seed.tmt         lodging            germ 
##      17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##      14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##      12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##       5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
##    plant.growth            stem            date        area.dam          leaves 
##       2.3426061       2.3426061       0.1464129       0.1464129       0.0000000
missing_by_class <- tapply(rowSums(is.na(predictors)), Soybean$Class, mean) / ncol(predictors) * 100
print(missing_by_class[order(missing_by_class, decreasing = TRUE)])
##                2-4-d-injury               cyst-nematode 
##                    80.35714                    68.57143 
##            herbicide-injury            phytophthora-rot 
##                    57.14286                    39.41558 
## diaporthe-pod-&-stem-blight         alternarialeaf-spot 
##                    33.71429                     0.00000 
##                 anthracnose            bacterial-blight 
##                     0.00000                     0.00000 
##           bacterial-pustule                  brown-spot 
##                     0.00000                     0.00000 
##              brown-stem-rot                charcoal-rot 
##                     0.00000                     0.00000 
##       diaporthe-stem-canker                downy-mildew 
##                     0.00000                     0.00000 
##          frog-eye-leaf-spot      phyllosticta-leaf-spot 
##                     0.00000                     0.00000 
##              powdery-mildew           purple-seed-stain 
##                     0.00000                     0.00000 
##        rhizoctonia-root-rot 
##                     0.00000
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

By using multiple cleaning methods and imputation with MICE to reduce missing data, to remove missing data and impute remaining values using predictive mean matching. It help simplifies the dataset and ensures robust predictive modeling with minimizing bias and enhancing accuracy.

response <- Soybean[, 1]   

nzv_predictors <- rownames(nzv[nzv$nzv == TRUE, ])  
clean_predictors <- predictors[, !names(predictors) %in% nzv_predictors]

high_missing <- names(missing_pct[missing_pct > 15]) 

clean_predictors <- clean_predictors[, !names(clean_predictors) %in% high_missing]

imp <- mice(clean_predictors, m = 5, method = "pmm", seed = 123)
## 
##  iter imp variable
##   1   1  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   1   2  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   1   3  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   1   4  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   1   5  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   2   1  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   2   2  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   2   3  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   2   4  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   2   5  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   3   1  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   3   2  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   3   3  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   3   4  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   3   5  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   4   1  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   4   2  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   4   3  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   4   4  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   4   5  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   5   1  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   5   2  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   5   3  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   5   4  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
##   5   5  date  plant.stand  precip  temp  crop.hist  area.dam  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  stem  stem.cankers  canker.lesion  ext.decay  int.discolor  fruit.pods  seed  mold.growth  seed.size  roots
## Warning: Number of logged events: 515
imputed_predictors <- complete(imp)

imputed_soybean <- cbind(response, imputed_predictors)

str(imputed_soybean)
## 'data.frame':    683 obs. of  24 variables:
##  $ response     : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date         : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand  : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip       : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp         : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ crop.hist    : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam     : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo    : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg    : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size    : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion: Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ ext.decay    : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods   : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots        : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
print(colSums(is.na(imputed_soybean)) / nrow(imputed_soybean) * 100)
##      response          date   plant.stand        precip          temp 
##             0             0             0             0             0 
##     crop.hist      area.dam  plant.growth        leaves     leaf.halo 
##             0             0             0             0             0 
##     leaf.marg     leaf.size   leaf.shread     leaf.malf          stem 
##             0             0             0             0             0 
##  stem.cankers canker.lesion     ext.decay  int.discolor    fruit.pods 
##             0             0             0             0             0 
##          seed   mold.growth     seed.size         roots 
##             0             0             0             0