Import Libraries

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: lattice
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(questionr)
## Warning: package 'questionr' was built under R version 4.3.3

3.1

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via: > library(mlbench) > data(Glass) > str(Glass) ‘data.frame’: 214 obs. of 10 variables: $ RI : num 1.52 1.52 1.52 1.52 1.52 … $ Na : num 13.6 13.9 13.5 13.2 13.3 … $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 … $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 … $ Si : num 71.8 72.7 73 72.6 73.1 … $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 … $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 … $Ba : num 0000000000… $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 … $ Type: Factor w/ 6 levels “1”,“2”,“3”,“5”,..: 1 1 1 1 1 1 1 111…

data("Glass")
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
factors <- Glass[, 1:9]

# Histograms displaying Predictors
factors %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 16) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Distributions of Predictive Variables")

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Categories of Glass Distribution")

# Matrix of Correlations
Cor_Fact <- cor(factors)
corrplot.mixed(Cor_Fact, lower = 'number', upper = 'square', order ='AOE')

The distributions of the predictor variables largely exhibit right-skewed patterns, particularly for Ba, Ca, Fe, and K, with Ba and Fe concentrated primarily around zero. In contrast, Mg and Si show left-skewed distributions. Na’s distribution is close to normal, though it has a slight rightward skew. The glass type distributions show that types 1, 2, and 7 occur most frequently. When examining correlations between predictor variables, the strongest positive relationship is observed between RI and Ca, with a correlation of 0.81. Ba and Al also show a positive correlation, though weaker, at 0.48. On the negative side, RI and Si have the strongest negative correlation at -0.54, while Al and Mg, Ca and Mg, and Ba and Mg all show strong negative correlations, ranging from -0.45 to -0.5.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
factors %>%
  summarise_all(~skewness(.))

By examining the skewness values for each variable, we can identify that all variables, except for Si, Na, and Al, are likely to contain outliers. Skewness reflects the asymmetry of a data distribution, with values further from 0 indicating greater skewness. A positive skewness value suggests a right-skewed distribution, while a negative value points to a left-skewed distribution. We rely on skewness to help identify outliers because higher skewness typically increases the likelihood of outliers. A skewness value of 1 is used as the threshold to classify a distribution as moderately asymmetric.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?
factors %>%
  mutate_all(list(~BoxCoxTrans(.)$lambda)) %>%
  head(1)
freq.na(factors)
##    missing %
## RI       0 0
## Na       0 0
## Mg       0 0
## Al       0 0
## Si       0 0
## K        0 0
## Ca       0 0
## Ba       0 0
## Fe       0 0

I opted to apply a Box-Cox transformation to make the predictor distributions more linear. It’s important to point out that Na and Al have skewness values between -1 and 1, indicating their distributions are already fairly normalized. Additionally, we observe that Ri, Si, and Ca undergo power transformations, with negative transformations applied to Ca and Ri, while Si receives a positive transformation. However, Mg, K, Ba, and Fe have NA values, indicating that the Box-Cox transformation is not applicable to those predictors.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen-tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

data("Soybean")
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Soybean_tidy <- Soybean %>%
  select(-Class) %>%
  mutate(across(where(is.factor), as.numeric)) %>%
  pivot_longer(cols = -c(date), names_to = "name", values_to = "value")

# Predictor Distributions
ggplot(Soybean_tidy, aes(x = value)) +
  geom_histogram(stat = "count") +
  facet_wrap(vars(name))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 2336 rows containing non-finite outside the scale range
## (`stat_count()`).

A variable repeatedly taking the same value indicates a degenerate distribution. It appears that the variables mycelium and sclerotia follow such patterns, as they mostly have only one value. Additionally, the variables leaf.mild and leaf.malf show heavily one-sided distributions, and after excluding any missing data, they could also exhibit degenerate distributions.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Percentage of Missing Values for Each Predictor
Soybean %>%
  summarise_all(list(~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "predictor", values_to = "missing_count") %>%
  ggplot(aes(y = reorder(predictor, missing_count), x = missing_count)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  labs(title = "Percentage of Missing Values for Each Predictor",
       x = "Percentage of Missing Values",
       y = "Predictor") +
  theme_minimal()

# NA Frequency By Predictor
freq.na(Soybean)
##                 missing  %
## hail                121 18
## sever               121 18
## seed.tmt            121 18
## lodging             121 18
## germ                112 16
## leaf.mild           108 16
## fruiting.bodies     106 16
## fruit.spots         106 16
## seed.discolor       106 16
## shriveling          106 16
## leaf.shread         100 15
## seed                 92 13
## mold.growth          92 13
## seed.size            92 13
## leaf.halo            84 12
## leaf.marg            84 12
## leaf.size            84 12
## leaf.malf            84 12
## fruit.pods           84 12
## precip               38  6
## stem.cankers         38  6
## canker.lesion        38  6
## ext.decay            38  6
## mycelium             38  6
## int.discolor         38  6
## sclerotia            38  6
## plant.stand          36  5
## roots                31  5
## temp                 30  4
## crop.hist            16  2
## plant.growth         16  2
## stem                 16  2
## date                  1  0
## area.dam              1  0
## Class                 0  0
## leaves                0  0
Total_Na <- Soybean %>%
  group_by(Class) %>%
  summarize_at(vars(-group_cols()), ~ sum(is.na(.)))

# NA Frequency by Class
row_sum <- rowSums(Total_Na[, -1])
result <- cbind(Total_Na[, 1, drop = FALSE], RowSums = row_sum)
result

As indicated in the table above, the predictors Hail, Sever, Seed.tmt, and Lodging each have 18% missing values. Additionally, the classes with missing values include phytophthora-rot, herbicide-injury, diaporthe-pod-&-stem-blight, cyst-nematode, and 2-4-d-injury. Among these, phytophthora-rot has the highest number of missing values, totaling 1,214.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Missing Data Strategy

One approach to managing missing data is through Multivariate Imputation via Chained Equations, commonly known as MICE. This technique focuses on imputing the missing values, and the function md.pattern() provides a visual representation of the missing data for each predictor. MICE offers three different methods: Predictive Mean Matching (PMM), Classification and Regression Trees (CART), and Lasso Linear Regression (lasso.norm). We implemented five imputations to introduce uncertainty into the missing data, which can lead to more precise parameter estimates.

How much missing data?

Review the % missing for each predictor from your previous analysis. Identify the ones with high % missing and those will need special attention.

Should we eliminate?

Threshold for elimination: Set a threshold (e.g. 30% missing) to decide if a predictor should be dropped from the analysis. If a predictor exceeds this threshold, consider dropping it from the dataset as it won’t provide enough information.

Domain knowledge: Use domain knowledge to decide if the predictors with missing values are important for the analysis. If some are not, they can be dropped regardless of the % missing.

Imputing missing data:

Multivariate Imputation via Chained Equations (MICE): Use MICE to impute missing values for predictors that don’t exceed the threshold. This will help retain information and reduce bias.

MICE methods:

Predictive Mean Matching (PMM): Use this to preserve the distribution of the data and impute missing values based on similar observed values.

Classification and Regression Trees (CART): Use CART to impute missing values based on other predictors, especially useful for categorical variables.

Lasso Linear Regression (lasso.norm): Use this for continuous predictors to handle multicollinearity and improve predictions.

Multiple Imputation: Do multiple imputations (e.g. 5) to account for the uncertainty of missing data and get more reliable parameter estimates. This will also allow you to analyze the variability due to imputation.

Check the impact of imputation:

After imputation, do a sensitivity analysis to see how the imputed values affect the overall results. Compare models built on the original dataset with those built on the imputed dataset.

Document and report:

Clearly document the steps taken to handle missing data, including the reason for any predictors dropped or imputed. Report the % missing before and after imputation in any results or visualizations.

Further consideration:

If the pattern of missingness looks related to the classes (i.e. certain classes have more missing values), consider doing analyses that account for this relationship as it may indicate an underlying issue that needs to be addressed.

Conclusion

By following this process we can keep the analysis robust and meaningful and retain as much information as possible from the soybean dataset. MICE will help reduce the bias introduced by missing values and improve any models we build from the data.