library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: lattice
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(questionr)
## Warning: package 'questionr' was built under R version 4.3.3
3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via: > library(mlbench) > data(Glass) > str(Glass) ‘data.frame’: 214 obs. of 10 variables: $ RI : num 1.52 1.52 1.52 1.52 1.52 … $ Na : num 13.6 13.9 13.5 13.2 13.3 … $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 … $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 … $ Si : num 71.8 72.7 73 72.6 73.1 … $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 … $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 … $Ba : num 0000000000… $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 … $ Type: Factor w/ 6 levels “1”,“2”,“3”,“5”,..: 1 1 1 1 1 1 1 111…
data("Glass")
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
factors <- Glass[, 1:9]
# Histograms displaying Predictors
factors %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 16) +
facet_wrap(~key, scales = 'free') +
ggtitle("Distributions of Predictive Variables")
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Categories of Glass Distribution")
# Matrix of Correlations
Cor_Fact <- cor(factors)
corrplot.mixed(Cor_Fact, lower = 'number', upper = 'square', order ='AOE')
The distributions of the predictor variables largely exhibit right-skewed patterns, particularly for Ba, Ca, Fe, and K, with Ba and Fe concentrated primarily around zero. In contrast, Mg and Si show left-skewed distributions. Na’s distribution is close to normal, though it has a slight rightward skew. The glass type distributions show that types 1, 2, and 7 occur most frequently. When examining correlations between predictor variables, the strongest positive relationship is observed between RI and Ca, with a correlation of 0.81. Ba and Al also show a positive correlation, though weaker, at 0.48. On the negative side, RI and Si have the strongest negative correlation at -0.54, while Al and Mg, Ca and Mg, and Ba and Mg all show strong negative correlations, ranging from -0.45 to -0.5.
factors %>%
summarise_all(~skewness(.))
By examining the skewness values for each variable, we can identify that all variables, except for Si, Na, and Al, are likely to contain outliers. Skewness reflects the asymmetry of a data distribution, with values further from 0 indicating greater skewness. A positive skewness value suggests a right-skewed distribution, while a negative value points to a left-skewed distribution. We rely on skewness to help identify outliers because higher skewness typically increases the likelihood of outliers. A skewness value of 1 is used as the threshold to classify a distribution as moderately asymmetric.
factors %>%
mutate_all(list(~BoxCoxTrans(.)$lambda)) %>%
head(1)
freq.na(factors)
## missing %
## RI 0 0
## Na 0 0
## Mg 0 0
## Al 0 0
## Si 0 0
## K 0 0
## Ca 0 0
## Ba 0 0
## Fe 0 0
I opted to apply a Box-Cox transformation to make the predictor distributions more linear. It’s important to point out that Na and Al have skewness values between -1 and 1, indicating their distributions are already fairly normalized. Additionally, we observe that Ri, Si, and Ca undergo power transformations, with negative transformations applied to Ca and Ri, while Si receives a positive transformation. However, Mg, K, Ba, and Fe have NA values, indicating that the Box-Cox transformation is not applicable to those predictors.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen-tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details
data("Soybean")
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Soybean_tidy <- Soybean %>%
select(-Class) %>%
mutate(across(where(is.factor), as.numeric)) %>%
pivot_longer(cols = -c(date), names_to = "name", values_to = "value")
# Predictor Distributions
ggplot(Soybean_tidy, aes(x = value)) +
geom_histogram(stat = "count") +
facet_wrap(vars(name))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 2336 rows containing non-finite outside the scale range
## (`stat_count()`).
A variable repeatedly taking the same value indicates a degenerate distribution. It appears that the variables mycelium and sclerotia follow such patterns, as they mostly have only one value. Additionally, the variables leaf.mild and leaf.malf show heavily one-sided distributions, and after excluding any missing data, they could also exhibit degenerate distributions.
# Percentage of Missing Values for Each Predictor
Soybean %>%
summarise_all(list(~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "predictor", values_to = "missing_count") %>%
ggplot(aes(y = reorder(predictor, missing_count), x = missing_count)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(title = "Percentage of Missing Values for Each Predictor",
x = "Percentage of Missing Values",
y = "Predictor") +
theme_minimal()
# NA Frequency By Predictor
freq.na(Soybean)
## missing %
## hail 121 18
## sever 121 18
## seed.tmt 121 18
## lodging 121 18
## germ 112 16
## leaf.mild 108 16
## fruiting.bodies 106 16
## fruit.spots 106 16
## seed.discolor 106 16
## shriveling 106 16
## leaf.shread 100 15
## seed 92 13
## mold.growth 92 13
## seed.size 92 13
## leaf.halo 84 12
## leaf.marg 84 12
## leaf.size 84 12
## leaf.malf 84 12
## fruit.pods 84 12
## precip 38 6
## stem.cankers 38 6
## canker.lesion 38 6
## ext.decay 38 6
## mycelium 38 6
## int.discolor 38 6
## sclerotia 38 6
## plant.stand 36 5
## roots 31 5
## temp 30 4
## crop.hist 16 2
## plant.growth 16 2
## stem 16 2
## date 1 0
## area.dam 1 0
## Class 0 0
## leaves 0 0
Total_Na <- Soybean %>%
group_by(Class) %>%
summarize_at(vars(-group_cols()), ~ sum(is.na(.)))
# NA Frequency by Class
row_sum <- rowSums(Total_Na[, -1])
result <- cbind(Total_Na[, 1, drop = FALSE], RowSums = row_sum)
result
As indicated in the table above, the predictors Hail, Sever, Seed.tmt, and Lodging each have 18% missing values. Additionally, the classes with missing values include phytophthora-rot, herbicide-injury, diaporthe-pod-&-stem-blight, cyst-nematode, and 2-4-d-injury. Among these, phytophthora-rot has the highest number of missing values, totaling 1,214.
One approach to managing missing data is through Multivariate Imputation via Chained Equations, commonly known as MICE. This technique focuses on imputing the missing values, and the function md.pattern() provides a visual representation of the missing data for each predictor. MICE offers three different methods: Predictive Mean Matching (PMM), Classification and Regression Trees (CART), and Lasso Linear Regression (lasso.norm). We implemented five imputations to introduce uncertainty into the missing data, which can lead to more precise parameter estimates.
How much missing data?
Review the % missing for each predictor from your previous analysis. Identify the ones with high % missing and those will need special attention.
Should we eliminate?
Threshold for elimination: Set a threshold (e.g. 30% missing) to decide if a predictor should be dropped from the analysis. If a predictor exceeds this threshold, consider dropping it from the dataset as it won’t provide enough information.
Domain knowledge: Use domain knowledge to decide if the predictors with missing values are important for the analysis. If some are not, they can be dropped regardless of the % missing.
Imputing missing data:
Multivariate Imputation via Chained Equations (MICE): Use MICE to impute missing values for predictors that don’t exceed the threshold. This will help retain information and reduce bias.
MICE methods:
Predictive Mean Matching (PMM): Use this to preserve the distribution of the data and impute missing values based on similar observed values.
Classification and Regression Trees (CART): Use CART to impute missing values based on other predictors, especially useful for categorical variables.
Lasso Linear Regression (lasso.norm): Use this for continuous predictors to handle multicollinearity and improve predictions.
Multiple Imputation: Do multiple imputations (e.g. 5) to account for the uncertainty of missing data and get more reliable parameter estimates. This will also allow you to analyze the variability due to imputation.
Check the impact of imputation:
After imputation, do a sensitivity analysis to see how the imputed values affect the overall results. Compare models built on the original dataset with those built on the imputed dataset.
Document and report:
Clearly document the steps taken to handle missing data, including the reason for any predictors dropped or imputed. Report the % missing before and after imputation in any results or visualizations.
Further consideration:
If the pattern of missingness looks related to the classes (i.e. certain classes have more missing values), consider doing analyses that account for this relationship as it may indicate an underlying issue that needs to be addressed.
Conclusion
By following this process we can keep the analysis robust and meaningful and retain as much information as possible from the soybean dataset. MICE will help reduce the bias introduced by missing values and improve any models we build from the data.