The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
(Glass)
a & b: Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors. Do there appear to be any outliers in the data? Are any predictors skewed?
cor_matrix <- cor(Glass[, 1:9])
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("blue", "white", "red"),
title = "Predictors Correlation Matrix",
legend.title = "Correlation")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
## Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Glass %>%
select(-Type) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, y = Variable)) +
geom_boxplot(fill = "pink", outlier.color = "purple", outlier.shape = 16) +
facet_wrap(~Variable, scales = "free", ncol = 3) +
labs(title = "Boxplots of Predictors",
x = "Value",
y = NULL) +
theme_minimal() +
theme(strip.text = element_text(face = "bold"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
Glass %>%
pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(aes(y = after_stat(density)),
bins = 30, fill = "purple", color = "black", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3) +
labs(title = "Histograms of Predictors",
x = "Value",
y = "Density") +
theme_minimal()
From the histogram, we can see that there is some highly skewed predictors to the right such as K, Ba and Fe. While Mg is left skewed. The other predictors: Al, Ca, Na, RI, and Si have relatively normal symmetry. In our box plot, all predictors outside of Mg have a noticable amount of outliers. In our Correlation Matrix, one noticeably high correlation is with Ca and RI.
c: Are there any relevant transformations of one or more predictors that might improve the classification model?
There are many outliers that are right skewed and would need to be transformed. However, we must also take into account the amount of zeros. Because of this, I chose to apply a cube root transformation since it accounts for the zeros.
glass_cbrt <- Glass %>%
mutate(
K = K^(1/3),
Ba = Ba^(1/3),
Fe = Fe^(1/3)
) %>%
mutate(across(-Type, scale))
glass_cbrt %>%
pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(fill = "coral", alpha = 0.6) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(title = "Cube Root Transformed Predictors Histogram",
x = "Transformed Value",
y = "Density") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
a: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
#get predictos
predictor_cols <- names(Soybean)[-1]
#func examining frequency distributions
freq_dist_exam <- function(data, predictor) {
freq_table <- table(data[[predictor]], useNA = "ifany")
prop_table <- prop.table(freq_table) * 100
data.frame(
Predictor = predictor,
Level = names(freq_table),
Count = as.vector(freq_table),
Percentage = round(as.vector(prop_table), 2)
)
}
#apply to predictors
all_frequencies <- do.call(rbind,
lapply(predictor_cols, function(p) freq_dist_exam(Soybean, p)))
#identify degenerate frequenct
degenerate_checker <- all_frequencies %>%
group_by(Predictor) %>%
summarise(
n_levels = n(),
max_pct = max(Percentage),
min_pct = min(Percentage),
single_dominant = max_pct > 95
) %>%
arrange(desc(max_pct))
print(degenerate_checker)
## # A tibble: 35 × 5
## Predictor n_levels max_pct min_pct single_dominant
## <chr> <int> <dbl> <dbl> <lgl>
## 1 mycelium 3 93.6 0.88 FALSE
## 2 sclerotia 3 91.5 2.93 FALSE
## 3 leaves 2 88.7 11.3 FALSE
## 4 int.discolor 4 85.1 2.93 FALSE
## 5 leaf.malf 3 81.1 6.59 FALSE
## 6 roots 4 80.7 2.2 FALSE
## 7 shriveling 3 78.9 5.56 FALSE
## 8 leaf.mild 4 78.3 2.93 FALSE
## 9 seed.size 3 77.9 8.64 FALSE
## 10 mold.growth 3 76.7 9.81 FALSE
## # ℹ 25 more rows
While all of the predictors aren’t single dominant (having a max percentage greater than 95% for an occurrence), I would say that mycellium and sclerotia could be considered degenerate distributed as they have a very high percentage of 0s. However, many other predictors do also fall into having a highly skewed imbalanced distribution towards zero.
From the table, we can see that predictors that are more likely to be missing are hail, sever, seed.tmt and lodging. But many other predictors have similar missing data percentages.
data_missing_pred <- Soybean %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Predictor", values_to = "Missing_Count") %>%
mutate(Missing_Percent = round(Missing_Count / nrow(Soybean) * 100, 2)) %>%
arrange(desc(Missing_Percent))
print(data_missing_pred)
## # A tibble: 36 × 3
## Predictor Missing_Count Missing_Percent
## <chr> <int> <dbl>
## 1 hail 121 17.7
## 2 sever 121 17.7
## 3 seed.tmt 121 17.7
## 4 lodging 121 17.7
## 5 germ 112 16.4
## 6 leaf.mild 108 15.8
## 7 fruiting.bodies 106 15.5
## 8 fruit.spots 106 15.5
## 9 seed.discolor 106 15.5
## 10 shriveling 106 15.5
## # ℹ 26 more rows
From the below visualization, we can see that the pattern of missing data is related to the class. In particular, there are five classes that account for much of the missing data: 2-4-d-injury, charcoal-rot, diaporthe-pod-&-stem-blight, herbicide-injury and phytophthora-rot.
missing_by_class <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~mean(is.na(.)) * 100)) %>%
pivot_longer(-Class, names_to = "Predictor", values_to = "Missing_Pct")
#I used the top predictors with most missing percentage
missing_by_class %>%
filter(Predictor %in% c("hail", "sever", "lodging", "seed.tmt")) %>%
ggplot(aes(x = Class, y = Missing_Pct, fill = Predictor)) +
geom_col(position = "dodge") +
labs(title = "Missingness by Class for Selected Predictors",
x = "Class", y = "Missing Percentage") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
c. Develop a strategy for handling missing data, either by eliminating
predictors or imputation.
Because the dataset uses mostly categorical predictors, has multiple class classifications, and a low percentage of missing data, I feel that eliminating the predictors would not be the answer. Instead, we can try imputation to handle this task. One way I feel would be appropriate is the use of K-nearest neightors imputation through the VIM package, that can estimate missing values from similar complete cases.
Since the KNN estimated vals for the missing cases, there are no missing values in the cleaned soybean dataset.
#find total missing values in df
sum(is.na(cleaned_soy))
## [1] 0
#look at missing vals per column
colSums(is.na(cleaned_soy))
## Class date plant.stand precip
## 0 0 0 0
## temp hail crop.hist area.dam
## 0 0 0 0
## sever seed.tmt germ plant.growth
## 0 0 0 0
## leaves leaf.halo leaf.marg leaf.size
## 0 0 0 0
## leaf.shread leaf.malf leaf.mild stem
## 0 0 0 0
## lodging stem.cankers canker.lesion fruiting.bodies
## 0 0 0 0
## ext.decay mycelium int.discolor sclerotia
## 0 0 0 0
## fruit.pods fruit.spots seed mold.growth
## 0 0 0 0
## seed.discolor seed.size shriveling roots
## 0 0 0 0
## Class_imp date_imp plant.stand_imp precip_imp
## 0 0 0 0
## temp_imp hail_imp crop.hist_imp area.dam_imp
## 0 0 0 0
## sever_imp seed.tmt_imp germ_imp plant.growth_imp
## 0 0 0 0
## leaves_imp leaf.halo_imp leaf.marg_imp leaf.size_imp
## 0 0 0 0
## leaf.shread_imp leaf.malf_imp leaf.mild_imp stem_imp
## 0 0 0 0
## lodging_imp stem.cankers_imp canker.lesion_imp fruiting.bodies_imp
## 0 0 0 0
## ext.decay_imp mycelium_imp int.discolor_imp sclerotia_imp
## 0 0 0 0
## fruit.pods_imp fruit.spots_imp seed_imp mold.growth_imp
## 0 0 0 0
## seed.discolor_imp seed.size_imp shriveling_imp roots_imp
## 0 0 0 0