library(tidyverse)
library(GGally)
library(corrplot)
library(reshape2)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass %>%
pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_density(fill = "skyblue", alpha = 0.6) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(title = "Density of Predictors",
x = "Value", y = "Density") +
theme_minimal()
cor_matrix <- cor(Glass[, 1:9])
corrplot(cor_matrix, method = "color", type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(200),
tl.cex = 0.8,
title = "Correlation Matrix of Predictors")
Glass %>%
select(-Type) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(y = Value)) +
geom_boxplot(fill = "skyblue", outlier.color = "red", outlier.shape = 1) +
facet_wrap(~Variable, scales = "free", ncol = 3) +
labs(title = "Boxplots of Glass Dataset Predictors",
y = "Value",
x = NULL) +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
RI, NA, AI, Si, Ca have fairly normal distribution with some
level of skewness. Whereas, MG, K, Ba, Fe are highly skewed. The
predictors are uncorrelated with the exception of RI and Ca are strongly
correlated. All of the predictors have outliers with the exception of
Mg
Yes since the a lot of the predictors have right skew and outliers we will need to transform the data. Box-Cox is an option however since there are a lot of zeros it shouldn’t be used. However, for this scenario a log(1 + x) and standardization can be used.
Glass_transformed <- Glass %>%
mutate(
K = log1p(K),
Ba = log1p(Ba),
Fe = log1p(Fe)
) %>%
mutate(across(-Type, scale))
Glass_transformed %>%
pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_density(fill = "skyblue", alpha = 0.6) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(title = "Density of Transformed Predictors",
x = "Transformed Value", y = "Density") +
theme_minimal()
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
predictors <- Soybean[, -ncol(Soybean)]
freq_list <- lapply(predictors, function(x) {
table(x)
})
freq_list
## $Class
## x
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
## x
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
## x
## 0 1
## 354 293
##
## $precip
## x
## 0 1 2
## 74 112 459
##
## $temp
## x
## 0 1 2
## 80 374 199
##
## $hail
## x
## 0 1
## 435 127
##
## $crop.hist
## x
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
## x
## 0 1 2 3
## 123 227 145 187
##
## $sever
## x
## 0 1 2
## 195 322 45
##
## $seed.tmt
## x
## 0 1 2
## 305 222 35
##
## $germ
## x
## 0 1 2
## 165 213 193
##
## $plant.growth
## x
## 0 1
## 441 226
##
## $leaves
## x
## 0 1
## 77 606
##
## $leaf.halo
## x
## 0 1 2
## 221 36 342
##
## $leaf.marg
## x
## 0 1 2
## 357 21 221
##
## $leaf.size
## x
## 0 1 2
## 51 327 221
##
## $leaf.shread
## x
## 0 1
## 487 96
##
## $leaf.malf
## x
## 0 1
## 554 45
##
## $leaf.mild
## x
## 0 1 2
## 535 20 20
##
## $stem
## x
## 0 1
## 296 371
##
## $lodging
## x
## 0 1
## 520 42
##
## $stem.cankers
## x
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
## x
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
## x
## 0 1
## 473 104
##
## $ext.decay
## x
## 0 1 2
## 497 135 13
##
## $mycelium
## x
## 0 1
## 639 6
##
## $int.discolor
## x
## 0 1 2
## 581 44 20
##
## $sclerotia
## x
## 0 1
## 625 20
##
## $fruit.pods
## x
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
## x
## 0 1 2 4
## 345 75 57 100
##
## $seed
## x
## 0 1
## 476 115
##
## $mold.growth
## x
## 0 1
## 524 67
##
## $seed.discolor
## x
## 0 1
## 513 64
##
## $seed.size
## x
## 0 1
## 532 59
##
## $shriveling
## x
## 0 1
## 539 38
In this dataset, mycelium and sclerotia seem to have degenerate distributions. However most of the predictors seem to be heavily imbalanced.
missing_counts <- sapply(Soybean, function(x) sum(is.na(x)))
missing_df <- data.frame(Variable = names(missing_counts), Missing = missing_counts)
missing_df <- missing_df[order(-missing_df$Missing), ]
missing_df
## Variable Missing
## hail hail 121
## sever sever 121
## seed.tmt seed.tmt 121
## lodging lodging 121
## germ germ 112
## leaf.mild leaf.mild 108
## fruiting.bodies fruiting.bodies 106
## fruit.spots fruit.spots 106
## seed.discolor seed.discolor 106
## shriveling shriveling 106
## leaf.shread leaf.shread 100
## seed seed 92
## mold.growth mold.growth 92
## seed.size seed.size 92
## leaf.halo leaf.halo 84
## leaf.marg leaf.marg 84
## leaf.size leaf.size 84
## leaf.malf leaf.malf 84
## fruit.pods fruit.pods 84
## precip precip 38
## stem.cankers stem.cankers 38
## canker.lesion canker.lesion 38
## ext.decay ext.decay 38
## mycelium mycelium 38
## int.discolor int.discolor 38
## sclerotia sclerotia 38
## plant.stand plant.stand 36
## roots roots 31
## temp temp 30
## crop.hist crop.hist 16
## plant.growth plant.growth 16
## stem stem 16
## date date 1
## area.dam area.dam 1
## Class Class 0
## leaves leaves 0
missing_df$MissingPercent <- 100 * missing_df$Missing / nrow(Soybean)
ggplot(missing_df, aes(x = reorder(Variable, -Missing), y = MissingPercent)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
ylab("Missing Data (%)") +
xlab("Variable") +
ggtitle("Percentage of Missing Data by Variable")
Soybean$Class <- Soybean$Class
long_data <- melt(Soybean, id.vars = "Class")
long_data$missing <- is.na(long_data$value)
missing_by_class <- long_data %>%
group_by(Class, variable) %>%
summarize(missing_pct = mean(missing) * 100) %>%
ungroup()
## `summarise()` has grouped output by 'Class'. You can override using the
## `.groups` argument.
ggplot(missing_by_class, aes(x = variable, y = Class, fill = missing_pct)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "red", name = "% Missing") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Missing Data Pattern by Class and Variable")
The missing data seems to belong to 5 different classes:
The data set is primarily categorical predictors and a multi-class classification task. Either a kNN or mice or a random forest algorithm can be used to impute missing data. Other strategies include mode imputation or removing that classes completely but they aren’t the best methods if the missing data has significance. Ultimately, I am not sure as to which approach is the best.