The UC Irvine Machine Learning Repository6 contains a data set related to my_glass identification. The data consist of 214 my_glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
library(tidyverse)
library(fpp3)
library(gridExtra)
library(GGally)
library(fabletools)
library(caret)
library(ggplot2)
library("e1071")
library("reshape2")
data(Glass)
my_glass<-Glass
# List of predictor names
predictors <- c("RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe")
# Loop over each predictor and plot its distribution
plot_list <- lapply(predictors, function(var) {
median_value <- median(Glass[[var]], na.rm = TRUE)
ggplot(my_glass, aes_string(var)) +
geom_density( fill = "lightgreen", color = "black") +geom_vline(xintercept = median_value, color = "blue", linetype = "dashed") +
ggtitle(paste("Dis", var))
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Display all histograms in a grid
grid.arrange(grobs = plot_list, ncol = 3)
ggpairs(my_glass[, predictors])
plot_list2 <- lapply(predictors, function(var) {
ggplot(my_glass, aes_string(x = '""', y = var)) + # use aes_string() for dynamic variable names
geom_boxplot() +
ggtitle(paste("Boxplot of", var)) # concatenate string with 'paste()'
})
grid.arrange(grobs = plot_list2, ncol = 3)
predictors_gl<- my_glass[sapply(my_glass, is.numeric)]
skewValues <- apply( predictors_gl ,2, skewness)
head(skewValues)
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
There are multiple varibles with many values of zeros as wll at outliers far out to on K. RI: 1.6027 (right-skewed) Na: 0.4478 (slightly right-skewed) Mg: -1.1365 (left-skewed) Al: 0.8946 (slightly right-skewed) Si: -0.7202 (left-skewed) K: 6.4601 (highly right-skewed
nearZeroVar(my_glass)
## integer(0)
Because we have multiple predictors with many observations of zero the Box Cox tranformation will not work here, instead we can try a Spatial Sign tranformation to mitigate the efects of outliers.
splom(~ my_glass[,-10], pch = 16, col = rgb(.2, .9, .4, .5), cex = .6,main= " Pre transformation")
cs_glass<-preProcess(my_glass[,-10], ,methos = c("center","scale"))
cspreddata<- predict(cs_glass, newdata = my_glass[,-10])
ss_glass <- spatialSign(cspreddata)
splom(~ ss_glass, pch = 16, col = rgb(.9, .4, .11, .5), cex = .6, main="Center scale Spatial Sign ")
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
lapply(Soybean, function(x) if (is.factor(x)) table(x))
## $Class
## x
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
## x
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
## x
## 0 1
## 354 293
##
## $precip
## x
## 0 1 2
## 74 112 459
##
## $temp
## x
## 0 1 2
## 80 374 199
##
## $hail
## x
## 0 1
## 435 127
##
## $crop.hist
## x
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
## x
## 0 1 2 3
## 123 227 145 187
##
## $sever
## x
## 0 1 2
## 195 322 45
##
## $seed.tmt
## x
## 0 1 2
## 305 222 35
##
## $germ
## x
## 0 1 2
## 165 213 193
##
## $plant.growth
## x
## 0 1
## 441 226
##
## $leaves
## x
## 0 1
## 77 606
##
## $leaf.halo
## x
## 0 1 2
## 221 36 342
##
## $leaf.marg
## x
## 0 1 2
## 357 21 221
##
## $leaf.size
## x
## 0 1 2
## 51 327 221
##
## $leaf.shread
## x
## 0 1
## 487 96
##
## $leaf.malf
## x
## 0 1
## 554 45
##
## $leaf.mild
## x
## 0 1 2
## 535 20 20
##
## $stem
## x
## 0 1
## 296 371
##
## $lodging
## x
## 0 1
## 520 42
##
## $stem.cankers
## x
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
## x
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
## x
## 0 1
## 473 104
##
## $ext.decay
## x
## 0 1 2
## 497 135 13
##
## $mycelium
## x
## 0 1
## 639 6
##
## $int.discolor
## x
## 0 1 2
## 581 44 20
##
## $sclerotia
## x
## 0 1
## 625 20
##
## $fruit.pods
## x
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
## x
## 0 1 2 4
## 345 75 57 100
##
## $seed
## x
## 0 1
## 476 115
##
## $mold.growth
## x
## 0 1
## 524 67
##
## $seed.discolor
## x
## 0 1
## 513 64
##
## $seed.size
## x
## 0 1
## 532 59
##
## $shriveling
## x
## 0 1
## 539 38
##
## $roots
## x
## 0 1 2
## 551 86 15
Some variables have very skewed distributions, indicating potential degeneration:
leaves: Most cases are in category 1 (606 ), mycelium: Category 0 dominates (639 ) sclerotia: Most cases (625) are in category 0.
Now lets see about missing data
missing_data <- sapply(Soybean, function(x) sum(is.na(x)))
missing_data <- data.frame(Predictor = names(missing_data), MissingCount = missing_data)
missing_data <- missing_data[missing_data$MissingCount > 0, ]
print(missing_data)
## Predictor MissingCount
## date date 1
## plant.stand plant.stand 36
## precip precip 38
## temp temp 30
## hail hail 121
## crop.hist crop.hist 16
## area.dam area.dam 1
## sever sever 121
## seed.tmt seed.tmt 121
## germ germ 112
## plant.growth plant.growth 16
## leaf.halo leaf.halo 84
## leaf.marg leaf.marg 84
## leaf.size leaf.size 84
## leaf.shread leaf.shread 100
## leaf.malf leaf.malf 84
## leaf.mild leaf.mild 108
## stem stem 16
## lodging lodging 121
## stem.cankers stem.cankers 38
## canker.lesion canker.lesion 38
## fruiting.bodies fruiting.bodies 106
## ext.decay ext.decay 38
## mycelium mycelium 38
## int.discolor int.discolor 38
## sclerotia sclerotia 38
## fruit.pods fruit.pods 84
## fruit.spots fruit.spots 106
## seed seed 92
## mold.growth mold.growth 92
## seed.discolor seed.discolor 106
## seed.size seed.size 92
## shriveling shriveling 106
## roots roots 31
missing_data$missing_percent <- (missing_data$MissingCount / nrow(Soybean)) * 100
print(missing_data)
## Predictor MissingCount missing_percent
## date date 1 0.1464129
## plant.stand plant.stand 36 5.2708638
## precip precip 38 5.5636896
## temp temp 30 4.3923865
## hail hail 121 17.7159590
## crop.hist crop.hist 16 2.3426061
## area.dam area.dam 1 0.1464129
## sever sever 121 17.7159590
## seed.tmt seed.tmt 121 17.7159590
## germ germ 112 16.3982430
## plant.growth plant.growth 16 2.3426061
## leaf.halo leaf.halo 84 12.2986823
## leaf.marg leaf.marg 84 12.2986823
## leaf.size leaf.size 84 12.2986823
## leaf.shread leaf.shread 100 14.6412884
## leaf.malf leaf.malf 84 12.2986823
## leaf.mild leaf.mild 108 15.8125915
## stem stem 16 2.3426061
## lodging lodging 121 17.7159590
## stem.cankers stem.cankers 38 5.5636896
## canker.lesion canker.lesion 38 5.5636896
## fruiting.bodies fruiting.bodies 106 15.5197657
## ext.decay ext.decay 38 5.5636896
## mycelium mycelium 38 5.5636896
## int.discolor int.discolor 38 5.5636896
## sclerotia sclerotia 38 5.5636896
## fruit.pods fruit.pods 84 12.2986823
## fruit.spots fruit.spots 106 15.5197657
## seed seed 92 13.4699854
## mold.growth mold.growth 92 13.4699854
## seed.discolor seed.discolor 106 15.5197657
## seed.size seed.size 92 13.4699854
## shriveling shriveling 106 15.5197657
## roots roots 31 4.5387994
Lets fidnout if its related to a class by grouping by class
library(dplyr)
class_missing <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~sum(is.na(.)), .names = "Missing_{col}"))
print(class_missing)
## # A tibble: 19 × 36
## Class Missing_date Missing_plant.stand Missing_precip Missing_temp
## <fct> <int> <int> <int> <int>
## 1 2-4-d-injury 1 16 16 16
## 2 alternarialeaf-… 0 0 0 0
## 3 anthracnose 0 0 0 0
## 4 bacterial-blight 0 0 0 0
## 5 bacterial-pustu… 0 0 0 0
## 6 brown-spot 0 0 0 0
## 7 brown-stem-rot 0 0 0 0
## 8 charcoal-rot 0 0 0 0
## 9 cyst-nematode 0 14 14 14
## 10 diaporthe-pod-&… 0 6 0 0
## 11 diaporthe-stem-… 0 0 0 0
## 12 downy-mildew 0 0 0 0
## 13 frog-eye-leaf-s… 0 0 0 0
## 14 herbicide-injury 0 0 8 0
## 15 phyllosticta-le… 0 0 0 0
## 16 phytophthora-rot 0 0 0 0
## 17 powdery-mildew 0 0 0 0
## 18 purple-seed-sta… 0 0 0 0
## 19 rhizoctonia-roo… 0 0 0 0
## # ℹ 31 more variables: Missing_hail <int>, Missing_crop.hist <int>,
## # Missing_area.dam <int>, Missing_sever <int>, Missing_seed.tmt <int>,
## # Missing_germ <int>, Missing_plant.growth <int>, Missing_leaves <int>,
## # Missing_leaf.halo <int>, Missing_leaf.marg <int>, Missing_leaf.size <int>,
## # Missing_leaf.shread <int>, Missing_leaf.malf <int>,
## # Missing_leaf.mild <int>, Missing_stem <int>, Missing_lodging <int>,
## # Missing_stem.cankers <int>, Missing_canker.lesion <int>, …
Lets make a map of missing predictors per class
missing_heatmap <- melt(class_missing, id.vars = "Class")
ggplot(missing_heatmap, aes(x = variable, y = Class)) +
geom_tile(aes(fill = value), color = "white") +
scale_fill_gradient(low = "white", high = "maroon") +
theme_minimal() +
labs(title = "Missing Data Pattern by Class", x = "Predictors", y = "Class")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the output above, we observe that there are several predictors with a significant amount of missing data, while others are fully complete. The classes “2-4-d-injury,” “cyst-nematode,” and “herbicide-injury” exhibit nearly complete missingness for many predictors, whereas “phytophthora-rot” shows a high rate of missing data across multiple predictors.
We could consider removing the classes that have a high rate of missing values, as imputation might be challenging due to complete missingness in certain classes. In this case, I would prefer to use a model that is not sensitive to imbalance or missing predictors, such as Naive Bayes. This approach can help mitigate the impact of the missing data while still allowing for effective classification.. ```