The UC Irvine Machine Learning Repository 6 contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass_long <- pivot_longer(Glass, cols = c(RI, Na, Mg, Al, Si, K, Ca, Ba, Fe),
names_to = "Predictor", values_to = "Value")
ggplot(Glass_long, aes(x = Value)) +
geom_histogram(fill = "lightblue", color = "black", bins = 20) +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Glass Predictors")
ggplot(Glass_long, aes(y = Value)) +
geom_boxplot(fill = "lightblue") +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Boxplots of Glass Predictors")
Transformation type:
In the Data(Glass), I tries to apply Yeo-Johnson Transformation. After applied Yeo-Johnson Transformation: - Symmetric: Al, Ca, Na, Ri, and Si - Skewness: Ba, Fe, Mg, and K - Outliers: Al, Ba, Ca, Fe, Na, Rl, and Si
The Yeo-Johnson transformation has likely imporved the model by normalizing the data.
Glass_transformed <- Glass
for (pred in c("Ba", "Fe", "K")) {
if (any(Glass[[pred]] <= 0)) {
Glass_transformed[[pred]] <- Glass[[pred]] + 0.1
}
bc_trans <- BoxCoxTrans(Glass_transformed[[pred]])
Glass_transformed[[pred]] <- predict(bc_trans, Glass_transformed[[pred]])
}
preproc <- preProcess(Glass[, 1:9], method = c("YeoJohnson", "center", "scale"))
transformed_data <- predict(preproc, Glass[, 1:9])
post_skewness <- apply(transformed_data, 2, e1071::skewness)
print(post_skewness)
## RI Na Mg Al Si
## 1.6027150827 -0.0088476749 -0.8770969306 0.0002128329 -0.7202392108
## K Ca Ba Fe
## -0.0708227694 -0.2063893005 3.3686799688 1.7298107096
transformed_long <- as.data.frame(transformed_data) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value")
ggplot(transformed_long, aes(x = Value)) +
geom_histogram(fill = "lightgreen", color = "black", bins = 20) +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Transformed Glass Predictors")
ggplot(transformed_long, aes(y = Value)) +
geom_boxplot(fill = "lightgreen") +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Boxplots of Glass Predictors")
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g.,temperature, precipitation) and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
Based on the table, the predictors leaf.mild, mycelium, and sclerotia have been flagged as near-zero variance, meaning they have extremely low variability in their values.
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
predictors <- Soybean[, -1]
categorical_predictors <- predictors[, sapply(predictors, is.factor)]
nzv <- nearZeroVar(categorical_predictors, saveMetrics = TRUE)
print(nzv)
## freqRatio percentUnique zeroVar nzv
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
flagged_predictors <- rownames(nzv[nzv$nzv == TRUE, ])
print(flagged_predictors)
## [1] "leaf.mild" "mycelium" "sclerotia"
Yes, there are likely particular predictors more likely to be missing. (~15% ~18%) (hail, sever, seed.tmt, lodging, germ, leaf.mild, fruiting.bodies, fruit.spots, seed.discolor, shriveling)
Yes, the pattern is strongly related to the disease classes. (2 4 d injury, Cyst nematode, Diapother pod & stem blight, Herbicide injury, and Phytophthora rot)
missing_pct <- colSums(is.na(predictors)) / nrow(predictors) * 100
print(missing_pct[order(missing_pct, decreasing = TRUE)])
## hail sever seed.tmt lodging germ
## 17.7159590 17.7159590 17.7159590 17.7159590 16.3982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 15.8125915 15.5197657 15.5197657 15.5197657 15.5197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 14.6412884 13.4699854 13.4699854 13.4699854 12.2986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 12.2986823 12.2986823 12.2986823 12.2986823 5.5636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 5.5636896 5.5636896 5.5636896 5.5636896 5.5636896
## sclerotia plant.stand roots temp crop.hist
## 5.5636896 5.2708638 4.5387994 4.3923865 2.3426061
## plant.growth stem date area.dam leaves
## 2.3426061 2.3426061 0.1464129 0.1464129 0.0000000
missing_by_class <- tapply(rowSums(is.na(predictors)), Soybean$Class, mean) / ncol(predictors) * 100
print(missing_by_class[order(missing_by_class, decreasing = TRUE)])
## 2-4-d-injury cyst-nematode
## 80.35714 68.57143
## herbicide-injury phytophthora-rot
## 57.14286 39.41558
## diaporthe-pod-&-stem-blight alternarialeaf-spot
## 33.71429 0.00000
## anthracnose bacterial-blight
## 0.00000 0.00000
## bacterial-pustule brown-spot
## 0.00000 0.00000
## brown-stem-rot charcoal-rot
## 0.00000 0.00000
## diaporthe-stem-canker downy-mildew
## 0.00000 0.00000
## frog-eye-leaf-spot phyllosticta-leaf-spot
## 0.00000 0.00000
## powdery-mildew purple-seed-stain
## 0.00000 0.00000
## rhizoctonia-root-rot
## 0.00000
By using multiple cleaning methods and imputation with MICE to reduce missing data, to remove missing data and impute remaining values using predictive mean matching. It help simplifies the dataset and ensures robust predictive modeling with minimizing bias and enhancing accuracy.
response <- Soybean[, 1]
nzv_predictors <- rownames(nzv[nzv$nzv == TRUE, ])
clean_predictors <- predictors[, !names(predictors) %in% nzv_predictors]
high_missing <- names(missing_pct[missing_pct > 15])
clean_predictors <- clean_predictors[, !names(clean_predictors) %in% high_missing]
imp <- mice(clean_predictors, m = 5, method = "pmm", seed = 123)
##
## iter imp variable
## 1 1 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 1 2 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 1 3 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 1 4 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 1 5 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 2 1 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 2 2 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 2 3 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 2 4 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 2 5 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 3 1 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 3 2 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 3 3 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 3 4 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 3 5 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 4 1 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 4 2 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 4 3 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 4 4 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 4 5 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 5 1 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 5 2 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 5 3 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 5 4 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## 5 5 date plant.stand precip temp crop.hist area.dam plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth seed.size roots
## Warning: Number of logged events: 515
imputed_predictors <- complete(imp)
imputed_soybean <- cbind(response, imputed_predictors)
str(imputed_soybean)
## 'data.frame': 683 obs. of 24 variables:
## $ response : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion: Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
print(colSums(is.na(imputed_soybean)) / nrow(imputed_soybean) * 100)
## response date plant.stand precip temp
## 0 0 0 0 0
## crop.hist area.dam plant.growth leaves leaf.halo
## 0 0 0 0 0
## leaf.marg leaf.size leaf.shread leaf.malf stem
## 0 0 0 0 0
## stem.cankers canker.lesion ext.decay int.discolor fruit.pods
## 0 0 0 0 0
## seed mold.growth seed.size roots
## 0 0 0 0