This section explores the distributions of predictor variables in the Glass dataset and examines relationships between them. Understanding these distributions helps in identifying skewness, outliers, and potential correlations that can impact classification performance.
# Load required libraries
library(mlbench)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(e1071)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Load Glass dataset
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
par(mfrow = c(3, 3))
# Generate histograms for each predictor variable
for (col in names(Glass)[1:9]) {
hist(Glass[[col]], main = paste("Histogram of", col), col = "lightblue", border = "black")
}
par(mfrow = c(1, 1))
# Boxplots for each predictor
boxplot(Glass[, 1:9], main = "Boxplots of Predictors", las = 2, col = "lightblue")
ggpairs(Glass, columns = 1:9, ggplot2::aes(color = Type))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
# Correlation heatmap
cor_matrix <- cor(Glass[, 1:9])
corrplot(cor_matrix, method = "circle", type = "lower", diag = FALSE)
Conclusion: This analysis highlights the distributions of predictor variables and relationships between them. Certain predictors show potential skewness and strong correlations, which may influence classification models.
This section identifies outliers and skewness in the dataset, which can affect model performance.
identify_outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
outliers <- sum(x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR))
return(outliers)
}
# Apply the function to each predictor
outlier_counts <- sapply(Glass[, 1:9], identify_outliers)
print(outlier_counts)
## RI Na Mg Al Si K Ca Ba Fe
## 17 7 0 18 12 7 26 38 12
# Compute skewness for each predictor
skew_values <- sapply(Glass[, 1:9], skewness)
print(skew_values)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
# Identify highly skewed predictors
skewed_predictors <- names(skew_values[abs(skew_values) > 1])
print(skewed_predictors)
## [1] "RI" "Mg" "K" "Ca" "Ba" "Fe"
Conclusion: Several predictors show high skewness, which may require transformation before classification to ensure a more normal distribution.
To address skewness, we apply log and Box-Cox transformations to improve model performance.
Glass_transformed <- Glass
# Apply log transformation to skewed predictors
Glass_transformed[, skewed_predictors] <- log(Glass[, skewed_predictors] + 1)
# Visualize
par(mfrow = c(3, 3))
for (col in skewed_predictors) {
hist(Glass_transformed[[col]], main = paste("Log-Transformed", col), col = "lightblue", border = "black")
}
par(mfrow = c(1, 1))
Glass_transformed <- Glass
for (col in skewed_predictors) {
x <- Glass[[col]]
shift_constant <- ifelse(min(x) <= 0, abs(min(x)) + 1, 0)
x_shifted <- x + shift_constant
# Apply Box-Cox transformation
boxcox_transform <- boxcox(lm(x_shifted ~ 1), lambda = seq(-2, 2, by = 0.1))
# Find optimal lambda
best_lambda <- boxcox_transform$x[which.max(boxcox_transform$y)]
# Apply transformation based on best lambda
if (best_lambda == 0) {
Glass_transformed[[col]] <- log(x_shifted)
} else {
Glass_transformed[[col]] <- (x_shifted^best_lambda - 1) / best_lambda
}
}
# Visualizing distributions after Box-Cox transformation
par(mfrow = c(3, 3))
for (col in skewed_predictors) {
hist(Glass_transformed[[col]], main = paste("Box-Cox Transformed", col), col = "lightblue", border = "black")
}
par(mfrow = c(1, 1))
Conclusion: Applying transformations significantly reduces skewness, making the data more suitable for classification models.
# Load Soybean dataset
data(Soybean)
# Display structure and summary statistics
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
cat_cols <- names(Soybean)[sapply(Soybean, is.factor)]
cat_freqs <- lapply(Soybean[cat_cols], table)
# Display frequency distributions
cat_freqs
## $Class
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
##
## 0 1
## 354 293
##
## $precip
##
## 0 1 2
## 74 112 459
##
## $temp
##
## 0 1 2
## 80 374 199
##
## $hail
##
## 0 1
## 435 127
##
## $crop.hist
##
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
##
## 0 1 2 3
## 123 227 145 187
##
## $sever
##
## 0 1 2
## 195 322 45
##
## $seed.tmt
##
## 0 1 2
## 305 222 35
##
## $germ
##
## 0 1 2
## 165 213 193
##
## $plant.growth
##
## 0 1
## 441 226
##
## $leaves
##
## 0 1
## 77 606
##
## $leaf.halo
##
## 0 1 2
## 221 36 342
##
## $leaf.marg
##
## 0 1 2
## 357 21 221
##
## $leaf.size
##
## 0 1 2
## 51 327 221
##
## $leaf.shread
##
## 0 1
## 487 96
##
## $leaf.malf
##
## 0 1
## 554 45
##
## $leaf.mild
##
## 0 1 2
## 535 20 20
##
## $stem
##
## 0 1
## 296 371
##
## $lodging
##
## 0 1
## 520 42
##
## $stem.cankers
##
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
##
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
##
## 0 1
## 473 104
##
## $ext.decay
##
## 0 1 2
## 497 135 13
##
## $mycelium
##
## 0 1
## 639 6
##
## $int.discolor
##
## 0 1 2
## 581 44 20
##
## $sclerotia
##
## 0 1
## 625 20
##
## $fruit.pods
##
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
##
## 0 1 2 4
## 345 75 57 100
##
## $seed
##
## 0 1
## 476 115
##
## $mold.growth
##
## 0 1
## 524 67
##
## $seed.discolor
##
## 0 1
## 513 64
##
## $seed.size
##
## 0 1
## 532 59
##
## $shriveling
##
## 0 1
## 539 38
##
## $roots
##
## 0 1 2
## 551 86 15
missing_counts <- colSums(is.na(Soybean))
missing_percents <- missing_counts / nrow(Soybean) * 100
missing_data <- data.frame(Predictor = names(missing_counts), Missing_Percent = missing_percents)
missing_data <- missing_data[order(-missing_data$Missing_Percent), ]
print(missing_data)
## Predictor Missing_Percent
## hail hail 17.7159590
## sever sever 17.7159590
## seed.tmt seed.tmt 17.7159590
## lodging lodging 17.7159590
## germ germ 16.3982430
## leaf.mild leaf.mild 15.8125915
## fruiting.bodies fruiting.bodies 15.5197657
## fruit.spots fruit.spots 15.5197657
## seed.discolor seed.discolor 15.5197657
## shriveling shriveling 15.5197657
## leaf.shread leaf.shread 14.6412884
## seed seed 13.4699854
## mold.growth mold.growth 13.4699854
## seed.size seed.size 13.4699854
## leaf.halo leaf.halo 12.2986823
## leaf.marg leaf.marg 12.2986823
## leaf.size leaf.size 12.2986823
## leaf.malf leaf.malf 12.2986823
## fruit.pods fruit.pods 12.2986823
## precip precip 5.5636896
## stem.cankers stem.cankers 5.5636896
## canker.lesion canker.lesion 5.5636896
## ext.decay ext.decay 5.5636896
## mycelium mycelium 5.5636896
## int.discolor int.discolor 5.5636896
## sclerotia sclerotia 5.5636896
## plant.stand plant.stand 5.2708638
## roots roots 4.5387994
## temp temp 4.3923865
## crop.hist crop.hist 2.3426061
## plant.growth plant.growth 2.3426061
## stem stem 2.3426061
## date date 0.1464129
## area.dam area.dam 0.1464129
## Class Class 0.0000000
## leaves leaves 0.0000000
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
Soybean_imputed <- kNN(Soybean, k = 5)
sum(is.na(Soybean_imputed))
## [1] 0
# Load Soybean dataset
data(Soybean)
# Display structure and summary statistics
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
cat_cols <- names(Soybean)[sapply(Soybean, is.factor)]
# Count unique values for each categorical variable
cat_freqs <- lapply(Soybean[cat_cols], table)
# Display frequency distributions
cat_freqs
## $Class
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
##
## 0 1
## 354 293
##
## $precip
##
## 0 1 2
## 74 112 459
##
## $temp
##
## 0 1 2
## 80 374 199
##
## $hail
##
## 0 1
## 435 127
##
## $crop.hist
##
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
##
## 0 1 2 3
## 123 227 145 187
##
## $sever
##
## 0 1 2
## 195 322 45
##
## $seed.tmt
##
## 0 1 2
## 305 222 35
##
## $germ
##
## 0 1 2
## 165 213 193
##
## $plant.growth
##
## 0 1
## 441 226
##
## $leaves
##
## 0 1
## 77 606
##
## $leaf.halo
##
## 0 1 2
## 221 36 342
##
## $leaf.marg
##
## 0 1 2
## 357 21 221
##
## $leaf.size
##
## 0 1 2
## 51 327 221
##
## $leaf.shread
##
## 0 1
## 487 96
##
## $leaf.malf
##
## 0 1
## 554 45
##
## $leaf.mild
##
## 0 1 2
## 535 20 20
##
## $stem
##
## 0 1
## 296 371
##
## $lodging
##
## 0 1
## 520 42
##
## $stem.cankers
##
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
##
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
##
## 0 1
## 473 104
##
## $ext.decay
##
## 0 1 2
## 497 135 13
##
## $mycelium
##
## 0 1
## 639 6
##
## $int.discolor
##
## 0 1 2
## 581 44 20
##
## $sclerotia
##
## 0 1
## 625 20
##
## $fruit.pods
##
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
##
## 0 1 2 4
## 345 75 57 100
##
## $seed
##
## 0 1
## 476 115
##
## $mold.growth
##
## 0 1
## 524 67
##
## $seed.discolor
##
## 0 1
## 513 64
##
## $seed.size
##
## 0 1
## 532 59
##
## $shriveling
##
## 0 1
## 539 38
##
## $roots
##
## 0 1 2
## 551 86 15
Understanding the frequency distributions of categorical predictors is crucial in assessing the variability of the data and identifying degenerate distributions. A degenerate distribution occurs when a predictor has little to no variability, meaning it has only one unique value or is highly imbalanced. Such predictors do not provide useful information for classification and may need to be removed.
To determine if any categorical predictor is degenerate, we count the number of unique values in each categorical column. If a predictor has only one unique value or is heavily skewed towards a single category, it might not contribute meaningfully to the model and should be reconsidered for inclusion.
missing_counts <- colSums(is.na(Soybean))
missing_percents <- missing_counts / nrow(Soybean) * 100
missing_data <- data.frame(Predictor = names(missing_counts), Missing_Percent = missing_percents)
missing_data <- missing_data[order(-missing_data$Missing_Percent), ]
print(missing_data)
## Predictor Missing_Percent
## hail hail 17.7159590
## sever sever 17.7159590
## seed.tmt seed.tmt 17.7159590
## lodging lodging 17.7159590
## germ germ 16.3982430
## leaf.mild leaf.mild 15.8125915
## fruiting.bodies fruiting.bodies 15.5197657
## fruit.spots fruit.spots 15.5197657
## seed.discolor seed.discolor 15.5197657
## shriveling shriveling 15.5197657
## leaf.shread leaf.shread 14.6412884
## seed seed 13.4699854
## mold.growth mold.growth 13.4699854
## seed.size seed.size 13.4699854
## leaf.halo leaf.halo 12.2986823
## leaf.marg leaf.marg 12.2986823
## leaf.size leaf.size 12.2986823
## leaf.malf leaf.malf 12.2986823
## fruit.pods fruit.pods 12.2986823
## precip precip 5.5636896
## stem.cankers stem.cankers 5.5636896
## canker.lesion canker.lesion 5.5636896
## ext.decay ext.decay 5.5636896
## mycelium mycelium 5.5636896
## int.discolor int.discolor 5.5636896
## sclerotia sclerotia 5.5636896
## plant.stand plant.stand 5.2708638
## roots roots 4.5387994
## temp temp 4.3923865
## crop.hist crop.hist 2.3426061
## plant.growth plant.growth 2.3426061
## stem stem 2.3426061
## date date 0.1464129
## area.dam area.dam 0.1464129
## Class Class 0.0000000
## leaves leaves 0.0000000
library(VIM)
Soybean_imputed <- kNN(Soybean, k = 5)
sum(is.na(Soybean_imputed))
## [1] 0
Conclusion: Handling missing data through imputation or removing highly missing predictors ensures better data integrity and model performance. Imputation helps preserve valuable information that would otherwise be lost if entire rows or columns were removed. This is particularly important when working with small or imbalanced datasets, where every data point contributes to the overall learning process. Additionally, appropriate handling of missing data prevents biases in model training and improves the robustness of predictive analytics. By using imputation techniques such as mode imputation or kNN-based imputation, we can maintain consistency in dataset structure while minimizing the risk of information loss, ultimately leading to more accurate and reliable models.