# Load the Glass data
data(Glass)
Glass <- Glass
# Basic structure of the dataset
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
# Plotting pairwise relationships between predictors
# And excluding the Type variable as it's the target variable
ggpairs(Glass[, -10]) + theme_bw()
# Univariate plots for individual predictors
Glass_long <- reshape2::melt(Glass[, -10])
ggplot(Glass_long, aes(value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~variable, scales = "free") +
theme_bw() +
labs(title = "Distribution of Predictor Variables")
The visualizations of the predictor variables show the distributions and
relationships between the nine predictors. From the univariate plots,
several variables appear to follow relatively normal distributions,
though some variables like , , , and display significant skewness, with
most values concentrated towards the lower end.
The pair plot (scatter plot matrix) provides an overview of the relationships between these predictors. For example: - shows a slight negative correlation with , , , , and . - The variable exhibits a positive correlation with and , which might indicate a link between these elements in certain types of glass.
# Boxplots to identify outliers
ggplot(Glass_long, aes(x = variable, y = value)) +
geom_boxplot(fill = "lightgreen") +
theme_bw() +
labs(title = "Boxplots of Predictor Variables")
# Checking skewness for each predictor
library(e1071)
skewness_values <- apply(Glass[, -10], 2, skewness)
skewness_values
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
: Based on the boxplots, there are notable outliers in variables such as , , , , and . The distribution of these variables shows some extreme values that might represent specific glass types or experimental errors.
: Variables like , , , and are particularly skewed. For example: - is negatively skewed with a large number of data points clustered at the high end (around 3.5–4.5) and very few observations in the lower range. - , , and are positively skewed, with most values concentrated near zero and some extreme outliers at the higher end.
The presence of skewness suggests that transformations might be needed to improve the predictive modeling.
# Loading necessary library
library(caret)
# Pre-process the Glass data using the 'spatialSign' method
preProc <- preProcess(Glass[, -10], method = "spatialSign")
# Applying the transformation to the dataset
Glass_transformed_ss <- predict(preProc, Glass[, -10])
# Viewing the transformed data
head(Glass_transformed_ss)
## RI Na Mg Al Si K
## 1 0.3862138 0.12608198 0.5551342 -0.3063816 -0.49869486 -0.29720647
## 2 -0.1782866 0.42318047 0.4548937 -0.1218880 0.07316353 -0.01874385
## 3 -0.4739625 0.09851782 0.3951817 0.1254439 0.28831726 -0.10811108
## 4 -0.1932662 -0.20158533 0.5799799 -0.2581474 -0.04397201 0.09305653
## 5 -0.2268925 -0.12303180 0.4726731 -0.2991170 0.40373539 0.05916450
## 6 -0.3036441 -0.29004891 0.2459643 0.1344417 0.15791804 0.08402131
## Ca Ba Fe
## 1 -0.06449657 -0.1561358 -0.2594843
## 2 -0.56756147 -0.2523255 -0.4193433
## 3 -0.54468469 -0.2318677 -0.3853442
## 4 -0.43085072 -0.2929133 -0.4867968
## 5 -0.45422808 -0.2565822 -0.4264175
## 6 -0.23892010 -0.1349600 0.7986264
# Plotting distributions after spatialSign transformation
Glass_long_ss <- reshape2::melt(Glass_transformed_ss)
ggplot(Glass_long_ss, aes(value)) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
facet_wrap(~variable, scales = "free") +
theme_bw() +
labs(title = "Distributions After SpatialSign Transformation")
The transformation was applied to all predictors because it normalizes
multivariate data, mitigating the effect of outliers and skewed
distributions. Given that several predictors, such as , , , and , show
significant skewness, this transformation helps ensure that all
variables are on a comparable scale, which can improve the performance
of classification models by making the data more robust and
consistent.
# Load the Soybean dataset
data(Soybean)
# Summary of the dataset structure
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Investigate frequency distributions for categorical variables
# Using lapply to check the frequency of levels for each categorical predictor
freq_distributions <- lapply(Soybean, table)
freq_distributions
## $Class
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
##
## $date
##
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
##
## $plant.stand
##
## 0 1
## 354 293
##
## $precip
##
## 0 1 2
## 74 112 459
##
## $temp
##
## 0 1 2
## 80 374 199
##
## $hail
##
## 0 1
## 435 127
##
## $crop.hist
##
## 0 1 2 3
## 65 165 219 218
##
## $area.dam
##
## 0 1 2 3
## 123 227 145 187
##
## $sever
##
## 0 1 2
## 195 322 45
##
## $seed.tmt
##
## 0 1 2
## 305 222 35
##
## $germ
##
## 0 1 2
## 165 213 193
##
## $plant.growth
##
## 0 1
## 441 226
##
## $leaves
##
## 0 1
## 77 606
##
## $leaf.halo
##
## 0 1 2
## 221 36 342
##
## $leaf.marg
##
## 0 1 2
## 357 21 221
##
## $leaf.size
##
## 0 1 2
## 51 327 221
##
## $leaf.shread
##
## 0 1
## 487 96
##
## $leaf.malf
##
## 0 1
## 554 45
##
## $leaf.mild
##
## 0 1 2
## 535 20 20
##
## $stem
##
## 0 1
## 296 371
##
## $lodging
##
## 0 1
## 520 42
##
## $stem.cankers
##
## 0 1 2 3
## 379 39 36 191
##
## $canker.lesion
##
## 0 1 2 3
## 320 83 177 65
##
## $fruiting.bodies
##
## 0 1
## 473 104
##
## $ext.decay
##
## 0 1 2
## 497 135 13
##
## $mycelium
##
## 0 1
## 639 6
##
## $int.discolor
##
## 0 1 2
## 581 44 20
##
## $sclerotia
##
## 0 1
## 625 20
##
## $fruit.pods
##
## 0 1 2 3
## 407 130 14 48
##
## $fruit.spots
##
## 0 1 2 4
## 345 75 57 100
##
## $seed
##
## 0 1
## 476 115
##
## $mold.growth
##
## 0 1
## 524 67
##
## $seed.discolor
##
## 0 1
## 513 64
##
## $seed.size
##
## 0 1
## 532 59
##
## $shriveling
##
## 0 1
## 539 38
##
## $roots
##
## 0 1 2
## 551 86 15
# Check for degenerate distributions (those with most values in a single category)
degenerate_vars <- sapply(Soybean, function(x) max(table(x)) / length(x))
degenerate_vars
## Class date plant.stand precip temp
## 0.1346999 0.2181552 0.5183016 0.6720351 0.5475842
## hail crop.hist area.dam sever seed.tmt
## 0.6368960 0.3206442 0.3323572 0.4714495 0.4465593
## germ plant.growth leaves leaf.halo leaf.marg
## 0.3118594 0.6456808 0.8872621 0.5007321 0.5226940
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 0.4787701 0.7130307 0.8111274 0.7833089 0.5431918
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0.7613470 0.5549048 0.4685212 0.6925329 0.7276720
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 0.9355783 0.8506589 0.9150805 0.5959004 0.5051245
## seed mold.growth seed.discolor seed.size shriveling
## 0.6969253 0.7672035 0.7510981 0.7789165 0.7891654
## roots
## 0.8067350
cat(” \textbf{Breakdown of Variables and Frequency Distributions:}
\begin{itemize} \item \$shriveling: \begin{itemize} \item 0: 539 occurrences \item 1: 38 occurrences \item Proportion of dominant category (0): \(\\frac{539}{683} \\approx 0.79\) \item Not degenerate, but highly skewed towards 0. \end{itemize}
\item \$roots: \begin{itemize} \item 0: 551 occurrences \item 1: 86 occurrences \item 2: 15 occurrences \item Proportion of dominant category (0): \(\\frac{551}{683} \\approx 0.81\) \item Not degenerate, but skewed. \end{itemize}
\item \$seed: \begin{itemize} \item 0: 476 occurrences \item 1: 115 occurrences \item Proportion of dominant category (0): \(\\frac{476}{683} \\approx 0.70\) \item Not degenerate, but skewed. \end{itemize}
\item \$mold.growth: \begin{itemize} \item 0: 524 occurrences \item 1: 67 occurrences \item Proportion of dominant category (0): \(\\frac{524}{683} \\approx 0.77\) \item Not degenerate, but skewed. \end{itemize}
\item \$mycelium: \begin{itemize} \item 0: 639 occurrences \item 1: 6 occurrences \item Proportion of dominant category (0): \(\\frac{639}{683} \\approx 0.94\) \item \textbf{Degenerate variable} (since the proportion exceeds 90\%). \end{itemize}
\item \$int.discolor: \begin{itemize} \item 0: 581 occurrences \item 1: 44 occurrences \item 2: 20 occurrences \item Proportion of dominant category (0): \(\\frac{581}{683} \\approx 0.85\) \item Not degenerate, but highly skewed. \end{itemize}
\item \$sclerotia: \begin{itemize} \item 0: 625 occurrences \item 1: 20 occurrences \item Proportion of dominant category (0): \(\\frac{625}{683} \\approx 0.91\) \item \textbf{Degenerate variable} (exceeds 90\%). \end{itemize} \end{itemize}
\textbf{Summary of Degenerate Variables:} \begin{itemize} \item \\(mycelium: 93.6\\% of values are in category 0. \\item \\\)sclerotia: 91.5\% of values are in category 0. \item These two variables are degenerate, as more than 90\% of their values fall into a single category. \end{itemize}
Other variables like \\(shriveling, \\\)roots, and \$mold.growth show skewness but are not considered degenerate by the 90\% threshold. “)
# Investigating missing data
missing_data_summary <- colSums(is.na(Soybean))
missing_data_summary
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
# Visualizing missing data pattern
library(VIM)
aggr_plot <- aggr(Soybean, col = c('navyblue', 'red'), numbers = TRUE, sortVars = TRUE, labels = names(Soybean), cex.axis = .7, gap = 3, ylab = c("Missing data", "Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
# Checking if missing data is related to class
missing_by_class <- Soybean %>%
group_by(Class) %>%
summarise_all(~sum(is.na(.)))
missing_by_class
## # A tibble: 19 × 36
## Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2-4-d… 1 16 16 16 16 16 1 16 16
## 2 alter… 0 0 0 0 0 0 0 0 0
## 3 anthr… 0 0 0 0 0 0 0 0 0
## 4 bacte… 0 0 0 0 0 0 0 0 0
## 5 bacte… 0 0 0 0 0 0 0 0 0
## 6 brown… 0 0 0 0 0 0 0 0 0
## 7 brown… 0 0 0 0 0 0 0 0 0
## 8 charc… 0 0 0 0 0 0 0 0 0
## 9 cyst-… 0 14 14 14 14 0 0 14 14
## 10 diapo… 0 6 0 0 15 0 0 15 15
## 11 diapo… 0 0 0 0 0 0 0 0 0
## 12 downy… 0 0 0 0 0 0 0 0 0
## 13 frog-… 0 0 0 0 0 0 0 0 0
## 14 herbi… 0 0 8 0 8 0 0 8 8
## 15 phyll… 0 0 0 0 0 0 0 0 0
## 16 phyto… 0 0 0 0 68 0 0 68 68
## 17 powde… 0 0 0 0 0 0 0 0 0
## 18 purpl… 0 0 0 0 0 0 0 0 0
## 19 rhizo… 0 0 0 0 0 0 0 0 0
## # ℹ 26 more variables: germ <int>, plant.growth <int>, leaves <int>,
## # leaf.halo <int>, leaf.marg <int>, leaf.size <int>, leaf.shread <int>,
## # leaf.malf <int>, leaf.mild <int>, stem <int>, lodging <int>,
## # stem.cankers <int>, canker.lesion <int>, fruiting.bodies <int>,
## # ext.decay <int>, mycelium <int>, int.discolor <int>, sclerotia <int>,
## # fruit.pods <int>, fruit.spots <int>, seed <int>, mold.growth <int>,
## # seed.discolor <int>, seed.size <int>, shriveling <int>, roots <int>
\ The variables with the highest proportions of missing values are , , , and , each with approximately 17.7% missing data. \ Variables such as , , and also have over 15% of their data missing. \ The distribution of missing data across various variables shows some patterns, as visualized in the plots, where some variables exhibit clusters of missingness in the dataset.
\ The heatmap of the missing data pattern indicates that certain combinations of variables have missing data together, which might point to systematic missingness related to specific conditions.
# Loading necessary libraries
library(caret)
library(dplyr)
# Separating numeric and categorical variables
numericVars <- Soybean[, sapply(Soybean, is.numeric)]
categoricalVars <- Soybean[, sapply(Soybean, is.factor)]
# Imputation: numeric variables using median impute
preProcNumeric <- preProcess(numericVars, method = "medianImpute")
numeric_imputed <- predict(preProcNumeric, numericVars)
# Mode imputation for categorical variables
mode_impute <- function(x) {
x[is.na(x)] <- names(which.max(table(x)))
return(x)
}
categorical_imputed <- categoricalVars %>% mutate_all(mode_impute)
# Combining the imputed datasets
Soybean_imputed <- cbind(numeric_imputed, categorical_imputed)
# Checking the dataset after imputation
sum(is.na(Soybean_imputed))
## [1] 0