library(mlbench)
library(ggplot2)
library(corrplot)corrplot 0.95 loaded
library(tidyr)library(mlbench)
library(ggplot2)
library(corrplot)corrplot 0.95 loaded
library(tidyr)#Excercise 3.1 ## a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors
data(Glass)
cor_matrix <- cor(Glass[, -10])
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black")Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Type, y = Value, fill = Type)) +
geom_boxplot() +
facet_wrap(~Predictor, scales = "free_y") +
theme_minimal() +
labs(title = "Predictor Distributions by Glass Type")Do there appear to be any outliers in the data? Are any predictors skewed?
Most of the elements look significantly skewed. Ba (Barium), Fe (Iron), and K (Potassium) are strongly right Skewed. Most sample had almost no Ba but there are a few that had extreme amount of Ba. Potassium had two sample with very high levels.
Are there any relevant transformations of one or more predictors that might improve the classification model?
The glass dataset contains several values with high skewness and varying scales. I think applying transformations is basically mandatory if you’d wanna apply any linear or distance based models. For the extreme. Removing the extreme outliers. For this I believe Box-Cox Spatial Sign would be helpful. The deal with the skewness we can use a box-cox transformation.
library(mlbench)
library(caret)Loading required package: lattice
data(Soybean)
str(Soybean)'data.frame': 683 obs. of 36 variables:
$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
$ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
$ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
near_zero_var_stats <- nearZeroVar(Soybean, saveMetrics = TRUE)
# Filter for the "degenerate" ones
degenerate_cols <- near_zero_var_stats[near_zero_var_stats$nzv == TRUE, ]
print(degenerate_cols) freqRatio percentUnique zeroVar nzv
leaf.mild 26.75 0.4392387 FALSE TRUE
mycelium 106.50 0.2928258 FALSE TRUE
sclerotia 31.25 0.2928258 FALSE TRUE
Remove variables that are both degenerate (Near-Zero Variance) and have a high percentage of missing values and use K-Nearest Neighbors (KNN) Imputation to fill in the missing data.