Homework 4

library(mlbench)
library(ggplot2)
library(corrplot)
corrplot 0.95 loaded
library(tidyr)

#Excercise 3.1 ## a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

data(Glass)

cor_matrix <- cor(Glass[, -10])
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black")

Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Type, y = Value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~Predictor, scales = "free_y") +
  theme_minimal() +
  labs(title = "Predictor Distributions by Glass Type")

b

Do there appear to be any outliers in the data? Are any predictors skewed?

Most of the elements look significantly skewed. Ba (Barium), Fe (Iron), and K (Potassium) are strongly right Skewed. Most sample had almost no Ba but there are a few that had extreme amount of Ba. Potassium had two sample with very high levels.

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

The glass dataset contains several values with high skewness and varying scales. I think applying transformations is basically mandatory if you’d wanna apply any linear or distance based models. For the extreme. Removing the extreme outliers. For this I believe Box-Cox Spatial Sign would be helpful. The deal with the skewness we can use a box-cox transformation.

3.2

a

library(mlbench)
library(caret)
Loading required package: lattice
data(Soybean)
str(Soybean)
'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
near_zero_var_stats <- nearZeroVar(Soybean, saveMetrics = TRUE)

# Filter for the "degenerate" ones
degenerate_cols <- near_zero_var_stats[near_zero_var_stats$nzv == TRUE, ]
print(degenerate_cols)
          freqRatio percentUnique zeroVar  nzv
leaf.mild     26.75     0.4392387   FALSE TRUE
mycelium     106.50     0.2928258   FALSE TRUE
sclerotia     31.25     0.2928258   FALSE TRUE

c Develop a strategy for handling missing data, either by eliminating predictors or imputation

Remove variables that are both degenerate (Near-Zero Variance) and have a high percentage of missing values and use K-Nearest Neighbors (KNN) Imputation to fill in the missing data.