Homework 4

library(mlbench)
library(ggplot2)
library(corrplot)

corrplot 0.95 loaded

library(tidyr)

#Excercise 3.1 ## a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

data(Glass)

cor_matrix <- cor(Glass[, -10])
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.col = "black")

Glass %>%
  pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Type, y = Value, fill = Type)) +
  geom_boxplot() +
  facet_wrap(~Predictor, scales = "free_y") +
  theme_minimal() +
  labs(title = "Predictor Distributions by Glass Type")

b

Do there appear to be any outliers in the data? Are any predictors skewed?

Most of the elements look significantly skewed. Ba (Barium), Fe (Iron), and K (Potassium) are strongly right Skewed. Most sample had almost no Ba but there are a few that had extreme amount of Ba. Potassium had two sample with very high levels.

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

The glass dataset contains several values with high skewness and varying scales. I think applying transformations is basically mandatory if you’d wanna apply any linear or distance based models. For the extreme. Removing the extreme outliers. For this I believe Box-Cox Spatial Sign would be helpful. The deal with the skewness we can use a box-cox transformation.

3.2

a

library(mlbench)
library(caret)

Loading required package: lattice

data(Soybean)
str(Soybean)

'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

near_zero_var_stats <- nearZeroVar(Soybean, saveMetrics = TRUE)

# Filter for the "degenerate" ones
degenerate_cols <- near_zero_var_stats[near_zero_var_stats$nzv == TRUE, ]
print(degenerate_cols)

          freqRatio percentUnique zeroVar  nzv
leaf.mild     26.75     0.4392387   FALSE TRUE
mycelium     106.50     0.2928258   FALSE TRUE
sclerotia     31.25     0.2928258   FALSE TRUE

b Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to

the classes?

na_counts <- apply(Soybean, 1, function(x) sum(is.na(x)))

# Aggregate mean NA count by Class
agg <- aggregate(na_counts ~ Soybean$Class, FUN = mean)

# Order from largest to smallest
agg[order(-agg$na_counts), ]

                 Soybean$Class na_counts
1                 2-4-d-injury  28.12500
9                cyst-nematode  24.00000
14            herbicide-injury  20.00000
16            phytophthora-rot  13.79545
10 diaporthe-pod-&-stem-blight  11.80000
2          alternarialeaf-spot   0.00000
3                  anthracnose   0.00000
4             bacterial-blight   0.00000
5            bacterial-pustule   0.00000
6                   brown-spot   0.00000
7               brown-stem-rot   0.00000
8                 charcoal-rot   0.00000
11       diaporthe-stem-canker   0.00000
12                downy-mildew   0.00000
13          frog-eye-leaf-spot   0.00000
15      phyllosticta-leaf-spot   0.00000
17              powdery-mildew   0.00000
18           purple-seed-stain   0.00000
19        rhizoctonia-root-rot   0.00000

While 18% data it missing it’s not randomly distributed. It follow a specific problematic pattern where a major of the missing values are concentrated in a few specific disease classes. herbicide-injury, phytophthora-rot, cyst-nematode, and 2-4-d-injury all of large missing data.

c Develop a strategy for handling missing data, either by eliminating predictors or imputation

Remove variables that are both degenerate (Near-Zero Variance) and have a high percentage of missing values and use K-Nearest Neighbors (KNN) Imputation to fill in the missing data.