DATA 624 Homework 4

library(tidyverse)
library(GGally)
library(corrplot)
library(reshape2)

Question 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
  2. Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_density(fill = "skyblue", alpha = 0.6) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(title = "Density of Predictors",
       x = "Value", y = "Density") +
  theme_minimal()

cor_matrix <- cor(Glass[, 1:9])

corrplot(cor_matrix, method = "color", type = "upper", 
         col = colorRampPalette(c("blue", "white", "red"))(200),
         tl.cex = 0.8,
         title = "Correlation Matrix of Predictors")

Glass %>%
  select(-Type) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(y = Value)) +
  geom_boxplot(fill = "skyblue", outlier.color = "red", outlier.shape = 1) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(title = "Boxplots of Glass Dataset Predictors",
       y = "Value",
       x = NULL) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

RI, NA, AI, Si, Ca have fairly normal distribution with some level of skewness. Whereas, MG, K, Ba, Fe are highly skewed. The predictors are uncorrelated with the exception of RI and Ca are strongly correlated. All of the predictors have outliers with the exception of Mg

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes since the a lot of the predictors have right skew and outliers we will need to transform the data. Box-Cox is an option however since there are a lot of zeros it shouldn’t be used. However, for this scenario a log(1 + x) and standardization can be used.

Glass_transformed <- Glass %>%
  mutate(
    K  = log1p(K),
    Ba = log1p(Ba),
    Fe = log1p(Fe)
  ) %>%
  mutate(across(-Type, scale))
Glass_transformed %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_density(fill = "skyblue", alpha = 0.6) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(title = "Density of Transformed Predictors",
       x = "Transformed Value", y = "Density") +
  theme_minimal()

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
predictors <- Soybean[, -ncol(Soybean)]

freq_list <- lapply(predictors, function(x) {
  table(x)
})

freq_list
## $Class
## x
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
## $date
## x
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
## $plant.stand
## x
##   0   1 
## 354 293 
## 
## $precip
## x
##   0   1   2 
##  74 112 459 
## 
## $temp
## x
##   0   1   2 
##  80 374 199 
## 
## $hail
## x
##   0   1 
## 435 127 
## 
## $crop.hist
## x
##   0   1   2   3 
##  65 165 219 218 
## 
## $area.dam
## x
##   0   1   2   3 
## 123 227 145 187 
## 
## $sever
## x
##   0   1   2 
## 195 322  45 
## 
## $seed.tmt
## x
##   0   1   2 
## 305 222  35 
## 
## $germ
## x
##   0   1   2 
## 165 213 193 
## 
## $plant.growth
## x
##   0   1 
## 441 226 
## 
## $leaves
## x
##   0   1 
##  77 606 
## 
## $leaf.halo
## x
##   0   1   2 
## 221  36 342 
## 
## $leaf.marg
## x
##   0   1   2 
## 357  21 221 
## 
## $leaf.size
## x
##   0   1   2 
##  51 327 221 
## 
## $leaf.shread
## x
##   0   1 
## 487  96 
## 
## $leaf.malf
## x
##   0   1 
## 554  45 
## 
## $leaf.mild
## x
##   0   1   2 
## 535  20  20 
## 
## $stem
## x
##   0   1 
## 296 371 
## 
## $lodging
## x
##   0   1 
## 520  42 
## 
## $stem.cankers
## x
##   0   1   2   3 
## 379  39  36 191 
## 
## $canker.lesion
## x
##   0   1   2   3 
## 320  83 177  65 
## 
## $fruiting.bodies
## x
##   0   1 
## 473 104 
## 
## $ext.decay
## x
##   0   1   2 
## 497 135  13 
## 
## $mycelium
## x
##   0   1 
## 639   6 
## 
## $int.discolor
## x
##   0   1   2 
## 581  44  20 
## 
## $sclerotia
## x
##   0   1 
## 625  20 
## 
## $fruit.pods
## x
##   0   1   2   3 
## 407 130  14  48 
## 
## $fruit.spots
## x
##   0   1   2   4 
## 345  75  57 100 
## 
## $seed
## x
##   0   1 
## 476 115 
## 
## $mold.growth
## x
##   0   1 
## 524  67 
## 
## $seed.discolor
## x
##   0   1 
## 513  64 
## 
## $seed.size
## x
##   0   1 
## 532  59 
## 
## $shriveling
## x
##   0   1 
## 539  38

In this dataset, mycelium and sclerotia seem to have degenerate distributions. However most of the predictors seem to be heavily imbalanced.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_counts <- sapply(Soybean, function(x) sum(is.na(x)))
missing_df <- data.frame(Variable = names(missing_counts), Missing = missing_counts)
missing_df <- missing_df[order(-missing_df$Missing), ]
missing_df
##                        Variable Missing
## hail                       hail     121
## sever                     sever     121
## seed.tmt               seed.tmt     121
## lodging                 lodging     121
## germ                       germ     112
## leaf.mild             leaf.mild     108
## fruiting.bodies fruiting.bodies     106
## fruit.spots         fruit.spots     106
## seed.discolor     seed.discolor     106
## shriveling           shriveling     106
## leaf.shread         leaf.shread     100
## seed                       seed      92
## mold.growth         mold.growth      92
## seed.size             seed.size      92
## leaf.halo             leaf.halo      84
## leaf.marg             leaf.marg      84
## leaf.size             leaf.size      84
## leaf.malf             leaf.malf      84
## fruit.pods           fruit.pods      84
## precip                   precip      38
## stem.cankers       stem.cankers      38
## canker.lesion     canker.lesion      38
## ext.decay             ext.decay      38
## mycelium               mycelium      38
## int.discolor       int.discolor      38
## sclerotia             sclerotia      38
## plant.stand         plant.stand      36
## roots                     roots      31
## temp                       temp      30
## crop.hist             crop.hist      16
## plant.growth       plant.growth      16
## stem                       stem      16
## date                       date       1
## area.dam               area.dam       1
## Class                     Class       0
## leaves                   leaves       0
missing_df$MissingPercent <- 100 * missing_df$Missing / nrow(Soybean)

ggplot(missing_df, aes(x = reorder(Variable, -Missing), y = MissingPercent)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  ylab("Missing Data (%)") +
  xlab("Variable") +
  ggtitle("Percentage of Missing Data by Variable")

Soybean$Class <- Soybean$Class
long_data <- melt(Soybean, id.vars = "Class")
long_data$missing <- is.na(long_data$value)

missing_by_class <- long_data %>%
  group_by(Class, variable) %>%
  summarize(missing_pct = mean(missing) * 100) %>%
  ungroup()
## `summarise()` has grouped output by 'Class'. You can override using the
## `.groups` argument.
ggplot(missing_by_class, aes(x = variable, y = Class, fill = missing_pct)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red", name = "% Missing") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Missing Data Pattern by Class and Variable")

The missing data seems to belong to 5 different classes:

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The data set is primarily categorical predictors and a multi-class classification task. Either a kNN or mice or a random forest algorithm can be used to impute missing data. Other strategies include mode imputation or removing that classes completely but they aren’t the best methods if the missing data has significance. Ultimately, I am not sure as to which approach is the best.