Required Libraries

library(tidyverse)
library(ggplot2)
library(GGally)
library(mlbench)
library(psych)
library(caret)
library(missMethods)

Problem 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via data(Glass)

RI Na Mg Al Si K Ca Ba Fe Type
1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Correlation/Distribution Matrix

Correlations

Comparing the correlations between the predictors, we have correlations that range from very weak to moderate strength. The list below shows the moderate levels that coould be of concern.

Moderate

  • Al/Mg = -0.482
  • Ca/Mg = -0.444
  • Ba/Mg = -0.492
  • Al/Ba = 0.479

Distributions

We can see some predictor variables have some major skewness to the right/left, while a few of them are relatively normally distributed.

Left-Skewed

  • Mg

Right-Skewed

  • K
  • Ca
  • Ba
  • Fe
  • RI

Normal

  • Na
  • Al
  • Si
Glass |>
  ggpairs(columns = c(1:9))

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplot

Initially, the boxplot was difficult to interpret as the scale for Si made the other predictors difficult to interpret. I decided to standardize the scales with it’s z-scores using the scale() function.

pred <- 
  Glass |>
  select(1:9)

pred |>
  gather() |>
  ggplot(aes(x=key, y=value)) +
  geom_boxplot() +
  labs(x = "", title = "Predictors")

We can see that there are outliers within each predictor as they contain z-scores greater than 3. However, some predictors have quite large z-scores that are concerning such as:

  • K: Z-Score > 8
  • Ba: Z-Score > 5
  • Ca: Z-Score > 5
  • RI: Z-Score > 5
pred <- 
  Glass |>
  select(1:9)

scaled_pred <- 
  as_tibble(scale(pred))

scaled_pred |>
  gather() |>
  ggplot(aes(x=key, y=value)) +
  geom_boxplot() +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
  labs(x = "",
       y = "z-score",
       title = "Predictors (Scaled)")

Skewness

As mentioned earlier, we do see a few variables have a skewed distribution. Let’s look at it plotted, now that they have been normalized.

We can see that Ba and K are extremely right-skewed distributions while Mg and Si is left-skewed.

skew_glass <-
  describe(scaled_pred) |>
  select(mean, sd, median, skew)

skew_glass |>
  ggplot(aes(x=row.names(skew_glass), y=skew)) +
  geom_bar(stat = 'identity') +
  labs(x = "",
       y = "Skew",
       title = "Predictors (Scaled)")

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

At minimal K and Ba should be transformed to be less than it’s extreme skewness. We can try to use a Box-Cox, scale and spatialSign transformation to help with all the predictors, since they each have some skewness and outliers to be dealt with. The results show a more normally distributed graph compared to before.

transform_glass <- 
  preProcess(pred, method = c("BoxCox", "center", "scale", "spatialSign"))

pre_processed<- predict(transform_glass, newdata = pred)

knitr::kable(head(pre_processed))
RI Na Mg Al Si K Ca Ba Fe
0.3897787 0.1394924 0.5571464 -0.2916376 -0.5017687 -0.2982838 -0.0053160 -0.1567018 -0.2604248
-0.1706773 0.4233480 0.4383221 -0.0602644 0.0671272 -0.0180610 -0.6071833 -0.2431334 -0.4040668
-0.4527459 0.1128136 0.3764394 0.1722424 0.2729916 -0.1029837 -0.5850359 -0.2208709 -0.3670684
-0.1977660 -0.1844299 0.5979005 -0.2010439 -0.0500587 0.0959318 -0.4174184 -0.3019639 -0.5018381
-0.2291155 -0.1036329 0.4791653 -0.2526386 0.4081183 0.0599771 -0.4676179 -0.2601063 -0.4322743
-0.3033006 -0.2854614 0.2448550 0.1635348 0.1561188 0.0836424 -0.2415362 -0.1343513 0.7950244
glass_processed <-
  pre_processed |>
  gather()

glass_processed |>
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~key, scales = 'free')

Problem 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)

knitr::kable(head(Soybean))
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
diaporthe-stem-canker 6 0 2 1 0 1 1 1 0 0 1 1 0 2 2 0 0 0 1 1 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 4 0 2 1 0 2 0 2 1 1 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 1 2 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 0 1 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 6 0 2 1 0 2 0 1 0 2 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 5 0 2 1 0 3 0 1 0 1 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

We can see that there a few degenerate distributions given their frequency distributions.

soy_predictors <-
  Soybean |>
  select(-Class)

soy_predictors |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(~key) +
  labs(title = "Soybean Frequency Distributions")

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Here we see hail, lodging, sever, and seed.tmt as some of the higher predictors missing values.

missing_data <-
  soy_predictors |>
  gather() |>
  group_by(key, value) |>
  summarise(n = n()) |>
  spread(value, n) |>
  mutate(not_null = sum(c_across(1:7), na.rm = T)) |>
  rename(is_null = `<NA>`,
         predictor=key) |>
  select(is_null, not_null) |>
  mutate(pct_missing = round(is_null / (not_null + is_null), 3)) |>
  arrange(predictor)

knitr::kable(head(missing_data |> arrange(desc(pct_missing), predictor)))
predictor is_null not_null pct_missing
hail 121 562 0.177
lodging 121 562 0.177
seed.tmt 121 562 0.177
sever 121 562 0.177
germ 112 571 0.164
leaf.mild 108 575 0.158
missing_data |>
  ggplot(aes(x=is_null, y=reorder(predictor, -is_null))) +
  geom_bar(stat='identity', fill="#1170aa") +
  theme_bw() +
  labs(x="count", title = "Missing Values of Predcitors")

Let’s take a closer look as to the classes these missing values may be attributed to. We see that the phytophthora-rot class leads the 19 classes with missing values, with a total of 5 classes missing values in general.

missing_classes <-
  Soybean %>%
  filter(rowSums(is.na(.)) > 0) |>
  group_by(Class) %>%
  summarise_all(~sum(is.na(.)))

knitr::kable(missing_classes)
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
2-4-d-injury 1 16 16 16 16 16 1 16 16 16 16 0 0 0 0 16 0 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
cyst-nematode 0 14 14 14 14 0 0 14 14 14 0 0 14 14 14 14 14 14 0 14 14 14 14 14 14 14 14 0 14 0 0 14 0 14 0
diaporthe-pod-&-stem-blight 0 6 0 0 15 0 0 15 15 6 0 0 15 15 15 15 15 15 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15
herbicide-injury 0 0 8 0 8 0 0 8 8 8 0 0 0 0 0 0 0 8 0 8 8 8 8 8 8 8 8 0 8 8 8 8 8 8 0
phytophthora-rot 0 0 0 0 68 0 0 68 68 68 0 0 55 55 55 55 55 55 0 68 0 0 68 0 0 0 0 68 68 68 68 68 68 68 0
  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Given that the missing data is categorical with only 6 options, a median may be the easiest option to perform. We can see our graphs now do not show a bar for N/A

numeric_predictors <-
  sapply(soy_predictors, as.numeric)
  
soy_median <- 
  as_tibble(impute_median(numeric_predictors))

soy_median |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(~key) +
  labs(title = "Soybean Frequency Distributions (Imputed")