Data Preprocessing / Overfitting

Required Libraries

library(tidyverse)
library(ggplot2)
library(GGally)
library(mlbench)
library(psych)
library(caret)
library(missMethods)

Problem 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via data(Glass)

RI	Na	Mg	Al	Si	K	Ca	Fe	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	0.00	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	0.00	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	0.00	1
1.51766	13.21	3.69	1.29	72.61	0.57	8.22	0.00	1
1.51742	13.27	3.62	1.24	73.08	0.55	8.07	0.00	1
1.51596	12.79	3.61	1.62	72.97	0.64	8.07	0.26	1

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Correlation/Distribution Matrix

Correlations

Comparing the correlations between the predictors, we have correlations that range from very weak to moderate strength. The list below shows the moderate levels that coould be of concern.

Moderate

Al/Mg = -0.482
Ca/Mg = -0.444
Ba/Mg = -0.492
Al/Ba = 0.479

Distributions

We can see some predictor variables have some major skewness to the right/left, while a few of them are relatively normally distributed.

Left-Skewed

Mg

Right-Skewed

K
Ca
Ba
Fe
RI

Normal

Na
Al
Si

Glass |>
  ggpairs(columns = c(1:9))

Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplot

Initially, the boxplot was difficult to interpret as the scale for Si made the other predictors difficult to interpret. I decided to standardize the scales with it’s z-scores using the scale() function.

pred <- 
  Glass |>
  select(1:9)

pred |>
  gather() |>
  ggplot(aes(x=key, y=value)) +
  geom_boxplot() +
  labs(x = "", title = "Predictors")

We can see that there are outliers within each predictor as they contain z-scores greater than 3. However, some predictors have quite large z-scores that are concerning such as:

K: Z-Score > 8
Ba: Z-Score > 5
Ca: Z-Score > 5
RI: Z-Score > 5

pred <- 
  Glass |>
  select(1:9)

scaled_pred <- 
  as_tibble(scale(pred))

scaled_pred |>
  gather() |>
  ggplot(aes(x=key, y=value)) +
  geom_boxplot() +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
  labs(x = "",
       y = "z-score",
       title = "Predictors (Scaled)")

Skewness

As mentioned earlier, we do see a few variables have a skewed distribution. Let’s look at it plotted, now that they have been normalized.

We can see that Ba and K are extremely right-skewed distributions while Mg and Si is left-skewed.

skew_glass <-
  describe(scaled_pred) |>
  select(mean, sd, median, skew)

skew_glass |>
  ggplot(aes(x=row.names(skew_glass), y=skew)) +
  geom_bar(stat = 'identity') +
  labs(x = "",
       y = "Skew",
       title = "Predictors (Scaled)")

Are there any relevant transformations of one or more predictors that might improve the classification model?

At minimal K and Ba should be transformed to be less than it’s extreme skewness. We can try to use a Box-Cox, scale and spatialSign transformation to help with all the predictors, since they each have some skewness and outliers to be dealt with. The results show a more normally distributed graph compared to before.

transform_glass <- 
  preProcess(pred, method = c("BoxCox", "center", "scale", "spatialSign"))

pre_processed<- predict(transform_glass, newdata = pred)

knitr::kable(head(pre_processed))

RI	Na	Mg	Al	Si	K	Ca	Ba	Fe
0.3897787	0.1394924	0.5571464	-0.2916376	-0.5017687	-0.2982838	-0.0053160	-0.1567018	-0.2604248
-0.1706773	0.4233480	0.4383221	-0.0602644	0.0671272	-0.0180610	-0.6071833	-0.2431334	-0.4040668
-0.4527459	0.1128136	0.3764394	0.1722424	0.2729916	-0.1029837	-0.5850359	-0.2208709	-0.3670684
-0.1977660	-0.1844299	0.5979005	-0.2010439	-0.0500587	0.0959318	-0.4174184	-0.3019639	-0.5018381
-0.2291155	-0.1036329	0.4791653	-0.2526386	0.4081183	0.0599771	-0.4676179	-0.2601063	-0.4322743
-0.3033006	-0.2854614	0.2448550	0.1635348	0.1561188	0.0836424	-0.2415362	-0.1343513	0.7950244

glass_processed <-
  pre_processed |>
  gather()

glass_processed |>
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~key, scales = 'free')

Problem 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)

knitr::kable(head(Soybean))

Class	date	precip	temp	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.marg	leaf.size	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	fruit.spots
diaporthe-stem-canker	6	2	1	1	1	1	0	0	1	1	2	2	1	1	3	1	1	1	4
diaporthe-stem-canker	4	2	1	2	0	2	1	1	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	1	2	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	0	1	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	6	2	1	2	0	1	0	2	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	5	2	1	3	0	1	0	1	1	1	2	2	1	0	3	0	1	1	4

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

We can see that there a few degenerate distributions given their frequency distributions.

Leaves
Mycelium
Sclerotia

soy_predictors <-
  Soybean |>
  select(-Class)

soy_predictors |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(~key) +
  labs(title = "Soybean Frequency Distributions")

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Here we see hail, lodging, sever, and seed.tmt as some of the higher predictors missing values.

missing_data <-
  soy_predictors |>
  gather() |>
  group_by(key, value) |>
  summarise(n = n()) |>
  spread(value, n) |>
  mutate(not_null = sum(c_across(1:7), na.rm = T)) |>
  rename(is_null = `<NA>`,
         predictor=key) |>
  select(is_null, not_null) |>
  mutate(pct_missing = round(is_null / (not_null + is_null), 3)) |>
  arrange(predictor)

knitr::kable(head(missing_data |> arrange(desc(pct_missing), predictor)))

predictor	is_null	not_null	pct_missing
hail	121	562	0.177
lodging	121	562	0.177
seed.tmt	121	562	0.177
sever	121	562	0.177
germ	112	571	0.164
leaf.mild	108	575	0.158

missing_data |>
  ggplot(aes(x=is_null, y=reorder(predictor, -is_null))) +
  geom_bar(stat='identity', fill="#1170aa") +
  theme_bw() +
  labs(x="count", title = "Missing Values of Predcitors")

Let’s take a closer look as to the classes these missing values may be attributed to. We see that the phytophthora-rot class leads the 19 classes with missing values, with a total of 5 classes missing values in general.

missing_classes <-
  Soybean %>%
  filter(rowSums(is.na(.)) > 0) |>
  group_by(Class) %>%
  summarise_all(~sum(is.na(.)))

knitr::kable(missing_classes)

Class	date	plant.stand	precip	temp	hail	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaf.halo	leaf.marg	leaf.size	leaf.shread	leaf.malf	leaf.mild	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	mycelium	int.discolor	sclerotia	fruit.pods	fruit.spots	seed	mold.growth	seed.discolor	seed.size	shriveling	roots
2-4-d-injury	1	16	16	16	16	16	1	16	16	16	16	0	0	0	16	0	16	16	16	16	16	16	16	16	16	16	16	16	16	16	16	16	16	16
cyst-nematode	0	14	14	14	14	0	0	14	14	14	0	14	14	14	14	14	14	0	14	14	14	14	14	14	14	14	0	14	0	0	14	0	14	0
diaporthe-pod-&-stem-blight	0	6	0	0	15	0	0	15	15	6	0	15	15	15	15	15	15	0	15	0	0	0	0	0	0	0	0	0	0	0	0	0	0	15
herbicide-injury	0	0	8	0	8	0	0	8	8	8	0	0	0	0	0	0	8	0	8	8	8	8	8	8	8	8	0	8	8	8	8	8	8	0
phytophthora-rot	0	0	0	0	68	0	0	68	68	68	0	55	55	55	55	55	55	0	68	0	0	68	0	0	0	0	68	68	68	68	68	68	68	0

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Given that the missing data is categorical with only 6 options, a median may be the easiest option to perform. We can see our graphs now do not show a bar for N/A

numeric_predictors <-
  sapply(soy_predictors, as.numeric)
  
soy_median <- 
  as_tibble(impute_median(numeric_predictors))

soy_median |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(~key) +
  labs(title = "Soybean Frequency Distributions (Imputed")