library(tidyverse)
library(ggplot2)
library(GGally)
library(mlbench)
library(psych)
library(caret)
library(missMethods)
The UC Irvine Machine Learning Repository6 contains a data set
related to glass identification. The data consist of 214 glass samples
labeled as one of seven class categories. There are nine predictors,
including the refractive index and percentages of eight elements: Na,
Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via
data(Glass)
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
---|---|---|---|---|---|---|---|---|---|
1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0.00 | 1 |
1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0.00 | 1 |
1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0.00 | 1 |
1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0.00 | 1 |
1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0.00 | 1 |
1.51596 | 12.79 | 3.61 | 1.62 | 72.97 | 0.64 | 8.07 | 0 | 0.26 | 1 |
Comparing the correlations between the predictors, we have correlations that range from very weak to moderate strength. The list below shows the moderate levels that coould be of concern.
Moderate
We can see some predictor variables have some major skewness to the right/left, while a few of them are relatively normally distributed.
Left-Skewed
Right-Skewed
Normal
Glass |>
ggpairs(columns = c(1:9))
Initially, the boxplot was difficult to interpret as the scale for Si made the other predictors difficult to interpret. I decided to standardize the scales with it’s z-scores using the scale() function.
pred <-
Glass |>
select(1:9)
pred |>
gather() |>
ggplot(aes(x=key, y=value)) +
geom_boxplot() +
labs(x = "", title = "Predictors")
We can see that there are outliers within each predictor as they contain z-scores greater than 3. However, some predictors have quite large z-scores that are concerning such as:
pred <-
Glass |>
select(1:9)
scaled_pred <-
as_tibble(scale(pred))
scaled_pred |>
gather() |>
ggplot(aes(x=key, y=value)) +
geom_boxplot() +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
labs(x = "",
y = "z-score",
title = "Predictors (Scaled)")
As mentioned earlier, we do see a few variables have a skewed distribution. Let’s look at it plotted, now that they have been normalized.
We can see that Ba and K are extremely right-skewed distributions while Mg and Si is left-skewed.
skew_glass <-
describe(scaled_pred) |>
select(mean, sd, median, skew)
skew_glass |>
ggplot(aes(x=row.names(skew_glass), y=skew)) +
geom_bar(stat = 'identity') +
labs(x = "",
y = "Skew",
title = "Predictors (Scaled)")
At minimal K and Ba should be transformed to be less than it’s extreme skewness. We can try to use a Box-Cox, scale and spatialSign transformation to help with all the predictors, since they each have some skewness and outliers to be dealt with. The results show a more normally distributed graph compared to before.
transform_glass <-
preProcess(pred, method = c("BoxCox", "center", "scale", "spatialSign"))
pre_processed<- predict(transform_glass, newdata = pred)
knitr::kable(head(pre_processed))
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
---|---|---|---|---|---|---|---|---|
0.3897787 | 0.1394924 | 0.5571464 | -0.2916376 | -0.5017687 | -0.2982838 | -0.0053160 | -0.1567018 | -0.2604248 |
-0.1706773 | 0.4233480 | 0.4383221 | -0.0602644 | 0.0671272 | -0.0180610 | -0.6071833 | -0.2431334 | -0.4040668 |
-0.4527459 | 0.1128136 | 0.3764394 | 0.1722424 | 0.2729916 | -0.1029837 | -0.5850359 | -0.2208709 | -0.3670684 |
-0.1977660 | -0.1844299 | 0.5979005 | -0.2010439 | -0.0500587 | 0.0959318 | -0.4174184 | -0.3019639 | -0.5018381 |
-0.2291155 | -0.1036329 | 0.4791653 | -0.2526386 | 0.4081183 | 0.0599771 | -0.4676179 | -0.2601063 | -0.4322743 |
-0.3033006 | -0.2854614 | 0.2448550 | 0.1635348 | 0.1561188 | 0.0836424 | -0.2415362 | -0.1343513 | 0.7950244 |
glass_processed <-
pre_processed |>
gather()
glass_processed |>
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~key, scales = 'free')
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
knitr::kable(head(Soybean))
Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 4 | 0 | 2 | 1 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 5 | 0 | 2 | 1 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
We can see that there a few degenerate distributions given their frequency distributions.
soy_predictors <-
Soybean |>
select(-Class)
soy_predictors |>
gather() |>
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(~key) +
labs(title = "Soybean Frequency Distributions")
Here we see hail, lodging, sever, and seed.tmt as some of the higher predictors missing values.
missing_data <-
soy_predictors |>
gather() |>
group_by(key, value) |>
summarise(n = n()) |>
spread(value, n) |>
mutate(not_null = sum(c_across(1:7), na.rm = T)) |>
rename(is_null = `<NA>`,
predictor=key) |>
select(is_null, not_null) |>
mutate(pct_missing = round(is_null / (not_null + is_null), 3)) |>
arrange(predictor)
knitr::kable(head(missing_data |> arrange(desc(pct_missing), predictor)))
predictor | is_null | not_null | pct_missing |
---|---|---|---|
hail | 121 | 562 | 0.177 |
lodging | 121 | 562 | 0.177 |
seed.tmt | 121 | 562 | 0.177 |
sever | 121 | 562 | 0.177 |
germ | 112 | 571 | 0.164 |
leaf.mild | 108 | 575 | 0.158 |
missing_data |>
ggplot(aes(x=is_null, y=reorder(predictor, -is_null))) +
geom_bar(stat='identity', fill="#1170aa") +
theme_bw() +
labs(x="count", title = "Missing Values of Predcitors")
Let’s take a closer look as to the classes these missing values may be attributed to. We see that the phytophthora-rot class leads the 19 classes with missing values, with a total of 5 classes missing values in general.
missing_classes <-
Soybean %>%
filter(rowSums(is.na(.)) > 0) |>
group_by(Class) %>%
summarise_all(~sum(is.na(.)))
knitr::kable(missing_classes)
Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2-4-d-injury | 1 | 16 | 16 | 16 | 16 | 16 | 1 | 16 | 16 | 16 | 16 | 0 | 0 | 0 | 0 | 16 | 0 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
cyst-nematode | 0 | 14 | 14 | 14 | 14 | 0 | 0 | 14 | 14 | 14 | 0 | 0 | 14 | 14 | 14 | 14 | 14 | 14 | 0 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 0 | 14 | 0 | 0 | 14 | 0 | 14 | 0 |
diaporthe-pod-&-stem-blight | 0 | 6 | 0 | 0 | 15 | 0 | 0 | 15 | 15 | 6 | 0 | 0 | 15 | 15 | 15 | 15 | 15 | 15 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 |
herbicide-injury | 0 | 0 | 8 | 0 | 8 | 0 | 0 | 8 | 8 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 0 | 8 | 8 | 8 | 8 | 8 | 8 | 0 |
phytophthora-rot | 0 | 0 | 0 | 0 | 68 | 0 | 0 | 68 | 68 | 68 | 0 | 0 | 55 | 55 | 55 | 55 | 55 | 55 | 0 | 68 | 0 | 0 | 68 | 0 | 0 | 0 | 0 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 0 |
Given that the missing data is categorical with only 6 options, a median may be the easiest option to perform. We can see our graphs now do not show a bar for N/A
numeric_predictors <-
sapply(soy_predictors, as.numeric)
soy_median <-
as_tibble(impute_median(numeric_predictors))
soy_median |>
gather() |>
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(~key) +
labs(title = "Soybean Frequency Distributions (Imputed")