3.1. The UC Irvine Machine Learning Repository1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
kable(summary(Glass[,1:5]))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| RI | Na | Mg | Al | Si | |
|---|---|---|---|---|---|
| Min. :1.511 | Min. :10.73 | Min. :0.000 | Min. :0.290 | Min. :69.81 | |
| 1st Qu.:1.517 | 1st Qu.:12.91 | 1st Qu.:2.115 | 1st Qu.:1.190 | 1st Qu.:72.28 | |
| Median :1.518 | Median :13.30 | Median :3.480 | Median :1.360 | Median :72.79 | |
| Mean :1.518 | Mean :13.41 | Mean :2.685 | Mean :1.445 | Mean :72.65 | |
| 3rd Qu.:1.519 | 3rd Qu.:13.82 | 3rd Qu.:3.600 | 3rd Qu.:1.630 | 3rd Qu.:73.09 | |
| Max. :1.534 | Max. :17.38 | Max. :4.490 | Max. :3.500 | Max. :75.41 |
kable(summary(Glass[,6:10]))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|
| Min. :0.0000 | Min. : 5.430 | Min. :0.000 | Min. :0.00000 | 1:70 | |
| 1st Qu.:0.1225 | 1st Qu.: 8.240 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 2:76 | |
| Median :0.5550 | Median : 8.600 | Median :0.000 | Median :0.00000 | 3:17 | |
| Mean :0.4971 | Mean : 8.957 | Mean :0.175 | Mean :0.05701 | 5:13 | |
| 3rd Qu.:0.6100 | 3rd Qu.: 9.172 | 3rd Qu.:0.000 | 3rd Qu.:0.10000 | 6: 9 | |
| Max. :6.2100 | Max. :16.190 | Max. :3.150 | Max. :0.51000 | 7:29 |
There 9 numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe) and one categorical target with 6 levels (Type). The data set is highly skewed toward categories 1 and 2 with 146 out of 214 observations falling into those two categories alone. Ba, and Fe have their first quartile and median values at 0 indicating that they have a lot of zeros in their distributions.
Glass[,1:9] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")Histograms of the predictor distributions show that all of them except maybe one, Si are moderately to highly skewed. The histogram below has been colored by the target categories and shows that the distributions for different target values are slightly different (especially for Al and Na) indicating that those variables may be good predictors. Almost all of the zero values in Mg belong in categories 5, 6 and 7 and almost all of the non-zero values in Ba correspond to category 7, so treating zeroes as missing data would probably reduce the predictive accuracy of our model.
Glass %>%
gather(-Type, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Type)) +
geom_histogram(bins=10, alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")Those differences in the distributions of the predictors when separated by the target variable are even more evident in the bar plots below. All of the predictors show a lot of variation in their distributions for each target value. The plot for Si has the least difference in the distributions for each category indicating that it may have less predictive value. We can also see a lot of outliers indicated by the red dots to the top and bottom of the plot ‘whiskers’. There are some really extreme outliers like the two at the top of the Ba plot, one at the top of the K plot, etc…
Glass %>%
gather(-Type,key = "var", value = "val") %>%
ggplot(aes(x=factor(Type), y=val)) +
geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
size=2, show.legend=TRUE) +
stat_summary(aes(colour="median"), fun.y=median, geom="point",
size=2, show.legend=TRUE) +
facet_wrap(~ var, scales = "free", ncol=3) +
labs(colour="Statistics", x="", y="") +
scale_colour_manual(values=c("#9900FF", "#3300FF")) +
theme(panel.background=element_blank())Only RI and Ca seem to be very highly correlated with each other with a correlation statistic of about 0.81.
The correlation plot below has been colored by the target variable and shows some clear patterns in the distributions by Type confirming what we noted above.
Glass %>%
ggscatmat(color="Type", alpha=0.6) +
scale_color_manual(values=c('#d73027','#fc8d59','#fee090','#e0f3f8','#91bfdb','#4575b4')) +
theme(panel.background=element_blank(), legend.position="top",
axis.text.x = element_text(angle=-40, vjust=1, hjust=0))## [1] 7
The findCorrelation function suggests removing column 7 which corresponds to Ca.
There are no missing values in the data set.
missing <- data.frame(t(apply(is.na(Glass), 2, sum)))
# kable(missing, col.names = "NA's")
row.names(missing) <- c("NA's")
kable(missing, caption='Missing Values in the Glass Dataset') %>%
kable_styling(bootstrap_options = c("condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| NA’s | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
As noted above most if not all of the predictors have both outliers and skewed distributions as well as a large percentage of zeroes in a few predictors.
skewness_statistic <- apply(Glass[,1:9], 2, skewness)
ratio_max_to_min <- apply(Glass[,1:9], 2, max)/apply(Glass[,1:9], 2, min)
kable(data.frame(skewness_statistic, ratio_max_to_min)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| skewness_statistic | ratio_max_to_min | |
|---|---|---|
| RI | 1.6027151 | 1.015075 |
| Na | 0.4478343 | 1.619758 |
| Mg | -1.1364523 | Inf |
| Al | 0.8946104 | 12.068965 |
| Si | -0.7202392 | 1.080218 |
| K | 6.4600889 | Inf |
| Ca | 2.0184463 | 2.981584 |
| Ba | 3.3686800 | Inf |
| Fe | 1.7298107 | Inf |
Because there are so many zero values in the data it’s hard to estimate the severity of the skew using the max to min ratio since division by zero is undefined. But we clearly have some large skewness values and can clearly see the skew in the histograms.
The skewed distributions might be improved by log, square root, or inverse transformations or a Box-Cox transformation.
Unfortunately because of the zero values Mg, K, Ba, and Fe cannot be transformed using Box-Cox. I’m still not sure how to handle that although I found this post by Rob Hyndman on the subject… Transforming data with zeros.
# https://stackoverflow.com/questions/46485024/how-to-use-boxcoxtrans-function-in-r
GlassBC <- apply(Glass[,1:9], 2, BoxCoxTrans)
trans_Glass <- purrr::map2(GlassBC, Glass[,1:9], function(x, y) predict(x, y))
trans_Glass_df = as.data.frame(do.call(cbind, trans_Glass))
kable(head(trans_Glass_df)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|
| 0.2838746 | 2.613007 | 4.49 | 0.0976177 | 2575.684 | 0.06 | 0.8254539 | 0 | 0.00 |
| 0.2829051 | 2.631169 | 3.60 | 0.3323808 | 2644.326 | 0.48 | 0.8145827 | 0 | 0.00 |
| 0.2824954 | 2.604909 | 3.55 | 0.4819347 | 2663.270 | 0.39 | 0.8139144 | 0 | 0.00 |
| 0.2829194 | 2.580974 | 3.69 | 0.2715633 | 2635.606 | 0.57 | 0.8195032 | 0 | 0.00 |
| 0.2828507 | 2.585506 | 3.62 | 0.2271057 | 2669.843 | 0.55 | 0.8176698 | 0 | 0.00 |
| 0.2824323 | 2.548664 | 3.61 | 0.5455844 | 2661.810 | 0.64 | 0.8176698 | 0 | 0.26 |
trans_Glass_df %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")I’m not seeing a lot of improvement in the histograms after transformation with Box-Cox so let’s try something else… Another post suggests three ways of handling the zeros:
- Add a constant value © to each value of variable then take a log transformation
- Impute zero value with mean
- Take square root instead of log for transformation2
As noted before the zero values have predictive value and do not seem to be data collection errors, so imputing them with the mean would not be a good choice, so let’s try the square root transformation first.
# Square Root transformation
Glass_SqRt <- data.frame(apply(Glass[,1:9], 2, sqrt))
Glass_SqRt %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")The square root transformation seems to have done a little bit better at taming our distributions, however they are still highly skewed in some cases.
Let’s try one more of the suggestions by using \(log(x+1)\) to transform our data.
# log(x+1) transformation
Glass_log <- data.frame(apply(Glass[,1:9]+1, 2, log))
Glass_log %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")Maybe a little better, but still not great!
kable(nearZeroVar(Glass, names = TRUE, saveMetrics=TRUE)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| RI | 1.000000 | 83.177570 | FALSE | FALSE |
| Na | 1.000000 | 66.355140 | FALSE | FALSE |
| Mg | 5.250000 | 43.925234 | FALSE | FALSE |
| Al | 1.333333 | 55.140187 | FALSE | FALSE |
| Si | 1.000000 | 62.149533 | FALSE | FALSE |
| K | 2.500000 | 30.373832 | FALSE | FALSE |
| Ca | 1.000000 | 66.822430 | FALSE | FALSE |
| Ba | 88.000000 | 15.887851 | FALSE | FALSE |
| Fe | 20.571429 | 14.953271 | FALSE | FALSE |
| Type | 1.085714 | 2.803738 | FALSE | FALSE |
The caret function nearZeroVar does not indicate that the variables with high frequencies of zeroes should be removed so another solution might be to try the alternative Box-Cox transformation that Hyndman suggested in his blog post.3
The data below has been transformed using the log1p function Hyndman referenced.
Alt_BoxCox_Glass <- log1p(Glass[,1:9])
kable(head(Alt_BoxCox_Glass)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
|---|---|---|---|---|---|---|---|---|
| 0.9246596 | 2.683758 | 1.702928 | 0.7419373 | 4.287441 | 0.0582689 | 2.277267 | 0 | 0.0000000 |
| 0.9233100 | 2.700690 | 1.526056 | 0.8586616 | 4.300410 | 0.3920421 | 2.178155 | 0 | 0.0000000 |
| 0.9227419 | 2.676216 | 1.515127 | 0.9321641 | 4.303930 | 0.3293037 | 2.172476 | 0 | 0.0000000 |
| 0.9233299 | 2.653946 | 1.545433 | 0.8285518 | 4.298781 | 0.4510756 | 2.221375 | 0 | 0.0000000 |
| 0.9232346 | 2.658159 | 1.530395 | 0.8064759 | 4.305146 | 0.4382549 | 2.204972 | 0 | 0.0000000 |
| 0.9226544 | 2.623944 | 1.528228 | 0.9631743 | 4.303660 | 0.4946962 | 2.204972 | 0 | 0.2311117 |
Alt_BoxCox_Glass %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")I think that in this case it might be necessary to know if these are real zeroes, or ‘censored’ data where the zeroes may actually be trace amounts that are below the detection limit. In this case Kuhn and Johnson state:
when a sample has a value below the limit of detection, the actual limit can be used in place of the real value. For this situation, it is also common to use a random number between zero and the limit of detection.4
Imputing a non-zero value might improve our transformation options by allowing us to use a regular Box-Cox transformation on all the predictors, but I am not convinced it will result in much more normalized distributions than the alternative Box-Cox transformation.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
# library(mlbench)
data(Soybean)
## See ?Soybean for details
[,1] Class the 19 classes
[,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3] plant.stand normal(0),lt-normal(1).
[,4] precip lt-norm(0),norm(1),gt-norm(2).
[,5] temp lt-norm(0),norm(1),gt-norm(2).
[,6] hail yes(0),no(1).
[,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9] sever minor(0),pot-severe(1),severe(2).
[,10] seed.tmt none(0),fungicide(1),other(2).
[,11] germ 90-100%(0),80-89%(1),lt-80%(2).
[,12] plant.growth norm(0),abnorm(1).
[,13] leaves norm(0),abnorm(1).
[,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2).
[,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2).
[,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2).
[,17] leaf.shread absent(0),present(1).
[,18] leaf.malf absent(0),present(1).
[,19] leaf.mild absent(0),upper-surf(1),lower-surf(2).
[,20] stem norm(0),abnorm(1).
[,21] lodging yes(0),no(1).
[,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24] fruiting.bodies absent(0),present(1).
[,25] ext.decay absent(0),firm-and-dry(1),watery(2).
[,26] mycelium absent(0),present(1).
[,27] int.discolor none(0),brown(1),black(2).
[,28] sclerotia absent(0),present(1).
[,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3).
[,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31] seed norm(0),abnorm(1).
[,32] mold.growth absent(0),present(1).
[,33] seed.discolor absent(0),present(1).
[,34] seed.size norm(0),lt-norm(1).
[,35] shriveling absent(0),present(1).
[,36] roots norm(0),rotted(1),galls-cysts(2).
# reorder columns by number of levels in factors
Soybeans <- Soybean[, c(1,2,7,8,22,23,29,30,4,5,9:11,14:16,19,25,27,36,
3,6,12,13,17,18,20,21,24,26,28,31:35)]
kable(summary(data.frame(Soybeans[,1:2]))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,3:8])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,9:14])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,15:20])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,21:28])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kable(summary(Soybeans[,29:36])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))In Applied Predictive Modeling by Kuhn & Johnson, they describe ‘degenerate distributions’ as those with zero or near zero variance.5 The output of the nearZeroVar function shows that there are 3 variables, leaf.mild, mycelium, sclerotia, in the Soybean data set that are ‘degenerate’ in this way.
near_zero_soy <- nearZeroVar(Soybean, names = TRUE, saveMetrics=TRUE) %>%
rownames_to_column('variable') %>%
filter(zeroVar == TRUE | nzv == TRUE)
kable(near_zero_soy) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| variable | freqRatio | percentUnique | zeroVar | nzv |
|---|---|---|---|---|
| leaf.mild | 26.75 | 0.4392387 | FALSE | TRUE |
| mycelium | 106.50 | 0.2928258 | FALSE | TRUE |
| sclerotia | 31.25 | 0.2928258 | FALSE | TRUE |
We can see this near zero variance in the plots below as well as some interesting correlations between NA values and certain target classes which we will explore more in the next section.
par(mar=c(1,1,1,1))
Soybean %>%
gather(-Class, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Class)) +
geom_bar(alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#f1b6da','#c51b7d','#b2df8a','#33a02c','#fb9a99',
'#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a',
'#ffeda0','#dfc27d','#80cdc1','#01665e','#a6cee3',
'#1f78b4','#bf812d','#8c510a','#bababa')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")Because of the predictive value of the missingness in this data set, unless my domain knowledge told me of some reason not to, I would choose to create an additional factor level for each of those variables to indicate missingness.
# https://gist.github.com/riinuots/e517c36b1feb480df981721a00e0e24a
Soybean_Imputed <- Soybean %>%
mutate_if(is.factor, fct_explicit_na)par(mar=c(1,1,1,1))
Soybean_Imputed %>%
gather(-Class, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Class)) +
geom_bar(alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#f1b6da','#c51b7d','#b2df8a','#33a02c','#fb9a99',
'#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a',
'#ffeda0','#dfc27d','#80cdc1','#01665e','#a6cee3',
'#1f78b4','#bf812d','#8c510a','#bababa')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")