Exercise 3.1

3.1. The UC Irvine Machine Learning Repository1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

RI Na Mg Al Si
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290 Min. :69.81
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190 1st Qu.:72.28
Median :1.518 Median :13.30 Median :3.480 Median :1.360 Median :72.79
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445 Mean :72.65
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630 3rd Qu.:73.09
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500 Max. :75.41
K Ca Ba Fe Type
Min. :0.0000 Min. : 5.430 Min. :0.000 Min. :0.00000 1:70
1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000 1st Qu.:0.00000 2:76
Median :0.5550 Median : 8.600 Median :0.000 Median :0.00000 3:17
Mean :0.4971 Mean : 8.957 Mean :0.175 Mean :0.05701 5:13
3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000 3rd Qu.:0.10000 6: 9
Max. :6.2100 Max. :16.190 Max. :3.150 Max. :0.51000 7:29

There 9 numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe) and one categorical target with 6 levels (Type). The data set is highly skewed toward categories 1 and 2 with 146 out of 214 observations falling into those two categories alone. Ba, and Fe have their first quartile and median values at 0 indicating that they have a lot of zeros in their distributions.

Histograms of the predictor distributions show that all of them except maybe one, Si are moderately to highly skewed. The histogram below has been colored by the target categories and shows that the distributions for different target values are slightly different (especially for Al and Na) indicating that those variables may be good predictors. Almost all of the zero values in Mg belong in categories 5, 6 and 7 and almost all of the non-zero values in Ba correspond to category 7, so treating zeroes as missing data would probably reduce the predictive accuracy of our model.

Those differences in the distributions of the predictors when separated by the target variable are even more evident in the bar plots below. All of the predictors show a lot of variation in their distributions for each target value. The plot for Si has the least difference in the distributions for each category indicating that it may have less predictive value. We can also see a lot of outliers indicated by the red dots to the top and bottom of the plot ‘whiskers’. There are some really extreme outliers like the two at the top of the Ba plot, one at the top of the K plot, etc…

Only RI and Ca seem to be very highly correlated with each other with a correlation statistic of about 0.81.

The correlation plot below has been colored by the target variable and shows some clear patterns in the distributions by Type confirming what we noted above.

## [1] 7

The findCorrelation function suggests removing column 7 which corresponds to Ca.

There are no missing values in the data set.

Missing Values in the Glass Dataset
RI Na Mg Al Si K Ca Ba Fe Type
NA’s 0 0 0 0 0 0 0 0 0 0

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

As noted above most if not all of the predictors have both outliers and skewed distributions as well as a large percentage of zeroes in a few predictors.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

skewness_statistic ratio_max_to_min
RI 1.6027151 1.015075
Na 0.4478343 1.619758
Mg -1.1364523 Inf
Al 0.8946104 12.068965
Si -0.7202392 1.080218
K 6.4600889 Inf
Ca 2.0184463 2.981584
Ba 3.3686800 Inf
Fe 1.7298107 Inf

Because there are so many zero values in the data it’s hard to estimate the severity of the skew using the max to min ratio since division by zero is undefined. But we clearly have some large skewness values and can clearly see the skew in the histograms.

The skewed distributions might be improved by log, square root, or inverse transformations or a Box-Cox transformation.

Unfortunately because of the zero values Mg, K, Ba, and Fe cannot be transformed using Box-Cox. I’m still not sure how to handle that although I found this post by Rob Hyndman on the subject… Transforming data with zeros.

RI Na Mg Al Si K Ca Ba Fe
0.2838746 2.613007 4.49 0.0976177 2575.684 0.06 0.8254539 0 0.00
0.2829051 2.631169 3.60 0.3323808 2644.326 0.48 0.8145827 0 0.00
0.2824954 2.604909 3.55 0.4819347 2663.270 0.39 0.8139144 0 0.00
0.2829194 2.580974 3.69 0.2715633 2635.606 0.57 0.8195032 0 0.00
0.2828507 2.585506 3.62 0.2271057 2669.843 0.55 0.8176698 0 0.00
0.2824323 2.548664 3.61 0.5455844 2661.810 0.64 0.8176698 0 0.26

I’m not seeing a lot of improvement in the histograms after transformation with Box-Cox so let’s try something else… Another post suggests three ways of handling the zeros:

  1. Add a constant value © to each value of variable then take a log transformation
  2. Impute zero value with mean
  3. Take square root instead of log for transformation2

As noted before the zero values have predictive value and do not seem to be data collection errors, so imputing them with the mean would not be a good choice, so let’s try the square root transformation first.

The square root transformation seems to have done a little bit better at taming our distributions, however they are still highly skewed in some cases.

Let’s try one more of the suggestions by using \(log(x+1)\) to transform our data.

Maybe a little better, but still not great!

freqRatio percentUnique zeroVar nzv
RI 1.000000 83.177570 FALSE FALSE
Na 1.000000 66.355140 FALSE FALSE
Mg 5.250000 43.925234 FALSE FALSE
Al 1.333333 55.140187 FALSE FALSE
Si 1.000000 62.149533 FALSE FALSE
K 2.500000 30.373832 FALSE FALSE
Ca 1.000000 66.822430 FALSE FALSE
Ba 88.000000 15.887851 FALSE FALSE
Fe 20.571429 14.953271 FALSE FALSE
Type 1.085714 2.803738 FALSE FALSE

The caret function nearZeroVar does not indicate that the variables with high frequencies of zeroes should be removed so another solution might be to try the alternative Box-Cox transformation that Hyndman suggested in his blog post.3

The data below has been transformed using the log1p function Hyndman referenced.

RI Na Mg Al Si K Ca Ba Fe
0.9246596 2.683758 1.702928 0.7419373 4.287441 0.0582689 2.277267 0 0.0000000
0.9233100 2.700690 1.526056 0.8586616 4.300410 0.3920421 2.178155 0 0.0000000
0.9227419 2.676216 1.515127 0.9321641 4.303930 0.3293037 2.172476 0 0.0000000
0.9233299 2.653946 1.545433 0.8285518 4.298781 0.4510756 2.221375 0 0.0000000
0.9232346 2.658159 1.530395 0.8064759 4.305146 0.4382549 2.204972 0 0.0000000
0.9226544 2.623944 1.528228 0.9631743 4.303660 0.4946962 2.204972 0 0.2311117

I think that in this case it might be necessary to know if these are real zeroes, or ‘censored’ data where the zeroes may actually be trace amounts that are below the detection limit. In this case Kuhn and Johnson state:

when a sample has a value below the limit of detection, the actual limit can be used in place of the real value. For this situation, it is also common to use a random number between zero and the limit of detection.4

Imputing a non-zero value might improve our transformation options by allowing us to use a regular Box-Cox transformation on all the predictors, but I am not convinced it will result in much more normalized distributions than the alternative Box-Cox transformation.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

# library(mlbench)
data(Soybean)
## See ?Soybean for details

Data Dictionary from Soybean {mlbench} Documentation

[,1]    Class   the 19 classes
[,2]    date    apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3]    plant.stand normal(0),lt-normal(1).
[,4]    precip  lt-norm(0),norm(1),gt-norm(2).
[,5]    temp    lt-norm(0),norm(1),gt-norm(2).
[,6]    hail    yes(0),no(1).
[,7]    crop.hist   dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8]    area.dam    scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9]    sever   minor(0),pot-severe(1),severe(2).
[,10]   seed.tmt    none(0),fungicide(1),other(2).
[,11]   germ    90-100%(0),80-89%(1),lt-80%(2).
[,12]   plant.growth    norm(0),abnorm(1).
[,13]   leaves  norm(0),abnorm(1).
[,14]   leaf.halo   absent(0),yellow-halos(1),no-yellow-halos(2).
[,15]   leaf.marg   w-s-marg(0),no-w-s-marg(1),dna(2).
[,16]   leaf.size   lt-1/8(0),gt-1/8(1),dna(2).
[,17]   leaf.shread absent(0),present(1).
[,18]   leaf.malf   absent(0),present(1).
[,19]   leaf.mild   absent(0),upper-surf(1),lower-surf(2).
[,20]   stem    norm(0),abnorm(1).
[,21]   lodging yes(0),no(1).
[,22]   stem.cankers    absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23]   canker.lesion   dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24]   fruiting.bodies absent(0),present(1).
[,25]   ext.decay   absent(0),firm-and-dry(1),watery(2).
[,26]   mycelium    absent(0),present(1).
[,27]   int.discolor    none(0),brown(1),black(2).
[,28]   sclerotia   absent(0),present(1).
[,29]   fruit.pods  norm(0),diseased(1),few-present(2),dna(3).
[,30]   fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31]   seed    norm(0),abnorm(1).
[,32]   mold.growth absent(0),present(1).
[,33]   seed.discolor   absent(0),present(1).
[,34]   seed.size   norm(0),lt-norm(1).
[,35]   shriveling  absent(0),present(1).
[,36]   roots   norm(0),rotted(1),galls-cysts(2).

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

In Applied Predictive Modeling by Kuhn & Johnson, they describe ‘degenerate distributions’ as those with zero or near zero variance.5 The output of the nearZeroVar function shows that there are 3 variables, leaf.mild, mycelium, sclerotia, in the Soybean data set that are ‘degenerate’ in this way.

variable freqRatio percentUnique zeroVar nzv
leaf.mild 26.75 0.4392387 FALSE TRUE
mycelium 106.50 0.2928258 FALSE TRUE
sclerotia 31.25 0.2928258 FALSE TRUE

We can see this near zero variance in the plots below as well as some interesting correlations between NA values and certain target classes which we will explore more in the next section.

Footnotes