Preprocessing Issue (Overfitting)

3.1 GLASS DATA

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
glass.df = Glass

glass.df.x = glass.df[, names(glass.df)[names(glass.df)!='Type']]
names(glass.df.x)
## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(Glass)
##       vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI       1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na       2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg       3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al       4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si       5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K        6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca       7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba       8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe       9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
## Type*   10 214  2.54 1.71   2.00    2.31 1.48  1.00  6.00  5.00  1.04    -0.29
##         se
## RI    0.00
## Na    0.06
## Mg    0.10
## Al    0.03
## Si    0.05
## K     0.04
## Ca    0.10
## Ba    0.03
## Fe    0.01
## Type* 0.12

This data set contains 7 different types of glass type, sodium, magnesium, aluminum, silicon, potassium, calcium, barium andiron in the sample. There are 214 observations in total.

The skewness of each of the glass types are printed here. Later we will use visualization tohave a intuitive view of the skewness.

describe(Glass)$skew
##  [1]  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889
##  [7]  2.0184463  3.3686800  1.7298107  1.0377535

3.1A

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

corrplot(cor(glass.df.x) , method = "color", type='upper')

corrplot (cor(glass.df.x), method='number', type='upper')  ## make it upper

library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(Glass[,1:9], histogram=TRUE, pch=19)

We can see that there exist a strong positive correlation between variable RI and Ca with correlation coefficient = 0.81.This correlation plot also shows us visually the skewness of each of the glass types. As we can see most of them are either left skewed or right skewed. K, Ca, Ba, Fe have the most skewness.

####3.1B Do there appear to be any outliers in the data? Are any predictors skewed?

# wide to long
glass.df.stacked<-stack(glass.df)
## Warning in stack.data.frame(glass.df): non-vector columns will be ignored
head(glass.df,3)
##        RI    Na   Mg   Al    Si    K   Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0  0    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0  0    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0  0    1
# head(glass.df.stacked,5)
# tail(glass.df.stacked,4)

This correlation plot also shows us visually the skewness of each of the glass types. As we can see most of them are either left skewed or right skewed. K, Ca, Ba, Fe have the most skewness.

ggplot(data = glass.df.stacked, 
       aes(y = values, fill=as.factor(ind))) +
    geom_boxplot(outlier.color = 'red', outlier.shape = 'square') +
    facet_wrap(~ind, scales = 'free') +
    ggtitle("Outlier detection using boxplot")

Mg is nicely laid out with tight range. And that’s the only variable that do not have outlier. The rest of the variables all have outliers as indicated on boxplot. Most of the outliers our on the upper side of the median.

3.1c

Scale Transformation

Are there any relevant transformations of one or more predictors that might improve the classification model?

ggplot(data = stack(glass.df), aes(x=ind, y=values, fill=ind)) +
  geom_boxplot()
## Warning in stack.data.frame(glass.df): non-vector columns will be ignored

ggtitle('Non-Transformed, Original, data')
## $title
## [1] "Non-Transformed, Original, data"
## 
## attr(,"class")
## [1] "labels"
glass.df.trans = apply(glass.df.x, 2, function(x) scale(x)) %>% data.frame()

ggplot(data = stack(glass.df.trans), aes(x=ind, y=values, fill=ind)) +
  geom_boxplot()

ggtitle('Transformed data, Same Scale')
## $title
## [1] "Transformed data, Same Scale"
## 
## attr(,"class")
## [1] "labels"

Si scale is this proportionally much higher than the rest of the variables, which puts the rest at the very bottom on the Y axis. Even without this Si variable, the rest of the variables are also highly different in their magnitude. In order for us to make prediction with less influence of the magnitude due to scale, the scale transformation is utilized. The transformed variables shows nicely on the 2nd boxplot. This transform the data can be used for future analysis with much confidence than the raw.

boxcoxtrans function

For the heavily left skewed variables Mg, K, Ba and, Fe which deem special attention, BoxCox transformation are futher applied. Some mathemateical manipulation ( a small amount say 1e-6 (0.000001) ) so that the function will not crash.

library(e1071) # for skewness function
## 
## Attaching package: 'e1071'
## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness
# Box-cox transforms
Glass$K <- Glass$K + 1e-6
Glass$Ba <- Glass$Ba + 1e-6
Glass$Fe <- Glass$Fe + 1e-6
Glass$Mg <- Glass$Mg + 1e-6

• After transforming the variables, we can then apply the Box-Cox transform to all the predictor variables and see the histogram of predictors. We can also see the skewness before and after the transformations.

library(e1071)
library(lattice)
library(caret)

boxcoxplots <- function(class){
  boxcox <- BoxCoxTrans(class)
  boxcox_pred <- predict(boxcox, class)
  # find the skewness after using the box-cox transform
  skewness(boxcox_pred)
}

 apply(glass.df.x, 2, skewness) ## before
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
# after
apply(glass.df.x, 2, boxcoxplots)
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.13645228  0.09105899 -0.65090568  6.46008890 
##          Ca          Ba          Fe 
## -0.19395573  3.36867997  1.72981071

We did a before and after of skewness of each predictor , and have found that the box Cox transformation has really helped in minimizing the skewness of the predictor variables.