3.1. The UC Irvine Machine Learning Repository

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass_long <- pivot_longer(Glass, -Type, names_to="predictor", values_to="value")

ggpairs(Glass)

The ggpairs correlation values show us that the correlation between variables is pretty negligible with most of them falling on or under 0.44. The only exception to this pattern is the relationship between the refractive index of the glass and its Calcium content and we can see that these two are the only strongly correlated variables.

ggplot(Glass_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~predictor, scales="free")

The density plots show us how the data points are distributed. Looking at these we can see tat while the plots for Al, Na and Si all have relatively normal looking distributions centered around types , the plots for Mg, K, Ri seem bi modal, and yet others that look completely skewed

b) Do there appear to be any outliers in the data? Are any predictors skewed?

ggplot(Glass_long, aes(x=Type, y=value)) +
  geom_boxplot() +
  facet_wrap(~predictor, scales="free")

To see the outliers more clearly we can look at boxplots. here we can see that with the exception of Magnesium, every other plot has a lot of outliers

ggplot(Glass_long, aes(x=value)) +
  geom_boxplot() +
  facet_wrap(~predictor, scales="free")

The boxplots also seem to confirm our assessments of the density charts. While the outliers for Si, Na , Al and Ca mostly seem to lie on both sides of the majority of the observations within the interquartile range, Fe , Ba and K seem to have outliers only towards the right of the their interquartile range skewing their curves to the left.

We can also verify this with the skewness function from package e1071 . With results between

num_vars <- Glass[, 1:9]
skew_before <- apply(num_vars, 2, skewness)
skew_before
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

We see that in order of skewness from least to most would be Na, Si and Al with skewness values lying between -1 to 1 . Ri, Fe,Mg and Ca lying within 2 and Ba and K being the most skewed. This is in line with the plots we’ve seen from before

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We could use the Box-Cos transformation but we would have to add a a small constant to the plots that have zero values, Mg,K, Ba, Fe so as to not break the transformation. With this transformation we should be able to improve the classification model and address the skewness

Glass[, c("Mg","K","Ba","Fe")] <- Glass[, c("Mg","K","Ba","Fe")] + 1e-6

boxcox_skewness <- function(x) {
  BCT <- BoxCoxTrans(x)
  skewness(predict(BCT, x))
}

skew_after <- apply(Glass[, -10], 2, boxcox_skewness)
skew_after
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.43270870  0.09105899 -0.65090568 -0.78216211 
##          Ca          Ba          Fe 
## -0.19395573  1.67566612  0.74424403

Comparing both

skew_table <- data.frame(
  Predictor = names(skew_before),
  Skew_Before = round(skew_before, 2),
  Skew_After  = round(skew_after, 2)
)
skew_table
##    Predictor Skew_Before Skew_After
## RI        RI        1.60       1.57
## Na        Na        0.45       0.03
## Mg        Mg       -1.14      -1.43
## Al        Al        0.89       0.09
## Si        Si       -0.72      -0.65
## K          K        6.46      -0.78
## Ca        Ca        2.02      -0.19
## Ba        Ba        3.37       1.68
## Fe        Fe        1.73       0.74

This transformation seems to have worked great in handling the skewness of the most skewed plots with K going from 6.46 to -.77 Ba going from 3.37 to 1.68 and would be a much better set for a classification model

3.2. The soybean data Repository

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Models are often crippled by predictors with degenerate distributions. For example predictor variables with a single unique values or values so similar that they have near zero variance. Its always advantageous to find and remove these sorts of predictors. The Caret package also has the function nearZeroVar that will return the column numbers of predictors that fall into this criteria

NZV_variables <- nearZeroVar(Soybean)
colnames(Soybean)[NZV_variables]
## [1] "leaf.mild" "mycelium"  "sclerotia"

We see that the three variables that fall under this criteria are leaf.mild, mycelium and sclerotia. Removing these will be advantageous to the model.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The two main options we have is Removing predictors or imputing missing values. If the predictor has missing values across all classes it may not provide any information that is particularly useful. This could include the first 2-4-d-injury, cyst-nematode and herbicide injury. For the rest we could try imputation with either most frequent category or k-Nearest neighbor if correlation exists within the predictors. Here missingness seems to be related to class at least when it comes to the top 5, so we could add this average missing value we calculated as as a new factor.