data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass_long <- pivot_longer(Glass, -Type, names_to="predictor", values_to="value")
ggpairs(Glass)
The ggpairs correlation values show us that the correlation between variables is pretty negligible with most of them falling on or under 0.44. The only exception to this pattern is the relationship between the refractive index of the glass and its Calcium content and we can see that these two are the only strongly correlated variables.
ggplot(Glass_long, aes(x = value)) +
geom_density() +
facet_wrap(~predictor, scales="free")
The density plots show us how the data points are distributed. Looking at these we can see tat while the plots for Al, Na and Si all have relatively normal looking distributions centered around types , the plots for Mg, K, Ri seem bi modal, and yet others that look completely skewed
ggplot(Glass_long, aes(x=Type, y=value)) +
geom_boxplot() +
facet_wrap(~predictor, scales="free")
To see the outliers more clearly we can look at boxplots. here we can
see that with the exception of Magnesium, every other plot has a lot of
outliers
ggplot(Glass_long, aes(x=value)) +
geom_boxplot() +
facet_wrap(~predictor, scales="free")
The boxplots also seem to confirm our assessments of the density charts.
While the outliers for Si, Na , Al and Ca mostly seem to lie on both
sides of the majority of the observations within the interquartile
range, Fe , Ba and K seem to have outliers only towards the right of the
their interquartile range skewing their curves to the left.
We can also verify this with the skewness function from package e1071 . With results between
num_vars <- Glass[, 1:9]
skew_before <- apply(num_vars, 2, skewness)
skew_before
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
We see that in order of skewness from least to most would be Na, Si and Al with skewness values lying between -1 to 1 . Ri, Fe,Mg and Ca lying within 2 and Ba and K being the most skewed. This is in line with the plots we’ve seen from before
We could use the Box-Cos transformation but we would have to add a a small constant to the plots that have zero values, Mg,K, Ba, Fe so as to not break the transformation. With this transformation we should be able to improve the classification model and address the skewness
Glass[, c("Mg","K","Ba","Fe")] <- Glass[, c("Mg","K","Ba","Fe")] + 1e-6
boxcox_skewness <- function(x) {
BCT <- BoxCoxTrans(x)
skewness(predict(BCT, x))
}
skew_after <- apply(Glass[, -10], 2, boxcox_skewness)
skew_after
## RI Na Mg Al Si K
## 1.56566039 0.03384644 -1.43270870 0.09105899 -0.65090568 -0.78216211
## Ca Ba Fe
## -0.19395573 1.67566612 0.74424403
Comparing both
skew_table <- data.frame(
Predictor = names(skew_before),
Skew_Before = round(skew_before, 2),
Skew_After = round(skew_after, 2)
)
skew_table
## Predictor Skew_Before Skew_After
## RI RI 1.60 1.57
## Na Na 0.45 0.03
## Mg Mg -1.14 -1.43
## Al Al 0.89 0.09
## Si Si -0.72 -0.65
## K K 6.46 -0.78
## Ca Ca 2.02 -0.19
## Ba Ba 3.37 1.68
## Fe Fe 1.73 0.74
This transformation seems to have worked great in handling the skewness of the most skewed plots with K going from 6.46 to -.77 Ba going from 3.37 to 1.68 and would be a much better set for a classification model
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Models are often crippled by predictors with degenerate distributions. For example predictor variables with a single unique values or values so similar that they have near zero variance. Its always advantageous to find and remove these sorts of predictors. The Caret package also has the function nearZeroVar that will return the column numbers of predictors that fall into this criteria
NZV_variables <- nearZeroVar(Soybean)
colnames(Soybean)[NZV_variables]
## [1] "leaf.mild" "mycelium" "sclerotia"
We see that the three variables that fall under this criteria are leaf.mild, mycelium and sclerotia. Removing these will be advantageous to the model.
The two main options we have is Removing predictors or imputing missing values. If the predictor has missing values across all classes it may not provide any information that is particularly useful. This could include the first 2-4-d-injury, cyst-nematode and herbicide injury. For the rest we could try imputation with either most frequent category or k-Nearest neighbor if correlation exists within the predictors. Here missingness seems to be related to class at least when it comes to the top 5, so we could add this average missing value we calculated as as a new factor.