Exercises 3.1 and 3.2 from the Kuhn and Johnson book “Applied Predictive Modeling”.
#clear the workspace
rm(list = ls())
#load req's packages
library(mlbench)
library(ggplot2)
library(GGally)
library(dplyr)
library(corrplot)
library(tidyr)
library(psych)
library(knitr)
library(DMwR)The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
predictors <- Glass[,1:9]
predictors %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()+
ggtitle("Glass Predictor Variables - Histograms")predictors %>%
gather() %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(~key, scales = 'free')+
ggtitle("Glass Predictor Variables - Histograms")pairs(predictors, main="Glass Predictor Variables - Pairs Plot")r <-cor(predictors)
corrplot.mixed(r,
lower.col = "black",
number.cex = .7,
title="Glass Predictor Variables - Correlation Plot",
mar=c(0,0,1,0))From the above plots, we can see that some of the vatiables are reasonably well centered (Al, Na), some are skewed (Mg) and there are also a few that are seem to have a high proportion of zero or near-zero weights (Fe, Ba)
Most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.
Most of the correlations are not very strong. The exception to this is the correlation between calcium oxide and the refraction index is strongly positively correlated.
predictors %>%
gather() %>%
ggplot(aes(x=key,y=value,color=key)) +
geom_boxplot()+
ggtitle("Glass Predictor Variables - BoxPlot")pred.norm <- predictors / apply(predictors, 2, sd)
pred.norm %>%
gather() %>%
ggplot(aes(x=key,y=value,color=key)) +
geom_boxplot()+
scale_y_continuous()+
ggtitle("Normalized Glass Predictor Variables - BoxPlot")p <- describe(predictors)
ggplot(p,aes(x = row.names(p),y=skew))+
geom_bar(stat='identity') +
ggtitle("Glass Predictors - Skew")In terms of the outliers, I first performed a box-plot to try to get a visual sense. I can see right away that the variables need to be re-scales. A simple/common recaling method is to divide by the min value however in this case, I have several vars with zero-mins and as such, we’ll scale by the standard deviation.
Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.
Something like a Box-Cox transformation might improve the classification model’s preformance.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
#number of unique values per col
incl.nas <- sapply(sapply(Soybean,unique),length)
no.nas <- sapply(sapply(Soybean[complete.cases(Soybean),],unique),length)
r <- t(rbind(incl.nas,no.nas))
row.names(r) <- colnames(Soybean)
kable(r)| incl.nas | no.nas | |
|---|---|---|
| Class | 19 | 15 |
| date | 8 | 7 |
| plant.stand | 3 | 2 |
| precip | 4 | 3 |
| temp | 4 | 3 |
| hail | 3 | 2 |
| crop.hist | 5 | 4 |
| area.dam | 5 | 4 |
| sever | 4 | 3 |
| seed.tmt | 4 | 3 |
| germ | 4 | 3 |
| plant.growth | 3 | 2 |
| leaves | 2 | 2 |
| leaf.halo | 4 | 3 |
| leaf.marg | 4 | 3 |
| leaf.size | 4 | 3 |
| leaf.shread | 3 | 2 |
| leaf.malf | 3 | 2 |
| leaf.mild | 4 | 3 |
| stem | 3 | 2 |
| lodging | 3 | 2 |
| stem.cankers | 5 | 4 |
| canker.lesion | 5 | 4 |
| fruiting.bodies | 3 | 2 |
| ext.decay | 4 | 2 |
| mycelium | 3 | 2 |
| int.discolor | 4 | 3 |
| sclerotia | 3 | 2 |
| fruit.pods | 5 | 3 |
| fruit.spots | 5 | 4 |
| seed | 3 | 2 |
| mold.growth | 3 | 2 |
| seed.discolor | 3 | 2 |
| seed.size | 3 | 2 |
| shriveling | 3 | 2 |
| roots | 4 | 3 |
The table above shows the unique-value-count by variable. Based on this table it does not appear as though there are any variables with degenerate distributions.
For this kind of problem, I;d liek to try a “one size fits all” solution is rarely optimal.
There are several variables where I feel imputation makes no sense - For these variables, we’ll assume an NA means that they didn’t occur and impute zeros
Soybean$hail[is.na(Soybean$hail)] <- 0
Soybean$sever[is.na(Soybean$hail)] <- 0For the remaining data we’ll use KNN (k=10) to impute. Note that I’m using the mode rather than an average as all of these variables appear to be discreet.
df <- data.frame(Soybean)
Soybean.impute <- knnImputation(df, k = 10, scale = T, meth = "mode",
distData = NULL)
nrow(Soybean.impute[!complete.cases(Soybean.impute),])## [1] 0
I can see that the number of incomplete cases is now 0.