Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
# Required R packages
library(mlbench)
library(tidyverse)
library(GGally)
library(caret)
library(VIM)
library(rcompanion)
library(corrplot)
library(e1071)
library(naniar)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Let us look at the descriptive statistics first.
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Distributions of predictor variables
par(mfrow = c(3,3))
for (i in 1:9){
plotNormalDensity(
Glass[,i], main = sprintf("Variable %s", names(Glass)[i]))
}
The plots show the actual distributions and a superimposed normal curve for each predictor, making it easy to compare the two. None of the predictors follow a normal distribution. Na, Al, and Si are close to normal, but they exhibit fat tails. Mg and K are bimodal. Ca, Ba, Fe, and K are positively skewed.
Relationships between predictors
# Correlation
Glass2 = select(Glass, -Type)
Glass %>% select(-Type) %>%
cor() %>%
corrplot(., method='color', type="upper", order="hclust",
addCoef.col = "black", tl.col="black", diag=FALSE)
The relationship between RI and Ca shows a highly positive correlation. There are a few variables with moderate correlation values.
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
#Boxplots for outlier
Glass2 %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(outlier.size=1, outlier.colour='blue', alpha=0.3)+
theme_minimal()
There are a number of outliers for each predictor variable, except for Mg. Ba and Ca appears to have the most outliers. Let us calculate skewness to identify extreme outliers.
#Skewness
Glass2[-10] %>%
apply(2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
The rule of thumb seems to be: If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are highly skewed. 6 out of 9 variables exhibit a high skew. K, Ba and Ca lead the way with extreme skews.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Box-Cox and PCA transformations may be used for variables that are skewed in order to improve the classification model.
glass_transformed = preProcess(Glass2, method = c('BoxCox', 'center', 'scale', 'pca'))
glass_transformed
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (0)
## - principal component signal extraction (9)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
#Collating the transformed data
prediction = predict(glass_transformed, Glass)
par(mfrow = c(3,3))
for (i in 2:8){
plotNormalDensity(
prediction[,i], main = sprintf("Variable %s", names(prediction)[i]))
}
# Correlation
par(xpd=TRUE)
prediction %>% select(-Type) %>%
cor() %>%
corrplot(., method='color', type="upper", order="hclust",mar = c(2, 2, 1, 1),
addCoef.col = "black", tl.col="black", tl.cex=0.8, tl.pos = "td", diag=FALSE)
Refractive index, Na, Al, Si, and Ca were box-cox transformed, and all variables were centered and scaled before applying PCA. After applying the transformations, the density is closer to normal and not heavily skewed, and there is no correlation among the variables.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
#Summary
summary(Soybean[,2:36])
## date plant.stand precip temp hail crop.hist
## 5 :149 0 :354 0 : 74 0 : 80 0 :435 0 : 65
## 4 :131 1 :293 1 :112 1 :374 1 :127 1 :165
## 3 :118 NA's: 36 2 :459 2 :199 NA's:121 2 :219
## 2 : 93 NA's: 38 NA's: 30 3 :218
## 6 : 90 NA's: 16
## (Other):101
## NA's : 1
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0 :123 0 :195 0 :305 0 :165 0 :441 0: 77 0 :221
## 1 :227 1 :322 1 :222 1 :213 1 :226 1:606 1 : 36
## 2 :145 2 : 45 2 : 35 2 :193 NA's: 16 2 :342
## 3 :187 NA's:121 NA's:121 NA's:112 NA's: 84
## NA's: 1
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0 :357 0 : 51 0 :487 0 :554 0 :535 0 :296 0 :520
## 1 : 21 1 :327 1 : 96 1 : 45 1 : 20 1 :371 1 : 42
## 2 :221 2 :221 NA's:100 NA's: 84 2 : 20 NA's: 16 NA's:121
## NA's: 84 NA's: 84 NA's:108
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0 :379 0 :320 0 :473 0 :497 0 :639 0 :581
## 1 : 39 1 : 83 1 :104 1 :135 1 : 6 1 : 44
## 2 : 36 2 :177 NA's:106 2 : 13 NA's: 38 2 : 20
## 3 :191 3 : 65 NA's: 38 NA's: 38
## NA's: 38 NA's: 38
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor
## 0 :625 0 :407 0 :345 0 :476 0 :524 0 :513
## 1 : 20 1 :130 1 : 75 1 :115 1 : 67 1 : 64
## NA's: 38 2 : 14 2 : 57 NA's: 92 NA's: 92 NA's:106
## 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## seed.size shriveling roots
## 0 :532 0 :539 0 :551
## 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
A degenerate distribution (sometimes called a constant distribution) is a distribution of a degenerate random variable — a constant with probability of 1. In other words, a random variable X has a single possible value. Let us plot the categorical value to identify if there is any degenarete.
Soybean %>%
gather() %>%
ggplot(aes(value))+facet_wrap(~key, scales = "free")+
geom_histogram(stat="count")+
theme_minimal()
From above, it looks like “leaf.malf”, “lodging”, “mycelium”, “sclerotia”, “mold.growth”, “seed.discolor”, “seed.size”, “shriveling”, “leaf.mild”, “int.discolor” might be degenerate.
(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
#No of missing values
# Calculate the missing values
colSums(is.na(Soybean))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
Visualize using mice package
missing <- aggr(Soybean, col=c('navyblue','yellow'), numbers=TRUE,
sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
82% of the dataset is complete. Hail, sever, seed.tmt and lodging seem to be missing together and account for 17.7% of missing values. Overall, there seems to be a pattern with several variables exhibitng missing values toegther.
Pattern related to classes?
gg_miss_upset(Soybean)
The plot shows that germ, hail, server, seed.tmt and lodging have missing values together. There clearly is a pattern here.
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
First, we will remove the near-zero variance predictors using the nearZerovar from caret. The outour is the index of the columns to remove
nearZeroVar(Soybean,freqCut = 95/5, uniqueCut = 10)
## [1] 19 26 28
The columns to remove are leaf.mild, mycelium and sclerota
paste(colnames(Soybean)[19])
## [1] "leaf.mild"
paste(colnames(Soybean)[26])
## [1] "mycelium"
paste(colnames(Soybean)[28])
## [1] "sclerotia"
The rest of the data are MNAR (Missing Not at Random). The 18% rows containing missing rows may be omitted or imputed after investigating the nature of the analysis. We can substitute the missing values with mean or median and then test the accuracy of the model to select the best strategy which lead to a better model. Since the data are MNAR, I will select the kNN method to impute the missing values
Imputation with kNN
Soybean_imputed <- Soybean %>%
select(-c(mycelium, sclerotia, leaf.mild))%>%
kNN()
aggr(Soybean_imputed, main='Missing data')
The dataset is now complete.