The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
library(tidyverse)
library(GGally)
library(corrplot)
library(e1071)
library(caret)
library(car)
library(VIM)
library(mice)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
ggplot(gather(Glass[,-10]), aes(value)) +
geom_histogram(bins = 15, fill = 'darkgreen') +
facet_wrap(~key, scales = 'free_x')
The distributions all seem skewed. There seemes to be a strong positive correlated relationship between the predictors RI
and Ca
p while there is negative correlation between RI
and Ai
and RI
and Si
.
Refractive Index and Ca are highly correlated with a score of 0.81.
Do there appear to be any outliers in the data?
ggplot(stack(Glass[,-10]), aes(x = ind, y = values)) +
geom_boxplot(outlier.colour="darkgreen", outlier.shape=4, outlier.size=2) +
labs(x = "Predictors", y = "Values") +
theme(text = element_text(size=10), axis.text.x = element_text(angle = 90, hjust = 1))
There are outliers present but they are only found in Refractive Index.
## [1] 48 51 57 104 105 106 107 108 111 112 113 132 171 185 186 188 190
## integer(0)
## integer(0)
## integer(0)
## integer(0)
## integer(0)
## integer(0)
## integer(0)
## integer(0)
Are any predictors skewed?
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
There results confirmed my assumptions about the distributions of the predictors being skewed. RI, K, Ba and Fe are all right skewed. Mg, is left skewed. The others somewhat resemble a bell shape but still slightly skewed.
Are there any relevant transformations of one or more predictors that might improve the classification model?
glass_transformed <- preProcess(Glass[,-10], method = c("BoxCox", "center", "scale"))
new_data <- predict(glass_transformed, Glass[,-10])
ggplot(stack(new_data), aes(x = ind, y = values)) +
geom_boxplot(outlier.colour="darkgreen", outlier.shape=4, outlier.size=2) +
labs(x = "Predictors", y = "Values") +
theme(text = element_text(size=10), axis.text.x = element_text(angle = 90, hjust = 1))
Looking at the boxplot the mean for each variable are centered around 0.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
Investigate the frequency distributions for the categorical predictors.
par(mfrow = c(3, 6))
for (i in 1:ncol(Soybean)) {
barplot(table(Soybean[ ,i]), col = 'orange', ylab = names(Soybean[i]))
}
Are any of the distributions degenerate in the ways discussed earlier in this chapter?
" some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables. Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor" – Applied Predictive Modeling.
Based on the words stated, the following function would be help to answer this question.
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
Yes, they are: leaf.mild
, mycelium
and sclerotia
.
Roughly 18% of the data are missing.
Are there particular predictors that are more likely to be missing?
aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean), cex.axis=.7, oma = c(7,4,2,2), gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
OR
Is the pattern of missing data related to the classes?
miss_df <- Soybean %>% group_by(Class) %>%
summarise_all(~sum(is.na(.))) %>%
transmute(Class, na_count = rowSums(.[-1]))
miss_df
## # A tibble: 19 x 2
## Class na_count
## <fct> <dbl>
## 1 2-4-d-injury 450
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode 336
## 10 diaporthe-pod-&-stem-blight 177
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury 160
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 1214
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
I would say yes. Also, because the remaining columns seems features of the classes.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Imputation
## Warning: Number of logged events: 3389
## Class date plant.stand precip temp hail
## brown-spot : 92 0: 26 0:388 0:102 0: 82 0:465
## alternarialeaf-spot: 91 1: 75 1:295 1:112 1:378 1:218
## frog-eye-leaf-spot : 91 2: 93 2:469 2:223
## phytophthora-rot : 88 3:119
## anthracnose : 44 4:131
## brown-stem-rot : 44 5:149
## (Other) :233 6: 90
## crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0: 65 0:123 0:263 0:426 0:206 0:443 0: 77 0:250
## 1:165 1:227 1:359 1:222 1:261 1:240 1:606 1: 36
## 2:220 2:145 2: 61 2: 35 2:216 2:397
## 3:233 3:188
##
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0:409 0:106 0:564 0:624 0:643 0:296 0:587
## 1: 38 1:327 1:119 1: 59 1: 20 1:387 1: 96
## 2:236 2:250 2: 20
##
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0:391 0:358 0:508 0:498 0:670 0:619
## 1: 43 1: 83 1:175 1:169 1: 13 1: 44
## 2: 43 2:177 2: 16 2: 20
## 3:206 3: 65
##
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 0:625 0:417 0:423 0:490 0:616 0:545 0:597
## 1: 58 1:173 1: 75 1:193 1: 67 1:138 1: 86
## 2: 33 2: 59
## 3: 60 4:126
##
##
##
## shriveling roots
## 0:609 0:552
## 1: 74 1: 86
## 2: 45
##
##
##
##
All missing values were replaced with the predictive mean and so we have a complete dataset.