3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: \(Na, Mg, Al, Si, K, Ca, Ba,\) and \(Fe\).
library(tidyr)
library(dplyr)
library(knitr)
library(utils)
library(ggplot2)
library(mlbench)
library(corrplot)
library(purrr)
library(mice)
library(reshape2)
library(VIM)
library(gridExtra)
data(Glass)
glimpse(Glass)
## Observations: 214
## Variables: 10
## $ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.5...
## $ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 1...
## $ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.6...
## $ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.3...
## $ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 7...
## $ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.5...
## $ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.4...
## $ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.1...
## $ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
# For a Correlation Matrix, we need to remove the categorical variable
Glass2 = select(Glass, -Type)
m = cor(Glass2)
corrplot(m)
# Show distributions of all variables
Glass2 %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=16)
From the histogram plots above we find that the data contain distributions that are skewed as well as those that are roughly normal. The distributions of \(Al, Ca, Na, Rl\) and \(Sl\) are roughly normal, while others are clearly skewed.
To improve classification, variables that are skewed may be transformed using a Box-Cox transformation. In the existing dataset, this can be done for variables \(Ba, Fe, K\) and \(Mg\).
3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
Soybean = Soybean %>% keep(is.factor)
for (col in 2:ncol(Soybean)) {
t = table(Soybean[[col]], dnn = colnames(Soybean[col]))
print(colnames(Soybean[col]))
print(t)
}
## [1] "date"
## date
## 0 1 2 3 4 5 6
## 26 75 93 118 131 149 90
## [1] "plant.stand"
## plant.stand
## 0 1
## 354 293
## [1] "precip"
## precip
## 0 1 2
## 74 112 459
## [1] "temp"
## temp
## 0 1 2
## 80 374 199
## [1] "hail"
## hail
## 0 1
## 435 127
## [1] "crop.hist"
## crop.hist
## 0 1 2 3
## 65 165 219 218
## [1] "area.dam"
## area.dam
## 0 1 2 3
## 123 227 145 187
## [1] "sever"
## sever
## 0 1 2
## 195 322 45
## [1] "seed.tmt"
## seed.tmt
## 0 1 2
## 305 222 35
## [1] "germ"
## germ
## 0 1 2
## 165 213 193
## [1] "plant.growth"
## plant.growth
## 0 1
## 441 226
## [1] "leaves"
## leaves
## 0 1
## 77 606
## [1] "leaf.halo"
## leaf.halo
## 0 1 2
## 221 36 342
## [1] "leaf.marg"
## leaf.marg
## 0 1 2
## 357 21 221
## [1] "leaf.size"
## leaf.size
## 0 1 2
## 51 327 221
## [1] "leaf.shread"
## leaf.shread
## 0 1
## 487 96
## [1] "leaf.malf"
## leaf.malf
## 0 1
## 554 45
## [1] "leaf.mild"
## leaf.mild
## 0 1 2
## 535 20 20
## [1] "stem"
## stem
## 0 1
## 296 371
## [1] "lodging"
## lodging
## 0 1
## 520 42
## [1] "stem.cankers"
## stem.cankers
## 0 1 2 3
## 379 39 36 191
## [1] "canker.lesion"
## canker.lesion
## 0 1 2 3
## 320 83 177 65
## [1] "fruiting.bodies"
## fruiting.bodies
## 0 1
## 473 104
## [1] "ext.decay"
## ext.decay
## 0 1 2
## 497 135 13
## [1] "mycelium"
## mycelium
## 0 1
## 639 6
## [1] "int.discolor"
## int.discolor
## 0 1 2
## 581 44 20
## [1] "sclerotia"
## sclerotia
## 0 1
## 625 20
## [1] "fruit.pods"
## fruit.pods
## 0 1 2 3
## 407 130 14 48
## [1] "fruit.spots"
## fruit.spots
## 0 1 2 4
## 345 75 57 100
## [1] "seed"
## seed
## 0 1
## 476 115
## [1] "mold.growth"
## mold.growth
## 0 1
## 524 67
## [1] "seed.discolor"
## seed.discolor
## 0 1
## 513 64
## [1] "seed.size"
## seed.size
## 0 1
## 532 59
## [1] "shriveling"
## shriveling
## 0 1
## 539 38
## [1] "roots"
## roots
## 0 1 2
## 551 86 15
We see from the tables above that the following variables are somewhat degenerate:
- shriveling
- seed.discolor
- mycelium
- sclerotia
- leaf.mild
# mice::md.pattern(Soybean)
# summary(VIM::aggr(Soybean))
aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE,
sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
The MICE library can be used for imputation of missing data.
imputed.1 <- mice(Soybean, m=5, maxit = 5, seed = 500, printFlag=F)
Soybean2 = complete(imputed.1, 2)
# Check number of incomplete rows in original and imputed
table(is.na(Soybean))
##
## FALSE TRUE
## 22251 2337
table(is.na(Soybean2))
##
## FALSE
## 24588