3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
library(fpp2)
library(lattice)
library(tidyverse)
library(psych)
library(e1071)
library(car)
library(caret)
library(naniar)
library(VIM)
library(mice)
library(miceFast)data(Glass)
describe(Glass)## vars n mean sd median trimmed mad min max range skew kurtosis
## RI 1 214 1.52 0.00 1.52 1.52 0.00 1.51 1.53 0.02 1.60 4.72
## Na 2 214 13.41 0.82 13.30 13.38 0.64 10.73 17.38 6.65 0.45 2.90
## Mg 3 214 2.68 1.44 3.48 2.87 0.30 0.00 4.49 4.49 -1.14 -0.45
## Al 4 214 1.44 0.50 1.36 1.41 0.31 0.29 3.50 3.21 0.89 1.94
## Si 5 214 72.65 0.77 72.79 72.71 0.57 69.81 75.41 5.60 -0.72 2.82
## K 6 214 0.50 0.65 0.56 0.43 0.17 0.00 6.21 6.21 6.46 52.87
## Ca 7 214 8.96 1.42 8.60 8.74 0.66 5.43 16.19 10.76 2.02 6.41
## Ba 8 214 0.18 0.50 0.00 0.03 0.00 0.00 3.15 3.15 3.37 12.08
## Fe 9 214 0.06 0.10 0.00 0.04 0.00 0.00 0.51 0.51 1.73 2.52
## Type* 10 214 2.54 1.71 2.00 2.31 1.48 1.00 6.00 5.00 1.04 -0.29
## se
## RI 0.00
## Na 0.06
## Mg 0.10
## Al 0.03
## Si 0.05
## K 0.04
## Ca 0.10
## Ba 0.03
## Fe 0.01
## Type* 0.12
- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
pairs(~RI+Na+Mg+Al+Si+K+Ca+Ba+Fe,data=Glass,
main="Glass ID ")ggplot(Glass) +
geom_point(mapping = aes(x = K, y = Ca, size = Type))## Warning: Using size for a discrete variable is not advised.
ggplot(Glass, aes(x = Si, y = Fe)) +
geom_point()ggplot(Glass, aes(x=as.factor(Si), y=Fe)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
xlab("Type")ggplot(Glass, aes(x=RI)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$RI)## [1] 1.602715
ggplot(Glass, aes(x=Na)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Na)## [1] 0.4478343
ggplot(Glass, aes(x=Mg)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Mg)## [1] -1.136452
ggplot(Glass, aes(x=Al)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Al)## [1] 0.8946104
ggplot(Glass, aes(x=Si)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Si)## [1] -0.7202392
ggplot(Glass, aes(x=K)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$K)## [1] 6.460089
ggplot(Glass, aes(x=Ca)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Ca)## [1] 2.018446
ggplot(Glass, aes(x=Ba)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Ba)## [1] 3.36868
ggplot(Glass, aes(x=Fe)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
skewness(Glass$Fe)## [1] 1.729811
- Do there appear to be any outliers in the data? Are any predictors skewed?
From the above graphs, we can visualize that there seems to be some outliers present in the data presented, also as previously shown, some predictors are moderately skewed.
- Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, a Box-cox transformation can be used to normalize data by enhanced forecasting. When using The function “powerTransform()” and when utilizing the “car package” it will tally up the Box-Cox transformation using the maximum likelihood for returns and it approaches the information on the estimated values with convenient rounded values that are 1.96 SD the maximum likelihood estimate.
summary(powerTransform(Glass[,1:9], family="yjPower"))$result[,1:2]## Est Power Rounded Pwr
## RI -25.0853114 -25.09
## Na 1.3755562 1.00
## Mg 1.7699080 2.00
## Al 0.9773267 1.00
## Si 10.9452696 10.95
## K -0.1441078 0.00
## Ca 0.6774333 0.50
## Ba -6.8620464 -6.86
## Fe -14.9245600 -14.92
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
data(Soybean)
str(Soybean)## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
describe(Soybean)## vars n mean sd median trimmed mad min max range skew
## Class* 1 683 9.30 5.51 8 9.18 7.41 1 19 18 0.11
## date* 2 682 4.55 1.69 5 4.62 1.48 1 7 6 -0.30
## plant.stand* 3 647 1.45 0.50 1 1.44 0.00 1 2 1 0.19
## precip* 4 645 2.60 0.69 3 2.74 0.00 1 3 2 -1.42
## temp* 5 653 2.18 0.63 2 2.23 0.00 1 3 2 -0.16
## hail* 6 562 1.23 0.42 1 1.16 0.00 1 2 1 1.31
## crop.hist* 7 667 2.88 0.98 3 2.98 1.48 1 4 3 -0.40
## area.dam* 8 682 2.58 1.07 2 2.60 1.48 1 4 3 0.02
## sever* 9 562 1.73 0.60 2 1.69 0.00 1 3 2 0.17
## seed.tmt* 10 562 1.52 0.61 1 1.45 0.00 1 3 2 0.74
## germ* 11 571 2.05 0.79 2 2.06 1.48 1 3 2 -0.09
## plant.growth* 12 667 1.34 0.47 1 1.30 0.00 1 2 1 0.68
## leaves* 13 683 1.89 0.32 2 1.98 0.00 1 2 1 -2.44
## leaf.halo* 14 599 2.20 0.95 3 2.25 0.00 1 3 2 -0.41
## leaf.marg* 15 599 1.77 0.96 1 1.72 0.00 1 3 2 0.46
## leaf.size* 16 599 2.28 0.61 2 2.34 0.00 1 3 2 -0.25
## leaf.shread* 17 583 1.16 0.37 1 1.08 0.00 1 2 1 1.80
## leaf.malf* 18 599 1.08 0.26 1 1.00 0.00 1 2 1 3.22
## leaf.mild* 19 575 1.10 0.40 1 1.00 0.00 1 3 2 3.95
## stem* 20 667 1.56 0.50 2 1.57 0.00 1 2 1 -0.23
## lodging* 21 562 1.07 0.26 1 1.00 0.00 1 2 1 3.23
## stem.cankers* 22 645 2.06 1.35 1 1.95 0.00 1 4 3 0.61
## canker.lesion* 23 645 1.98 1.08 2 1.85 1.48 1 4 3 0.51
## fruiting.bodies* 24 577 1.18 0.38 1 1.10 0.00 1 2 1 1.66
## ext.decay* 25 645 1.25 0.48 1 1.16 0.00 1 3 2 1.70
## mycelium* 26 645 1.01 0.10 1 1.00 0.00 1 2 1 10.20
## int.discolor* 27 645 1.13 0.42 1 1.00 0.00 1 3 2 3.34
## sclerotia* 28 645 1.03 0.17 1 1.00 0.00 1 2 1 5.40
## fruit.pods* 29 599 1.50 0.88 1 1.28 0.00 1 4 3 1.84
## fruit.spots* 30 577 1.85 1.17 1 1.69 0.00 1 4 3 0.95
## seed* 31 591 1.19 0.40 1 1.12 0.00 1 2 1 1.54
## mold.growth* 32 591 1.11 0.32 1 1.02 0.00 1 2 1 2.43
## seed.discolor* 33 577 1.11 0.31 1 1.02 0.00 1 2 1 2.47
## seed.size* 34 591 1.10 0.30 1 1.00 0.00 1 2 1 2.66
## shriveling* 35 577 1.07 0.25 1 1.00 0.00 1 2 1 3.49
## roots* 36 652 1.18 0.44 1 1.07 0.00 1 3 2 2.46
## kurtosis se
## Class* -1.38 0.21
## date* -0.90 0.06
## plant.stand* -1.97 0.02
## precip* 0.55 0.03
## temp* -0.58 0.02
## hail* -0.29 0.02
## crop.hist* -0.92 0.04
## area.dam* -1.29 0.04
## sever* -0.56 0.03
## seed.tmt* -0.44 0.03
## germ* -1.40 0.03
## plant.growth* -1.54 0.02
## leaves* 3.98 0.01
## leaf.halo* -1.76 0.04
## leaf.marg* -1.75 0.04
## leaf.size* -0.63 0.02
## leaf.shread* 1.26 0.02
## leaf.malf* 8.35 0.01
## leaf.mild* 14.68 0.02
## stem* -1.95 0.02
## lodging* 8.42 0.01
## stem.cankers* -1.51 0.05
## canker.lesion* -1.24 0.04
## fruiting.bodies* 0.75 0.02
## ext.decay* 1.98 0.02
## mycelium* 102.18 0.00
## int.discolor* 10.57 0.02
## sclerotia* 27.19 0.01
## fruit.pods* 2.41 0.04
## fruit.spots* -0.76 0.05
## seed* 0.37 0.02
## mold.growth* 3.93 0.01
## seed.discolor* 4.12 0.01
## seed.size* 5.10 0.01
## shriveling* 10.21 0.01
## roots* 5.49 0.02
- Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
soy1 <- Soybean[,2:36]
nearZeroVar(soy1, names = TRUE)## [1] "leaf.mild" "mycelium" "sclerotia"
The nearZeroVar from the caret library can detect degenerate variables. Degenerate distributions is a variable with “zero-variance”. This means that the unique values over the same is very low, and that the frequency of the most prevalent value to the next most prevalent value is large. The degenerate variables are leaf.mild, mycelium, and sclerotia.
- Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
gg_miss_var(Soybean)res<-summary(aggr(Soybean, sortVar=TRUE))$combinations##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
vis_miss(Soybean, sort_miss=TRUE)Soybean %>%
group_by(Class) %>%
miss_var_summary() %>%
ggplot(aes(Class, variable, fill=pct_miss)) + geom_tile() +scale_fill_gradient(low="blue", high="yellow") +
theme(axis.text.x=element_text(angle=90,hjust=1))The above graph shows patterns of the missing data within the Soybean dataset. Colored by pct_missing, yellow values indicate more missing data per class within our data. Missing data can introduce bias in our models.
- Develop a strategy for handling missing data, either by eliminating predictors or imputation.
There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.
Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors.
imputed = mice(Soybean, method="rf", m=2)##
## iter imp variable
## 1 1 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 2 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 1 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 2 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 1 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 2 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 1 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 2 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 1 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 2 date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## Warning: Number of logged events: 340
imputed <- complete(imputed)
imputed <- as.data.frame(imputed)
gg_miss_var(imputed)