library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(corrplot)
library(moments)
library(reshape2)
library(mice)Data Preprocessing/Overfitting
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Distribution
To understand the distribution of the predictor variable in this data set I am using histograms to obtain the frequencies of each of them. The variables are different and some are more normally distributed while others are not normally distributed. Silica (Si), Soda (Na) and Lime (Ca) are the predictors that are at higher concentrations.
lglass <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>% mutate(Predictor = as.factor(Predictor))
lglass %>% ggplot(aes(Value, color = Predictor, fill = Predictor)) + geom_histogram(bins = 20) + facet_wrap(~ Predictor, ncol = 3, scales = "free") Relationship Between Predictors
Most of the correlations are negative (RI and Si (-0.54), Mg and Ba (-0.49) and Mg and Al (-0.48)) with the exception of two positively correlated variables (RI and Ca (0.81) and Al and Ba (0.48)). The correlation table below indicates that there is a relationship between each variable.
Do there appear to be any outliers in the data? Are any predictors skewed?
To determine the skewness I will be using the moments library. K, Ba, and Ca variables are all have highly right skewed. Mg and Si are left skewed.
skewValues <- apply(Glass[ , -10], 2, skewness)
hiloRatios <- apply(Glass[ , -10], 2, function(x) max(x) / min(x + 0.1))
cbind(Skew = skewValues, Hilo = hiloRatios)## Skew Hilo
## RI 1.6140150 0.9520715
## Na 0.4509917 1.6048015
## Mg -1.1444648 44.9000000
## Al 0.9009179 8.9743590
## Si -0.7253173 1.0786726
## K 6.5056358 62.1000000
## Ca 2.0326774 2.9276673
## Ba 3.3924309 31.5000000
## Fe 1.7420068 5.1000000
From the visualization below it appears that there are outliers such as K in the type 5 glass, Ba in type 2 glass and Ca in type 2.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Removing the skew would removing outliers which improves a model’s performance. It would be important to normalize variables by centering and scaling the variables. Transformations such as Box Cox would improve the classification model of the skewed variables. Normalizing the variables by centering and scaling.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
The distributions are highly skewed such that approx. In this case there are several predictors that have little variance. 94% of the observations for the mycelium variable fall into the same category. Missing values are scattered throughout the dataset, affecting almost all variables to differing degrees. Missing values comprise of approx. 18% of the data for several variables such as hail, sever, seed.tmt, and lodging. The predictors with low frequencies of non-zero values are: lodging, mycelium, sclerotia, mold.growth, shriveling, leaf.mlf, leaf.mild, seed.discover, seed.size and leaf.malf. These may cause issues when using models like linear regression.
## date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ
## 1 6 0 2 1 0 1 1 1 0 0
## 2 4 0 2 1 0 2 0 2 1 1
## 3 3 0 2 1 0 1 0 2 1 2
## 4 3 0 2 1 0 1 0 2 0 1
## 5 6 0 2 1 0 2 0 1 0 2
## 6 5 0 2 1 0 3 0 1 0 1
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1 1 1 0 2 2 0 0
## 2 1 1 0 2 2 0 0
## 3 1 1 0 2 2 0 0
## 4 1 1 0 2 2 0 0
## 5 1 1 0 2 2 0 0
## 6 1 1 0 2 2 0 0
## leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1 0 1 1 3 1 1 1
## 2 0 1 0 3 1 1 1
## 3 0 1 0 3 0 1 1
## 4 0 1 0 3 0 1 1
## 5 0 1 0 3 1 1 1
## 6 0 1 0 3 0 1 1
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 1 0 0 0 0 4 0 0
## 2 0 0 0 0 4 0 0
## 3 0 0 0 0 4 0 0
## 4 0 0 0 0 4 0 0
## 5 0 0 0 0 4 0 0
## 6 0 0 0 0 4 0 0
## seed.discolor seed.size shriveling roots
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
sb_freq[, 2:length(sb_freq)] <- lapply(sb_freq[, 2:length(sb_freq)], function(x) as.numeric(as.character(x)))
ggplot(data=melt(sb_freq), mapping=aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = 'free_x')Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Leaf has a min of 84 NA’s each, as does fruit.pods. Seed, mold.growth, seed.discolor, seed.size, and shriveling were grouped and have NA’s of either 92 or 106. Cancker both have 38 as does ext.decay, mcelium, int.discover, and sclerotia, and precip sever, seed.tmt and germ also all have large amounts of NA and are possibly grouped together sever and seed.tmt have 121, germ is at 112.
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots na_count
## 31 0
The predictors that have the most NA’s are Phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&stem-blight and herbicide-injury.
(Soybean %>% select(Class, na_count) %>% group_by(Class) %>% summarise(na_count = sum(na_count)) %>% arrange(desc(na_count)))## # A tibble: 19 x 2
## Class na_count
## <fct> <int>
## 1 phytophthora-rot 1214
## 2 2-4-d-injury 450
## 3 cyst-nematode 336
## 4 diaporthe-pod-&-stem-blight 177
## 5 herbicide-injury 160
## 6 alternarialeaf-spot 0
## 7 anthracnose 0
## 8 bacterial-blight 0
## 9 bacterial-pustule 0
## 10 brown-spot 0
## 11 brown-stem-rot 0
## 12 charcoal-rot 0
## 13 diaporthe-stem-canker 0
## 14 downy-mildew 0
## 15 frog-eye-leaf-spot 0
## 16 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Leaf.marg, leaf.halo and leaf.size have strong correlations and removing this would make it more efficient.Fruit.spots has many NA’s and perhaps removing it would be a good strategy.For imputations it would be best to use the k-nearest neighbors to model the NA’s. Inspecting the distribution and predictor distributions of the target Class variable will help to determine the degree of bias to which dropping the missing data has introduced.
sb <- Soybean[, -c(37, 35, 34, 33, 28, 26, 21, 19, 18)]
sb[, 2:length(sb)] <- lapply(sb[, 2:length(sb)], function(x) as.numeric(as.character(x)))
str(sb[,-1])## 'data.frame': 683 obs. of 27 variables:
## $ date : num 6 4 3 3 6 5 5 4 6 4 ...
## $ plant.stand : num 0 0 0 0 0 0 0 0 0 0 ...
## $ precip : num 2 2 2 2 2 2 2 2 2 2 ...
## $ temp : num 1 1 1 1 1 1 1 1 1 1 ...
## $ hail : num 0 0 0 0 0 0 0 1 0 0 ...
## $ crop.hist : num 1 2 1 1 2 3 2 1 3 2 ...
## $ area.dam : num 1 0 0 0 0 0 0 0 0 0 ...
## $ sever : num 1 2 2 2 1 1 1 1 1 2 ...
## $ seed.tmt : num 0 1 1 0 0 0 1 0 1 0 ...
## $ germ : num 0 1 2 1 2 1 0 2 1 2 ...
## $ plant.growth : num 1 1 1 1 1 1 1 1 1 1 ...
## $ leaves : num 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.halo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ leaf.marg : num 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.size : num 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.shread : num 0 0 0 0 0 0 0 0 0 0 ...
## $ stem : num 1 1 1 1 1 1 1 1 1 1 ...
## $ stem.cankers : num 3 3 3 3 3 3 3 3 3 3 ...
## $ canker.lesion : num 1 1 0 0 1 0 1 1 1 1 ...
## $ fruiting.bodies: num 1 1 1 1 1 1 1 1 1 1 ...
## $ ext.decay : num 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fruit.pods : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fruit.spots : num 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mold.growth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ roots : num 0 0 0 0 0 0 0 0 0 0 ...
## date plant.stand precip temp
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:1.000
## Median :4.000 Median :0.0000 Median :2.000 Median :1.000
## Mean :3.554 Mean :0.4529 Mean :1.597 Mean :1.182
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.000 Max. :1.0000 Max. :2.000 Max. :2.000
## NA's :1 NA's :36 NA's :38 NA's :30
## hail crop.hist area.dam sever
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.000 Median :2.000 Median :1.000 Median :1.0000
## Mean :0.226 Mean :1.885 Mean :1.581 Mean :0.7331
## 3rd Qu.:0.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.000 Max. :3.000 Max. :3.000 Max. :2.0000
## NA's :121 NA's :16 NA's :1 NA's :121
## seed.tmt germ plant.growth leaves
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.000 Median :0.0000 Median :1.0000
## Mean :0.5196 Mean :1.049 Mean :0.3388 Mean :0.8873
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.000 Max. :1.0000 Max. :1.0000
## NA's :121 NA's :112 NA's :16
## leaf.halo leaf.marg leaf.size leaf.shread
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :0.000 Median :1.000 Median :0.0000
## Mean :1.202 Mean :0.773 Mean :1.284 Mean :0.1647
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :1.0000
## NA's :84 NA's :84 NA's :84 NA's :100
## stem stem.cankers canker.lesion fruiting.bodies
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00 Median :1.0000 Median :0.0000
## Mean :0.5562 Mean :1.06 Mean :0.9798 Mean :0.1802
## 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.00 Max. :3.0000 Max. :1.0000
## NA's :16 NA's :38 NA's :38 NA's :106
## ext.decay int.discolor fruit.pods fruit.spots
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.2496 Mean :0.1302 Mean :0.5042 Mean :1.021
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :2.0000 Max. :2.0000 Max. :3.0000 Max. :4.000
## NA's :38 NA's :38 NA's :84 NA's :106
## seed mold.growth roots
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1946 Mean :0.1134 Mean :0.1779
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :2.0000
## NA's :92 NA's :92 NA's :31
correlated_sb <- cor(sb[,-1], use="pairwise.complete.obs")
corrplot(correlated_sb, method = "circle", order = "hclust")