Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Explore Predictor variables using visualizations
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
glass_correlation=cor(Glass[,1:9])
glass_correlation
## RI Na Mg Al Si
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073
## K Ca Ba Fe
## RI -0.289832711 0.8104027 -0.0003860189 0.143009609
## Na -0.266086504 -0.2754425 0.3266028795 -0.241346411
## Mg 0.005395667 -0.4437500 -0.4922621178 0.083059529
## Al 0.325958446 -0.2595920 0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K 1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155 1.0000000 -0.1128409671 0.124968219
## Ba -0.042618059 -0.1128410 1.0000000000 -0.058691755
## Fe -0.007719049 0.1249682 -0.0586917554 1.000000000
corrplot(glass_correlation , order = 'hclust')
# Visualizing / checking for predictor distribution.
library(purrr)
library(tidyr)
library(ggplot2)
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2020a.
## 1.0/zoneinfo/America/New_York'
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Do there appear to be any outliers in the data? Are any predictors skewed? Ans: All predictors appear to be skewed. I can also see major outliers in all of the predictors. Some preditors show a more normal distribution, e.g. Al,Ca,Na and Si while others are very skewed like Ba, Fe and K
library(e1071)
# Checking for skewness
Glass[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
## Ca Ba Fe
## 2.0184463 3.3686800 1.7298107
#checking for outliers
par(mfrow = c(3, 3))
boxplot(Glass$RI, main = "RI")
boxplot(Glass$Na, main = "Na")
boxplot(Glass$Mg, main = "Mg")
boxplot(Glass$Al, main = "Al")
boxplot(Glass$Si, main = "Si")
boxplot(Glass$K, main = "K")
boxplot(Glass$Ca, main = "Ca")
boxplot(Glass$Ba, main = "Ba")
boxplot(Glass$Fe, main = "Fe")
Are there any relevant transformations of one or more predictors that might improve the classification model? Ans: The box cox transformation would be a great option for predictors that are skewed. In dealing with outliers, the spatial sign transformation could be used.
#Yes. Ans: The box cox transformation would be a great option for predictors that are skewed. In dealing with outliers, the spatial sign transformation could be used.
Loading Soybean data
library(mlbench)
data("Soybean")
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
head(Soybean)
## Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker 6 0 2 1 0 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2
## 3 diaporthe-stem-canker 3 0 2 1 0 1
## 4 diaporthe-stem-canker 3 0 2 1 0 1
## 5 diaporthe-stem-canker 6 0 2 1 0 2
## 6 diaporthe-stem-canker 5 0 2 1 0 3
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1 1 1 0 0 1 1 0 2
## 2 0 2 1 1 1 1 0 2
## 3 0 2 1 2 1 1 0 2
## 4 0 2 0 1 1 1 0 2
## 5 0 1 0 2 1 1 0 2
## 6 0 1 0 1 1 1 0 2
## leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1 2 0 0 0 1 1 3
## 2 2 0 0 0 1 0 3
## 3 2 0 0 0 1 0 3
## 4 2 0 0 0 1 0 3
## 5 2 0 0 0 1 0 3
## 6 2 0 0 0 1 0 3
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1 1 1 1 0 0 0
## 2 1 1 1 0 0 0
## 3 0 1 1 0 0 0
## 4 0 1 1 0 0 0
## 5 1 1 1 0 0 0
## 6 0 1 1 0 0 0
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1 0 4 0 0 0 0
## 2 0 4 0 0 0 0
## 3 0 4 0 0 0 0
## 4 0 4 0 0 0 0
## 5 0 4 0 0 0 0
## 6 0 4 0 0 0 0
## shriveling roots
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
For the categorical predictors, are any of the distributions degenerate in the ways discussed earlier in the chapter? Ans: Using the caret library, the nearZeroVar funtion we see that the following predictors have very low variance indicating that the distribution is more or less a single line: “leaf.mild” “mycelium” “sclerotia”
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
pred_degenerate <- nearZeroVar(Soybean)
names(Soybean)[pred_degenerate]
## [1] "leaf.mild" "mycelium" "sclerotia"
par(mfrow = c(1, 3))
barplot((table(Soybean$leaf.mild) * 100) / nrow(Soybean))
barplot((table(Soybean$mycelium) * 100) / nrow(Soybean))
barplot((table(Soybean$sclerotia) * 100) / nrow(Soybean))
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of the missing data related to the classes? Based on the below we see that certain patterns of missing values are present. Groups of predictors have missing values of 84, 106, 38 and 121.
#ccheck for missing values of each predictor.
sapply(Soybean, function(x) sum(is.na(x)))
## Class date plant.stand precip
## 0 1 36 38
## temp hail crop.hist area.dam
## 30 121 16 1
## sever seed.tmt germ plant.growth
## 121 121 112 16
## leaves leaf.halo leaf.marg leaf.size
## 0 84 84 84
## leaf.shread leaf.malf leaf.mild stem
## 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies
## 121 38 38 106
## ext.decay mycelium int.discolor sclerotia
## 38 38 38 38
## fruit.pods fruit.spots seed mold.growth
## 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
Develop a strategy for handling missing data, either by eliminating preditors or imputation. There are a number of choices when dealing with missing values. When dealing with categrical data it would be an option to simple remove the rows of the missing values. We can also think about using KNN to impute the data as an option as well. I would treat the predictors with similar patterns the same when choosing the kind of missing value treatment to use.
# Handling missing values