problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.
3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# package was not installed i have installed below package
#install.packages("mlbench")
library(mlbench)
data(Glass) # load data set
str(Glass) # view data set structure
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
To visualize distributions, I can plot a density graph of all predictors:
# load package
library(tidyr); library(ggplot2)
# density plot
Glass %>%
purrr::keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
To understand relationship between all predictors, I can make scatterplots and run correlations.
# package was not installed i have installed below package
#install.packages("GGally")
Glass %>%
purrr::keep(is.numeric) %>%
GGally::ggpairs()
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, most of the variables are skewed (some right-skewed, some left-skewed) and have outliers.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, to deal with skewness I can run a “Box-Cox” transformation (in case of positive values); to deal with outliers, I can run a “Spatial Sign” transformation.
# package was not installed i have installed below package
#install.packages("caret")
# load package
library(caret)
# run Box-Cox and Spatial Sign
preProc <- preProcess(Glass[,-10], method = c("BoxCox", "spatialSign"))
# apply transformation to the data set
GlassT <- predict(preProc, Glass[,-10])
# density plot
GlassT %>%
purrr::keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Variable “Class” is the outcome. The others are predictors. I can inspect degenerated distributions numerically, visually and through function “Near Zero Variance.
Hmisc::describe(Soybean)
## Soybean
##
## 36 Variables 683 Observations
## --------------------------------------------------------------------------------
## Class
## n missing distinct
## 683 0 19
##
## lowest : 2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight bacterial-pustule
## highest: phyllosticta-leaf-spot phytophthora-rot powdery-mildew purple-seed-stain rhizoctonia-root-rot
## --------------------------------------------------------------------------------
## date
## n missing distinct
## 682 1 7
##
## Value 0 1 2 3 4 5 6
## Frequency 26 75 93 118 131 149 90
## Proportion 0.038 0.110 0.136 0.173 0.192 0.218 0.132
## --------------------------------------------------------------------------------
## plant.stand
## n missing distinct
## 647 36 2
##
## Value 0 1
## Frequency 354 293
## Proportion 0.547 0.453
## --------------------------------------------------------------------------------
## precip
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 74 112 459
## Proportion 0.115 0.174 0.712
## --------------------------------------------------------------------------------
## temp
## n missing distinct
## 653 30 3
##
## Value 0 1 2
## Frequency 80 374 199
## Proportion 0.123 0.573 0.305
## --------------------------------------------------------------------------------
## hail
## n missing distinct
## 562 121 2
##
## Value 0 1
## Frequency 435 127
## Proportion 0.774 0.226
## --------------------------------------------------------------------------------
## crop.hist
## n missing distinct
## 667 16 4
##
## Value 0 1 2 3
## Frequency 65 165 219 218
## Proportion 0.097 0.247 0.328 0.327
## --------------------------------------------------------------------------------
## area.dam
## n missing distinct
## 682 1 4
##
## Value 0 1 2 3
## Frequency 123 227 145 187
## Proportion 0.180 0.333 0.213 0.274
## --------------------------------------------------------------------------------
## sever
## n missing distinct
## 562 121 3
##
## Value 0 1 2
## Frequency 195 322 45
## Proportion 0.347 0.573 0.080
## --------------------------------------------------------------------------------
## seed.tmt
## n missing distinct
## 562 121 3
##
## Value 0 1 2
## Frequency 305 222 35
## Proportion 0.543 0.395 0.062
## --------------------------------------------------------------------------------
## germ
## n missing distinct
## 571 112 3
##
## Value 0 1 2
## Frequency 165 213 193
## Proportion 0.289 0.373 0.338
## --------------------------------------------------------------------------------
## plant.growth
## n missing distinct
## 667 16 2
##
## Value 0 1
## Frequency 441 226
## Proportion 0.661 0.339
## --------------------------------------------------------------------------------
## leaves
## n missing distinct
## 683 0 2
##
## Value 0 1
## Frequency 77 606
## Proportion 0.113 0.887
## --------------------------------------------------------------------------------
## leaf.halo
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 221 36 342
## Proportion 0.369 0.060 0.571
## --------------------------------------------------------------------------------
## leaf.marg
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 357 21 221
## Proportion 0.596 0.035 0.369
## --------------------------------------------------------------------------------
## leaf.size
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 51 327 221
## Proportion 0.085 0.546 0.369
## --------------------------------------------------------------------------------
## leaf.shread
## n missing distinct
## 583 100 2
##
## Value 0 1
## Frequency 487 96
## Proportion 0.835 0.165
## --------------------------------------------------------------------------------
## leaf.malf
## n missing distinct
## 599 84 2
##
## Value 0 1
## Frequency 554 45
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## leaf.mild
## n missing distinct
## 575 108 3
##
## Value 0 1 2
## Frequency 535 20 20
## Proportion 0.930 0.035 0.035
## --------------------------------------------------------------------------------
## stem
## n missing distinct
## 667 16 2
##
## Value 0 1
## Frequency 296 371
## Proportion 0.444 0.556
## --------------------------------------------------------------------------------
## lodging
## n missing distinct
## 562 121 2
##
## Value 0 1
## Frequency 520 42
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## stem.cankers
## n missing distinct
## 645 38 4
##
## Value 0 1 2 3
## Frequency 379 39 36 191
## Proportion 0.588 0.060 0.056 0.296
## --------------------------------------------------------------------------------
## canker.lesion
## n missing distinct
## 645 38 4
##
## Value 0 1 2 3
## Frequency 320 83 177 65
## Proportion 0.496 0.129 0.274 0.101
## --------------------------------------------------------------------------------
## fruiting.bodies
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 473 104
## Proportion 0.82 0.18
## --------------------------------------------------------------------------------
## ext.decay
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 497 135 13
## Proportion 0.771 0.209 0.020
## --------------------------------------------------------------------------------
## mycelium
## n missing distinct
## 645 38 2
##
## Value 0 1
## Frequency 639 6
## Proportion 0.991 0.009
## --------------------------------------------------------------------------------
## int.discolor
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 581 44 20
## Proportion 0.901 0.068 0.031
## --------------------------------------------------------------------------------
## sclerotia
## n missing distinct
## 645 38 2
##
## Value 0 1
## Frequency 625 20
## Proportion 0.969 0.031
## --------------------------------------------------------------------------------
## fruit.pods
## n missing distinct
## 599 84 4
##
## Value 0 1 2 3
## Frequency 407 130 14 48
## Proportion 0.679 0.217 0.023 0.080
## --------------------------------------------------------------------------------
## fruit.spots
## n missing distinct
## 577 106 4
##
## Value 0 1 2 4
## Frequency 345 75 57 100
## Proportion 0.598 0.130 0.099 0.173
## --------------------------------------------------------------------------------
## seed
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 476 115
## Proportion 0.805 0.195
## --------------------------------------------------------------------------------
## mold.growth
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 524 67
## Proportion 0.887 0.113
## --------------------------------------------------------------------------------
## seed.discolor
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 513 64
## Proportion 0.889 0.111
## --------------------------------------------------------------------------------
## seed.size
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 532 59
## Proportion 0.9 0.1
## --------------------------------------------------------------------------------
## shriveling
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 539 38
## Proportion 0.934 0.066
## --------------------------------------------------------------------------------
## roots
## n missing distinct
## 652 31 3
##
## Value 0 1 2
## Frequency 551 86 15
## Proportion 0.845 0.132 0.023
## --------------------------------------------------------------------------------
# visual frequencies
library(gridExtra)
library(purrr)
marrangeGrob(
map(
names(Soybean),
~ ggplot(Soybean, aes_string(.x)) +
geom_bar()
),
ncol = 6,
nrow = 6,
top = "Soybean Distributions"
)
When the first metric is low (under 10%) and the second is large (above 20), the algorithm suggests that dealing with variables with near zero variance.
nzv <- nearZeroVar(Soybean[,-1], saveMetrics= TRUE)
library(dplyr)
filter(nzv, zeroVar == TRUE | nzv == TRUE)
## freqRatio percentUnique zeroVar nzv
## leaf.mild 26.75 0.4392387 FALSE TRUE
## mycelium 106.50 0.2928258 FALSE TRUE
## sclerotia 31.25 0.2928258 FALSE TRUE
(b) Roughly 18% of the data are missing. Are there particular predictors
that are more likely to be missing? Is the pattern of missing data
related to the classes?
# package was not installed i have installed below package
#install.packages("naniar")
# load package
# number, proportions and percentages of missingness
library(naniar)
cbind(c("Number of Mising Values",
"Number of Complete Values",
"Proportion of Missing Values",
"Proportion of Complete Values",
"Percentage of Missing Values",
"Percentage of complete Values"),
rbind(n_miss(Soybean),
n_complete(Soybean),
round(prop_miss(Soybean),4),
round(prop_complete(Soybean),4),
round(pct_miss(Soybean),2),
round(pct_complete(Soybean),2)))
## [,1] [,2]
## [1,] "Number of Mising Values" "2337"
## [2,] "Number of Complete Values" "22251"
## [3,] "Proportion of Missing Values" "0.095"
## [4,] "Proportion of Complete Values" "0.905"
## [5,] "Percentage of Missing Values" "9.5"
## [6,] "Percentage of complete Values" "90.5"
The correct is that 9.5% of the data are missing (half of 18%). But it is true that some variables has a lot more of missing:
# frequency of missingness by variable
library(naniar)
gg_miss_var(Soybean, show_pct = TRUE)
For synthesis purpose, We will subset just 5 predictors and run the analysis:
# package was not installed i have installed below package
#install.packages("finalfit")
# load package
library(finalfit)
outcome <- "Class"
predictors1 <- colnames(Soybean[,2:6])
Soybean %>%
missing_pairs(outcome, predictors1)
Missing data are presented in grey color.
levels(Soybean$Class)
## [1] "2-4-d-injury" "alternarialeaf-spot"
## [3] "anthracnose" "bacterial-blight"
## [5] "bacterial-pustule" "brown-spot"
## [7] "brown-stem-rot" "charcoal-rot"
## [9] "cyst-nematode" "diaporthe-pod-&-stem-blight"
## [11] "diaporthe-stem-canker" "downy-mildew"
## [13] "frog-eye-leaf-spot" "herbicide-injury"
## [15] "phyllosticta-leaf-spot" "phytophthora-rot"
## [17] "powdery-mildew" "purple-seed-stain"
## [19] "rhizoctonia-root-rot"
To see the most frequent missing data on predictors
## number of missing data in each variable
# order by number of missing data
library(naniar)
Soybean %>%
group_by(Class) %>%
miss_var_summary() %>%
arrange(desc(n_miss)) %>%
print(n=50)
## # A tibble: 665 × 4
## # Groups: Class [19]
## Class variable n_miss pct_miss
## <fct> <chr> <int> <num>
## 1 phytophthora-rot hail 68 77.3
## 2 phytophthora-rot sever 68 77.3
## 3 phytophthora-rot seed.tmt 68 77.3
## 4 phytophthora-rot germ 68 77.3
## 5 phytophthora-rot lodging 68 77.3
## 6 phytophthora-rot fruiting.bodies 68 77.3
## 7 phytophthora-rot fruit.pods 68 77.3
## 8 phytophthora-rot fruit.spots 68 77.3
## 9 phytophthora-rot seed 68 77.3
## 10 phytophthora-rot mold.growth 68 77.3
## 11 phytophthora-rot seed.discolor 68 77.3
## 12 phytophthora-rot seed.size 68 77.3
## 13 phytophthora-rot shriveling 68 77.3
## 14 phytophthora-rot leaf.halo 55 62.5
## 15 phytophthora-rot leaf.marg 55 62.5
## 16 phytophthora-rot leaf.size 55 62.5
## 17 phytophthora-rot leaf.shread 55 62.5
## 18 phytophthora-rot leaf.malf 55 62.5
## 19 phytophthora-rot leaf.mild 55 62.5
## 20 2-4-d-injury plant.stand 16 100
## 21 2-4-d-injury precip 16 100
## 22 2-4-d-injury temp 16 100
## 23 2-4-d-injury hail 16 100
## 24 2-4-d-injury crop.hist 16 100
## 25 2-4-d-injury sever 16 100
## 26 2-4-d-injury seed.tmt 16 100
## 27 2-4-d-injury germ 16 100
## 28 2-4-d-injury plant.growth 16 100
## 29 2-4-d-injury leaf.shread 16 100
## 30 2-4-d-injury leaf.mild 16 100
## 31 2-4-d-injury stem 16 100
## 32 2-4-d-injury lodging 16 100
## 33 2-4-d-injury stem.cankers 16 100
## 34 2-4-d-injury canker.lesion 16 100
## 35 2-4-d-injury fruiting.bodies 16 100
## 36 2-4-d-injury ext.decay 16 100
## 37 2-4-d-injury mycelium 16 100
## 38 2-4-d-injury int.discolor 16 100
## 39 2-4-d-injury sclerotia 16 100
## 40 2-4-d-injury fruit.pods 16 100
## 41 2-4-d-injury fruit.spots 16 100
## 42 2-4-d-injury seed 16 100
## 43 2-4-d-injury mold.growth 16 100
## 44 2-4-d-injury seed.discolor 16 100
## 45 2-4-d-injury seed.size 16 100
## 46 2-4-d-injury shriveling 16 100
## 47 2-4-d-injury roots 16 100
## 48 diaporthe-pod-&-stem-blight hail 15 100
## 49 diaporthe-pod-&-stem-blight sever 15 100
## 50 diaporthe-pod-&-stem-blight seed.tmt 15 100
## # ℹ 615 more rows
The most frequent missing data on other variables occurs on level “phytophthora-rot”.
by the percentage of data, the order (and the levels on variable “Class”) will be different:
## percentage of missing data in each variable
# order by percentage of missing data
Soybean %>%
group_by(Class) %>%
miss_var_summary() %>%
arrange(desc(pct_miss)) %>%
print(n=50)
## # A tibble: 665 × 4
## # Groups: Class [19]
## Class variable n_miss pct_miss
## <fct> <chr> <int> <num>
## 1 diaporthe-pod-&-stem-blight hail 15 100
## 2 diaporthe-pod-&-stem-blight sever 15 100
## 3 diaporthe-pod-&-stem-blight seed.tmt 15 100
## 4 diaporthe-pod-&-stem-blight leaf.halo 15 100
## 5 diaporthe-pod-&-stem-blight leaf.marg 15 100
## 6 diaporthe-pod-&-stem-blight leaf.size 15 100
## 7 diaporthe-pod-&-stem-blight leaf.shread 15 100
## 8 diaporthe-pod-&-stem-blight leaf.malf 15 100
## 9 diaporthe-pod-&-stem-blight leaf.mild 15 100
## 10 diaporthe-pod-&-stem-blight lodging 15 100
## 11 diaporthe-pod-&-stem-blight roots 15 100
## 12 cyst-nematode plant.stand 14 100
## 13 cyst-nematode precip 14 100
## 14 cyst-nematode temp 14 100
## 15 cyst-nematode hail 14 100
## 16 cyst-nematode sever 14 100
## 17 cyst-nematode seed.tmt 14 100
## 18 cyst-nematode germ 14 100
## 19 cyst-nematode leaf.halo 14 100
## 20 cyst-nematode leaf.marg 14 100
## 21 cyst-nematode leaf.size 14 100
## 22 cyst-nematode leaf.shread 14 100
## 23 cyst-nematode leaf.malf 14 100
## 24 cyst-nematode leaf.mild 14 100
## 25 cyst-nematode lodging 14 100
## 26 cyst-nematode stem.cankers 14 100
## 27 cyst-nematode canker.lesion 14 100
## 28 cyst-nematode fruiting.bodies 14 100
## 29 cyst-nematode ext.decay 14 100
## 30 cyst-nematode mycelium 14 100
## 31 cyst-nematode int.discolor 14 100
## 32 cyst-nematode sclerotia 14 100
## 33 cyst-nematode fruit.spots 14 100
## 34 cyst-nematode seed.discolor 14 100
## 35 cyst-nematode shriveling 14 100
## 36 2-4-d-injury plant.stand 16 100
## 37 2-4-d-injury precip 16 100
## 38 2-4-d-injury temp 16 100
## 39 2-4-d-injury hail 16 100
## 40 2-4-d-injury crop.hist 16 100
## 41 2-4-d-injury sever 16 100
## 42 2-4-d-injury seed.tmt 16 100
## 43 2-4-d-injury germ 16 100
## 44 2-4-d-injury plant.growth 16 100
## 45 2-4-d-injury leaf.shread 16 100
## 46 2-4-d-injury leaf.mild 16 100
## 47 2-4-d-injury stem 16 100
## 48 2-4-d-injury lodging 16 100
## 49 2-4-d-injury stem.cankers 16 100
## 50 2-4-d-injury canker.lesion 16 100
## # ℹ 615 more rows
Finally, we can model missingness, using a classification algorithm:
# package was not installed i have installed below package
#install.packages("rpart.plot")
# load package
# model missingness
library(rpart)
library(rpart.plot)
Soybean %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data = .) %>%
prp(type = 4, extra = 101, prefix = "Prop. Miss = ", roundint = FALSE)
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
There is some evidence that they are not missing at random (MAR). Missing values on many predictors tend to occur on levels 1 (“2-4-d-injury”), 9 (“cyst-nematode”), 10 (“diaporthe-pod-&-stem-blight”), 14 (“herbicide-injury”) and 16 (“phytophthora-rot”). A better ideia is to imputate them. We will use Random Forest algorithm to run imputation:
# package was not installed i have installed below package
#install.packages("missForest")
# load package
# imputation
library(missForest)
library(dplyr)
Soybean.imp <- missForest(select(Soybean, -"Class"),
ntree = 1000, verbose = T)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1121311
## difference(s): 0.02183644
## time: 17.75 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1108298
## difference(s): 0.005145367
## time: 17.92 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1106335
## difference(s): 0.002384438
## time: 16.26 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1111546
## difference(s): 0.001087639
## time: 16.24 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.110835
## difference(s): 0.0007529805
## time: 18.22 seconds
##
## missForest iteration 6 in progress...done!
## estimated error(s): 0.1104027
## difference(s): 0.0007529805
## time: 18.2 seconds
SoybeanImpute <- data.frame(Soybean.imp$ximp)
SoybeanNew <- as.data.frame(cbind(Soybean$Class, SoybeanImpute))