Applied Predictive Modeling (2013) is one of the most amazing books about machine learning. It was written by Max Kuhn and Kjell Johnson; the first created the caret package on R. For R users, Applied Predictive Modeling (2013) is a mandatory reading.
In this post, I will share some great things that I have learned while practicing the exercises from Chapter 3 of this book, specially about Exploratory Data Analysis (EDA) and Data Preprocessing. The title of Chapter 3 is Data Pre-processing. Let’s go to the exercises!
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass) # load data set
str(Glass) # view data set structure
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Question 1: using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
To visualize distributions, we can plot a density graph of all predictors:
# load packages
library(tidyr); library(ggplot2)
# density plot for all numeric variables
Glass %>%
purrr::keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
To understand relationship between all predictors, we can make scatterplots and run correlations. All of this can be done with the function “ggpairs”:
Glass %>%
purrr::keep(is.numeric) %>%
GGally::ggpairs()
Question 2: Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, almost all variables are skewed (some right-skewed, some left-skewed) and/or have outliers.
Question 3: Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, to deal with skewness we can run a “Box-Cox” transformation (in case of positive values); to deal with outliers, we can run a “Spatial Sign” transformation. Let’s do it and, after that, see how the variables look like:
library(caret) # load package
# run Box-Cox and Spatial Sign transformation
preProc <- preProcess(Glass[,-10], method = c("BoxCox", "spatialSign"))
# apply transformation to the data set
GlassT <- predict(preProc, Glass[,-10])
# density plot
GlassT %>%
purrr::keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
Now, the variables are much more well-behaved, aren’t them?
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Question 4: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Variable “Class” is the outcome. The others are predictors. We can inspect degenerated distributions numerically, visually and through function "Near Zero Variance. Let’s do the first two ways first:
Hmisc::describe(Soybean)
## Soybean
##
## 36 Variables 683 Observations
## --------------------------------------------------------------------------------
## Class
## n missing distinct
## 683 0 19
##
## lowest : 2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight bacterial-pustule
## highest: phyllosticta-leaf-spot phytophthora-rot powdery-mildew purple-seed-stain rhizoctonia-root-rot
## --------------------------------------------------------------------------------
## date
## n missing distinct
## 682 1 7
##
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##
## Value 0 1 2 3 4 5 6
## Frequency 26 75 93 118 131 149 90
## Proportion 0.038 0.110 0.136 0.173 0.192 0.218 0.132
## --------------------------------------------------------------------------------
## plant.stand
## n missing distinct
## 647 36 2
##
## Value 0 1
## Frequency 354 293
## Proportion 0.547 0.453
## --------------------------------------------------------------------------------
## precip
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 74 112 459
## Proportion 0.115 0.174 0.712
## --------------------------------------------------------------------------------
## temp
## n missing distinct
## 653 30 3
##
## Value 0 1 2
## Frequency 80 374 199
## Proportion 0.123 0.573 0.305
## --------------------------------------------------------------------------------
## hail
## n missing distinct
## 562 121 2
##
## Value 0 1
## Frequency 435 127
## Proportion 0.774 0.226
## --------------------------------------------------------------------------------
## crop.hist
## n missing distinct
## 667 16 4
##
## Value 0 1 2 3
## Frequency 65 165 219 218
## Proportion 0.097 0.247 0.328 0.327
## --------------------------------------------------------------------------------
## area.dam
## n missing distinct
## 682 1 4
##
## Value 0 1 2 3
## Frequency 123 227 145 187
## Proportion 0.180 0.333 0.213 0.274
## --------------------------------------------------------------------------------
## sever
## n missing distinct
## 562 121 3
##
## Value 0 1 2
## Frequency 195 322 45
## Proportion 0.347 0.573 0.080
## --------------------------------------------------------------------------------
## seed.tmt
## n missing distinct
## 562 121 3
##
## Value 0 1 2
## Frequency 305 222 35
## Proportion 0.543 0.395 0.062
## --------------------------------------------------------------------------------
## germ
## n missing distinct
## 571 112 3
##
## Value 0 1 2
## Frequency 165 213 193
## Proportion 0.289 0.373 0.338
## --------------------------------------------------------------------------------
## plant.growth
## n missing distinct
## 667 16 2
##
## Value 0 1
## Frequency 441 226
## Proportion 0.661 0.339
## --------------------------------------------------------------------------------
## leaves
## n missing distinct
## 683 0 2
##
## Value 0 1
## Frequency 77 606
## Proportion 0.113 0.887
## --------------------------------------------------------------------------------
## leaf.halo
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 221 36 342
## Proportion 0.369 0.060 0.571
## --------------------------------------------------------------------------------
## leaf.marg
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 357 21 221
## Proportion 0.596 0.035 0.369
## --------------------------------------------------------------------------------
## leaf.size
## n missing distinct
## 599 84 3
##
## Value 0 1 2
## Frequency 51 327 221
## Proportion 0.085 0.546 0.369
## --------------------------------------------------------------------------------
## leaf.shread
## n missing distinct
## 583 100 2
##
## Value 0 1
## Frequency 487 96
## Proportion 0.835 0.165
## --------------------------------------------------------------------------------
## leaf.malf
## n missing distinct
## 599 84 2
##
## Value 0 1
## Frequency 554 45
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## leaf.mild
## n missing distinct
## 575 108 3
##
## Value 0 1 2
## Frequency 535 20 20
## Proportion 0.930 0.035 0.035
## --------------------------------------------------------------------------------
## stem
## n missing distinct
## 667 16 2
##
## Value 0 1
## Frequency 296 371
## Proportion 0.444 0.556
## --------------------------------------------------------------------------------
## lodging
## n missing distinct
## 562 121 2
##
## Value 0 1
## Frequency 520 42
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## stem.cankers
## n missing distinct
## 645 38 4
##
## Value 0 1 2 3
## Frequency 379 39 36 191
## Proportion 0.588 0.060 0.056 0.296
## --------------------------------------------------------------------------------
## canker.lesion
## n missing distinct
## 645 38 4
##
## Value 0 1 2 3
## Frequency 320 83 177 65
## Proportion 0.496 0.129 0.274 0.101
## --------------------------------------------------------------------------------
## fruiting.bodies
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 473 104
## Proportion 0.82 0.18
## --------------------------------------------------------------------------------
## ext.decay
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 497 135 13
## Proportion 0.771 0.209 0.020
## --------------------------------------------------------------------------------
## mycelium
## n missing distinct
## 645 38 2
##
## Value 0 1
## Frequency 639 6
## Proportion 0.991 0.009
## --------------------------------------------------------------------------------
## int.discolor
## n missing distinct
## 645 38 3
##
## Value 0 1 2
## Frequency 581 44 20
## Proportion 0.901 0.068 0.031
## --------------------------------------------------------------------------------
## sclerotia
## n missing distinct
## 645 38 2
##
## Value 0 1
## Frequency 625 20
## Proportion 0.969 0.031
## --------------------------------------------------------------------------------
## fruit.pods
## n missing distinct
## 599 84 4
##
## Value 0 1 2 3
## Frequency 407 130 14 48
## Proportion 0.679 0.217 0.023 0.080
## --------------------------------------------------------------------------------
## fruit.spots
## n missing distinct
## 577 106 4
##
## Value 0 1 2 4
## Frequency 345 75 57 100
## Proportion 0.598 0.130 0.099 0.173
## --------------------------------------------------------------------------------
## seed
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 476 115
## Proportion 0.805 0.195
## --------------------------------------------------------------------------------
## mold.growth
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 524 67
## Proportion 0.887 0.113
## --------------------------------------------------------------------------------
## seed.discolor
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 513 64
## Proportion 0.889 0.111
## --------------------------------------------------------------------------------
## seed.size
## n missing distinct
## 591 92 2
##
## Value 0 1
## Frequency 532 59
## Proportion 0.9 0.1
## --------------------------------------------------------------------------------
## shriveling
## n missing distinct
## 577 106 2
##
## Value 0 1
## Frequency 539 38
## Proportion 0.934 0.066
## --------------------------------------------------------------------------------
## roots
## n missing distinct
## 652 31 3
##
## Value 0 1 2
## Frequency 551 86 15
## Proportion 0.845 0.132 0.023
## --------------------------------------------------------------------------------
# visual frequencies
library(gridExtra)
library(purrr)
marrangeGrob(
map(
names(Soybean),
~ ggplot(Soybean, aes_string(.x)) +
geom_bar()
),
ncol = 6,
nrow = 6,
top = "Soybean Distribution"
)
Now, let’s run function “Near Zero Variance”. What does this function do? It calculates:
When the first metric is low (under 10%) and the second is large (above 20), the algorithm suggests that we are dealing with variables with near zero variance. Let’s do it:
nzv <- nearZeroVar(Soybean[,-1], saveMetrics= TRUE)
library(dplyr)
filter(nzv, zeroVar == TRUE | nzv == TRUE)
## freqRatio percentUnique zeroVar nzv
## leaf.mild 26.75 0.4392387 FALSE TRUE
## mycelium 106.50 0.2928258 FALSE TRUE
## sclerotia 31.25 0.2928258 FALSE TRUE
Question 5: Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Missing values and how to deal with them is an important and huge topic. First, is it true that roughly 18% of the data are missing? It seems not. Let’s see it:
# number, proportions and percentages of missingness on the entire data set
library(naniar)
cbind(c("Number of Mising Values",
"Number of Complete Values",
"Proportion of Missing Values",
"Proportion of Complete Values",
"Percentage of Missing Values",
"Percentage of complete Values"),
rbind(n_miss(Soybean),
n_complete(Soybean),
round(prop_miss(Soybean),4),
round(prop_complete(Soybean),4),
round(pct_miss(Soybean),2),
round(pct_complete(Soybean),2)))
## [,1] [,2]
## [1,] "Number of Mising Values" "2337"
## [2,] "Number of Complete Values" "22251"
## [3,] "Proportion of Missing Values" "0.095"
## [4,] "Proportion of Complete Values" "0.905"
## [5,] "Percentage of Missing Values" "9.5"
## [6,] "Percentage of complete Values" "90.5"
The correct is that 9.5% of the data are missing (half of 18%). But it is true that some variables has a lot more of missingness:
# frequency of missingness by variable
library(naniar)
gg_miss_var(Soybean, show_pct = TRUE)
Is the pattern of missingness related to the outcome (variable “Class”)? For synthesis purpose, We will subset just 5 predictors and run the analysis:
library(finalfit)
outcome <- "Class"
predictors1 <- colnames(Soybean[,2:6])
Soybean %>%
missing_pairs(outcome, predictors1)
Missing data are presented in grey color. On the last row, first column, we see that there are much more missing data on variable “hail” when the factor of variable Class is 16. What is the name of factor 16?
levels(Soybean$Class)
## [1] "2-4-d-injury" "alternarialeaf-spot"
## [3] "anthracnose" "bacterial-blight"
## [5] "bacterial-pustule" "brown-spot"
## [7] "brown-stem-rot" "charcoal-rot"
## [9] "cyst-nematode" "diaporthe-pod-&-stem-blight"
## [11] "diaporthe-stem-canker" "downy-mildew"
## [13] "frog-eye-leaf-spot" "herbicide-injury"
## [15] "phyllosticta-leaf-spot" "phytophthora-rot"
## [17] "powdery-mildew" "purple-seed-stain"
## [19] "rhizoctonia-root-rot"
It is “phytophthora-rot”. To be sure, let’s view the most frequent missing data on predictors, by variable “Class” (the outcome):
## number of missing data in each variable across the levels within Class
# order by number of missing data
library(naniar)
Soybean %>%
group_by(Class) %>%
miss_var_summary() %>%
arrange(desc(n_miss)) %>%
print(n=50)
## # A tibble: 665 x 4
## # Groups: Class [19]
## Class variable n_miss pct_miss
## <fct> <chr> <int> <dbl>
## 1 phytophthora-rot hail 68 77.3
## 2 phytophthora-rot sever 68 77.3
## 3 phytophthora-rot seed.tmt 68 77.3
## 4 phytophthora-rot germ 68 77.3
## 5 phytophthora-rot lodging 68 77.3
## 6 phytophthora-rot fruiting.bodies 68 77.3
## 7 phytophthora-rot fruit.pods 68 77.3
## 8 phytophthora-rot fruit.spots 68 77.3
## 9 phytophthora-rot seed 68 77.3
## 10 phytophthora-rot mold.growth 68 77.3
## 11 phytophthora-rot seed.discolor 68 77.3
## 12 phytophthora-rot seed.size 68 77.3
## 13 phytophthora-rot shriveling 68 77.3
## 14 phytophthora-rot leaf.halo 55 62.5
## 15 phytophthora-rot leaf.marg 55 62.5
## 16 phytophthora-rot leaf.size 55 62.5
## 17 phytophthora-rot leaf.shread 55 62.5
## 18 phytophthora-rot leaf.malf 55 62.5
## 19 phytophthora-rot leaf.mild 55 62.5
## 20 2-4-d-injury plant.stand 16 100
## 21 2-4-d-injury precip 16 100
## 22 2-4-d-injury temp 16 100
## 23 2-4-d-injury hail 16 100
## 24 2-4-d-injury crop.hist 16 100
## 25 2-4-d-injury sever 16 100
## 26 2-4-d-injury seed.tmt 16 100
## 27 2-4-d-injury germ 16 100
## 28 2-4-d-injury plant.growth 16 100
## 29 2-4-d-injury leaf.shread 16 100
## 30 2-4-d-injury leaf.mild 16 100
## 31 2-4-d-injury stem 16 100
## 32 2-4-d-injury lodging 16 100
## 33 2-4-d-injury stem.cankers 16 100
## 34 2-4-d-injury canker.lesion 16 100
## 35 2-4-d-injury fruiting.bodies 16 100
## 36 2-4-d-injury ext.decay 16 100
## 37 2-4-d-injury mycelium 16 100
## 38 2-4-d-injury int.discolor 16 100
## 39 2-4-d-injury sclerotia 16 100
## 40 2-4-d-injury fruit.pods 16 100
## 41 2-4-d-injury fruit.spots 16 100
## 42 2-4-d-injury seed 16 100
## 43 2-4-d-injury mold.growth 16 100
## 44 2-4-d-injury seed.discolor 16 100
## 45 2-4-d-injury seed.size 16 100
## 46 2-4-d-injury shriveling 16 100
## 47 2-4-d-injury roots 16 100
## 48 diaporthe-pod-&-stem-blight hail 15 100
## 49 diaporthe-pod-&-stem-blight sever 15 100
## 50 diaporthe-pod-&-stem-blight seed.tmt 15 100
## # ... with 615 more rows
That’s is correct. The most frequent missing data on other variables occurs on level “phytophthora-rot”. But, if we order, not by the number of missing data, but by the percentage of NAs, the order (and the levels on variable “Class”) will be different:
## percentage of missing data in each variable across the levels within Class
# order by percentage of missing data
Soybean %>%
group_by(Class) %>%
miss_var_summary() %>%
arrange(desc(pct_miss)) %>%
print(n=50)
## # A tibble: 665 x 4
## # Groups: Class [19]
## Class variable n_miss pct_miss
## <fct> <chr> <int> <dbl>
## 1 diaporthe-pod-&-stem-blight hail 15 100
## 2 diaporthe-pod-&-stem-blight sever 15 100
## 3 diaporthe-pod-&-stem-blight seed.tmt 15 100
## 4 diaporthe-pod-&-stem-blight leaf.halo 15 100
## 5 diaporthe-pod-&-stem-blight leaf.marg 15 100
## 6 diaporthe-pod-&-stem-blight leaf.size 15 100
## 7 diaporthe-pod-&-stem-blight leaf.shread 15 100
## 8 diaporthe-pod-&-stem-blight leaf.malf 15 100
## 9 diaporthe-pod-&-stem-blight leaf.mild 15 100
## 10 diaporthe-pod-&-stem-blight lodging 15 100
## 11 diaporthe-pod-&-stem-blight roots 15 100
## 12 cyst-nematode plant.stand 14 100
## 13 cyst-nematode precip 14 100
## 14 cyst-nematode temp 14 100
## 15 cyst-nematode hail 14 100
## 16 cyst-nematode sever 14 100
## 17 cyst-nematode seed.tmt 14 100
## 18 cyst-nematode germ 14 100
## 19 cyst-nematode leaf.halo 14 100
## 20 cyst-nematode leaf.marg 14 100
## 21 cyst-nematode leaf.size 14 100
## 22 cyst-nematode leaf.shread 14 100
## 23 cyst-nematode leaf.malf 14 100
## 24 cyst-nematode leaf.mild 14 100
## 25 cyst-nematode lodging 14 100
## 26 cyst-nematode stem.cankers 14 100
## 27 cyst-nematode canker.lesion 14 100
## 28 cyst-nematode fruiting.bodies 14 100
## 29 cyst-nematode ext.decay 14 100
## 30 cyst-nematode mycelium 14 100
## 31 cyst-nematode int.discolor 14 100
## 32 cyst-nematode sclerotia 14 100
## 33 cyst-nematode fruit.spots 14 100
## 34 cyst-nematode seed.discolor 14 100
## 35 cyst-nematode shriveling 14 100
## 36 2-4-d-injury plant.stand 16 100
## 37 2-4-d-injury precip 16 100
## 38 2-4-d-injury temp 16 100
## 39 2-4-d-injury hail 16 100
## 40 2-4-d-injury crop.hist 16 100
## 41 2-4-d-injury sever 16 100
## 42 2-4-d-injury seed.tmt 16 100
## 43 2-4-d-injury germ 16 100
## 44 2-4-d-injury plant.growth 16 100
## 45 2-4-d-injury leaf.shread 16 100
## 46 2-4-d-injury leaf.mild 16 100
## 47 2-4-d-injury stem 16 100
## 48 2-4-d-injury lodging 16 100
## 49 2-4-d-injury stem.cankers 16 100
## 50 2-4-d-injury canker.lesion 16 100
## # ... with 615 more rows
Finally, we can model missingness, using a classification algorithm:
# model missingness
library(rpart)
library(rpart.plot)
Soybean %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data = .) %>%
prp(type = 4, extra = 101, prefix = "Prop. Miss = ", roundint = FALSE)
Question 6: Develop a strategy for handling missing data, either by eliminating predictors or imputation.
As it became evident, it is temerarious to exclude missing values. There is too much evidence that they are not missing at random (MAR). Missing values on many predictors tend to occur on levels 1 (“2-4-d-injury”), 9 (“cyst-nematode”), 10 (“diaporthe-pod-&-stem-blight”), 14 (“herbicide-injury”) and 16 (“phytophthora-rot”). A better ideia is to imputate them. We will use Random Forest algorithm to run imputation:
# imputation
library(missForest)
library(dplyr)
Soybean.imp <- missForest(select(Soybean, -"Class"),
ntree = 1000, verbose = T)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1122898
## difference(s): 0.02162727
## time: 25.39 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1108225
## difference(s): 0.00556369
## time: 21.04 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1110342
## difference(s): 0.003011922
## time: 21.77 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1118427
## difference(s): 0.00167329
## time: 20.67 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.1112738
## difference(s): 0.0008366451
## time: 24.66 seconds
##
## missForest iteration 6 in progress...done!
## estimated error(s): 0.1119939
## difference(s): 0.0005438193
## time: 26.37 seconds
##
## missForest iteration 7 in progress...done!
## estimated error(s): 0.1117846
## difference(s): 0.001045806
## time: 22.86 seconds
SoybeanImpute <- data.frame(Soybean.imp$ximp)
SoybeanNew <- as.data.frame(cbind(Soybean$Class, SoybeanImpute))
Chapter 5 introduces Quantitative Structure-Activity Relationship (QSAR) modeling where the characteristics of a chemical compound are used to predict other chemical properties. The caret package contains a QSAR data set from Mente and Lombardo (2005). Here, the ability of a chemical to permeate the blood-brain barrier was experimentally determined for 208 compounds. 134 descriptors were measured for each compound. Start R and use these commands to load the data:
library(caret)
data(BloodBrain)
dim(bbbDescr)
## [1] 208 134
The numeric outcome is contained in the vector logBBB while the predictors are in the data frame bbbDescr.
Question 7: Do any of the individual predictors have degenerate distributions?
In this case, there are too much predictors (134), so we are not going to inspect distributions visually. Insted, we are going to calculate skewness, and print just the variables which skewness are higher (let’s say, above 2 or under - 2).
library(moments)
skewVar <- as.data.frame(skewness(bbbDescr, na.rm = FALSE))
colnames(skewVar)[1] <- "Skewness"
skewVar %>%
arrange(desc(Skewness)) %>%
filter(Skewness > 2 | Skewness < -2) %>%
print()
## Skewness
## negative 14.317990
## alert 10.050359
## a_acid 5.439858
## vsa_acid 5.439858
## tcpa 5.369967
## frac.anion7. 4.324616
## wpsa2 4.071604
## inthb 3.995614
## peoe_vsa.2.1 2.972059
## vsa_base 2.886451
## ppsa2 2.839157
## dpsa2 2.725506
## peoe_vsa.3.1 2.681897
## vsa_don 2.460145
## smr_vsa4 2.311704
## rule.of.5violations 2.220560
## peoe_vsa.3 2.175823
## adistd 2.162144
## peoe_vsa.2 2.107852
## adistm 2.057717
## slogp_vsa6 2.052164
## psa_npsa 2.021547
## wnsa2 -3.142954
As we did before, let’s present predictors that have zero or near zero variance:
# near zero variance
nzv <- nearZeroVar(bbbDescr, saveMetrics= TRUE)
filter(nzv, nzv==TRUE | zeroVar==TRUE)
## freqRatio percentUnique zeroVar nzv
## negative 207.00000 0.9615385 FALSE TRUE
## peoe_vsa.2.1 25.57143 5.7692308 FALSE TRUE
## peoe_vsa.3.1 21.00000 7.2115385 FALSE TRUE
## a_acid 33.50000 1.4423077 FALSE TRUE
## vsa_acid 33.50000 1.4423077 FALSE TRUE
## frac.anion7. 47.75000 5.7692308 FALSE TRUE
## alert 103.00000 0.9615385 FALSE TRUE
Question 8: Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling?
Yes, there are; this is very common between chemical compounds. To prove that, we will build a matrix correlation and, using the function “findCorrelation”, we are going to present predictors that are highly correlated (let’s say, above 0.75 or under -0.75):
### strong relations between predictors
corPred <- cor(bbbDescr)
highCorr <- findCorrelation(corPred, .75)
colnames(bbbDescr)[highCorr]
## [1] "vsa_don" "slogp_vsa2" "slogp_vsa7"
## [4] "smr_vsa0" "smr_vsa5" "mw"
## [7] "nocount" "hbdnr" "ub"
## [10] "nonpolar_area" "tcnp" "ovality"
## [13] "surface_area" "volume" "ppsa1"
## [16] "ppsa2" "ppsa3" "pnsa2"
## [19] "pnsa3" "fpsa2" "fnsa2"
## [22] "fnsa3" "wpsa1" "wpsa2"
## [25] "wpsa3" "wnsa1" "wnsa2"
## [28] "wnsa3" "dpsa1" "dpsa2"
## [31] "dpsa3" "sadh1" "sadh3"
## [34] "chdh1" "chdh3" "scdh1"
## [37] "scdh2" "scdh3" "saaa1"
## [40] "saaa3" "scaa1" "scaa2"
## [43] "scaa3" "ctdh" "ctaa"
## [46] "mchg" "vsa_hyd" "tpsa"
## [49] "a_acid" "a_base" "vsa_acc"
## [52] "slogp_vsa3" "weight" "logp.o.w."
## [55] "tpsa.1" "a_acc" "adistm"
## [58] "polar_area" "psa_npsa" "homo"
## [61] "sum_absolute_charge" "fpsa1" "fpsa3"
## [64] "sadh2" "chaa1" "chdh2"
We can use two strategies to deal with highly correlated predictors: 1) exclude some of them; 2) run principal component analysis. About first strategy, again, function “findCorrelation” is very useful. How does it work? The algorithm:
Let’s remove predictors based on “findCorrelation” and see predictors’ reduction:
# strategy 1: remove predictors highly correlated
dim(bbbDescr)
## [1] 208 134
bbbDescrFilter <- bbbDescr[,-highCorr]
dim(bbbDescrFilter)
## [1] 208 68
Finally, we can run a principal component analysis:
# principal component analysis
preProc <- preProcess(bbbDescr, method="pca")
# apply principal component analysis to the data set
bbbDescrPC <- predict(preProc, bbbDescr)
dim(bbbDescrPC)
## [1] 208 31
In conclusion, in the first strategy, we reduce predictors from 134 to 68 (66 predictors were removed). Using principal component analysis, predictors were reduced from 134 to 31 (103 predictors of difference).