Introduction

Applied Predictive Modeling (2013) is one of the most amazing books about machine learning. It was written by Max Kuhn and Kjell Johnson; the first created the caret package on R. For R users, Applied Predictive Modeling (2013) is a mandatory reading.

In this post, I will share some great things that I have learned while practicing the exercises from Chapter 3 of this book, specially about Exploratory Data Analysis (EDA) and Data Preprocessing. The title of Chapter 3 is Data Pre-processing. Let’s go to the exercises!

Exercise 1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass) # load data set
str(Glass) # view data set structure
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Question 1: using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

To visualize distributions, we can plot a density graph of all predictors:

# load packages
library(tidyr); library(ggplot2)
# density plot for all numeric variables
Glass %>%
  purrr::keep(is.numeric) %>%   
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_density()

To understand relationship between all predictors, we can make scatterplots and run correlations. All of this can be done with the function “ggpairs”:

Glass %>%
  purrr::keep(is.numeric) %>%
  GGally::ggpairs()

Question 2: Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, almost all variables are skewed (some right-skewed, some left-skewed) and/or have outliers.

Question 3: Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, to deal with skewness we can run a “Box-Cox” transformation (in case of positive values); to deal with outliers, we can run a “Spatial Sign” transformation. Let’s do it and, after that, see how the variables look like:

library(caret) # load package
# run Box-Cox and Spatial Sign transformation
preProc <- preProcess(Glass[,-10], method = c("BoxCox", "spatialSign"))
# apply transformation to the data set
GlassT <- predict(preProc, Glass[,-10])
# density plot
GlassT %>%
  purrr::keep(is.numeric) %>%   
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_density()

Now, the variables are much more well-behaved, aren’t them?

Exercise 2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Question 4: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Variable “Class” is the outcome. The others are predictors. We can inspect degenerated distributions numerically, visually and through function "Near Zero Variance. Let’s do the first two ways first:

Hmisc::describe(Soybean)
## Soybean 
## 
##  36  Variables      683  Observations
## --------------------------------------------------------------------------------
## Class 
##        n  missing distinct 
##      683        0       19 
## 
## lowest : 2-4-d-injury           alternarialeaf-spot    anthracnose            bacterial-blight       bacterial-pustule     
## highest: phyllosticta-leaf-spot phytophthora-rot       powdery-mildew         purple-seed-stain      rhizoctonia-root-rot  
## --------------------------------------------------------------------------------
## date 
##        n  missing distinct 
##      682        1        7 
## 
## lowest : 0 1 2 3 4, highest: 2 3 4 5 6
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency     26    75    93   118   131   149    90
## Proportion 0.038 0.110 0.136 0.173 0.192 0.218 0.132
## --------------------------------------------------------------------------------
## plant.stand 
##        n  missing distinct 
##      647       36        2 
##                       
## Value          0     1
## Frequency    354   293
## Proportion 0.547 0.453
## --------------------------------------------------------------------------------
## precip 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency     74   112   459
## Proportion 0.115 0.174 0.712
## --------------------------------------------------------------------------------
## temp 
##        n  missing distinct 
##      653       30        3 
##                             
## Value          0     1     2
## Frequency     80   374   199
## Proportion 0.123 0.573 0.305
## --------------------------------------------------------------------------------
## hail 
##        n  missing distinct 
##      562      121        2 
##                       
## Value          0     1
## Frequency    435   127
## Proportion 0.774 0.226
## --------------------------------------------------------------------------------
## crop.hist 
##        n  missing distinct 
##      667       16        4 
##                                   
## Value          0     1     2     3
## Frequency     65   165   219   218
## Proportion 0.097 0.247 0.328 0.327
## --------------------------------------------------------------------------------
## area.dam 
##        n  missing distinct 
##      682        1        4 
##                                   
## Value          0     1     2     3
## Frequency    123   227   145   187
## Proportion 0.180 0.333 0.213 0.274
## --------------------------------------------------------------------------------
## sever 
##        n  missing distinct 
##      562      121        3 
##                             
## Value          0     1     2
## Frequency    195   322    45
## Proportion 0.347 0.573 0.080
## --------------------------------------------------------------------------------
## seed.tmt 
##        n  missing distinct 
##      562      121        3 
##                             
## Value          0     1     2
## Frequency    305   222    35
## Proportion 0.543 0.395 0.062
## --------------------------------------------------------------------------------
## germ 
##        n  missing distinct 
##      571      112        3 
##                             
## Value          0     1     2
## Frequency    165   213   193
## Proportion 0.289 0.373 0.338
## --------------------------------------------------------------------------------
## plant.growth 
##        n  missing distinct 
##      667       16        2 
##                       
## Value          0     1
## Frequency    441   226
## Proportion 0.661 0.339
## --------------------------------------------------------------------------------
## leaves 
##        n  missing distinct 
##      683        0        2 
##                       
## Value          0     1
## Frequency     77   606
## Proportion 0.113 0.887
## --------------------------------------------------------------------------------
## leaf.halo 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency    221    36   342
## Proportion 0.369 0.060 0.571
## --------------------------------------------------------------------------------
## leaf.marg 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency    357    21   221
## Proportion 0.596 0.035 0.369
## --------------------------------------------------------------------------------
## leaf.size 
##        n  missing distinct 
##      599       84        3 
##                             
## Value          0     1     2
## Frequency     51   327   221
## Proportion 0.085 0.546 0.369
## --------------------------------------------------------------------------------
## leaf.shread 
##        n  missing distinct 
##      583      100        2 
##                       
## Value          0     1
## Frequency    487    96
## Proportion 0.835 0.165
## --------------------------------------------------------------------------------
## leaf.malf 
##        n  missing distinct 
##      599       84        2 
##                       
## Value          0     1
## Frequency    554    45
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## leaf.mild 
##        n  missing distinct 
##      575      108        3 
##                             
## Value          0     1     2
## Frequency    535    20    20
## Proportion 0.930 0.035 0.035
## --------------------------------------------------------------------------------
## stem 
##        n  missing distinct 
##      667       16        2 
##                       
## Value          0     1
## Frequency    296   371
## Proportion 0.444 0.556
## --------------------------------------------------------------------------------
## lodging 
##        n  missing distinct 
##      562      121        2 
##                       
## Value          0     1
## Frequency    520    42
## Proportion 0.925 0.075
## --------------------------------------------------------------------------------
## stem.cankers 
##        n  missing distinct 
##      645       38        4 
##                                   
## Value          0     1     2     3
## Frequency    379    39    36   191
## Proportion 0.588 0.060 0.056 0.296
## --------------------------------------------------------------------------------
## canker.lesion 
##        n  missing distinct 
##      645       38        4 
##                                   
## Value          0     1     2     3
## Frequency    320    83   177    65
## Proportion 0.496 0.129 0.274 0.101
## --------------------------------------------------------------------------------
## fruiting.bodies 
##        n  missing distinct 
##      577      106        2 
##                     
## Value         0    1
## Frequency   473  104
## Proportion 0.82 0.18
## --------------------------------------------------------------------------------
## ext.decay 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency    497   135    13
## Proportion 0.771 0.209 0.020
## --------------------------------------------------------------------------------
## mycelium 
##        n  missing distinct 
##      645       38        2 
##                       
## Value          0     1
## Frequency    639     6
## Proportion 0.991 0.009
## --------------------------------------------------------------------------------
## int.discolor 
##        n  missing distinct 
##      645       38        3 
##                             
## Value          0     1     2
## Frequency    581    44    20
## Proportion 0.901 0.068 0.031
## --------------------------------------------------------------------------------
## sclerotia 
##        n  missing distinct 
##      645       38        2 
##                       
## Value          0     1
## Frequency    625    20
## Proportion 0.969 0.031
## --------------------------------------------------------------------------------
## fruit.pods 
##        n  missing distinct 
##      599       84        4 
##                                   
## Value          0     1     2     3
## Frequency    407   130    14    48
## Proportion 0.679 0.217 0.023 0.080
## --------------------------------------------------------------------------------
## fruit.spots 
##        n  missing distinct 
##      577      106        4 
##                                   
## Value          0     1     2     4
## Frequency    345    75    57   100
## Proportion 0.598 0.130 0.099 0.173
## --------------------------------------------------------------------------------
## seed 
##        n  missing distinct 
##      591       92        2 
##                       
## Value          0     1
## Frequency    476   115
## Proportion 0.805 0.195
## --------------------------------------------------------------------------------
## mold.growth 
##        n  missing distinct 
##      591       92        2 
##                       
## Value          0     1
## Frequency    524    67
## Proportion 0.887 0.113
## --------------------------------------------------------------------------------
## seed.discolor 
##        n  missing distinct 
##      577      106        2 
##                       
## Value          0     1
## Frequency    513    64
## Proportion 0.889 0.111
## --------------------------------------------------------------------------------
## seed.size 
##        n  missing distinct 
##      591       92        2 
##                   
## Value        0   1
## Frequency  532  59
## Proportion 0.9 0.1
## --------------------------------------------------------------------------------
## shriveling 
##        n  missing distinct 
##      577      106        2 
##                       
## Value          0     1
## Frequency    539    38
## Proportion 0.934 0.066
## --------------------------------------------------------------------------------
## roots 
##        n  missing distinct 
##      652       31        3 
##                             
## Value          0     1     2
## Frequency    551    86    15
## Proportion 0.845 0.132 0.023
## --------------------------------------------------------------------------------
# visual frequencies
library(gridExtra)
library(purrr)
marrangeGrob(
  map(
    names(Soybean), 
    ~ ggplot(Soybean, aes_string(.x)) + 
      geom_bar()
  ),
  ncol = 6,
  nrow = 6,
  top = "Soybean Distribution"
)

Now, let’s run function “Near Zero Variance”. What does this function do? It calculates:

  1. The fraction of unique values over the sample size (percentUnique);
  2. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value (freqRatio)

When the first metric is low (under 10%) and the second is large (above 20), the algorithm suggests that we are dealing with variables with near zero variance. Let’s do it:

nzv <- nearZeroVar(Soybean[,-1], saveMetrics= TRUE)
library(dplyr)
filter(nzv, zeroVar == TRUE | nzv == TRUE)
##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

Question 5: Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Missing values and how to deal with them is an important and huge topic. First, is it true that roughly 18% of the data are missing? It seems not. Let’s see it:

# number, proportions and percentages of missingness on the entire data set
library(naniar)
cbind(c("Number of Mising Values", 
        "Number of Complete Values",
        "Proportion of Missing Values",
        "Proportion of Complete Values",
        "Percentage of Missing Values",
        "Percentage of complete Values"),
      rbind(n_miss(Soybean),
            n_complete(Soybean),
            round(prop_miss(Soybean),4),
            round(prop_complete(Soybean),4),
            round(pct_miss(Soybean),2),
            round(pct_complete(Soybean),2)))
##      [,1]                            [,2]   
## [1,] "Number of Mising Values"       "2337" 
## [2,] "Number of Complete Values"     "22251"
## [3,] "Proportion of Missing Values"  "0.095"
## [4,] "Proportion of Complete Values" "0.905"
## [5,] "Percentage of Missing Values"  "9.5"  
## [6,] "Percentage of complete Values" "90.5"

The correct is that 9.5% of the data are missing (half of 18%). But it is true that some variables has a lot more of missingness:

# frequency of missingness by variable
library(naniar)
gg_miss_var(Soybean, show_pct = TRUE)

Is the pattern of missingness related to the outcome (variable “Class”)? For synthesis purpose, We will subset just 5 predictors and run the analysis:

library(finalfit)
outcome <- "Class"
predictors1 <- colnames(Soybean[,2:6])
Soybean %>% 
  missing_pairs(outcome, predictors1)

Missing data are presented in grey color. On the last row, first column, we see that there are much more missing data on variable “hail” when the factor of variable Class is 16. What is the name of factor 16?

levels(Soybean$Class)
##  [1] "2-4-d-injury"                "alternarialeaf-spot"        
##  [3] "anthracnose"                 "bacterial-blight"           
##  [5] "bacterial-pustule"           "brown-spot"                 
##  [7] "brown-stem-rot"              "charcoal-rot"               
##  [9] "cyst-nematode"               "diaporthe-pod-&-stem-blight"
## [11] "diaporthe-stem-canker"       "downy-mildew"               
## [13] "frog-eye-leaf-spot"          "herbicide-injury"           
## [15] "phyllosticta-leaf-spot"      "phytophthora-rot"           
## [17] "powdery-mildew"              "purple-seed-stain"          
## [19] "rhizoctonia-root-rot"

It is “phytophthora-rot”. To be sure, let’s view the most frequent missing data on predictors, by variable “Class” (the outcome):

## number of missing data in each variable across the levels within Class
# order by number of missing data
library(naniar)
Soybean %>%
  group_by(Class) %>%
  miss_var_summary() %>%
  arrange(desc(n_miss)) %>%
  print(n=50)
## # A tibble: 665 x 4
## # Groups:   Class [19]
##    Class                       variable        n_miss pct_miss
##    <fct>                       <chr>            <int>    <dbl>
##  1 phytophthora-rot            hail                68     77.3
##  2 phytophthora-rot            sever               68     77.3
##  3 phytophthora-rot            seed.tmt            68     77.3
##  4 phytophthora-rot            germ                68     77.3
##  5 phytophthora-rot            lodging             68     77.3
##  6 phytophthora-rot            fruiting.bodies     68     77.3
##  7 phytophthora-rot            fruit.pods          68     77.3
##  8 phytophthora-rot            fruit.spots         68     77.3
##  9 phytophthora-rot            seed                68     77.3
## 10 phytophthora-rot            mold.growth         68     77.3
## 11 phytophthora-rot            seed.discolor       68     77.3
## 12 phytophthora-rot            seed.size           68     77.3
## 13 phytophthora-rot            shriveling          68     77.3
## 14 phytophthora-rot            leaf.halo           55     62.5
## 15 phytophthora-rot            leaf.marg           55     62.5
## 16 phytophthora-rot            leaf.size           55     62.5
## 17 phytophthora-rot            leaf.shread         55     62.5
## 18 phytophthora-rot            leaf.malf           55     62.5
## 19 phytophthora-rot            leaf.mild           55     62.5
## 20 2-4-d-injury                plant.stand         16    100  
## 21 2-4-d-injury                precip              16    100  
## 22 2-4-d-injury                temp                16    100  
## 23 2-4-d-injury                hail                16    100  
## 24 2-4-d-injury                crop.hist           16    100  
## 25 2-4-d-injury                sever               16    100  
## 26 2-4-d-injury                seed.tmt            16    100  
## 27 2-4-d-injury                germ                16    100  
## 28 2-4-d-injury                plant.growth        16    100  
## 29 2-4-d-injury                leaf.shread         16    100  
## 30 2-4-d-injury                leaf.mild           16    100  
## 31 2-4-d-injury                stem                16    100  
## 32 2-4-d-injury                lodging             16    100  
## 33 2-4-d-injury                stem.cankers        16    100  
## 34 2-4-d-injury                canker.lesion       16    100  
## 35 2-4-d-injury                fruiting.bodies     16    100  
## 36 2-4-d-injury                ext.decay           16    100  
## 37 2-4-d-injury                mycelium            16    100  
## 38 2-4-d-injury                int.discolor        16    100  
## 39 2-4-d-injury                sclerotia           16    100  
## 40 2-4-d-injury                fruit.pods          16    100  
## 41 2-4-d-injury                fruit.spots         16    100  
## 42 2-4-d-injury                seed                16    100  
## 43 2-4-d-injury                mold.growth         16    100  
## 44 2-4-d-injury                seed.discolor       16    100  
## 45 2-4-d-injury                seed.size           16    100  
## 46 2-4-d-injury                shriveling          16    100  
## 47 2-4-d-injury                roots               16    100  
## 48 diaporthe-pod-&-stem-blight hail                15    100  
## 49 diaporthe-pod-&-stem-blight sever               15    100  
## 50 diaporthe-pod-&-stem-blight seed.tmt            15    100  
## # ... with 615 more rows

That’s is correct. The most frequent missing data on other variables occurs on level “phytophthora-rot”. But, if we order, not by the number of missing data, but by the percentage of NAs, the order (and the levels on variable “Class”) will be different:

## percentage of missing data in each variable across the levels within Class
# order by percentage of missing data
Soybean %>%
  group_by(Class) %>%
  miss_var_summary() %>%
  arrange(desc(pct_miss)) %>%
  print(n=50)
## # A tibble: 665 x 4
## # Groups:   Class [19]
##    Class                       variable        n_miss pct_miss
##    <fct>                       <chr>            <int>    <dbl>
##  1 diaporthe-pod-&-stem-blight hail                15      100
##  2 diaporthe-pod-&-stem-blight sever               15      100
##  3 diaporthe-pod-&-stem-blight seed.tmt            15      100
##  4 diaporthe-pod-&-stem-blight leaf.halo           15      100
##  5 diaporthe-pod-&-stem-blight leaf.marg           15      100
##  6 diaporthe-pod-&-stem-blight leaf.size           15      100
##  7 diaporthe-pod-&-stem-blight leaf.shread         15      100
##  8 diaporthe-pod-&-stem-blight leaf.malf           15      100
##  9 diaporthe-pod-&-stem-blight leaf.mild           15      100
## 10 diaporthe-pod-&-stem-blight lodging             15      100
## 11 diaporthe-pod-&-stem-blight roots               15      100
## 12 cyst-nematode               plant.stand         14      100
## 13 cyst-nematode               precip              14      100
## 14 cyst-nematode               temp                14      100
## 15 cyst-nematode               hail                14      100
## 16 cyst-nematode               sever               14      100
## 17 cyst-nematode               seed.tmt            14      100
## 18 cyst-nematode               germ                14      100
## 19 cyst-nematode               leaf.halo           14      100
## 20 cyst-nematode               leaf.marg           14      100
## 21 cyst-nematode               leaf.size           14      100
## 22 cyst-nematode               leaf.shread         14      100
## 23 cyst-nematode               leaf.malf           14      100
## 24 cyst-nematode               leaf.mild           14      100
## 25 cyst-nematode               lodging             14      100
## 26 cyst-nematode               stem.cankers        14      100
## 27 cyst-nematode               canker.lesion       14      100
## 28 cyst-nematode               fruiting.bodies     14      100
## 29 cyst-nematode               ext.decay           14      100
## 30 cyst-nematode               mycelium            14      100
## 31 cyst-nematode               int.discolor        14      100
## 32 cyst-nematode               sclerotia           14      100
## 33 cyst-nematode               fruit.spots         14      100
## 34 cyst-nematode               seed.discolor       14      100
## 35 cyst-nematode               shriveling          14      100
## 36 2-4-d-injury                plant.stand         16      100
## 37 2-4-d-injury                precip              16      100
## 38 2-4-d-injury                temp                16      100
## 39 2-4-d-injury                hail                16      100
## 40 2-4-d-injury                crop.hist           16      100
## 41 2-4-d-injury                sever               16      100
## 42 2-4-d-injury                seed.tmt            16      100
## 43 2-4-d-injury                germ                16      100
## 44 2-4-d-injury                plant.growth        16      100
## 45 2-4-d-injury                leaf.shread         16      100
## 46 2-4-d-injury                leaf.mild           16      100
## 47 2-4-d-injury                stem                16      100
## 48 2-4-d-injury                lodging             16      100
## 49 2-4-d-injury                stem.cankers        16      100
## 50 2-4-d-injury                canker.lesion       16      100
## # ... with 615 more rows

Finally, we can model missingness, using a classification algorithm:

# model missingness
library(rpart)
library(rpart.plot)
Soybean %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data = .) %>%
  prp(type = 4, extra = 101, prefix = "Prop. Miss = ", roundint = FALSE)

Question 6: Develop a strategy for handling missing data, either by eliminating predictors or imputation.

As it became evident, it is temerarious to exclude missing values. There is too much evidence that they are not missing at random (MAR). Missing values on many predictors tend to occur on levels 1 (“2-4-d-injury”), 9 (“cyst-nematode”), 10 (“diaporthe-pod-&-stem-blight”), 14 (“herbicide-injury”) and 16 (“phytophthora-rot”). A better ideia is to imputate them. We will use Random Forest algorithm to run imputation:

# imputation
library(missForest)
library(dplyr)
Soybean.imp <- missForest(select(Soybean, -"Class"), 
                          ntree = 1000, verbose = T)
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1122898 
##     difference(s): 0.02162727 
##     time: 25.39 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1108225 
##     difference(s): 0.00556369 
##     time: 21.04 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.1110342 
##     difference(s): 0.003011922 
##     time: 21.77 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1118427 
##     difference(s): 0.00167329 
##     time: 20.67 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.1112738 
##     difference(s): 0.0008366451 
##     time: 24.66 seconds
## 
##   missForest iteration 6 in progress...done!
##     estimated error(s): 0.1119939 
##     difference(s): 0.0005438193 
##     time: 26.37 seconds
## 
##   missForest iteration 7 in progress...done!
##     estimated error(s): 0.1117846 
##     difference(s): 0.001045806 
##     time: 22.86 seconds
SoybeanImpute <- data.frame(Soybean.imp$ximp)
SoybeanNew <- as.data.frame(cbind(Soybean$Class, SoybeanImpute))

Exercise 3

Chapter 5 introduces Quantitative Structure-Activity Relationship (QSAR) modeling where the characteristics of a chemical compound are used to predict other chemical properties. The caret package contains a QSAR data set from Mente and Lombardo (2005). Here, the ability of a chemical to permeate the blood-brain barrier was experimentally determined for 208 compounds. 134 descriptors were measured for each compound. Start R and use these commands to load the data:

library(caret)
data(BloodBrain)
dim(bbbDescr)
## [1] 208 134

The numeric outcome is contained in the vector logBBB while the predictors are in the data frame bbbDescr.

Question 7: Do any of the individual predictors have degenerate distributions?

In this case, there are too much predictors (134), so we are not going to inspect distributions visually. Insted, we are going to calculate skewness, and print just the variables which skewness are higher (let’s say, above 2 or under - 2).

library(moments)
skewVar <- as.data.frame(skewness(bbbDescr, na.rm = FALSE))
colnames(skewVar)[1] <- "Skewness"
skewVar %>% 
  arrange(desc(Skewness)) %>%
  filter(Skewness > 2 | Skewness < -2) %>%
  print()
##                      Skewness
## negative            14.317990
## alert               10.050359
## a_acid               5.439858
## vsa_acid             5.439858
## tcpa                 5.369967
## frac.anion7.         4.324616
## wpsa2                4.071604
## inthb                3.995614
## peoe_vsa.2.1         2.972059
## vsa_base             2.886451
## ppsa2                2.839157
## dpsa2                2.725506
## peoe_vsa.3.1         2.681897
## vsa_don              2.460145
## smr_vsa4             2.311704
## rule.of.5violations  2.220560
## peoe_vsa.3           2.175823
## adistd               2.162144
## peoe_vsa.2           2.107852
## adistm               2.057717
## slogp_vsa6           2.052164
## psa_npsa             2.021547
## wnsa2               -3.142954

As we did before, let’s present predictors that have zero or near zero variance:

# near zero variance
nzv <- nearZeroVar(bbbDescr, saveMetrics= TRUE)
filter(nzv, nzv==TRUE | zeroVar==TRUE)
##              freqRatio percentUnique zeroVar  nzv
## negative     207.00000     0.9615385   FALSE TRUE
## peoe_vsa.2.1  25.57143     5.7692308   FALSE TRUE
## peoe_vsa.3.1  21.00000     7.2115385   FALSE TRUE
## a_acid        33.50000     1.4423077   FALSE TRUE
## vsa_acid      33.50000     1.4423077   FALSE TRUE
## frac.anion7.  47.75000     5.7692308   FALSE TRUE
## alert        103.00000     0.9615385   FALSE TRUE

Question 8: Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling?

Yes, there are; this is very common between chemical compounds. To prove that, we will build a matrix correlation and, using the function “findCorrelation”, we are going to present predictors that are highly correlated (let’s say, above 0.75 or under -0.75):

### strong relations between predictors
corPred <- cor(bbbDescr)
highCorr <- findCorrelation(corPred, .75)
colnames(bbbDescr)[highCorr]
##  [1] "vsa_don"             "slogp_vsa2"          "slogp_vsa7"         
##  [4] "smr_vsa0"            "smr_vsa5"            "mw"                 
##  [7] "nocount"             "hbdnr"               "ub"                 
## [10] "nonpolar_area"       "tcnp"                "ovality"            
## [13] "surface_area"        "volume"              "ppsa1"              
## [16] "ppsa2"               "ppsa3"               "pnsa2"              
## [19] "pnsa3"               "fpsa2"               "fnsa2"              
## [22] "fnsa3"               "wpsa1"               "wpsa2"              
## [25] "wpsa3"               "wnsa1"               "wnsa2"              
## [28] "wnsa3"               "dpsa1"               "dpsa2"              
## [31] "dpsa3"               "sadh1"               "sadh3"              
## [34] "chdh1"               "chdh3"               "scdh1"              
## [37] "scdh2"               "scdh3"               "saaa1"              
## [40] "saaa3"               "scaa1"               "scaa2"              
## [43] "scaa3"               "ctdh"                "ctaa"               
## [46] "mchg"                "vsa_hyd"             "tpsa"               
## [49] "a_acid"              "a_base"              "vsa_acc"            
## [52] "slogp_vsa3"          "weight"              "logp.o.w."          
## [55] "tpsa.1"              "a_acc"               "adistm"             
## [58] "polar_area"          "psa_npsa"            "homo"               
## [61] "sum_absolute_charge" "fpsa1"               "fpsa3"              
## [64] "sadh2"               "chaa1"               "chdh2"

We can use two strategies to deal with highly correlated predictors: 1) exclude some of them; 2) run principal component analysis. About first strategy, again, function “findCorrelation” is very useful. How does it work? The algorithm:

  1. Calculate the correlation matrix of the predictors.
  2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
  3. Determine the average correlation between A and the other variables. Do the same for predictor B.
  4. If A has a larger average correlation, remove it; otherwise, remove predictor B.
  5. Repeat Steps 2–4 until no absolute correlations are above the threshold.

Let’s remove predictors based on “findCorrelation” and see predictors’ reduction:

# strategy 1: remove predictors highly correlated
dim(bbbDescr)
## [1] 208 134
bbbDescrFilter <- bbbDescr[,-highCorr]
dim(bbbDescrFilter)
## [1] 208  68

Finally, we can run a principal component analysis:

# principal component analysis
preProc <- preProcess(bbbDescr, method="pca")
# apply principal component analysis to the data set
bbbDescrPC <- predict(preProc, bbbDescr)
dim(bbbDescrPC)
## [1] 208  31

In conclusion, in the first strategy, we reduce predictors from 134 to 68 (66 predictors were removed). Using principal component analysis, predictors were reduced from 134 to 31 (103 predictors of difference).