Applied Predictive Modeling

Load required packages

packages <- c("tidyverse", "forecast", "kableExtra", "broom", "ggplot2", "caret", "e1071", "knitr", "GGally", "VIM", "mlbench", "car", "corrplot", "mice", "seasonal", "fma", "latex2exp","gridExtra")
pacman::p_load(char = packages)

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

Load the dataset Glass

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

As we see above, dataframe Glass has 214 observations and 10 variables

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer

Lets plot histogram to understand the distribution of the predictor variable, we will use histograms to obtain the frequencies of each of them for the following reasons:

  • The Histogram bins the data by frequency of values and plots the frequency bins against the ordered values.

  • The height of each bin represents the amount of the frequency.

long_glass <- Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>%
  mutate(Predictor = as.factor(Predictor))

long_glass %>%
  ggplot(aes(Value, color = Predictor, fill = Predictor)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ Predictor, ncol = 3, scales = "free") +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  theme_light() +
  theme(legend.position = "none") +
  ggtitle("Distribution of Predictor Variables")

Glass is primarly made of Silica (Si), Sodium (Na) and lime/Calcium (Ca), Aluminium (Al). Therefore, they have relatively normal (symmetric) distributions. The remainder of the predictor variables appear to have non-normal (asymmetric) distributions.

Now we will examine how the predictors are related to each other. We will analyse that with a correlation plot.

#ColorBrewer's 5 class spectral color palette
col <- colorRampPalette(c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"))

Glass %>%
  select(-Type) %>%
  cor() %>%
  round(., 2) %>%
  corrplot(., method="color", col=col(200), type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )

  • As we see above, most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.

  • Most of the correlations are not very strong. The exception to this is the correlation between calcium(Ca) and the refraction index(RI) is strongly positively correlated.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Answer

To start with, let’s analyze how the predictors are distributed by the type of glass by using a scatter plot:

long_glass %>%
  ggplot(aes(x = Type, y = Value, color = Predictor)) +
  geom_jitter() +
  ylim(0, 20) + 
  scale_color_brewer(palette = "Set1") +
  theme_light()

From the above plot, looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potassium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.

Now let’s also analyse the outliers using box plot approach:

  • The Boxplot displays the data in quartiles. Inside the box lie the data filing within the 25th and 75th percentile, also called the 1st and 3rd quartile. The line inside the box represents the median, also known as the 50th percentile, also called the 2nd quartile.

  • Extending from the box are whiskers. Any values outside the whiskers are considered outliers. Symmetric distributions have a box with equally sized whiskers. Skewed distributions have the boxes with one long and one short whisker.

par(mfrow=c(3,3))
for(var in names(Glass)[-10]){
  boxplot(Glass[var], main=paste('Boxplot of', var), horizontal = T, col="steelblue")
}

  • The Boxplots show outliers in every variable except for Mg. The most extreme outliers appear in the K and Ba variables.

  • Seeing the box plots above, Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.

The skewness value can be calculated to confirm:

Glass[-10] %>% apply(2, skewness) %>% sort(decreasing=T)
##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523

The above calculated values confirms our conclusions.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer

A transformation method like Box-Cox transformation might improve the classification model’s preformance.

trans <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
transformed <- predict(trans, Glass[-10])

par(mfrow=c(3,3))
for(var in names(transformed)[-10]){
  boxplot(transformed[var], main=paste('Boxplot of', var), horizontal = T, col="steelblue")
}

transformed %>% apply(2, skewness) %>% sort(decreasing=T)
##           K          Ba          Fe          RI          Al          Na 
##  6.46008890  3.36867997  1.72981071  1.56566039  0.09105899  0.03384644 
##          Ca          Si          Mg 
## -0.19395573 -0.65090568 -1.13645228
  • The centering and scaling did the job of bringing the mean to 0 and standard deviation to 1.

  • It appears that the Box-Cox transformation has improved the skewness of Ca, Al, and Na. It was not effective in reducing the skewness for other predictors having heavier skewness.

  • It may be beneficial to remove one of the highly correlated predictors to improve stability of some linear models.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
#View(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer

To answer the first part of the question, we will use summary function to analyze frequency distribution of predictors:

summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

Following with second part of question: The variable with degenrate distributions is a variable with “zero-variance” or a handful of unique values that occur with very low frequencies “near-zero variance”, that satisfies both following conditions:

  • The fraction of unique values over the sample size is low (say 10%).

  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

  • The nearZeroVar function can be used to find the degenrate variables:

paste('The degenerate variables are:', paste(names(Soybean[,nearZeroVar(Soybean)]), collapse = ', '))
## [1] "The degenerate variables are: leaf.mild, mycelium, sclerotia"

Let’s explore the following variables: leaf.mild, mycelium, sclerotia

summary(Soybean[19])
##  leaf.mild 
##  0   :535  
##  1   : 20  
##  2   : 20  
##  NA's:108

For the leaf.mild variable, the fraction of unique value over the sample size is 3/683=0.4% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 535/20=26.75 > 20 which confirms that leaf.mild is a degenerate variable.

summary(Soybean[26])
##  mycelium  
##  0   :639  
##  1   :  6  
##  NA's: 38

For mycelium variable, the fraction of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 639/6=106.5 > 20 which confirms that mycelium is a degenerate variable.

summary(Soybean[28])
##  sclerotia 
##  0   :625  
##  1   : 20  
##  NA's: 38

For the sclerotia variable, the fraction of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 625/20=31.25 > 20 which confirms that sclerotia is a degenerate variable.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer

The count of missing values in each variables are found below:

nas <- Soybean %>% apply(2, is.na) %>% apply(2, sum, na.rm=T)
nas <- sort(nas, decreasing=T)
nas
##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             108             106             106             106             106 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##             100              92              92              92              84 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##              38              38              38              38              38 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##              38              36              31              30              16 
##    plant.growth            stem            date        area.dam           Class 
##              16              16               1               1               0 
##          leaves 
##               0

hail, sever, seed.tmt, lodging are top 4 predictors with most no. of missing values.

Let’s plot our missing values in predictors:

# A function that plots missingness
# requires `reshape2`

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)

ggplot_missing <- function(x){
  
  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2,
               y = Var1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows or observations")
}
ggplot_missing(Soybean[-1])

The above plot confirms our count of missing values in predictors.

Since the data is distributed on the basis of classes therefore pattern of missing data is related to classes.

Let’s use aggr() to analyze pattern of missing data:

  • The aggr function in the VIM package plots and calculates the amount of missing values in each variable.
library(VIM)
aggr(Soybean[-1], prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##           leaves 0.000000000
  • The non-graphical output of the function shows the exact proportion of missing values per variable.

  • The visualizations produced by the aggr function in the VIM package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows several predictors variables have over 15% of their values missing.

  • The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis.

dplyr is useful for wrangling data into aggregate summaries and is used to find the pattern of missing data related to the classes.

library(dplyr)

Soybean %>%
  dplyr::mutate(Total = n()) %>% 
  dplyr::filter(!complete.cases(.)) %>%
  dplyr::group_by(Class) %>%
  dplyr::mutate(Missing = n(), Proportion=Missing/Total) %>%
  dplyr::select(Class, Missing, Proportion) %>%
  unique()
## # A tibble: 5 x 3
## # Groups:   Class [5]
##   Class                       Missing Proportion
##   <fct>                         <int>      <dbl>
## 1 phytophthora-rot                 68     0.0996
## 2 diaporthe-pod-&-stem-blight      15     0.0220
## 3 cyst-nematode                    14     0.0205
## 4 2-4-d-injury                     16     0.0234
## 5 herbicide-injury                  8     0.0117
  • In the above process we checked if a pattern of missing data related to the classes exists is done by checking if some classes hold most of the incomplete cases. This is accomplished by filtering, grouping, and mutating the data with dplyr.

  • The majority of the missing values are in the phytophthora-rot class which has nearly 10% incomplete cases. The are only four more, out of the eighteen other, variables with incomplete cases.

  • The pattern of missing data is related to the classes. Mostly the phytophthora-rot class however since the other four variables only have between 1% and 2% incomplete cases.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer

Multiple imputation and KNN are widely used, and multiple imputation being simpler is generally preferred.

Method 1

The mice() function in the mice package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has over imputation 20 methods that can be applied to the data.

The one used with these data is the predictive mean matching(pmm) method which is currently the most popular in online forums. After the imputations are made, a complete dataset is created using the complete() function.

The aggr function from the VIM package used in the previous example (plots and calculates the amount of missing values in each variable) is then reran for comparison.

library(mice)
MICE <- mice(Soybean, method="pmm", printFlag=F, seed=624)
## Warning: Number of logged events: 1666
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.marg     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##        leaf.mild     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##         mycelium     0
##     int.discolor     0
##        sclerotia     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

Method 2

In this method, k neighbors are chosen based on some distance measure and their average is used as an imputation estimate. The method requires the selection of the number of nearest neighbors, and a distance metric. KNN can predict both discrete attributes (the most frequent value among the k nearest neighbors) and continuous attributes (the mean among the k nearest neighbors)

knnImputation is a function that fills in all NA values using the k Nearest Neighbours of each case with NA values. By default it uses the values of the neighbours and obtains an weighted (by the distance to the case) average of their values to fill in the unknows.

library(DMwR)
knnOutput <- knnImputation(Soybean)
library(rmarkdown)
paged_table(knnOutput)
ggplot_missing(knnOutput)

One of the obvious drawbacks of the KNN algorithm is that it becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset. Furthermore, the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor.