Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Required libraries
library(ggplot2)
library(tidyverse)

Glass %>% select(-c(Type)) %>% 
  gather() %>% 
  ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')

The histograms above for each predictor variable show near normal plots for Al, Na, and Si. The histograms for Ca and RI aren’t quite normal, as both have a bit of right skew. Histograms for Ba, Fe, and K clearly have right skew and Mg appears bi-modal.

library(corrplot)

corr <- Glass %>% select(-c(Type)) %>% cor()

corrplot(corr, method="number")

The correlation plot above shows a high positive correlation between Ca and RI, which would indicate both predictor variables contain almost the same information. The only other pair of predictor variables above 0.5 or below -0.5 is Si and RI with a correlation value of -0.54 which means these two have an inverse relationship.

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Glass %>% select(-c(Type)) %>% 
  gather() %>% 
  ggplot(aes(x = value)) + 
  stat_boxplot(geom = "errorbar", width = 0.5) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free')

Yes, based on the histograms from section (a) and the boxplots above, outliers do appears as the black dots in the boxplots beyond the whiskers (1.5 IQR from the hinge). Thus, all the predictor variables except Mg appear to have outliers by definition. As mentioned in section (a), yes most of the predictors are skewed, histograms for Ba, Ca, and K clearly have right skew, and Fe and RI have a bit of right skew.

library(e1071)
skewValues <- apply(Glass[1:9],2,skewness)
(skewValues)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Based on the skewness function, all the predictors are right skewed except for Mg an Si. The predictors with skew values furthest from zero are K, Ba, Ca, which matches the results of the visual assessment from the histograms.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

The first transformation, would be the removal of either Ca or RI due to their high positive correlation. For the predictors with clear right skew, the Box-Cox transformation can be applied. The Box-Cox transformation can identify the proper log, square, square root or inverse transformation in order to resolve the skew-ness and transform the data to fit a more normal distribution for use in a predictive model. For the outliers, spatial sign transformation can be applied which projects the predictor values onto a multidimensional sphere, in essence making all the sample values the same distance from the center of the sphere.

library(caret)
# Box Cox, center, scale transformations

trans <- preProcess(Glass, method = c("BoxCox", "center", "scale")) 

transformed <- predict(trans, Glass) 

transformed %>% select(-c(Type)) %>% 
  gather() %>% 
  ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')

Applying the Box-Cox transformation above with centering and scaling, I’ll be honest the shape of the near normal distributions did not improve that much. Also, the amount of skew did not appear to improve either.

library(caret)
# Spatial sign, center, scale transformations

trans <- preProcess(Glass, method = c("spatialSign", "center", "scale")) 

transformed <- predict(trans, Glass) 

transformed %>% select(-c(Type)) %>% 
  gather() %>% 
  ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')

Applying the spatial sign transformation above with centering and scaling, the not normal distributions for Ba, Fe, K, and Mg aren’t as bad as the initial histograms.

library(caret)
# Spatial sign, center, scale transformations

trans <- preProcess(Glass, method = c("BoxCox", "center", "scale", "pca")) 

transformed <- predict(trans, Glass) 

transformed %>% select(-c(Type)) %>% 
  gather() %>% 
  ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')

And finally, from the book example, I’ve applied Box-Cox, center, scaling and PCA. Based on the above histograms of the principal components, the visual assessment does look good in regards to normal distributions.

At this point, I’d probably go with PCA despite my initial paragraph describing Box-Cox and Spatial Sign.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
## See ?Soybean for details

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First, I generate a barplot for each predictor variable defining the frequency distribution, in order to provide an initial visual assessment.

Soybean %>% select(-c(Class)) %>% 
  gather() %>% 
  drop_na() %>%
  ggplot(aes(x = value)) + geom_bar() + facet_wrap(~key, ncol=3, scales = 'free')

The above barplots show many plots may be susceptible to severe disproportionate frequency. The nearZeroVar function below will calculate and identify those predictor variables with near zero variance, and thus degenerative frequency distributions.

library(caret)
cols <- nearZeroVar(Soybean, saveMetrics = TRUE, names = TRUE)

(cols)
##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

The output from the function nearZeroVar identifies leaf.mild, mycelium, and sclerotia as having near zero variance, and thus degenerate distributions. None of the predictor variables have zero variance based on the above calculations. Given these three variables have near-zero variance, they would be good candidates to remove from the predictive model.

(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Count the number of observations with a missing value by predictor variable
colSums_Missing_Count <- data.frame(colSums(is.na(Soybean)))

# Name the column for the NA count
colnames(colSums_Missing_Count) <- "NA.Count"

# Convert the index column into a named column to keep the variable names
colSums_Missing_Count <- cbind(Variable = rownames(colSums_Missing_Count), colSums_Missing_Count)
rownames(colSums_Missing_Count) <- 1:nrow(colSums_Missing_Count)

# Sort by the missing count in descending order
colSums_Missing_Count <- colSums_Missing_Count[order(-colSums_Missing_Count$NA.Count),]

# Output the results
(colSums_Missing_Count)
##           Variable NA.Count
## 6             hail      121
## 9            sever      121
## 10        seed.tmt      121
## 21         lodging      121
## 11            germ      112
## 19       leaf.mild      108
## 24 fruiting.bodies      106
## 30     fruit.spots      106
## 33   seed.discolor      106
## 35      shriveling      106
## 17     leaf.shread      100
## 31            seed       92
## 32     mold.growth       92
## 34       seed.size       92
## 14       leaf.halo       84
## 15       leaf.marg       84
## 16       leaf.size       84
## 18       leaf.malf       84
## 29      fruit.pods       84
## 4           precip       38
## 22    stem.cankers       38
## 23   canker.lesion       38
## 25       ext.decay       38
## 26        mycelium       38
## 27    int.discolor       38
## 28       sclerotia       38
## 3      plant.stand       36
## 36           roots       31
## 5             temp       30
## 7        crop.hist       16
## 12    plant.growth       16
## 20            stem       16
## 2             date        1
## 8         area.dam        1
## 1            Class        0
## 13          leaves        0
library(naniar)
vis_miss(Soybean)

The 5 predictor variables with the highest value missing count are hail, sever, seed.tmt, lodging, and germ. Overall, 11 predictor variables have over 100 missing values.

The vis_miss function from the library naniar shows the missing values do occur across the same observations. The next step will be to identify if those observations align to specific class values.

Soybean %>%
  filter(!complete.cases(.)) %>%
  group_by(Class)
## # A tibble: 121 × 36
## # Groups:   Class [5]
##    Class  date  plant.stand precip temp  hail  crop.hist area.dam sever seed.tmt
##    <fct>  <fct> <ord>       <ord>  <ord> <fct> <fct>     <fct>    <fct> <fct>   
##  1 phyto… 1     1           2      1     <NA>  3         1        <NA>  <NA>    
##  2 phyto… 2     1           2      2     <NA>  2         1        <NA>  <NA>    
##  3 phyto… 2     1           2      2     <NA>  2         1        <NA>  <NA>    
##  4 phyto… 3     1           2      1     <NA>  2         1        <NA>  <NA>    
##  5 phyto… 2     1           1      1     <NA>  0         1        <NA>  <NA>    
##  6 phyto… 2     1           2      1     <NA>  1         1        <NA>  <NA>    
##  7 phyto… 1     1           2      1     <NA>  1         1        <NA>  <NA>    
##  8 phyto… 2     1           2      2     <NA>  3         1        <NA>  <NA>    
##  9 phyto… 2     1           1      2     <NA>  2         1        <NA>  <NA>    
## 10 phyto… 1     1           2      1     <NA>  0         1        <NA>  <NA>    
## # … with 111 more rows, and 26 more variables: germ <ord>, plant.growth <fct>,
## #   leaves <fct>, leaf.halo <fct>, leaf.marg <fct>, leaf.size <ord>,
## #   leaf.shread <fct>, leaf.malf <fct>, leaf.mild <fct>, stem <fct>,
## #   lodging <fct>, stem.cankers <fct>, canker.lesion <fct>,
## #   fruiting.bodies <fct>, ext.decay <fct>, mycelium <fct>, int.discolor <fct>,
## #   sclerotia <fct>, fruit.pods <fct>, fruit.spots <fct>, seed <fct>,
## #   mold.growth <fct>, seed.discolor <fct>, seed.size <fct>, …

Above output indicates 121 rows are missing at least 1 value.

Soybean %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  summarize(Count = n())
## # A tibble: 5 × 2
##   Class                       Count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

Above output indicates those 121 rows occur across 5 classes. Class value phytophthora-rot contains the most rows with missing data at 68.

Soybean %>% 
  mutate(MissingValues = rowSums(is.na(Soybean))) %>%
  mutate(MissingPct = MissingValues / 35) %>%
  group_by(Class) %>%
  summarise(MissingPctAvg = mean(MissingPct)) %>%
  filter(MissingPctAvg > 0)
## # A tibble: 5 × 2
##   Class                       MissingPctAvg
##   <fct>                               <dbl>
## 1 2-4-d-injury                        0.804
## 2 cyst-nematode                       0.686
## 3 diaporthe-pod-&-stem-blight         0.337
## 4 herbicide-injury                    0.571
## 5 phytophthora-rot                    0.394

Above output calculates the average percentage of missing values for those 5 classes. The above indicates 2-4-d-injury rows are missing 80% of the values while cyst-nematode and herbicide-injury are also missing over half of the predictor values.

Soybean %>% 
  count(Class, sort=TRUE)
##                          Class  n
## 1                   brown-spot 92
## 2          alternarialeaf-spot 91
## 3           frog-eye-leaf-spot 91
## 4             phytophthora-rot 88
## 5                  anthracnose 44
## 6               brown-stem-rot 44
## 7             bacterial-blight 20
## 8            bacterial-pustule 20
## 9                 charcoal-rot 20
## 10       diaporthe-stem-canker 20
## 11                downy-mildew 20
## 12      phyllosticta-leaf-spot 20
## 13              powdery-mildew 20
## 14           purple-seed-stain 20
## 15        rhizoctonia-root-rot 20
## 16                2-4-d-injury 16
## 17 diaporthe-pod-&-stem-blight 15
## 18               cyst-nematode 14
## 19            herbicide-injury  8

The above info provides the number of rows by class value. From the count of rows with missing data by class value, the assessment shows that all rows for 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury have some missing data. And for phytophthora-rot, only 20 rows of the 88 do not contain missing values.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

colSums_Missing_Count %>%
  mutate(NA.Count.Pct = NA.Count/683)
##           Variable NA.Count NA.Count.Pct
## 6             hail      121  0.177159590
## 9            sever      121  0.177159590
## 10        seed.tmt      121  0.177159590
## 21         lodging      121  0.177159590
## 11            germ      112  0.163982430
## 19       leaf.mild      108  0.158125915
## 24 fruiting.bodies      106  0.155197657
## 30     fruit.spots      106  0.155197657
## 33   seed.discolor      106  0.155197657
## 35      shriveling      106  0.155197657
## 17     leaf.shread      100  0.146412884
## 31            seed       92  0.134699854
## 32     mold.growth       92  0.134699854
## 34       seed.size       92  0.134699854
## 14       leaf.halo       84  0.122986823
## 15       leaf.marg       84  0.122986823
## 16       leaf.size       84  0.122986823
## 18       leaf.malf       84  0.122986823
## 29      fruit.pods       84  0.122986823
## 4           precip       38  0.055636896
## 22    stem.cankers       38  0.055636896
## 23   canker.lesion       38  0.055636896
## 25       ext.decay       38  0.055636896
## 26        mycelium       38  0.055636896
## 27    int.discolor       38  0.055636896
## 28       sclerotia       38  0.055636896
## 3      plant.stand       36  0.052708638
## 36           roots       31  0.045387994
## 5             temp       30  0.043923865
## 7        crop.hist       16  0.023426061
## 12    plant.growth       16  0.023426061
## 20            stem       16  0.023426061
## 2             date        1  0.001464129
## 8         area.dam        1  0.001464129
## 1            Class        0  0.000000000
## 13          leaves        0  0.000000000

The above output indicates the percentage of missing values by column. No column has more than 18% missing data.

First, before defining the strategy, I do want to address the findings from section (b). Given the missing values are concentrated to only 5 of the 19 class values, then I believe there is a need for additional research behind those missing values. Potentially, an “informative missingness” exists that we could uncover with more knowledge of the subject or the observational method of data collection.

For handling the missing data, I believe a two-step approach may be valid. First step, eliminate the predictors with near zero variance, leaf.mild, mycelium, and sclerotia. As the textbook indicates, little predictive value will come from these three predictors. Second, considering the missing values of the remaining predictors, I don’t believe PCA via imputation is a good approach as the missing data is found across almost all the predictors. Typically, PCA requires missing data across a small number of predictors. In order to impute the remaining predictors, I’d suggest K-nearest neighbor model (KNN). A 5-nearest neighbor model would be a good parameter selection to start.

Another consideration would be a tree-based prediction model which was alluded to several times throughout Chapter 3 of the textbook. If the missing values do contain “informative missingness”, then a tree-based approach may perform well in predicting the correct class as compared to the regression model.

library(mice)

# from: https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
tempData <- mice(Soybean,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
completedData <- complete(tempData,1)

I did naively attempt the imputation with the mice package, but the processing took so long and the result of imputed values didn’t prove much to me besides that the function worked as advertised.