Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.

Ex 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(fpp3)
library(mlbench)
library(corrplot)
library(moments)
library(tidyverse)
library(VIM)
library(naniar)

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass |> pivot_longer(cols = c(1:9), names_to = 'predictors', values_to = 'value') |> ggplot()+ geom_histogram(aes(value))+
  facet_wrap(~predictors, scales='free')+
  labs(title='Distributions of All Predictors')

corrplot(cor(Glass[,1:9]), method='number', type='lower')

Al, Ba, CA, Fe, K and RI look right skewed. However, BA, Fe and K have many data at zero. Mg is left skewed with many data at zero. Na and Si are quite normal. RI has strong correlation with Al, Ca and Si. RI also has some correlation with K. Other strong correlation include Mg and Al/Ba/Ca, Ba and Al/Na/Mg.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
Glass |> pivot_longer(cols = c(1:9), names_to = 'predictors', values_to = 'value') |> ggplot()+ geom_boxplot(aes(value), outlier.colour = 'green')+
  facet_wrap(~predictors, scales='free')+
  labs(title='Boxplot of All Predictors')

apply(Glass[-10], 2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6140150  0.4509917 -1.1444648  0.9009179 -0.7253173  6.5056358  2.0326774 
##         Ba         Fe 
##  3.3924309  1.7420068

All green dots above are outliners and only Mg has no outliner. According to the skewness check, Mg and Si are left skewed and the rest are right skewed. Na is the one closet to normal distribution.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?
log_transform <- function(x) {
  if_else(x != 0, log(x + 1e-11), log(1))
}

Glass_log<-Glass|>
  select(RI, Al, K, Ca, Ba, Fe)|> mutate_all(log_transform)

Glass_log |> pivot_longer(cols = c(1:6), names_to = 'predictors_log', values_to = 'value_log') |> ggplot()+ geom_histogram(aes(value_log))+
  facet_wrap(~predictors_log, scales='free')+
  labs(title='Distributions of All Predictors')

apply(Glass_log, 2, skewness)
##         RI         Al          K         Ca         Ba         Fe 
##  1.6015381 -0.8405381 -1.1189706  1.0588744 -3.4060277 -1.2504705

There are transformations including logarithmic, square root, box-cox and reciprocal to handle the skewness. Since most of the predictors are right skewed, a logarithmic transformation is used. However, it’s recommended to remove those zeros first before transformation.

Ex 3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data("Soybean")
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A degenerate distribution is a probability distribution in a space with support only on a manifold of lower dimension, and according to others a distribution with support only at a single point. By the latter definition, it is a deterministic distribution and takes only a single value. This distribution satisfies the definition of “random variable” even though it does not appear random in the everyday sense of the word; hence it is considered degenerate.

Soybean|> gather() |> ggplot(aes(value)) + 
  geom_bar()+
  facet_wrap(~key, scales='free')+
  labs(title='Frequency Distributions For All Categorical Predictors')

plant.stand, hail, plant.growth, leaves, leaf.shread, leaf.malf, stem, lodging, fruiting.bodies, mycelium, sclerotia, seed, mold.growth, seed.discolor, seed.size and shriveling are degenerate.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing <- aggr(Soybean, col=c('red','orange'), numbers=TRUE,
                  sortVars=TRUE, labels=names(Soybean), cex.axis=.5, gap=1,
                  ylab=c('Histogram of missing data','Pattern'))
## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000
gg_miss_upset(Soybean, nsets=10)

Hail, sever, seed.tmt and lodging has the most missing values. There are pattern of missing values between these 4 predictors.

  1. Develop a strategy for handling missing data, either by eliminating /’predictors or imputation.
library(caret)
## 载入需要的程辑包:lattice
## 
## 载入程辑包:'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
# Identification of near zero variance predictors
nearZeroVar(Soybean,freqCut = 100/5, uniqueCut = 15)
## [1] 19 26 28
# Substitute the missing value by KNN
Soybean_new <- Soybean[,-c(19,26,28)] |>
  kNN()

aggr(Soybean_new, col=c('orange'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean_new), cex.axis=.5, gap=1,ylab=c('Histogram of missing data','Pattern'))

## 
##  Variables sorted by number of missings: 
##             Variable Count
##                Class     0
##                 date     0
##          plant.stand     0
##               precip     0
##                 temp     0
##                 hail     0
##            crop.hist     0
##             area.dam     0
##                sever     0
##             seed.tmt     0
##                 germ     0
##         plant.growth     0
##               leaves     0
##            leaf.halo     0
##            leaf.marg     0
##            leaf.size     0
##          leaf.shread     0
##            leaf.malf     0
##                 stem     0
##              lodging     0
##         stem.cankers     0
##        canker.lesion     0
##      fruiting.bodies     0
##            ext.decay     0
##         int.discolor     0
##           fruit.pods     0
##          fruit.spots     0
##                 seed     0
##          mold.growth     0
##        seed.discolor     0
##            seed.size     0
##           shriveling     0
##                roots     0
##            Class_imp     0
##             date_imp     0
##      plant.stand_imp     0
##           precip_imp     0
##             temp_imp     0
##             hail_imp     0
##        crop.hist_imp     0
##         area.dam_imp     0
##            sever_imp     0
##         seed.tmt_imp     0
##             germ_imp     0
##     plant.growth_imp     0
##           leaves_imp     0
##        leaf.halo_imp     0
##        leaf.marg_imp     0
##        leaf.size_imp     0
##      leaf.shread_imp     0
##        leaf.malf_imp     0
##             stem_imp     0
##          lodging_imp     0
##     stem.cankers_imp     0
##    canker.lesion_imp     0
##  fruiting.bodies_imp     0
##        ext.decay_imp     0
##     int.discolor_imp     0
##       fruit.pods_imp     0
##      fruit.spots_imp     0
##             seed_imp     0
##      mold.growth_imp     0
##    seed.discolor_imp     0
##        seed.size_imp     0
##       shriveling_imp     0
##            roots_imp     0

Using KNN to substitute the missing data and the result can be found above.