library(ggplot2)

library(mlbench)
library(e1071)
library(caret)

Homework-4 Applied Predictive Modeling, Chapter 3 Exercises

3.1 Glass data from the UC Irvine Machine Learning Repository

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
hist(Glass$RI)

hist(Glass$Na)

hist(Glass$Mg)

hist(Glass$Al)

hist(Glass$Si)

hist(Glass$K)

hist(Glass$Ca)

hist(Glass$Ba)

hist(Glass$Fe)

Based on the graphs above, the values for Mg, K, Ba, Fe elements are not normally distributed and are very much skewed to the right or left. The rest of the 5 predictors are near-normally distributed with slight skewness.

The below table and plot provide count of observations for each type of glass.

table(Glass$Type)
## 
##  1  2  3  5  6  7 
## 70 76 17 13  9 29
barplot(table(Glass$Type))

The cor function will help us see if there is any collinearity relationship between predictors.

gp <- Glass[,1:9]
cor(gp, use = "all.obs")
##               RI          Na           Mg          Al          Si
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073
##               K         Ca            Ba           Fe
## RI -0.289832711  0.8104027 -0.0003860189  0.143009609
## Na -0.266086504 -0.2754425  0.3266028795 -0.241346411
## Mg  0.005395667 -0.4437500 -0.4922621178  0.083059529
## Al  0.325958446 -0.2595920  0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K   1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155  1.0000000 -0.1128409671  0.124968219
## Ba -0.042618059 -0.1128410  1.0000000000 -0.058691755
## Fe -0.007719049  0.1249682 -0.0586917554  1.000000000

Indeed, we see that there is high collinearity between RI and Ca predictors as also shown in the plot below.

plot(Glass$RI, Glass$Ca)

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

    Yes, there are several predictors which are skewed and with outliers. For example, the values for the Fe element is are very much right-skewed with outliers. This is illustrated by the summary and the box-plot below
summary(Glass$Fe)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
boxplot(Glass$Fe)

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

    Yes, the values for Mg, K, Ba, Fe elements can be log-transformed or (Box-Cox with \(\lambda = 0\)), so that the data would be close to normal. For example, here’s the plot for the transformed Fe element.
hist(log(Glass$Fe))


3.2 Soybean data from the UC Irvine Machine Learning Repository

data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

    There are 3 predictors with degenerate distributions, listed below:

    nzv <- nearZeroVar(Soybean)
    names(Soybean)[nzv]
    ## [1] "leaf.mild" "mycelium"  "sclerotia"

    Here’s the graph illustrating the disproportionate percentage of distribution for each unique value in leaf.mild predictor

    barplot((table(Soybean$leaf.mild) * 100) / nrow(Soybean))

  2. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

    Based on the summary below, there are 3 predictors (sever, seed.tmt, and lodging) with the most missing values:

    summary(Soybean)
    ##                  Class          date     plant.stand  precip      temp    
    ##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
    ##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
    ##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
    ##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
    ##  anthracnose        : 44   6      : 90                                    
    ##  brown-stem-rot     : 44   (Other):101                                    
    ##  (Other)            :233   NA's   :  1                                    
    ##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
    ##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
    ##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
    ##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
    ##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
    ##             NA's: 16   NA's:  1                                   
    ##                                                                   
    ##                                                                   
    ##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
    ##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
    ##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
    ##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
    ##                       NA's: 84   NA's: 84   NA's: 84              
    ##                                                                   
    ##                                                                   
    ##                                                                   
    ##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
    ##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
    ##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
    ##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
    ##             NA's:108                         3   :191     3   : 65     
    ##                                              NA's: 38     NA's: 38     
    ##                                                                        
    ##                                                                        
    ##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
    ##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
    ##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
    ##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
    ##                  NA's: 38              NA's: 38                3   : 48  
    ##                                                                NA's: 84  
    ##                                                                          
    ##                                                                          
    ##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
    ##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
    ##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
    ##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
    ##  4   :100                                                              
    ##  NA's:106                                                              
    ##                                                                        
    ##                                                                        
    ##   roots    
    ##  0   :551  
    ##  1   : 86  
    ##  2   : 15  
    ##  NA's: 31  
    ##            
    ##            
    ## 

    There appears to be a pattern of missing data related to certain classes. The plot below shows this pattern which seems to be nearly identical across predictors with the most missing data.

    barplot(table(Soybean[is.na(Soybean$lodging), 'Class']))

    table(Soybean[is.na(Soybean$lodging), 'Class'])
    ## 
    ##                2-4-d-injury         alternarialeaf-spot 
    ##                          16                           0 
    ##                 anthracnose            bacterial-blight 
    ##                           0                           0 
    ##           bacterial-pustule                  brown-spot 
    ##                           0                           0 
    ##              brown-stem-rot                charcoal-rot 
    ##                           0                           0 
    ##               cyst-nematode diaporthe-pod-&-stem-blight 
    ##                          14                          15 
    ##       diaporthe-stem-canker                downy-mildew 
    ##                           0                           0 
    ##          frog-eye-leaf-spot            herbicide-injury 
    ##                           0                           8 
    ##      phyllosticta-leaf-spot            phytophthora-rot 
    ##                           0                          68 
    ##              powdery-mildew           purple-seed-stain 
    ##                           0                           0 
    ##        rhizoctonia-root-rot 
    ##                           0
  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

    Given that roughly 18% of the data is missing, it may be better to impute the missing values. However, a common pattern in the data appears to be that a predictor can only take on a single value and so imputing by assigning that exact value should be safe. However, such predictors are likely to be eliminated from the model due to having zero or near-zero variance. Here’s an illustration of this point for phytophthora-rot predictor:

    summary(Soybean[Soybean$Class == 'phytophthora-rot',])
    ##                  Class    date   plant.stand precip temp     hail   
    ##  phytophthora-rot   :88   0: 7   0: 0        0: 0   0: 9   0   :14  
    ##  2-4-d-injury       : 0   1:23   1:88        1:30   1:51   1   : 6  
    ##  alternarialeaf-spot: 0   2:25               2:58   2:28   NA's:68  
    ##  anthracnose        : 0   3:27                                      
    ##  bacterial-blight   : 0   4: 6                                      
    ##  bacterial-pustule  : 0   5: 0                                      
    ##  (Other)            : 0   6: 0                                      
    ##  crop.hist area.dam  sever    seed.tmt    germ    plant.growth leaves
    ##  0: 6      0: 0     0   : 0   0   :10   0   : 7   0: 0         0: 0  
    ##  1:20      1:87     1   : 7   1   :10   1   : 7   1:88         1:88  
    ##  2:32      2: 0     2   :13   2   : 0   2   : 6                      
    ##  3:30      3: 1     NA's:68   NA's:68   NA's:68                      
    ##                                                                      
    ##                                                                      
    ##                                                                      
    ##  leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem  
    ##  0   :33   0   : 0   0   : 0   0   :33     0   :33   0   :33   0: 0  
    ##  1   : 0   1   : 0   1   : 0   1   : 0     1   : 0   1   : 0   1:88  
    ##  2   : 0   2   :33   2   :33   NA's:55     NA's:55   2   : 0         
    ##  NA's:55   NA's:55   NA's:55                         NA's:55         
    ##                                                                      
    ##                                                                      
    ##                                                                      
    ##  lodging   stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
    ##  0   :18   0: 6         0: 0          0   :20         0:69      0:88    
    ##  1   : 2   1:19         1: 0          1   : 0         1: 6      1: 0    
    ##  NA's:68   2:30         2:88          NA's:68         2:13              
    ##            3:33         3: 0                                            
    ##                                                                         
    ##                                                                         
    ##                                                                         
    ##  int.discolor sclerotia fruit.pods fruit.spots   seed    mold.growth
    ##  0:88         0:88      0   : 0    0   : 0     0   :20   0   :20    
    ##  1: 0         1: 0      1   : 0    1   : 0     1   : 0   1   : 0    
    ##  2: 0                   2   : 0    2   : 0     NA's:68   NA's:68    
    ##                         3   :20    4   :20                          
    ##                         NA's:68    NA's:68                          
    ##                                                                     
    ##                                                                     
    ##  seed.discolor seed.size shriveling roots 
    ##  0   :20       0   :20   0   :20    0:20  
    ##  1   : 0       1   : 0   1   : 0    1:68  
    ##  NA's:68       NA's:68   NA's:68    2: 0  
    ##                                           
    ##                                           
    ##                                           
    ## 


    For predictors with various values, a more sophisticated method of imputation should be used. If it is determined that a variable with missing values is highly correlated with another predictor then K-nearest neighbor model technique can be used.