3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
library(tidyverse)
library(GGally)
library(corrplot)
library(e1071)
library(caret)
library(car)
library(VIM)
library(mice)
data(Glass)
str(Glass) 
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggpairs(Glass[, -10], lower=list(continuous='smooth'), progress = F)

ggplot(gather(Glass[,-10]), aes(value)) + 
    geom_histogram(bins = 15, fill = 'darkgreen') + 
    facet_wrap(~key, scales = 'free_x')

The distributions all seem skewed. There seemes to be a strong positive correlated relationship between the predictors RI and Cap while there is negative correlation between RI and Ai and RIand Si.

pred_cors <- cor(Glass[,-10])
corrplot(pred_cors, method="circle")

Refractive Index and Ca are highly correlated with a score of 0.81.

(b)

Do there appear to be any outliers in the data?

ggplot(stack(Glass[,-10]), aes(x = ind, y = values)) + 
  geom_boxplot(outlier.colour="darkgreen", outlier.shape=4, outlier.size=2) +  
   labs(x = "Predictors", y = "Values") +
  theme(text = element_text(size=10), axis.text.x = element_text(angle = 90, hjust = 1)) 

There are outliers present but they are only found in Refractive Index.

outliers <- boxplot(Glass$RI, plot = F)$out
which(Glass$RI %in% outliers)
##  [1]  48  51  57 104 105 106 107 108 111 112 113 132 171 185 186 188 190
which(Glass$Na %in% outliers)
## integer(0)
which(Glass$Mg %in% outliers)
## integer(0)
which(Glass$Ai %in% outliers)
## integer(0)
which(Glass$Si %in% outliers)
## integer(0)
which(Glass$K %in% outliers)
## integer(0)
which(Glass$Ca %in% outliers)
## integer(0)
which(Glass$Ba %in% outliers)
## integer(0)
which(Glass$Fe %in% outliers)
## integer(0)

Are any predictors skewed?

apply(Glass[,-10], 2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

There results confirmed my assumptions about the distributions of the predictors being skewed. RI, K, Ba and Fe are all right skewed. Mg, is left skewed. The others somewhat resemble a bell shape but still slightly skewed.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

glass_transformed <- preProcess(Glass[,-10], method = c("BoxCox", "center", "scale")) 
new_data <- predict(glass_transformed, Glass[,-10])
ggplot(stack(new_data), aes(x = ind, y = values)) + 
  geom_boxplot(outlier.colour="darkgreen", outlier.shape=4, outlier.size=2) +  
  labs(x = "Predictors", y = "Values") +
  theme(text = element_text(size=10), axis.text.x = element_text(angle = 90, hjust = 1)) 

Looking at the boxplot the mean for each variable are centered around 0.

3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean) 
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

(a)

Investigate the frequency distributions for the categorical predictors.

par(mfrow = c(3, 6))
for (i in 1:ncol(Soybean)) {
  barplot(table(Soybean[ ,i]), col = 'orange', ylab = names(Soybean[i]))
}

Are any of the distributions degenerate in the ways discussed earlier in this chapter?

" some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables. Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor" – Applied Predictive Modeling.

Based on the words stated, the following function would be help to answer this question.

nearZeroVar(Soybean, saveMetrics= TRUE)
##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

Yes, they are: leaf.mild, mycelium and sclerotia.

(b)

Roughly 18% of the data are missing.

Are there particular predictors that are more likely to be missing?

aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean), cex.axis=.7, oma = c(7,4,2,2), gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

OR

Amelia::missmap(Soybean)

Is the pattern of missing data related to the classes?

miss_df <- Soybean %>% group_by(Class) %>% 
  summarise_all(~sum(is.na(.))) %>%
  transmute(Class, na_count = rowSums(.[-1]))

miss_df
## # A tibble: 19 x 2
##    Class                       na_count
##    <fct>                          <dbl>
##  1 2-4-d-injury                     450
##  2 alternarialeaf-spot                0
##  3 anthracnose                        0
##  4 bacterial-blight                   0
##  5 bacterial-pustule                  0
##  6 brown-spot                         0
##  7 brown-stem-rot                     0
##  8 charcoal-rot                       0
##  9 cyst-nematode                    336
## 10 diaporthe-pod-&-stem-blight      177
## 11 diaporthe-stem-canker              0
## 12 downy-mildew                       0
## 13 frog-eye-leaf-spot                 0
## 14 herbicide-injury                 160
## 15 phyllosticta-leaf-spot             0
## 16 phytophthora-rot                1214
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

I would say yes. Also, because the remaining columns seems features of the classes.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Imputation

temp_data <- mice(Soybean, m=1 , maxit=50, meth='pmm', seed=500, printFlag = F)
## Warning: Number of logged events: 3389
imputed_data <- complete(temp_data, 1)
summary(imputed_data)
##                  Class     date    plant.stand precip  temp    hail   
##  brown-spot         : 92   0: 26   0:388       0:102   0: 82   0:465  
##  alternarialeaf-spot: 91   1: 75   1:295       1:112   1:378   1:218  
##  frog-eye-leaf-spot : 91   2: 93               2:469   2:223          
##  phytophthora-rot   : 88   3:119                                      
##  anthracnose        : 44   4:131                                      
##  brown-stem-rot     : 44   5:149                                      
##  (Other)            :233   6: 90                                      
##  crop.hist area.dam sever   seed.tmt germ    plant.growth leaves  leaf.halo
##  0: 65     0:123    0:263   0:426    0:206   0:443        0: 77   0:250    
##  1:165     1:227    1:359   1:222    1:261   1:240        1:606   1: 36    
##  2:220     2:145    2: 61   2: 35    2:216                        2:397    
##  3:233     3:188                                                           
##                                                                            
##                                                                            
##                                                                            
##  leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem    lodging
##  0:409     0:106     0:564       0:624     0:643     0:296   0:587  
##  1: 38     1:327     1:119       1: 59     1: 20     1:387   1: 96  
##  2:236     2:250                           2: 20                    
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
##  0:391        0:358         0:508           0:498     0:670    0:619       
##  1: 43        1: 83         1:175           1:169     1: 13    1: 44       
##  2: 43        2:177                         2: 16              2: 20       
##  3:206        3: 65                                                        
##                                                                            
##                                                                            
##                                                                            
##  sclerotia fruit.pods fruit.spots seed    mold.growth seed.discolor seed.size
##  0:625     0:417      0:423       0:490   0:616       0:545         0:597    
##  1: 58     1:173      1: 75       1:193   1: 67       1:138         1: 86    
##            2: 33      2: 59                                                  
##            3: 60      4:126                                                  
##                                                                              
##                                                                              
##                                                                              
##  shriveling roots  
##  0:609      0:552  
##  1: 74      1: 86  
##             2: 45  
##                    
##                    
##                    
## 
Amelia::missmap(imputed_data)

All missing values were replaced with the predictive mean and so we have a complete dataset.