Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench) 
library(dplyr)
library(psych)
library(corrplot) 
library(e1071)
library(car)
library(caret)
library(tidyr)
data(Glass) 
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Nine chemery predictors are numerical data, and column “type” is factors with six levels: 1,2,3,5,6,and 7. There are no missing data in the table.

summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

Si has the highest percentage usage among all elements (69.81% to 75.41%). Fe has the lowest percentage usage among all elements (0% to 0.51%).

Type “1” and “2” glasses have 70 and 76 samples which are 65% of total sample size. Type “6” glass has 9 sample size which is the smallest among six types of glasses.

Exercise 3.1.a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

The relationship of predictors can be shown by several pairwise plots.

pairs.panels(Glass[,-10],show.points=FALSE,gap=FALSE)

“Na” and “Si” seems close to normally distributed

“RI”, “Al” are lightly right-skewed

“Fe”, “K”, “Ba”, “Ca” are strongly right-skewed

“Mg” does not seem to be normally distribyted.

Corrplot and pairwise plot allowed to detect elements which correlate significantly between each other.

corrplot(cor(Glass[,-10]))

“RI” - “CA” (0.81)

“RI” - “SI” (-0.54)

Potentially it can cause a collinearity problem during model building process.

cor(Glass[,-10], as.numeric(Glass[,10]))
##            [,1]
## RI -0.168739357
## Na  0.506424080
## Mg -0.728159518
## Al  0.591197598
## Si  0.149690687
## K  -0.025834560
## Ca -0.008997841
## Ba  0.577676375
## Fe -0.183206747

Correlation between each elements and a glass type indicates that “Na”, “Mg”, “Al”, “Ba” strongly correlate with glass type (correlation coefficient more than 0.5) potentially making them a good predictors of glass type.

Exercise 3.1.b

Do there appear to be any outliers in the data? Are any predictors skewed?

The Boxplot displays ouliers and any values outside the whiskers are considered outliers.

data <-Glass[,-10]
par(mfrow = c(3, 3))
for (i in 1:ncol(data)) {
  boxplot(data[ ,i], ylab = names(data[i]), horizontal=T)
}

apply(Glass[,-10],2,skewness)
##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889 
##         Ca         Ba         Fe 
##  2.0184463  3.3686800  1.7298107

Box Plots show that all elements except Mg have outliers. Outliers identification and considering if they are influential points or not are the important part of any modeling process.

Computed skewness of the elements allowed us to confirm the findings discussed abouve: higly skewd elements are Fe, Ba, K, Ca

Exercise 3.1.c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yeo Johnson transformation was selected as a normalizing transformation. The Yeo-Johnson transformation can be thought of as an extension of the Box-Cox transformation. It handles both positive and negative values, whereas the Box-Cox transformation only handles positive values. Both can be used to transform the data so as to improve normality.

# YeoJohnson tranformation
glass_trans=preProcess(Glass[,-10], method=c("YeoJohnson"))
pred=predict(glass_trans,Glass[-10])

#  checking skewness after YeoJohnson transformation
apply(pred,2,skewness)
##            RI            Na            Mg            Al            Si 
##  1.6027150827 -0.0088476749 -0.8770969306  0.0002128304 -0.7202392108 
##             K            Ca            Ba            Fe 
## -0.0708227694 -0.2063893005  3.3686799688  1.7298107096

Yeo Johnson method allowed to normalize the variables, but not all of them. The following elements were transformed to approximately normal distribution: Na, Al, K, Ca.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Exercise 3.2.a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(mlbench) 
data(Soybean) 
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

As we see all variable are factors.

X <- Soybean[,1:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(X)) {
   barplot(table(Soybean[,i]),xlab = names(X[i]))
}

A degenerate distribution is a probability distribution in a space (discrete or continuous) with support only on a space of lower dimension. As it is said in the book: “Some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables… such an uninformative variable may have little effect on the calculations.”

The plots indicate that we have the following variables such as “sclerotia”, “leaf.mild” and “mycelium” that are close to zero variance predictors (a predictor variable that has a single unique value).

Using nearZeroVar() we can confirm which variables are close to zero-variance predictors.

nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.

https://www.rdocumentation.org/packages/caret/versions/6.0-84/topics/nearZeroVar

# Near zero variance predictors
library(caret) 

nearZeroVar(Soybean)
## [1] 19 26 28
nearZeroVar(Soybean, names = TRUE)
## [1] "leaf.mild" "mycelium"  "sclerotia"

nearZeroVar() has confirmed that the following variables are zero variance predictors: “sclerotia”, “leaf.mild” and “mycelium”. These variables can be removed before model building process.

Exercise 3.2.b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

% of missing values per each variables were calculated below:

sapply(Soybean, function(y) sum(length(which(is.na(y)))))/nrow(Soybean)*100
##           Class            date     plant.stand          precip 
##       0.0000000       0.1464129       5.2708638       5.5636896 
##            temp            hail       crop.hist        area.dam 
##       4.3923865      17.7159590       2.3426061       0.1464129 
##           sever        seed.tmt            germ    plant.growth 
##      17.7159590      17.7159590      16.3982430       2.3426061 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##       0.0000000      12.2986823      12.2986823      12.2986823 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##      17.7159590       5.5636896       5.5636896      15.5197657 
##       ext.decay        mycelium    int.discolor       sclerotia 
##       5.5636896       5.5636896       5.5636896       5.5636896 
##      fruit.pods     fruit.spots            seed     mold.growth 
##      12.2986823      15.5197657      13.4699854      13.4699854 
##   seed.discolor       seed.size      shriveling           roots 
##      15.5197657      13.4699854      15.5197657       4.5387994

The following variables have the lagest % of missing values: “sever”, “lodging”, “hail”, “seed.tmt”. Hence we can assume that these variables are more likely to be missing.

Apart from checking NA’s in each predictors, we also checked if some classes have missing data.

Soybean %>%
 filter(!complete.cases(.)) %>% 
group_by(Class) %>% 
  summarise(na = n()) %>%
  select(Class, na) %>% 
  arrange(desc(na))
## # A tibble: 5 x 2
##   Class                          na
##   <fct>                       <int>
## 1 phytophthora-rot               68
## 2 2-4-d-injury                   16
## 3 diaporthe-pod-&-stem-blight    15
## 4 cyst-nematode                  14
## 5 herbicide-injury                8

“phytophthora-rot” has the most NA’s. “2-4-d-injury”, “diaporthe-pod-&-stem-blight” and “cyst-nematode” also have missing values, but significantly less. We can conclude that the pattern of missing data is related to the classes.

Exercise 3.2.c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

If missingness is not informative we can potentially remove predictors, but we have quite a lot of missing data. Let’s check missing values distribution further with aggr() from VIF package. This function allows us to plot the amount of missing/imputed values in each variable and the amount of missing/imputed values in certain combinations of variables.

library(VIM)
aggr(Soybean, col=c('grey','pink'),  sortVars=T,numbers=T, cex.axis=0.5)

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

There are lot of missing data for some predictors. It may be possible to remove the predictors with the largest number of missing values(for example “hail”) if “hail” is not informative. We can check that by applying chi-square test. At the same time even after removal we are still left with a lots of missing values. There are several techniques that hepl to deal with missing values. We are going to apply one of them - kNN based method. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

kNN() performs imputation of missing data in a data frame using the k-Nearest Neighbour algorithm.

Soybean_imp<- kNN(Soybean, variable = c("hail"), k =5)
summary(Soybean_imp)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##  hail    crop.hist  area.dam    sever     seed.tmt     germ    
##  0:551   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1:132   1   :165   1   :227   1   :322   1   :222   1   :213  
##          2   :219   2   :145   2   : 45   2   : 35   2   :193  
##          3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##          NA's: 16   NA's:  1                                   
##                                                                
##                                                                
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots      hail_imp      
##  0   :551   Mode :logical  
##  1   : 86   FALSE:562      
##  2   : 15   TRUE :121      
##  NA's: 31                  
##                            
##                            
## 

As we see there is no missing values in “hail” variable now. We can apply the same approach to other variables.