Exercise 3.1

The UC Irvine Mache Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(corrplot)
library(e1071)
library(caret)
library(naniar)
library(mice)
library(VIM)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Converting into long format and creating Predictor as factor
melted <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>% 
  mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)

# Explore the distribution of each factor in Predictor group
melted %>% 
  ggplot(., aes(Value, fill=Predictor))+geom_histogram(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="Distribution of Predictors")+theme_minimal()

melted %>% 
  ggplot(., aes(Value, fill=Predictor))+geom_density(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="Distribution of Predictors")+theme_minimal()
## Warning: Ignoring unknown parameters: bins

# Correlation matrix
Glass %>% select(-Type) %>% cor() %>% corrplot(., method='color', type="upper", order="hclust", 
                                               addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE)

According to the above graphs, Si, Ai, Na and Ri are almost normally distributed as compared with other elements. Also, Ca, Na and Si appears to be the highly concentrated in the glass. Most of them don’t have very strong correlations. There is some correlation between RI and SI, Mg and Al, Mg and Ba, Mg and Ca. There is very strong correlation between Ca and RI.

(b)

Do there appear to be any outlier in the data? Are any predictors skewed?

# Plotting boxplot for all elements
melted %>% ggplot(aes(Predictor, Value, fill= Predictor))+geom_boxplot()+facet_wrap(~Predictor, scale='free')+coord_flip()+labs(title="Boxplot for Multiple Elements in Glass") + theme_classic() + theme(axis.text.x = element_text(angle=30, hjust=1))

# Summary statistics
summary(Glass[-10])
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe         
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05701  
##  3rd Qu.:0.10000  
##  Max.   :0.51000
# Skewness
Glass[-10] %>% apply(2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

According to boxplots, seems like most of them has some sort of outliers other than Mg but the values are not very high so let’s check it out using skewness function. Summary statistics shows that there is no significant difference in the mean and median of each elements other than Mg which has mean of 2.6 and median of 3.4. Let’s double check using skewness function to verify if there is extreme outliers.

Skewness output shows that K is skewed leading by Ba and Ca. Other than these, others are normally distributed.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

Let’s transform variables using BoxCox transformation using preProcess function from caret package.

# Transformation, scaling and centering the data 
trans <- preProcess(Glass, method=c("BoxCox", "center","scale"))
trans2 <- predict(trans, Glass)

# Plot the transformed data
melted_bx <- trans2 %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>% 
  mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)

melted_bx %>% ggplot(aes(Predictor, Value, fill= Predictor))+geom_boxplot()+facet_wrap(~Predictor, scale='free')+coord_flip()+labs(title="Boxplot for Multiple Elements in Glass") + theme_classic() + theme(axis.text.x = element_text(angle=30, hjust=1))

trans2[-10] %>% apply(2, skewness)
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.13645228  0.09105899 -0.65090568  6.46008890 
##          Ca          Ba          Fe 
## -0.19395573  3.36867997  1.72981071
# Skewness
Glass[-10] %>% apply(2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
trans2[-10] %>% apply(2, skewness)
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.13645228  0.09105899 -0.65090568  6.46008890 
##          Ca          Ba          Fe 
## -0.19395573  3.36867997  1.72981071

Centering and scaling has improved bringing the mean near 0 and standard deviation near 1 but boxcox did not significantly reduce the skewness. There is some improvement although but not very much noticeable. I am not sure about the details of dataset but if the outliers are type then they should be replaced with either median or may be knn can be helpful. Log transformation is another way of improving it in this case.

Glass[-10] %>% apply(2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
trans2[-10] %>% apply(2, skewness)
##          RI          Na          Mg          Al          Si           K 
##  1.56566039  0.03384644 -1.13645228  0.09105899 -0.65090568  6.46008890 
##          Ca          Ba          Fe 
## -0.19395573  3.36867997  1.72981071

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information no the environmental conditions (eg., temperature, precipitation) and plant conditions (eg., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

(a)

Investigate the frequency distributions for the categorical predictors. Are are of the distributions degenerate in the ways discussed earlier in this chapter?

# Summary
summary(Soybean[,2:36])
##       date     plant.stand  precip      temp       hail     crop.hist 
##  5      :149   0   :354    0   : 74   0   : 80   0   :435   0   : 65  
##  4      :131   1   :293    1   :112   1   :374   1   :127   1   :165  
##  3      :118   NA's: 36    2   :459   2   :199   NA's:121   2   :219  
##  2      : 93               NA's: 38   NA's: 30              3   :218  
##  6      : 90                                                NA's: 16  
##  (Other):101                                                          
##  NA's   :  1                                                          
##  area.dam    sever     seed.tmt     germ     plant.growth leaves  leaf.halo 
##  0   :123   0   :195   0   :305   0   :165   0   :441     0: 77   0   :221  
##  1   :227   1   :322   1   :222   1   :213   1   :226     1:606   1   : 36  
##  2   :145   2   : 45   2   : 35   2   :193   NA's: 16             2   :342  
##  3   :187   NA's:121   NA's:121   NA's:112                        NA's: 84  
##  NA's:  1                                                                   
##                                                                             
##                                                                             
##  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild    stem     lodging   
##  0   :357   0   : 51   0   :487    0   :554   0   :535   0   :296   0   :520  
##  1   : 21   1   :327   1   : 96    1   : 45   1   : 20   1   :371   1   : 42  
##  2   :221   2   :221   NA's:100    NA's: 84   2   : 20   NA's: 16   NA's:121  
##  NA's: 84   NA's: 84                          NA's:108                        
##                                                                               
##                                                                               
##                                                                               
##  stem.cankers canker.lesion fruiting.bodies ext.decay  mycelium   int.discolor
##  0   :379     0   :320      0   :473        0   :497   0   :639   0   :581    
##  1   : 39     1   : 83      1   :104        1   :135   1   :  6   1   : 44    
##  2   : 36     2   :177      NA's:106        2   : 13   NA's: 38   2   : 20    
##  3   :191     3   : 65                      NA's: 38              NA's: 38    
##  NA's: 38     NA's: 38                                                        
##                                                                               
##                                                                               
##  sclerotia  fruit.pods fruit.spots   seed     mold.growth seed.discolor
##  0   :625   0   :407   0   :345    0   :476   0   :524    0   :513     
##  1   : 20   1   :130   1   : 75    1   :115   1   : 67    1   : 64     
##  NA's: 38   2   : 14   2   : 57    NA's: 92   NA's: 92    NA's:106     
##             3   : 48   4   :100                                        
##             NA's: 84   NA's:106                                        
##                                                                        
##                                                                        
##  seed.size  shriveling  roots    
##  0   :532   0   :539   0   :551  
##  1   : 59   1   : 38   1   : 86  
##  NA's: 92   NA's:106   2   : 15  
##                        NA's: 31  
##                                  
##                                  
## 
# Plotting categorical variables 
Soybean %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Above plot shows the distribution of data points on the categorical variables. It also shows the missing values exist in almost all of the variables. Let’s plot and print out the number of missing values in the next section.

(b)

Roughly 18% of the data are missing. Are there particular predictors that are most likely to be missing? Is the pattern of missing data related to the classes?

# Calculate the missing values
colSums(is.na(Soybean))
##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31
# Visualize
vis_miss(Soybean) + labs(title="Summarized visualization of missing values")

gg_miss_upset(Soybean) 

Although there are a lot of missing values in the dataset as we can see in summary and first plot. Second plot shows if there is any pattern in the missing values. It shows that germ, hail, server, seed.tmt and lodging have missing values together and thus it has pattern of missing values together.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

There are various strategies to deal with the missing values and it all depends on the dataset and the understanding of domain. In some area, it’s safe just to remove the data and not to change anything especially in medical science. In social science, you can substitute the missing values with mean or median and then test the accuracy of the model to select the best strategy which lead to a better model. There are some other tools like knn, mice,etc which are also very effective.

Method 1 - mice

mice_method <- mice(Soybean, method="pmm", printFlag=F, seed=200)
## Warning: Number of logged events: 1669
aggr(complete(mice_method), prop=c(TRUE,TRUE), bars=TRUE, numbers=TRUE, sortVars=TRUE)

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.marg     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##        leaf.mild     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##         mycelium     0
##     int.discolor     0
##        sclerotia     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

This method assumes values are missing at random but previously we saw that some variables are not randomly missing and hence this method may not be very effective in this case. We might have to go for other tool. I’ll select knn method to impute the missing values again.

Method 2 - knn imputation

Soybean2 <- Soybean[3:36] # Removed class and date 

# knn method
knn_method <- kNN(Soybean ,k=5)
colSums(is.na(knn_method))
##               Class                date         plant.stand              precip 
##                   0                   0                   0                   0 
##                temp                hail           crop.hist            area.dam 
##                   0                   0                   0                   0 
##               sever            seed.tmt                germ        plant.growth 
##                   0                   0                   0                   0 
##              leaves           leaf.halo           leaf.marg           leaf.size 
##                   0                   0                   0                   0 
##         leaf.shread           leaf.malf           leaf.mild                stem 
##                   0                   0                   0                   0 
##             lodging        stem.cankers       canker.lesion     fruiting.bodies 
##                   0                   0                   0                   0 
##           ext.decay            mycelium        int.discolor           sclerotia 
##                   0                   0                   0                   0 
##          fruit.pods         fruit.spots                seed         mold.growth 
##                   0                   0                   0                   0 
##       seed.discolor           seed.size          shriveling               roots 
##                   0                   0                   0                   0 
##           Class_imp            date_imp     plant.stand_imp          precip_imp 
##                   0                   0                   0                   0 
##            temp_imp            hail_imp       crop.hist_imp        area.dam_imp 
##                   0                   0                   0                   0 
##           sever_imp        seed.tmt_imp            germ_imp    plant.growth_imp 
##                   0                   0                   0                   0 
##          leaves_imp       leaf.halo_imp       leaf.marg_imp       leaf.size_imp 
##                   0                   0                   0                   0 
##     leaf.shread_imp       leaf.malf_imp       leaf.mild_imp            stem_imp 
##                   0                   0                   0                   0 
##         lodging_imp    stem.cankers_imp   canker.lesion_imp fruiting.bodies_imp 
##                   0                   0                   0                   0 
##       ext.decay_imp        mycelium_imp    int.discolor_imp       sclerotia_imp 
##                   0                   0                   0                   0 
##      fruit.pods_imp     fruit.spots_imp            seed_imp     mold.growth_imp 
##                   0                   0                   0                   0 
##   seed.discolor_imp       seed.size_imp      shriveling_imp           roots_imp 
##                   0                   0                   0                   0

I replaced the missing values using knn method and we can see that there is no missing values here but again I am not sure and I cannot say for sure that this method was effective until I build a model, split the data and test for accuracy. To sum up, we have various strategies to deal with the missing values. It depends on dataset and area for selecting specific way of handling missing values. A lot of data scientists use median mostly to replace missing values.