Exercise 1

  1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.1.2
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
data <- Glass %>% select(-Type)
chart.Correlation(data, bg=c("blue","red","yellow"), pch=21)

glass_cor<- cor(Glass[1:9])
corrplot(glass_cor,  method="number")

The chart shows a string correlation bewteen Ri and Ca, what means that there are going to be the same information for both in the model. It would be a good perspective to remove one of them from our model to avoiid redundancy. Other variables show moderate to weak correlation.

par(mfrow=c(2, 2))
colnames <- dimnames(Glass)[[2]]
for (i in 1:9) {
    d <- density(Glass[,i])
    plot(d, type="n", main=colnames[i])
    polygon(d, col="blue", border="gray")
}

Plots show that some of the variables have distribution close to normal. Na, Ai, Si, Ba (with exception of right skew) can be considered as approximately normal.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
id.vars = NULL
df <- melt(Glass)
## Using Type as id variables
plot1 <-ggplot(data = df, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors",
                           breaks = seq(0, 4, 0.5),
                           limits=c(0, 4))  + coord_flip()




plot2 <- ggplot(data = df, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors",
                           breaks = seq(5, 15, 2),
                           limits=c(5, 15)) + coord_flip()



plot3 <- ggplot(data = df, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors",
                           breaks = seq(69, 76, 1),
                           limits=c(69, 76))+ coord_flip()

grid.arrange(plot1, plot2, plot3, nrow = 2)
## Warning: Removed 645 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1501 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1712 rows containing non-finite values (stat_boxplot).

There are outliers in all the predictor variables. The plots show that some of the predictors are skewed. It can be seen in the plots that Mg is left-skewed, while K, Ba, and Fe are right skewed. Ca also seems to be slightly right skewed.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

We could apply a Box-Cox transformation to the dataset to reduce skewness, using the forecast package. The Box-Cox method uses a maximum likelihood estimation to determine the transformation parameter \(\lambda\) what could help to improve the classification model.As mentioned in the textbook as well, we could use the caret class preProcess which has the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction. The function calculates the required quantities for the transformation.

Exercise 2

  1. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)

#?Soybean
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
for (i in 2:ncol(Soybean)){
temp <-as.data.frame(table(Soybean[[i]]))
temp$col = colnames(Soybean[i])
assign(paste0("freq_",colnames(Soybean[i])),temp)
rm(temp)
}

all <- do.call("rbind",mget(ls(pattern = "^freq_.*")))

head(all,4)
##                 Var1 Freq      col
## freq_area.dam.1    0  123 area.dam
## freq_area.dam.2    1  227 area.dam
## freq_area.dam.3    2  145 area.dam
## freq_area.dam.4    3  187 area.dam
freq_date
##   Var1 Freq  col
## 1    0   26 date
## 2    1   75 date
## 3    2   93 date
## 4    3  118 date
## 5    4  131 date
## 6    5  149 date
## 7    6   90 date
freq_temp
##   Var1 Freq  col
## 1    0   80 temp
## 2    1  374 temp
## 3    2  199 temp
freq_hail
##   Var1 Freq  col
## 1    0  435 hail
## 2    1  127 hail
freq_leaves
##   Var1 Freq    col
## 1    0   77 leaves
## 2    1  606 leaves
par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
  plot(Soybean[i], main = colnames(Soybean[i]))
}

# checking for near zero or zero variance
var <- nearZeroVar(Soybean)
par(mfrow = c(2,2))
for(i in var) {
  plot(Soybean[i], main = colnames(Soybean[i]))
}

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
colSums(is.na(Soybean))
##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31
sb <- Soybean%>% mutate(nul=rowSums(is.na(Soybean)))%>%
                      group_by(Class)%>% summarize(miss=sum(nul)) %>%filter(miss!=0)

aggr_plot <- aggr(Soybean, col=c('black','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Missing data","Model"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##         K 0.177159590
##        Fe 0.177159590
##        RI 0.177159590
##        Mg 0.177159590
##        Na 0.163982430
##        RI 0.158125915
##         K 0.155197657
##        Mg 0.155197657
##         K 0.155197657
##        Ba 0.155197657
##        Ba 0.146412884
##        Al 0.134699854
##        Si 0.134699854
##        Ca 0.134699854
##        Si 0.122986823
##         K 0.122986823
##        Ca 0.122986823
##        Fe 0.122986823
##        Na 0.122986823
##        Al 0.055636896
##        Al 0.055636896
##        Si 0.055636896
##        Ca 0.055636896
##        Ba 0.055636896
##        Fe 0.055636896
##        RI 0.055636896
##        Mg 0.052708638
##        Fe 0.045387994
##        Si 0.043923865
##        Ca 0.023426061
##        Mg 0.023426061
##        Na 0.023426061
##        Na 0.001464129
##        Ba 0.001464129
##        RI 0.000000000
##        Al 0.000000000
sb
## # A tibble: 5 x 2
##   Class                        miss
##   <fct>                       <dbl>
## 1 2-4-d-injury                  450
## 2 cyst-nematode                 336
## 3 diaporthe-pod-&-stem-blight   177
## 4 herbicide-injury              160
## 5 phytophthora-rot             1214

The different Classes with the missing values were summarized. Out of 19 categories of Class, only 4 have the missing values with class phytophthora-rot having the maximum of missing values. It looks that there is a pattern of missing data related to the classes.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
sb <- knnImputation(Soybean[,-1])
colSums(is.na(sb))
##            date     plant.stand          precip            temp            hail 
##               0               0               0               0               0 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##               0               0               0               0               0 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##               0               0               0               0               0 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##               0               0               0               0               0 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##               0               0               0               0               0 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##               0               0               0               0               0 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##               0               0               0               0               0

Due missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values should be dropped, because if there are more missing values than 5%, the predictor might not be providing correct information to the model. We used k nearest neighbors to impute the missing values in our dataset.