data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
glass <- Glass %>% select(-Type)
chart.Correlation(glass, bg=c("blue","red","yellow"), pch=21)

x <- cor(Glass[1:9])
corrplot(x,  method="number")

The predictor variables Ri and Ca have strong correlation. Both will be presenting the same information in the model. It is suggested to have one of the variables removed from the model to avoid redundancy. All other variables have moderate to weak correlation. That means we can include all of them in the model.

par(mfrow=c(2, 2))
colnames <- dimnames(Glass)[[2]]
for (i in 1:9) {
    d <- density(Glass[,i])
    plot(d, type="n", main=colnames[i])
    polygon(d, col="red", border="gray")
}

Density plots reveals that some of the variables have distribution close to normal. Na, Ai, Si, Ba (with exception of right skew) can be considered as approximately normal.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
df.m <- melt(Glass)
## Using Type as id variables
p1 <-ggplot(data = df.m, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors Glass",
                           breaks = seq(0, 2, 0.5),
                           limits=c(0, 2))  + coord_flip()




p2 <- ggplot(data = df.m, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors Glass",
                           breaks = seq(5, 10, 2),
                           limits=c(5, 10)) + coord_flip()



p3 <- ggplot(data = df.m, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4,aes(fill=variable)) +     scale_y_continuous(name = "Predictors Glass",
                           breaks = seq(70, 76, 1),
                           limits=c(70, 76))+ coord_flip()

grid.arrange(p1, p2, p3, nrow = 2)
## Warning: Removed 833 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1737 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1714 rows containing non-finite values (stat_boxplot).

There are outliers in all the predictor variables. The outliers are highlighted red in the plots. Looking at the box plots we can see that some off the predictor variables are skewed. Data is skewed right for Fe, left in variable K, Al has slight positive skew, and variable Si has slight left skew.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Transformation of variables is the best scenarios while building up the models. While there are different transformations that can be applied, we have to choose the best possible for different variables. In our case, we can follow the steps by first removing one of the highly correlated variables (Ri and Ca- we can remove Ca ), removing variables with high percentage of missing values, and removing zero variance variables. We checked for normality of the variables, as we explored in part a, only variables Na, Ai, Si, Ba were approximate normally distributed, the remaining variables Ri or Ca, Mg, K, Fe can be transformed using box-cox transformations.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A degenerate distribution is one in which the random variable is not actually random, as it has only a single value. In simple terms, the effect of second term is minor (negligible). Frequency of occurence of one term (factor) in the variable dominate the another term (factor). Zerto variance or near zero variance is the term used. We will use caret package function to detect the variables for degenerate distributions. In building the predictive models, zero variance terms are removed as they will predict biased results. In this case, we will detect the terms only. We will generate a dataframe of frequency distributions for all columns. We can check a few categories at random.

for (i in 2:ncol(Soybean)){
temp <-as.data.frame(table(Soybean[[i]]))
temp$col = colnames(Soybean[i])
assign(paste0("freq_",colnames(Soybean[i])),temp)
rm(temp)
}

freq_all <- do.call("rbind",mget(ls(pattern = "^freq_.*")))

head(freq_all,4)
##                 Var1 Freq      col
## freq_area.dam.1    0  123 area.dam
## freq_area.dam.2    1  227 area.dam
## freq_area.dam.3    2  145 area.dam
## freq_area.dam.4    3  187 area.dam
freq_date
##   Var1 Freq  col
## 1    0   26 date
## 2    1   75 date
## 3    2   93 date
## 4    3  118 date
## 5    4  131 date
## 6    5  149 date
## 7    6   90 date
freq_temp
##   Var1 Freq  col
## 1    0   80 temp
## 2    1  374 temp
## 3    2  199 temp
freq_stem
##   Var1 Freq  col
## 1    0  296 stem
## 2    1  371 stem
freq_leaves
##   Var1 Freq    col
## 1    0   77 leaves
## 2    1  606 leaves
par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
  plot(Soybean[i], main = colnames(Soybean[i]))
}

To detect the zero variance (or near zero variance predictors) we will use the function nearZeroVar() and will see the frequencies of those variables.

zeroVarCols <- nearZeroVar(Soybean)

# Columns 19, 26 ad 28 are degenerate
freq_leaf.mild
##   Var1 Freq       col
## 1    0  535 leaf.mild
## 2    1   20 leaf.mild
## 3    2   20 leaf.mild
freq_mycelium
##   Var1 Freq      col
## 1    0  639 mycelium
## 2    1    6 mycelium
freq_sclerotia
##   Var1 Freq       col
## 1    0  625 sclerotia
## 2    1   20 sclerotia
par(mfrow = c(2,2))
for(i in zeroVarCols) {
  plot(Soybean[i], main = colnames(Soybean[i]))
}

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
colSums(is.na(Soybean))
##           Class            date     plant.stand          precip 
##               0               1              36              38 
##            temp            hail       crop.hist        area.dam 
##              30             121              16               1 
##           sever        seed.tmt            germ    plant.growth 
##             121             121             112              16 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##               0              84              84              84 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##             121              38              38             106 
##       ext.decay        mycelium    int.discolor       sclerotia 
##              38              38              38              38 
##      fruit.pods     fruit.spots            seed     mold.growth 
##              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31
plot_missing(Soybean)

sb_class <- Soybean%>% mutate(nul=rowSums(is.na(Soybean)))%>%
                      group_by(Class)%>% summarize(miss=sum(nul)) %>%filter(miss!=0)

aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000
sb_class
## # A tibble: 5 x 2
##   Class                        miss
##   <fct>                       <dbl>
## 1 2-4-d-injury                  450
## 2 cyst-nematode                 336
## 3 diaporthe-pod-&-stem-blight   177
## 4 herbicide-injury              160
## 5 phytophthora-rot             1214

The missing plot reveals variables with the percentage of missing values. The different Classes with the missing values were summarized. Out of 19 categories of Class, only 4 have the missing values with class phytophthora-rot having the maximum of missing values. This indicates that there is pattern in missing values by category class.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
sb_imp <- knnImputation(Soybean[,-1])
colSums(is.na(sb_imp))
##            date     plant.stand          precip            temp 
##               0               0               0               0 
##            hail       crop.hist        area.dam           sever 
##               0               0               0               0 
##        seed.tmt            germ    plant.growth          leaves 
##               0               0               0               0 
##       leaf.halo       leaf.marg       leaf.size     leaf.shread 
##               0               0               0               0 
##       leaf.malf       leaf.mild            stem         lodging 
##               0               0               0               0 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##               0               0               0               0 
##        mycelium    int.discolor       sclerotia      fruit.pods 
##               0               0               0               0 
##     fruit.spots            seed     mold.growth   seed.discolor 
##               0               0               0               0 
##       seed.size      shriveling           roots 
##               0               0               0

The best stratergy is to start with checking the correlation between two variables. Due to high percentage of missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values are suggested to be dropped, as with more missing values, the predictor might not be providing correct information to the model. We used k nearest neighbours to impute the missing values in our dataset.

References:

https://rdrr.io/cran/EnvStats/man/boxcoxTransform.html

https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/preprocess.html?revision=845&root=caret

http://www.drenr.com/pages/html_files/soybeans.html