Data 624 Homework 4 Data Preprocessing/Overfitting

Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.

# Required R packages
library(mlbench)
library(tidyverse)
library(GGally)
library(caret)
library(VIM)
library(rcompanion)
library(corrplot)
library(e1071)
library(naniar)

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Let us look at the descriptive statistics first.

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

Distributions of predictor variables

par(mfrow = c(3,3))
for (i in 1:9){
  plotNormalDensity(
    Glass[,i], main = sprintf("Variable %s", names(Glass)[i]))
}

The plots show the actual distributions and a superimposed normal curve for each predictor, making it easy to compare the two. None of the predictors follow a normal distribution. Na, Al, and Si are close to normal, but they exhibit fat tails. Mg and K are bimodal. Ca, Ba, Fe, and K are positively skewed.

Relationships between predictors

# Correlation 
Glass2 = select(Glass, -Type)
Glass %>% select(-Type) %>% 
          cor() %>% 
          corrplot(., method='color', type="upper", order="hclust", 
                      addCoef.col = "black", tl.col="black",  diag=FALSE)

The relationship between RI and Ca shows a highly positive correlation. There are a few variables with moderate correlation values.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

#Boxplots for outlier
Glass2 %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot(outlier.size=1, outlier.colour='blue', alpha=0.3)+
    theme_minimal()

There are a number of outliers for each predictor variable, except for Mg. Ba and Ca appears to have the most outliers. Let us calculate skewness to identify extreme outliers.

#Skewness
Glass2[-10] %>% 
  apply(2, skewness)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

The rule of thumb seems to be: If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are highly skewed. 6 out of 9 variables exhibit a high skew. K, Ba and Ca lead the way with extreme skews.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox and PCA transformations may be used for variables that are skewed in order to improve the classification model.

glass_transformed = preProcess(Glass2, method = c('BoxCox', 'center', 'scale', 'pca'))
glass_transformed

## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

#Collating the transformed data
prediction = predict(glass_transformed, Glass)

par(mfrow = c(3,3))
for (i in 2:8){
  plotNormalDensity(
    prediction[,i], main = sprintf("Variable %s", names(prediction)[i]))
}

# Correlation 
par(xpd=TRUE)
prediction %>% select(-Type) %>% 
          cor() %>% 
          corrplot(., method='color', type="upper", order="hclust",mar = c(2, 2, 1, 1), 
                      addCoef.col = "black", tl.col="black", tl.cex=0.8, tl.pos = "td",                            diag=FALSE)

Refractive index, Na, Al, Si, and Ca were box-cox transformed, and all variables were centered and scaled before applying PCA. After applying the transformations, the density is closer to normal and not heavily skewed, and there is no correlation among the variables.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

#Summary
summary(Soybean[,2:36])

##       date     plant.stand  precip      temp       hail     crop.hist 
##  5      :149   0   :354    0   : 74   0   : 80   0   :435   0   : 65  
##  4      :131   1   :293    1   :112   1   :374   1   :127   1   :165  
##  3      :118   NA's: 36    2   :459   2   :199   NA's:121   2   :219  
##  2      : 93               NA's: 38   NA's: 30              3   :218  
##  6      : 90                                                NA's: 16  
##  (Other):101                                                          
##  NA's   :  1                                                          
##  area.dam    sever     seed.tmt     germ     plant.growth leaves  leaf.halo 
##  0   :123   0   :195   0   :305   0   :165   0   :441     0: 77   0   :221  
##  1   :227   1   :322   1   :222   1   :213   1   :226     1:606   1   : 36  
##  2   :145   2   : 45   2   : 35   2   :193   NA's: 16             2   :342  
##  3   :187   NA's:121   NA's:121   NA's:112                        NA's: 84  
##  NA's:  1                                                                   
##                                                                             
##                                                                             
##  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild    stem     lodging   
##  0   :357   0   : 51   0   :487    0   :554   0   :535   0   :296   0   :520  
##  1   : 21   1   :327   1   : 96    1   : 45   1   : 20   1   :371   1   : 42  
##  2   :221   2   :221   NA's:100    NA's: 84   2   : 20   NA's: 16   NA's:121  
##  NA's: 84   NA's: 84                          NA's:108                        
##                                                                               
##                                                                               
##                                                                               
##  stem.cankers canker.lesion fruiting.bodies ext.decay  mycelium   int.discolor
##  0   :379     0   :320      0   :473        0   :497   0   :639   0   :581    
##  1   : 39     1   : 83      1   :104        1   :135   1   :  6   1   : 44    
##  2   : 36     2   :177      NA's:106        2   : 13   NA's: 38   2   : 20    
##  3   :191     3   : 65                      NA's: 38              NA's: 38    
##  NA's: 38     NA's: 38                                                        
##                                                                               
##                                                                               
##  sclerotia  fruit.pods fruit.spots   seed     mold.growth seed.discolor
##  0   :625   0   :407   0   :345    0   :476   0   :524    0   :513     
##  1   : 20   1   :130   1   : 75    1   :115   1   : 67    1   : 64     
##  NA's: 38   2   : 14   2   : 57    NA's: 92   NA's: 92    NA's:106     
##             3   : 48   4   :100                                        
##             NA's: 84   NA's:106                                        
##                                                                        
##                                                                        
##  seed.size  shriveling  roots    
##  0   :532   0   :539   0   :551  
##  1   : 59   1   : 38   1   : 86  
##  NA's: 92   NA's:106   2   : 15  
##                        NA's: 31  
##                                  
##                                  
##

A degenerate distribution (sometimes called a constant distribution) is a distribution of a degenerate random variable — a constant with probability of 1. In other words, a random variable X has a single possible value. Let us plot the categorical value to identify if there is any degenarete.

Soybean %>% 
  gather() %>% 
  ggplot(aes(value))+facet_wrap(~key, scales = "free")+
  geom_histogram(stat="count")+
  theme_minimal()

From above, it looks like “leaf.malf”, “lodging”, “mycelium”, “sclerotia”, “mold.growth”, “seed.discolor”, “seed.size”, “shriveling”, “leaf.mild”, “int.discolor” might be degenerate.

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

#No of missing values
# Calculate the missing values
colSums(is.na(Soybean))

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

Visualize using mice package

missing <- aggr(Soybean, col=c('navyblue','yellow'), numbers=TRUE,
                  sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3,
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

82% of the dataset is complete. Hail, sever, seed.tmt and lodging seem to be missing together and account for 17.7% of missing values. Overall, there seems to be a pattern with several variables exhibitng missing values toegther.

Pattern related to classes?

gg_miss_upset(Soybean)

The plot shows that germ, hail, server, seed.tmt and lodging have missing values together. There clearly is a pattern here.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First, we will remove the near-zero variance predictors using the nearZerovar from caret. The outour is the index of the columns to remove

nearZeroVar(Soybean,freqCut = 95/5, uniqueCut = 10)

## [1] 19 26 28

The columns to remove are leaf.mild, mycelium and sclerota

paste(colnames(Soybean)[19])

## [1] "leaf.mild"

paste(colnames(Soybean)[26])

## [1] "mycelium"

paste(colnames(Soybean)[28])

## [1] "sclerotia"

The rest of the data are MNAR (Missing Not at Random). The 18% rows containing missing rows may be omitted or imputed after investigating the nature of the analysis. We can substitute the missing values with mean or median and then test the accuracy of the model to select the best strategy which lead to a better model. Since the data are MNAR, I will select the kNN method to impute the missing values

Imputation with kNN

Soybean_imputed <- Soybean %>% 
  select(-c(mycelium, sclerotia, leaf.mild))%>%
  kNN()

aggr(Soybean_imputed, main='Missing data')

The dataset is now complete.

Data 624 Homework 4 Data Preprocessing/Overfitting

Farhana Zahir

3/07/2021

Exercise 3.1

Exercise 3.2