library(tidyverse)
library(GGally)
library(fpp2)
library(rio)
library(gridExtra)
library(caret)
library(scales)
library(naniar)
library(missForest)

1 HW4: Data Preprocessing and Overfitting

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

1.1 Ex. 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

(a.) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b.) Do there appear to be any outliers in the data? Are any predictors skewed?

(c.) Are there any relevant transformations of one or more predictors that might improve the classification model?

1.1.1 Part a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

glass <- Glass[,1:9]
par(mfrow=c(3,3))
for(i in 1:ncol(glass)){
  hist(glass[,i],
       main = names(glass[i]),
       xlab = names(glass[i]))
}

glass_corr <- Glass %>% subset(select=-c(Type)) %>% cor()
ggpairs(Glass[1:9],
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.2)),
        title = "Correlation Matrix of Glass[1:9]") +
  theme(axis.text.y = element_text(size = 6))

1.1.2 Part b

Do there appear to be any outliers in the data? Are any predictors skewed?

Answer:

From the graphs below, we can see that

there are outliers in all predictors except Mg.
all predictors are skewed.

par(mfrow=c(3,3))
for(i in 1:ncol(glass)){
  boxplot(glass[,i], horizontal = T,
       main = names(glass[i]),
       xlab = names(glass[i]))
}

par(mfrow=c(3,3))
for(i in 1:ncol(glass)){
  plot(density(glass[,i]), 
       main = names(glass[i]),
       xlab = names(glass[i]))
}

1.1.3 Part c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Use BoxCox transformation to normalize the predictors.
The correlation coefficient of RI and Ca is 0.81, which is greater than 0.75. They have greater average correlation among all predictors. Thus, (1) centering and rescaling the data and applying PCA on the model, or (2) removing either RI or Ca from the data can help improve the classification model.

1.2 Ex. 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a.) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(b.) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

(c.) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

1.2.1 Part a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer:

There are three predictors which have

the fraction of unique values over the sample size is low (<10%).
the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (~20).

They are leaf.mild, mycelium, and sclerotia.

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(Soybean, maxsum = 10)

##                  Class       date     plant.stand  precip      temp    
##  brown-spot         : 92   0   : 26   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   1   : 75   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   2   : 93   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   3   :118               NA's: 38   NA's: 30  
##  anthracnose        : 44   4   :131                                    
##  brown-stem-rot     : 44   5   :149                                    
##  bacterial-blight   : 20   6   : 90                                    
##  bacterial-pustule  : 20   NA's:  1                                    
##  charcoal-rot       : 20                                               
##  (Other)            :173                                               
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##            
##            
##            
##

df <- data.frame(Column = character(),
                 Lower_than_10Pct = character(),
                 Ratio_1st_to_2nd = numeric())

for(i in 1:ncol(Soybean)){
  test <- prop.table(table(Soybean[i])) %>% sort(decreasing = T)
  max_fct <- max(test)
  second_fct <- test[2]
  
  lower_than_10Pct <- (min(test) <= 0.10) %>% as.character()
  ratio_1st_to_2nd <- max_fct/second_fct

  df <- df %>% add_row(Column = names(Soybean[i]),
                       Lower_than_10Pct = lower_than_10Pct, 
                       Ratio_1st_to_2nd = ratio_1st_to_2nd)
}

df %>% 
  rownames_to_column() %>%
  select(Column, Lower_than_10Pct, Ratio_1st_to_2nd) %>%
  filter(Lower_than_10Pct=='TRUE', Ratio_1st_to_2nd>=20)

1.2.2 Part b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer:

About 50% of the predictros contain more than 70 missing values. Typically, there are 4 predictors each with 121 missing values, which are server, seed.tmt, lodging and hail.
From the study, missing data is highly related to the classes. There are 4 classes having multiple predictors containing 100% NA values: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury.

gg_miss_var(Soybean) + 
  theme(axis.text.y = element_text(size = 8))

Soybean %>% 
  gather(key = 'Column', value = 'Value', -Class) %>%
  mutate(Value = if_else(!is.na(Value), 'Not NA','NA')) %>%
  group_by(Class, Column, Value) %>%
  summarise(Count = n()) %>%
  left_join(Soybean %>% group_by(Class) %>% summarise(Class_Count = n()), by = 'Class') %>%
  mutate(Pct_Count = Count/Class_Count) %>%
  select(-Count, -Class_Count) %>%
  filter(Value == 'NA', Pct_Count >= 0.8) %>%
  mutate(Pct_Count = percent(Pct_Count)) %>%
  spread(key = 'Column', value = 'Pct_Count')

Soybean %>% 
  filter(Class %in% c('2-4-d-injury','cyst-nematode','diaporthe-pod-&-stem-blight','herbicide-injury')) %>%
  gg_miss_var(facet = Class) + 
  theme(axis.text.y = element_text(size = 5))

1.2.3 Part c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer:

By eliminating predictors and imputation:

First, remove mycelium, sclerotia, shriveling, lodging, and leaf.malf from the Soybean data which have nearly zero variance.
Second, use missForest function to impute missing values.

Soybean %>%
  select(-Class) %>%
  apply(2, function(x) var(x,na.rm = TRUE)) %>%
  sort()

##        mycelium       sclerotia      shriveling         lodging 
##     0.009230103     0.030092927     0.061627431     0.069271319 
##       leaf.malf       seed.size   seed.discolor          leaves 
##     0.069597601     0.090016920     0.098786828     0.100174751 
##     mold.growth     leaf.shread fruiting.bodies            seed 
##     0.100685423     0.137787130     0.148011747     0.156987582 
##       leaf.mild            hail    int.discolor           roots 
##     0.163308590     0.175224085     0.175559728     0.192568300 
##    plant.growth       ext.decay            stem     plant.stand 
##     0.224360793     0.227969570     0.247209728     0.248161316 
##           sever       leaf.size        seed.tmt            temp 
##     0.356442804     0.374168765     0.374839033     0.394653276 
##          precip            germ      fruit.pods       leaf.halo 
##     0.470797824     0.625661351     0.778828706     0.900597987 
##       leaf.marg       crop.hist        area.dam   canker.lesion 
##     0.914919515     0.952118535     1.154279759     1.175058982 
##    stem.cankers     fruit.spots            date 
##     1.827083634     2.259983391     2.870033287

Soybean_mf <- Soybean %>% 
  select(-c('mycelium', 'sclerotia', 'shriveling', 'lodging','leaf.malf')) %>%
  missForest() %>%
  .$ximp

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

gg_miss_var(Soybean_mf) + 
  theme(axis.text.y = element_text(size = 10))

Soybean_mf

Data 624 HW4: Data Preprocessing

Data 624 HW4: Data Preprocessing

1 HW4: Data Preprocessing and Overfitting

1.1 Ex. 3.1

1.1.1 Part a

1.1.2 Part b

1.1.3 Part c

1.2 Ex. 3.2

1.2.1 Part a

1.2.2 Part b

1.2.3 Part c