DATA624: Homework 4

library(tidyverse)
library(GGally)
library(corrplot)
library(ggpubr)
library(naniar)
library(mice)

Task

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)

data(Glass)

str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Numeric Data

  • Al - right skewed
  • Ba - right skewed, outlier
  • Ca - right skewed, outlier
  • Fe - right skewed, outlier
  • K - right skewed, outlier, bimodal
  • Mg - left skewed, bimodal
  • Na - Close to near normal
  • RI - right skewed
  • Si - left skewed
Glass %>% 
  gather(-c(Type), key = variable, value = value) %>%
  ggplot(., aes(x = value)) +
  geom_histogram(aes(x=value, y = ..density..), bins = 30, fill="#69b3a2", color="#e9ecef") +
  geom_density(aes(x=value), color='red', lwd = 1.25) +
  facet_wrap(~variable, scales ="free", ncol = 3) +
  ggtitle("Distribution of Numeric Data")

Correlation Matrix

Most of the variables have an inverse relationship.

  • Strong Negative Relationships
    • Ca / Mg
    • RI / Si
    • RI / Al
    • Mg / Al
    • Mg / Ba
  • Strong Positive Relationships
    • Ca / RI
    • K / Al
    • Al / Ba
    • Na / Ba
num.data <- Glass[,-c(10)]
corrplot(cor(num.data), method = 'shade', order = 'AOE',col= colorRampPalette(c("red","tan", "blue"))(10) , type = 'lower', diag = FALSE)

Categorical Data

Glass %>%
  ggplot(aes(x = Type)) + geom_histogram(stat="count", fill="#69b3a2", color="#e9ecef") +
  ggtitle("Distribution of Categorical Variable - Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

There does appear to be ouliers in some of the variables in the dataset. Ba, Ca, Fe, K, Mg, and Na appear to have observations that are outliers to the rest of the variable. There are also predictors that have a skewed data distribution. Ca, Ba, Na, and RI are right skewed. Mg and Si are left skewed.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Transformations like Box-Cox, log, square root, and inverse can improve the data distributions of heavily skewed data. Variables like Ba and Fe may benefit from a log transformation as they are heavy right skews. Lesser skews can maybe benefit from a square root transformation. Mg can maybe be improved via a square root transformation. The transformation below shows that it improved slightly but still suffers from the bimodal nature of the variable. Another technique that could be used is Centering and Scaling to normalize the inputs. Certain models are sensitive to input scale and require center scaling to successfully train the model. Removing variables that may have multicollinarity can prove beneficial to the model. RI appears to be highly correlated with multiple predictor variables that could indicate that multicollinarity exists. It may be best to remove RI.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Looking at the data, mycelium appears to have a degenerate distribution. mycelium only contains 6 observations with the value 1 and 639 with the value 0, providing very little information about how mycelium is changes the target variable when it equals 0. For this reason it is best to remove the variable.

count = 0 
for(i in 1:length(names(Soybean))){
  if(is.factor(Soybean[,i])){
    count = count + 1
  }
  
  
}

paste0("Of the 36 variables in Soybean, ", count, " are categorical predictors.")
## [1] "Of the 36 variables in Soybean, 36 are categorical predictors."
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Dropping rows with missing data is not always the best solution. In many cases imputing is effective, as deleting data could cause the loss of important information that is needed to generalize to an unknown population. To handle the missing values I will us the MICE imputation algorithm to generate data.

imputed_Data <- mice(Soybean, m=2, maxit = 10, method = 'pmm', seed = 15)
## Warning: Number of logged events: 1344
completed.data <- complete(imputed_Data, 2)
summary(completed.data)
##                  Class     date    plant.stand precip  temp    hail   
##  brown-spot         : 92   0: 26   0:363       0:105   0: 94   0:464  
##  alternarialeaf-spot: 91   1: 75   1:320       1:112   1:379   1:219  
##  frog-eye-leaf-spot : 91   2: 93               2:466   2:210          
##  phytophthora-rot   : 88   3:118                                      
##  anthracnose        : 44   4:131                                      
##  brown-stem-rot     : 44   5:150                                      
##  (Other)            :233   6: 90                                      
##  crop.hist area.dam sever   seed.tmt germ    plant.growth leaves  leaf.halo
##  0: 71     0:124    0:222   0:389    0:174   0:449        0: 77   0:250    
##  1:166     1:227    1:385   1:236    1:213   1:234        1:606   1: 36    
##  2:222     2:145    2: 76   2: 58    2:296                        2:397    
##  3:224     3:187                                                           
##                                                                            
##                                                                            
##                                                                            
##  leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem    lodging
##  0:440     0:135     0:566       0:561     0:643     0:305   0:641  
##  1: 22     1:327     1:117       1:122     1: 20     1:378   1: 42  
##  2:221     2:221                           2: 20                    
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
##  0:403        0:340         0:578           0:509     0:677    0:619       
##  1: 39        1: 83         1:105           1:139     1:  6    1: 44       
##  2: 36        2:177                         2: 35              2: 20       
##  3:205        3: 83                                                        
##                                                                            
##                                                                            
##                                                                            
##  sclerotia fruit.pods fruit.spots seed    mold.growth seed.discolor seed.size
##  0:625     0:407      0:396       0:491   0:524       0:610         0:615    
##  1: 58     1:130      1: 75       1:192   1:159       1: 73         1: 68    
##            2: 38      2:102                                                  
##            3:108      4:110                                                  
##                                                                              
##                                                                              
##                                                                              
##  shriveling roots  
##  0:643      0:582  
##  1: 40      1: 86  
##             2: 15  
##                    
##                    
##                    
## 
missing.values <- completed.data %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'key', 'total'. You can override using the
## `.groups` argument.
levels <-
    (missing.values  %>% arrange(desc(pct)))$key

percentage.plot <- missing.values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")

percentage.plot