DATA624: Homework 4

library(tidyverse)
library(GGally)
library(corrplot)
library(ggpubr)
library(naniar)
library(mice)

Task

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)

data(Glass)

str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Numeric Data

Al - right skewed
Ba - right skewed, outlier
Ca - right skewed, outlier
Fe - right skewed, outlier
K - right skewed, outlier, bimodal
Mg - left skewed, bimodal
Na - Close to near normal
RI - right skewed
Si - left skewed

Glass %>% 
  gather(-c(Type), key = variable, value = value) %>%
  ggplot(., aes(x = value)) +
  geom_histogram(aes(x=value, y = ..density..), bins = 30, fill="#69b3a2", color="#e9ecef") +
  geom_density(aes(x=value), color='red', lwd = 1.25) +
  facet_wrap(~variable, scales ="free", ncol = 3) +
  ggtitle("Distribution of Numeric Data")

Correlation Matrix

Most of the variables have an inverse relationship.

Strong Negative Relationships
- Ca / Mg
- RI / Si
- RI / Al
- Mg / Al
- Mg / Ba
Strong Positive Relationships
- Ca / RI
- K / Al
- Al / Ba
- Na / Ba

num.data <- Glass[,-c(10)]
corrplot(cor(num.data), method = 'shade', order = 'AOE',col= colorRampPalette(c("red","tan", "blue"))(10) , type = 'lower', diag = FALSE)

Categorical Data

Glass %>%
  ggplot(aes(x = Type)) + geom_histogram(stat="count", fill="#69b3a2", color="#e9ecef") +
  ggtitle("Distribution of Categorical Variable - Type")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

There does appear to be ouliers in some of the variables in the dataset. Ba, Ca, Fe, K, Mg, and Na appear to have observations that are outliers to the rest of the variable. There are also predictors that have a skewed data distribution. Ca, Ba, Na, and RI are right skewed. Mg and Si are left skewed.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Transformations like Box-Cox, log, square root, and inverse can improve the data distributions of heavily skewed data. Variables like Ba and Fe may benefit from a log transformation as they are heavy right skews. Lesser skews can maybe benefit from a square root transformation. Mg can maybe be improved via a square root transformation. The transformation below shows that it improved slightly but still suffers from the bimodal nature of the variable. Another technique that could be used is Centering and Scaling to normalize the inputs. Certain models are sensitive to input scale and require center scaling to successfully train the model. Removing variables that may have multicollinarity can prove beneficial to the model. RI appears to be highly correlated with multiple predictor variables that could indicate that multicollinarity exists. It may be best to remove RI.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Looking at the data, mycelium appears to have a degenerate distribution. mycelium only contains 6 observations with the value 1 and 639 with the value 0, providing very little information about how mycelium is changes the target variable when it equals 0. For this reason it is best to remove the variable.

count = 0 
for(i in 1:length(names(Soybean))){
  if(is.factor(Soybean[,i])){
    count = count + 1
  }
  
  
}

paste0("Of the 36 variables in Soybean, ", count, " are categorical predictors.")

## [1] "Of the 36 variables in Soybean, 36 are categorical predictors."

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The variables with the most missing data are hail, lodging, seed.tmt, and sever. There appears to be a pattern to the missing data. Many variables have about the same percentage of missing data. It is possible that many missing data in variables overlap. There also appears to be patterns in the missing data when split by the class variable. The levels 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury are missing 100% of values for most of the variables. phytophthora-rot is missing 75% of data for most variables.

missing.values <- Soybean %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'key', 'total'. You can override using the
## `.groups` argument.

levels <-
    (missing.values  %>% filter(isna == T) %>% arrange(desc(pct)))$key

percentage.plot <- missing.values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")

percentage.plot

## Warning: Removed 2 rows containing missing values (position_stack).

gg_miss_fct(x = Soybean, fct = Class)

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Dropping rows with missing data is not always the best solution. In many cases imputing is effective, as deleting data could cause the loss of important information that is needed to generalize to an unknown population. To handle the missing values I will us the MICE imputation algorithm to generate data.

imputed_Data <- mice(Soybean, m=2, maxit = 10, method = 'pmm', seed = 15)

## Warning: Number of logged events: 1344

completed.data <- complete(imputed_Data, 2)

summary(completed.data)

##                  Class     date    plant.stand precip  temp    hail   
##  brown-spot         : 92   0: 26   0:363       0:105   0: 94   0:464  
##  alternarialeaf-spot: 91   1: 75   1:320       1:112   1:379   1:219  
##  frog-eye-leaf-spot : 91   2: 93               2:466   2:210          
##  phytophthora-rot   : 88   3:118                                      
##  anthracnose        : 44   4:131                                      
##  brown-stem-rot     : 44   5:150                                      
##  (Other)            :233   6: 90                                      
##  crop.hist area.dam sever   seed.tmt germ    plant.growth leaves  leaf.halo
##  0: 71     0:124    0:222   0:389    0:174   0:449        0: 77   0:250    
##  1:166     1:227    1:385   1:236    1:213   1:234        1:606   1: 36    
##  2:222     2:145    2: 76   2: 58    2:296                        2:397    
##  3:224     3:187                                                           
##                                                                            
##                                                                            
##                                                                            
##  leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem    lodging
##  0:440     0:135     0:566       0:561     0:643     0:305   0:641  
##  1: 22     1:327     1:117       1:122     1: 20     1:378   1: 42  
##  2:221     2:221                           2: 20                    
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
##  0:403        0:340         0:578           0:509     0:677    0:619       
##  1: 39        1: 83         1:105           1:139     1:  6    1: 44       
##  2: 36        2:177                         2: 35              2: 20       
##  3:205        3: 83                                                        
##                                                                            
##                                                                            
##                                                                            
##  sclerotia fruit.pods fruit.spots seed    mold.growth seed.discolor seed.size
##  0:625     0:407      0:396       0:491   0:524       0:610         0:615    
##  1: 58     1:130      1: 75       1:192   1:159       1: 73         1: 68    
##            2: 38      2:102                                                  
##            3:108      4:110                                                  
##                                                                              
##                                                                              
##                                                                              
##  shriveling roots  
##  0:643      0:582  
##  1: 40      1: 86  
##             2: 15  
##                    
##                    
##                    
##

missing.values <- completed.data %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'key', 'total'. You can override using the
## `.groups` argument.

levels <-
    (missing.values  %>% arrange(desc(pct)))$key

percentage.plot <- missing.values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")

percentage.plot