library(tidyverse)
library(corrplot)
library(missForest)
library(ggthemes)
library(psych)
library(naniar)
library(DMwR)

1 Question - 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

(a.) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b.) Do there appear to be any outliers in the data? Are any predictors skewed?

(c.) Are there any relevant transformations of one or more predictors that might improve the classification model?

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

1.1 (a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

data <- Glass %>% select(-Type)
data %>%
  gather(key = 'Predictor', value = 'Value') %>%
  ggplot(aes(x=Value)) +
  geom_histogram(bins=30) +
  facet_wrap(~Predictor,scales = "free") +
  theme_hc()+
  ggtitle('Histogram: Glass')

pairs.panels(data, scale=TRUE)

1.2 (b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Answer:

All predictors except Mg have outliers.
From (a), all predictors are skewed.

data %>%
  gather(key = 'Predictor', value = 'Value') %>%
  ggplot(aes(x=Value, y = Predictor)) +
  geom_boxplot()+
  facet_wrap(~Predictor, scales = 'free')+
  theme_hc()

1.3 (c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Targeting skewness, use BoxCox transformation to normalize the data.
Targeting collinearty, since RI and Ca has highest correlation 0.81, perform predictor reduction by either:

perform PCA after data normalization (BoxCox, center, scale, etc.,.);
remove either RI or Ca, whichever has higher mean correlation among the dataset.

2 Question - 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a.) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(b.) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

(c.) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(Soybean, maxsum=20)

##                          Class      date     plant.stand  precip      temp    
##  2-4-d-injury               :16   0   : 26   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot        :91   1   : 75   1   :293    1   :112   1   :374  
##  anthracnose                :44   2   : 93   NA's: 36    2   :459   2   :199  
##  bacterial-blight           :20   3   :118               NA's: 38   NA's: 30  
##  bacterial-pustule          :20   4   :131                                    
##  brown-spot                 :92   5   :149                                    
##  brown-stem-rot             :44   6   : 90                                    
##  charcoal-rot               :20   NA's:  1                                    
##  cyst-nematode              :14                                               
##  diaporthe-pod-&-stem-blight:15                                               
##  diaporthe-stem-canker      :20                                               
##  downy-mildew               :20                                               
##  frog-eye-leaf-spot         :91                                               
##  herbicide-injury           : 8                                               
##  phyllosticta-leaf-spot     :20                                               
##  phytophthora-rot           :88                                               
##  powdery-mildew             :20                                               
##  purple-seed-stain          :20                                               
##  rhizoctonia-root-rot       :20                                               
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##

2.1 (a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer:

According to this cahpter, some models can be crippled by predictors with degenerate distributions, such as predictors with near zeo predictors. A rule of thumb for detecting near-zero variance predctors is:

-The fraction of unique values over the sample size is low (say 10%).

-The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.

In this dataset, there are three predictors meet both criteria, which are leaf.mild, mycelium, and sclerotia.

Soybean %>% 
  select(1:18) %>%
  gather(key = 'Predictor', value = 'Value', -Class) %>%
  ggplot(aes(x=Value))+
  geom_histogram(stat="count")+
  facet_wrap(~Predictor, scales = 'free')+
  ggtitle('Histogram:Soybean - 1')+
  theme_hc()

Soybean %>% 
  select(1,19:36) %>%
  gather(key = 'Predictor', value = 'Value', -Class) %>%
  ggplot(aes(x=Value))+
  geom_histogram(stat="count")+
  facet_wrap(~Predictor, scales = 'free')+
  ggtitle('Histogram:Soybean - 2')+
  theme_hc()

Row_Cnt <- Soybean %>%
  gather(key = 'Predictor', value = 'Value', -Class, na.rm = FALSE) %>%
  #mutate(Value = if_else(is.na(Value),'NA', Value)) %>%
  group_by(Predictor) %>%
  tally(n='Row_Cnt')

# Predictors with fraction of unique values over the sample size less than 10%
Soybean %>%
  gather(key = 'Predictor', value = 'Value', -Class) %>%
  group_by(Predictor, Value) %>%
  tally(n='Val_Cnt') %>%
  left_join(Row_Cnt) %>%
  mutate(Uniq_Val_Frac=Val_Cnt/Row_Cnt) %>%
  filter(!is.na(Value), Uniq_Val_Frac < 0.1) %>%
  select(Predictor) %>%
  unique()

# Predictors with the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large
Soybean %>%
  gather(key = 'Predictor', value = 'Value', -Class, na.rm = TRUE) %>%
  group_by(Predictor, Value) %>%
  tally(n='Cnt') %>%
  arrange(Predictor, desc(Cnt)) %>%
  mutate(id = row_number()) %>%
  filter(id %in% c(1,2)) %>%
  select(-Value) %>%
  spread(key = 'id', value = 'Cnt') %>%
  mutate(Ratio_1to2 = `1`/`2`) %>%
  filter(Ratio_1to2 >=20) %>%
  select(-`1`,-`2`)

## Warning: attributes are not identical across measure variables;
## they will be dropped

2.2 (b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer:

Most of the predictors have missing values, and nearly half of them contain more than 75 missing values respectively. The predictors have the most missing values are server, seed.tmt, lodging and hail.
The missing data is highly related to the classes. There are only 5 classes with missing values, including phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight and herbicide-injury

gg_miss_var(Soybean)

Soybean %>%
  gather(key = 'Predictor', value = 'Value', - Class) %>%
  group_by(Class) %>%
  summarise(NA_Cnt = sum(is.na(Value))) %>%
  ggplot(aes(x=reorder(Class, NA_Cnt), y=NA_Cnt))+
  geom_bar(stat='identity')+
  coord_flip()+
  theme_hc()+
  ggtitle('Soybean: Missing Value Count by Class')+
  ylab('NA Count')+
  xlab('Class')

2.3 (c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer: 1. Remove predictors with near zero variation, including leaf.mild, mycelium, and sclerotia.

Use KNN to imputate missing values
Or using missForest to imputate missing values.

Soybean %>%
  select(-leaf.mild, -mycelium, -sclerotia) %>%
  DMwR::knnImputation(k=5) %>%
  gg_miss_var()

Soybean %>%
  select(-leaf.mild, -mycelium, -sclerotia) %>%
  missForest() %>%
  .$ximp %>%
  gg_miss_var()

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!

DATA 624 - HOMEWORK 4

DATA 624 - HOMEWORK 4

1 Question - 3.1

1.1 (a)

1.2 (b)

1.3 (c)

2 Question - 3.2

2.1 (a)

2.2 (b)

2.3 (c)