DATA624 - Data Preprocessing

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

data(Glass)

predictor_vars <- Glass %>%
  select(-Type) %>%
  gather(key = 'predictor_variable', value = 'value')

# Plot and print a histogram for each predictor variable.
ggplot(predictor_vars) +
  geom_histogram(aes(x = value, y = ..density..), bins = 30, fill = 'blue') +
  labs(title = 'Distributions of Predictor Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_density(aes(x = value), color = 'red') +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

ggplot(Glass, aes(x = Type)) +
  geom_bar(fill = 'blue') + 
  labs(title = 'Distribution of Categorical Variables') + 
  theme(plot.title = element_text(hjust = 0.5))

Do there appear to be any outliers in the data? Are any predictors skewed?

The columns Ba, Fe, and K look to be heavily skewed right. This is caused by left limit is bounded at 0 and outliers on the right side of the distribution.

Are there any relevant transformations of one or more predictors that might improve the classification model?

The distribution of Ba, Fe, K, Mg, RI, and Si are not symmetric. Centering, scaling, and applying the BoxCox transformation would benefit a model using these variables.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Using nearZeroVasrs illustrates that the leaf.mild, mycelium, and sclerotia columns meet the conditions to be a degenerate feature.

data(Soybean)

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

nearZeroVar(Soybean)

## [1] 19 26 28

colnames(Soybean[,c(19,26,28)])

## [1] "leaf.mild" "mycelium"  "sclerotia"

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The below visualization of missing data by feature displays a step wise pattern.

soybean_missing_counts <- sapply(Soybean, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>%
  as.data.frame() %>%
  rename('NA_Count' ='.') 

soybean_missing_counts <- soybean_missing_counts%>%
  mutate('Feature' = rownames(soybean_missing_counts))

ggplot(soybean_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'blue') +
  labs(title = 'Soybean Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5))

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I used the mice package that evokes a predictive mean matching to impute missing data. Below shows the new distribution of the predictor variables.

impute_df <- complete(imputed,1)

summary(impute_df)

##                  Class     date    plant.stand precip  temp    hail   
##  brown-spot         : 92   0: 26   0:390       0: 96   0: 83   0:486  
##  alternarialeaf-spot: 91   1: 75   1:293       1:114   1:382   1:197  
##  frog-eye-leaf-spot : 91   2: 93               2:473   2:218          
##  phytophthora-rot   : 88   3:118                                      
##  anthracnose        : 44   4:131                                      
##  brown-stem-rot     : 44   5:149                                      
##  (Other)            :233   6: 91                                      
##  crop.hist area.dam sever   seed.tmt germ    plant.growth leaves  leaf.halo
##  0: 67     0:123    0:316   0:335    0:165   0:441        0: 77   0:276    
##  1:169     1:227    1:322   1:263    1:213   1:242        1:606   1: 36    
##  2:225     2:145    2: 45   2: 85    2:305                        2:371    
##  3:222     3:188                                                           
##                                                                            
##                                                                            
##                                                                            
##  leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem    lodging
##  0:372     0:121     0:575       0:575     0:643     0:312   0:549  
##  1: 21     1:327     1:108       1:108     1: 20     1:371   1:134  
##  2:290     2:235                           2: 20                    
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
##  0:417        0:332         0:503           0:535     0:651    0:615       
##  1: 39        1: 87         1:180           1:135     1: 32    1: 44       
##  2: 36        2:177                         2: 13              2: 24       
##  3:191        3: 87                                                        
##                                                                            
##                                                                            
##                                                                            
##  sclerotia fruit.pods fruit.spots seed    mold.growth seed.discolor seed.size
##  0:625     0:423      0:373       0:476   0:544       0:589         0:624    
##  1: 58     1:130      1: 81       1:207   1:139       1: 94         1: 59    
##            2: 14      2: 72                                                  
##            3:116      4:157                                                  
##                                                                              
##                                                                              
##                                                                              
##  shriveling roots  
##  0:627      0:582  
##  1: 56      1: 86  
##             2: 15  
##                    
##                    
##                    
##

DATA624 - Data Preprocessing

Diego Correa

10/03/2021

3.1

3.2