library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(corrplot)
library(moments)
library(reshape2)
library(mice)

Data Preprocessing/Overfitting

3.1

data(Glass)

a.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Distribution

To understand the distribution of the predictor variable in this data set I am using histograms to obtain the frequencies of each of them. The variables are different and some are more normally distributed while others are not normally distributed. Silica (Si), Soda (Na) and Lime (Ca) are the predictors that are at higher concentrations.

lglass <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>% mutate(Predictor = as.factor(Predictor))

lglass %>% ggplot(aes(Value, color = Predictor, fill = Predictor)) + geom_histogram(bins = 20) + facet_wrap(~ Predictor, ncol = 3, scales = "free") 

Relationship Between Predictors

Most of the correlations are negative (RI and Si (-0.54), Mg and Ba (-0.49) and Mg and Al (-0.48)) with the exception of two positively correlated variables (RI and Ca (0.81) and Al and Ba (0.48)). The correlation table below indicates that there is a relationship between each variable.

corrplot(cor(Glass[,1:9]), method='circle')

b.

Do there appear to be any outliers in the data? Are any predictors skewed?

To determine the skewness I will be using the moments library. K, Ba, and Ca variables are all have highly right skewed. Mg and Si are left skewed.

skewValues <- apply(Glass[ , -10], 2, skewness)
hiloRatios <- apply(Glass[ , -10], 2, function(x) max(x) / min(x + 0.1))
cbind(Skew = skewValues, Hilo = hiloRatios)
##          Skew       Hilo
## RI  1.6140150  0.9520715
## Na  0.4509917  1.6048015
## Mg -1.1444648 44.9000000
## Al  0.9009179  8.9743590
## Si -0.7253173  1.0786726
## K   6.5056358 62.1000000
## Ca  2.0326774  2.9276673
## Ba  3.3924309 31.5000000
## Fe  1.7420068  5.1000000

From the visualization below it appears that there are outliers such as K in the type 5 glass, Ba in type 2 glass and Ca in type 2.

lglass %>% ggplot(aes(x = Type, y = Value, color = Predictor)) + geom_jitter() + ylim(0, 20) 

c.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Removing the skew would removing outliers which improves a model’s performance. It would be important to normalize variables by centering and scaling the variables. Transformations such as Box Cox would improve the classification model of the skewed variables. Normalizing the variables by centering and scaling.

3.2

data(Soybean)

a.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

The distributions are highly skewed such that approx. In this case there are several predictors that have little variance. 94% of the observations for the mycelium variable fall into the same category. Missing values are scattered throughout the dataset, affecting almost all variables to differing degrees. Missing values comprise of approx. 18% of the data for several variables such as hail, sever, seed.tmt, and lodging. The predictors with low frequencies of non-zero values are: lodging, mycelium, sclerotia, mold.growth, shriveling, leaf.mlf, leaf.mild, seed.discover, seed.size and leaf.malf. These may cause issues when using models like linear regression.

sb_freq <- Soybean
head(Soybean[,2:length(sb_freq)])
##   date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ
## 1    6           0      2    1    0         1        1     1        0    0
## 2    4           0      2    1    0         2        0     2        1    1
## 3    3           0      2    1    0         1        0     2        1    2
## 4    3           0      2    1    0         1        0     2        0    1
## 5    6           0      2    1    0         2        0     1        0    2
## 6    5           0      2    1    0         3        0     1        0    1
##   plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1            1      1         0         2         2           0         0
## 2            1      1         0         2         2           0         0
## 3            1      1         0         2         2           0         0
## 4            1      1         0         2         2           0         0
## 5            1      1         0         2         2           0         0
## 6            1      1         0         2         2           0         0
##   leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1         0    1       1            3             1               1         1
## 2         0    1       0            3             1               1         1
## 3         0    1       0            3             0               1         1
## 4         0    1       0            3             0               1         1
## 5         0    1       0            3             1               1         1
## 6         0    1       0            3             0               1         1
##   mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 1        0            0         0          0           4    0           0
## 2        0            0         0          0           4    0           0
## 3        0            0         0          0           4    0           0
## 4        0            0         0          0           4    0           0
## 5        0            0         0          0           4    0           0
## 6        0            0         0          0           4    0           0
##   seed.discolor seed.size shriveling roots
## 1             0         0          0     0
## 2             0         0          0     0
## 3             0         0          0     0
## 4             0         0          0     0
## 5             0         0          0     0
## 6             0         0          0     0
sb_freq[, 2:length(sb_freq)] <- lapply(sb_freq[, 2:length(sb_freq)], function(x) as.numeric(as.character(x)))

ggplot(data=melt(sb_freq), mapping=aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = 'free_x')

b.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Leaf has a min of 84 NA’s each, as does fruit.pods. Seed, mold.growth, seed.discolor, seed.size, and shriveling were grouped and have NA’s of either 92 or 106. Cancker both have 38 as does ext.decay, mcelium, int.discover, and sclerotia, and precip sever, seed.tmt and germ also all have large amounts of NA and are possibly grouped together sever and seed.tmt have 121, germ is at 112.

Soybean$na_count <- apply(Soybean, 1, function(x) sum(is.na(x)))
colSums(is.na(Soybean))
##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots        na_count 
##              31               0

The predictors that have the most NA’s are Phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&stem-blight and herbicide-injury.

(Soybean %>% select(Class, na_count) %>% group_by(Class) %>% summarise(na_count = sum(na_count)) %>% arrange(desc(na_count)))
## # A tibble: 19 x 2
##    Class                       na_count
##    <fct>                          <int>
##  1 phytophthora-rot                1214
##  2 2-4-d-injury                     450
##  3 cyst-nematode                    336
##  4 diaporthe-pod-&-stem-blight      177
##  5 herbicide-injury                 160
##  6 alternarialeaf-spot                0
##  7 anthracnose                        0
##  8 bacterial-blight                   0
##  9 bacterial-pustule                  0
## 10 brown-spot                         0
## 11 brown-stem-rot                     0
## 12 charcoal-rot                       0
## 13 diaporthe-stem-canker              0
## 14 downy-mildew                       0
## 15 frog-eye-leaf-spot                 0
## 16 phyllosticta-leaf-spot             0
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

c. 

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Leaf.marg, leaf.halo and leaf.size have strong correlations and removing this would make it more efficient.Fruit.spots has many NA’s and perhaps removing it would be a good strategy.For imputations it would be best to use the k-nearest neighbors to model the NA’s. Inspecting the distribution and predictor distributions of the target Class variable will help to determine the degree of bias to which dropping the missing data has introduced.

sb <- Soybean[, -c(37, 35, 34, 33, 28, 26, 21, 19, 18)]
sb[, 2:length(sb)] <- lapply(sb[, 2:length(sb)], function(x) as.numeric(as.character(x)))
str(sb[,-1])
## 'data.frame':    683 obs. of  27 variables:
##  $ date           : num  6 4 3 3 6 5 5 4 6 4 ...
##  $ plant.stand    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ precip         : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ temp           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hail           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ crop.hist      : num  1 2 1 1 2 3 2 1 3 2 ...
##  $ area.dam       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ sever          : num  1 2 2 2 1 1 1 1 1 2 ...
##  $ seed.tmt       : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ germ           : num  0 1 2 1 2 1 0 2 1 2 ...
##  $ plant.growth   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ leaves         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.halo      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ leaf.marg      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.size      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.shread    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ stem           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ stem.cankers   : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ canker.lesion  : num  1 1 0 0 1 0 1 1 1 1 ...
##  $ fruiting.bodies: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ext.decay      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruit.pods     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruit.spots    : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mold.growth    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ roots          : num  0 0 0 0 0 0 0 0 0 0 ...
summary(sb[,-1])
##       date        plant.stand         precip           temp      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :4.000   Median :0.0000   Median :2.000   Median :1.000  
##  Mean   :3.554   Mean   :0.4529   Mean   :1.597   Mean   :1.182  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :1.0000   Max.   :2.000   Max.   :2.000  
##  NA's   :1       NA's   :36       NA's   :38      NA's   :30     
##       hail         crop.hist        area.dam         sever       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.000   Median :2.000   Median :1.000   Median :1.0000  
##  Mean   :0.226   Mean   :1.885   Mean   :1.581   Mean   :0.7331  
##  3rd Qu.:0.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :3.000   Max.   :3.000   Max.   :2.0000  
##  NA's   :121     NA's   :16      NA's   :1       NA's   :121     
##     seed.tmt           germ        plant.growth        leaves      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :1.000   Median :0.0000   Median :1.0000  
##  Mean   :0.5196   Mean   :1.049   Mean   :0.3388   Mean   :0.8873  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :2.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :121      NA's   :112     NA's   :16                       
##    leaf.halo       leaf.marg       leaf.size      leaf.shread    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :0.000   Median :1.000   Median :0.0000  
##  Mean   :1.202   Mean   :0.773   Mean   :1.284   Mean   :0.1647  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :1.0000  
##  NA's   :84      NA's   :84      NA's   :84      NA's   :100     
##       stem         stem.cankers  canker.lesion    fruiting.bodies 
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00   Median :1.0000   Median :0.0000  
##  Mean   :0.5562   Mean   :1.06   Mean   :0.9798   Mean   :0.1802  
##  3rd Qu.:1.0000   3rd Qu.:3.00   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.00   Max.   :3.0000   Max.   :1.0000  
##  NA's   :16       NA's   :38     NA's   :38       NA's   :106     
##    ext.decay       int.discolor      fruit.pods      fruit.spots   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.2496   Mean   :0.1302   Mean   :0.5042   Mean   :1.021  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :2.0000   Max.   :2.0000   Max.   :3.0000   Max.   :4.000  
##  NA's   :38       NA's   :38       NA's   :84       NA's   :106    
##       seed         mold.growth         roots       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1946   Mean   :0.1134   Mean   :0.1779  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.0000  
##  NA's   :92       NA's   :92       NA's   :31
correlated_sb <- cor(sb[,-1], use="pairwise.complete.obs")
corrplot(correlated_sb, method = "circle", order = "hclust")