3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification.

3.1a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

There are 9 variables altogether, including 8 numeric and one factor (the target, a six level variable).

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

The heatmap shows some multicollinearity. However, the RI/CA and MG/Type pairs are the only ones with a correlation above .6 or below -.6.

##   col1 col2 correlation
## 1   RI   Ca   0.8104027
## 2   Mg Type  -0.7281595

3.1b Do there appear to be any outliers in the data? Are any predictors skewed?

Boxplots show possible outliers for K, Fe, Na and possibly others. Without knowing more about the data we cannot know whether they are anomalies or data errors. Distributions vary - many are only somewhat skewed while others (K, Ba, Fe, Mg) are highly skewed or are bimodal.

3.1c (c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We can run a multinomial regression to get a baseline. The AIC is 399.

## # weights:  66 (50 variable)
## initial  value 383.436526 
## iter  10 value 257.359885
## iter  20 value 181.634208
## iter  30 value 161.554088
## iter  40 value 157.912577
## iter  50 value 154.889493
## iter  60 value 153.706333
## iter  70 value 153.334999
## iter  80 value 152.219340
## iter  90 value 149.994098
## iter 100 value 149.743843
## final  value 149.743843 
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = dfGlass)
## 
## Coefficients:
##   (Intercept)        RI         Na          Mg         Al         Si
## 2   114.01139 210.99092 -3.5715880 -6.14888398 -0.0777839 -4.4509190
## 3    46.69565 -61.97027  1.6471464 -0.01788714  2.5121161  0.2207149
## 4    19.54782  14.22700 -0.4893655 -3.69586811 10.1611011 -0.5204113
## 5   -14.59763 -21.52840 10.7663636 -7.48120815 34.9748591 -0.9212133
## 6   -33.83528  22.99089  2.4341715 -5.00880431  6.2849258 -0.1495441
##               K         Ca          Ba           Fe
## 2   -3.70543961 -4.6895169   -5.757871    2.2610525
## 3   -0.67459086  0.6082768   -2.208131    1.5301451
## 4    0.62817476 -0.4292740   -3.450644   -0.6424633
## 5 -197.82120395 -4.7069924 -149.906448 -407.9088594
## 6   -0.06454676 -2.2076868   -2.475847  -15.9357312
## 
## Residual Deviance: 299.4877 
## AIC: 399.4877

We can try to normalize the skewed distributions using boxcox. Using trial and error, we find that 3 transformations (Al, Ba and Ca) reduce our AIC to 377. It is worth noting that when all four of the mostly skewed variables are transformed, the AIC increases to 401. Thus, boxcox transformations are no guarantee of improved fit. Shown below are the histograms for Al before and after boxcox.

## # weights:  66 (50 variable)
## initial  value 383.436526 
## iter  10 value 249.999082
## iter  20 value 172.203767
## iter  30 value 160.380752
## iter  40 value 155.735322
## iter  50 value 154.075494
## iter  60 value 152.313863
## iter  70 value 150.444607
## iter  80 value 146.505688
## iter  90 value 140.284776
## iter 100 value 138.367546
## final  value 138.367546 
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = Z4)
## 
## Coefficients:
##   (Intercept)         RI          Na         Mg         Al         Si
## 2  158.726993  273.79219 -2.05786771 -4.4134752  0.5718709 -2.8728396
## 3   30.926243 -103.79605  1.86098673  0.3082207  2.8679255  0.2212213
## 4    5.685205  -17.14786 -0.07425439 -4.6530156 17.5820560  1.0885226
## 5  -34.689627  -60.03965 25.96755965 -6.1535313 44.9179307  4.1985981
## 6 -173.657540 -201.47761  7.84847638 -2.2070841 12.5256247  4.8018289
##              K         Ca          Ba          Fe
## 2   -1.5781266 -389.47074   -8.296112    4.066924
## 3   -0.2016619   97.96148   -5.488342    1.925353
## 4    0.8232070  -58.36621    4.235923   -4.286303
## 5 -154.8908058  -33.14772 -245.894133 -336.987520
## 6    4.3438740   42.49031   27.396842  -16.006223
## 
## Residual Deviance: 276.7351 
## AIC: 376.7351

Removing possible outliers reduces our AIC to 374. (We should take care in removing this data, which might tell us something interesting when data meets certain rare conditions.)

## # weights:  66 (50 variable)
## initial  value 370.894210 
## iter  10 value 235.079001
## iter  20 value 167.439433
## iter  30 value 156.218471
## iter  40 value 152.232660
## iter  50 value 149.769480
## iter  60 value 148.873298
## iter  70 value 147.224736
## iter  80 value 144.687797
## iter  90 value 140.520123
## iter 100 value 136.799846
## final  value 136.799846 
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = dfGlass2)
## 
## Coefficients:
##   (Intercept)         RI        Na         Mg         Al         Si
## 2   135.87543  242.46029 -1.597713 -4.0578840  0.8773679 -2.4884756
## 3    26.60645 -121.54256  2.365625  0.4728967  2.7999264  0.5796289
## 4    60.53990   77.12660 -1.569426 -5.4510103 16.8372062 -0.9244504
## 5   -23.17418  -40.46829 23.312072 -2.9026957 35.1681966  2.8110950
## 6  -144.69916 -150.31477  6.686148 -1.8455719  8.0297401  3.0089003
##               K         Ca          Ba          Fe
## 2   -1.31623954 -347.63576  -12.082960    3.994494
## 3   -0.05915327   93.50323   -7.982407    3.162688
## 4    1.90186259  -72.26119  -14.314696  -24.136806
## 5 -119.86160340  -13.74760 -210.516845 -301.304544
## 6    6.72483448   82.67953   20.082394  -12.697464
## 
## Residual Deviance: 273.5997 
## AIC: 373.5997

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans.

3.2a Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

There are many predictors that are severely unbalanced. For example, in the case of mycelium, 0’s outweigh 1’s by a factor of more than 100. Other variables have balance issues as well, though not as sever - Sclerotia is unbalanced and there are possible degeneration issues for mold.growth, seed.discolor, seed.size and shriveling as well.

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

3.2b Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

There are 2337 missing values in the dataset. The data are missing in groups. For example, columns related to the leaf (e.g., leaf.halo and leaf.marg) are all missing or not as a group. Many records are missing most of the groups.

## [1] 2337
## [[1]]

## 
## [[2]]

## 
## [[3]]

The records with missing values are highly correlated with the various classes as the following mutlinomial regression shows.

## # weights:  57 (36 variable)
## initial  value 2011.051823 
## iter  10 value 1639.442368
## iter  20 value 1546.377778
## iter  30 value 1544.005030
## iter  40 value 1543.968246
## final  value 1543.967558 
## converged
## Call:
## multinom(formula = Class ~ missing, data = dfSoybean1)
## 
## Coefficients:
##                             (Intercept)    missing
## alternarialeaf-spot           25.298117 -47.562488
## anthracnose                   24.571466 -53.758070
## bacterial-blight              23.782982 -48.492613
## bacterial-pustule             23.782982 -48.493815
## brown-spot                    25.309049 -47.421545
## brown-stem-rot                24.571471 -53.744500
## charcoal-rot                  23.782980 -48.490159
## cyst-nematode                  2.004402  -2.137985
## diaporthe-pod-&-stem-blight    6.711386  -6.775926
## diaporthe-stem-canker         23.782982 -48.490155
## downy-mildew                  23.782985 -48.493578
## frog-eye-leaf-spot            25.298115 -47.546196
## herbicide-injury               9.642078 -10.335236
## phyllosticta-leaf-spot        23.782983 -48.494128
## phytophthora-rot              23.782893 -22.335979
## powdery-mildew                23.782981 -48.493892
## purple-seed-stain             23.782982 -48.494176
## rhizoctonia-root-rot          23.782986 -48.490298
## 
## Residual Deviance: 3087.935 
## AIC: 3159.935
##                              (Intercept)      missing
## alternarialeaf-spot         3.893784e-01 0.000000e+00
## anthracnose                 4.031456e-01 0.000000e+00
## bacterial-blight            4.184129e-01 0.000000e+00
## bacterial-pustule           4.184129e-01 0.000000e+00
## brown-spot                  3.891735e-01 0.000000e+00
## brown-stem-rot              4.031456e-01 0.000000e+00
## charcoal-rot                4.184129e-01 0.000000e+00
## cyst-nematode               0.000000e+00 0.000000e+00
## diaporthe-pod-&-stem-blight 5.230224e-06 4.249024e-06
## diaporthe-stem-canker       4.184129e-01 0.000000e+00
## downy-mildew                4.184128e-01 0.000000e+00
## frog-eye-leaf-spot          3.893785e-01 0.000000e+00
## herbicide-injury            9.670793e-01 9.647142e-01
## phyllosticta-leaf-spot      4.184129e-01 0.000000e+00
## phytophthora-rot            4.184131e-01 4.473017e-01
## powdery-mildew              4.184129e-01 0.000000e+00
## purple-seed-stain           4.184129e-01 0.000000e+00
## rhizoctonia-root-rot        4.184128e-01 0.000000e+00

3.2c Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We cannot eliminate records with any missing vaues because there are too many, and because they are not missing at random (MNAR). Some of the columns are missing 18% of their values, but eliminating them risks losing valuable information. In some cases, degenerate columns overlap with missing value columns (e.g., mold.growth) - these might be removed because there is more than one reason to do so.

Otherwise we are looking at imputation. Prior to imputation, dummy variables should be created to flag the missing group(s) to which the record belongs. Then we should test individual columns to see if they are MNAR - if not, we can use KNN or Linear regression to estimate missing values. When the data is MNAR there is no reason to think the variables are necessarily predictable from the nonmissing values - in this case it might be best to us median or mean.