3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification.
3.1a Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
There are 9 variables altogether, including 8 numeric and one factor (the target, a six level variable).
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
The heatmap shows some multicollinearity. However, the RI/CA and MG/Type pairs are the only ones with a correlation above .6 or below -.6.
## col1 col2 correlation
## 1 RI Ca 0.8104027
## 2 Mg Type -0.7281595
3.1b Do there appear to be any outliers in the data? Are any predictors skewed?
Boxplots show possible outliers for K, Fe, Na and possibly others. Without knowing more about the data we cannot know whether they are anomalies or data errors. Distributions vary - many are only somewhat skewed while others (K, Ba, Fe, Mg) are highly skewed or are bimodal.
3.1c (c) Are there any relevant transformations of one or more predictors that might improve the classification model?
We can run a multinomial regression to get a baseline. The AIC is 399.
## # weights: 66 (50 variable)
## initial value 383.436526
## iter 10 value 257.359885
## iter 20 value 181.634208
## iter 30 value 161.554088
## iter 40 value 157.912577
## iter 50 value 154.889493
## iter 60 value 153.706333
## iter 70 value 153.334999
## iter 80 value 152.219340
## iter 90 value 149.994098
## iter 100 value 149.743843
## final value 149.743843
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = dfGlass)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 114.01139 210.99092 -3.5715880 -6.14888398 -0.0777839 -4.4509190
## 3 46.69565 -61.97027 1.6471464 -0.01788714 2.5121161 0.2207149
## 4 19.54782 14.22700 -0.4893655 -3.69586811 10.1611011 -0.5204113
## 5 -14.59763 -21.52840 10.7663636 -7.48120815 34.9748591 -0.9212133
## 6 -33.83528 22.99089 2.4341715 -5.00880431 6.2849258 -0.1495441
## K Ca Ba Fe
## 2 -3.70543961 -4.6895169 -5.757871 2.2610525
## 3 -0.67459086 0.6082768 -2.208131 1.5301451
## 4 0.62817476 -0.4292740 -3.450644 -0.6424633
## 5 -197.82120395 -4.7069924 -149.906448 -407.9088594
## 6 -0.06454676 -2.2076868 -2.475847 -15.9357312
##
## Residual Deviance: 299.4877
## AIC: 399.4877
We can try to normalize the skewed distributions using boxcox. Using trial and error, we find that 3 transformations (Al, Ba and Ca) reduce our AIC to 377. It is worth noting that when all four of the mostly skewed variables are transformed, the AIC increases to 401. Thus, boxcox transformations are no guarantee of improved fit. Shown below are the histograms for Al before and after boxcox.
## # weights: 66 (50 variable)
## initial value 383.436526
## iter 10 value 249.999082
## iter 20 value 172.203767
## iter 30 value 160.380752
## iter 40 value 155.735322
## iter 50 value 154.075494
## iter 60 value 152.313863
## iter 70 value 150.444607
## iter 80 value 146.505688
## iter 90 value 140.284776
## iter 100 value 138.367546
## final value 138.367546
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = Z4)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 158.726993 273.79219 -2.05786771 -4.4134752 0.5718709 -2.8728396
## 3 30.926243 -103.79605 1.86098673 0.3082207 2.8679255 0.2212213
## 4 5.685205 -17.14786 -0.07425439 -4.6530156 17.5820560 1.0885226
## 5 -34.689627 -60.03965 25.96755965 -6.1535313 44.9179307 4.1985981
## 6 -173.657540 -201.47761 7.84847638 -2.2070841 12.5256247 4.8018289
## K Ca Ba Fe
## 2 -1.5781266 -389.47074 -8.296112 4.066924
## 3 -0.2016619 97.96148 -5.488342 1.925353
## 4 0.8232070 -58.36621 4.235923 -4.286303
## 5 -154.8908058 -33.14772 -245.894133 -336.987520
## 6 4.3438740 42.49031 27.396842 -16.006223
##
## Residual Deviance: 276.7351
## AIC: 376.7351
Removing possible outliers reduces our AIC to 374. (We should take care in removing this data, which might tell us something interesting when data meets certain rare conditions.)
## # weights: 66 (50 variable)
## initial value 370.894210
## iter 10 value 235.079001
## iter 20 value 167.439433
## iter 30 value 156.218471
## iter 40 value 152.232660
## iter 50 value 149.769480
## iter 60 value 148.873298
## iter 70 value 147.224736
## iter 80 value 144.687797
## iter 90 value 140.520123
## iter 100 value 136.799846
## final value 136.799846
## stopped after 100 iterations
## Call:
## multinom(formula = Type ~ ., data = dfGlass2)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 135.87543 242.46029 -1.597713 -4.0578840 0.8773679 -2.4884756
## 3 26.60645 -121.54256 2.365625 0.4728967 2.7999264 0.5796289
## 4 60.53990 77.12660 -1.569426 -5.4510103 16.8372062 -0.9244504
## 5 -23.17418 -40.46829 23.312072 -2.9026957 35.1681966 2.8110950
## 6 -144.69916 -150.31477 6.686148 -1.8455719 8.0297401 3.0089003
## K Ca Ba Fe
## 2 -1.31623954 -347.63576 -12.082960 3.994494
## 3 -0.05915327 93.50323 -7.982407 3.162688
## 4 1.90186259 -72.26119 -14.314696 -24.136806
## 5 -119.86160340 -13.74760 -210.516845 -301.304544
## 6 6.72483448 82.67953 20.082394 -12.697464
##
## Residual Deviance: 273.5997
## AIC: 373.5997
3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans.
3.2a Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
There are many predictors that are severely unbalanced. For example, in the case of mycelium, 0’s outweigh 1’s by a factor of more than 100. Other variables have balance issues as well, though not as sever - Sclerotia is unbalanced and there are possible degeneration issues for mold.growth, seed.discolor, seed.size and shriveling as well.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
3.2b Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
There are 2337 missing values in the dataset. The data are missing in groups. For example, columns related to the leaf (e.g., leaf.halo and leaf.marg) are all missing or not as a group. Many records are missing most of the groups.
## [1] 2337
## [[1]]
##
## [[2]]
##
## [[3]]
The records with missing values are highly correlated with the various classes as the following mutlinomial regression shows.
## # weights: 57 (36 variable)
## initial value 2011.051823
## iter 10 value 1639.442368
## iter 20 value 1546.377778
## iter 30 value 1544.005030
## iter 40 value 1543.968246
## final value 1543.967558
## converged
## Call:
## multinom(formula = Class ~ missing, data = dfSoybean1)
##
## Coefficients:
## (Intercept) missing
## alternarialeaf-spot 25.298117 -47.562488
## anthracnose 24.571466 -53.758070
## bacterial-blight 23.782982 -48.492613
## bacterial-pustule 23.782982 -48.493815
## brown-spot 25.309049 -47.421545
## brown-stem-rot 24.571471 -53.744500
## charcoal-rot 23.782980 -48.490159
## cyst-nematode 2.004402 -2.137985
## diaporthe-pod-&-stem-blight 6.711386 -6.775926
## diaporthe-stem-canker 23.782982 -48.490155
## downy-mildew 23.782985 -48.493578
## frog-eye-leaf-spot 25.298115 -47.546196
## herbicide-injury 9.642078 -10.335236
## phyllosticta-leaf-spot 23.782983 -48.494128
## phytophthora-rot 23.782893 -22.335979
## powdery-mildew 23.782981 -48.493892
## purple-seed-stain 23.782982 -48.494176
## rhizoctonia-root-rot 23.782986 -48.490298
##
## Residual Deviance: 3087.935
## AIC: 3159.935
## (Intercept) missing
## alternarialeaf-spot 3.893784e-01 0.000000e+00
## anthracnose 4.031456e-01 0.000000e+00
## bacterial-blight 4.184129e-01 0.000000e+00
## bacterial-pustule 4.184129e-01 0.000000e+00
## brown-spot 3.891735e-01 0.000000e+00
## brown-stem-rot 4.031456e-01 0.000000e+00
## charcoal-rot 4.184129e-01 0.000000e+00
## cyst-nematode 0.000000e+00 0.000000e+00
## diaporthe-pod-&-stem-blight 5.230224e-06 4.249024e-06
## diaporthe-stem-canker 4.184129e-01 0.000000e+00
## downy-mildew 4.184128e-01 0.000000e+00
## frog-eye-leaf-spot 3.893785e-01 0.000000e+00
## herbicide-injury 9.670793e-01 9.647142e-01
## phyllosticta-leaf-spot 4.184129e-01 0.000000e+00
## phytophthora-rot 4.184131e-01 4.473017e-01
## powdery-mildew 4.184129e-01 0.000000e+00
## purple-seed-stain 4.184129e-01 0.000000e+00
## rhizoctonia-root-rot 4.184128e-01 0.000000e+00
3.2c Develop a strategy for handling missing data, either by eliminating predictors or imputation.
We cannot eliminate records with any missing vaues because there are too many, and because they are not missing at random (MNAR). Some of the columns are missing 18% of their values, but eliminating them risks losing valuable information. In some cases, degenerate columns overlap with missing value columns (e.g., mold.growth) - these might be removed because there is more than one reason to do so.
Otherwise we are looking at imputation. Prior to imputation, dummy variables should be created to flag the missing group(s) to which the record belongs. Then we should test individual columns to see if they are MNAR - if not, we can use KNN or Linear regression to estimate missing values. When the data is MNAR there is no reason to think the variables are necessarily predictable from the nonmissing values - in this case it might be best to us median or mean.