#####Chapter 3 KJ 1, 2
3.1. Description of Glass dataset
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).
There are a total of 214 glass samples taken with no instances of missing data for any of the predictor variables. Based upon their histograms and skewness, the predictors RI, Na, Al, Si & Ca display either either a normal distribution pattern or a distribution that could be transformed into a normal distribution pattern i.e. division by sqrt(s). The remaining predictor variables Mg, K, Ba & Fe display concentrations of 0 frequency.
The existance of concentrations of 0 occurrence without additional information does not indicate an invalid measurement and therefore discarding this data or imputing replacement data would reduce the predictive accuracy of any model based on such action.
A better solution to handling the predictors with concentrations of 0 frequency is to use a zero-inflated binary distribution for continuous data.
The two predictors with the greatest correlation are RI and Ca suggesting that in a multivariable regression model, one of these explanatory variables could be removed because it is strongly co-linear with the other thus having little to no loss of predictive ability to the model.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RI | 1 | 214 | 1.518 | 0.003037 | 1.518 | 1.518 | 0.001875 | 1.511 | 1.534 | 0.02278 | 1.603 | 4.717 | 0.0002076 |
Na | 2 | 214 | 13.41 | 0.8166 | 13.3 | 13.38 | 0.6449 | 10.73 | 17.38 | 6.65 | 0.4478 | 2.898 | 0.05582 |
Mg | 3 | 214 | 2.685 | 1.442 | 3.48 | 2.866 | 0.3039 | 0 | 4.49 | 4.49 | -1.136 | -0.4527 | 0.0986 |
Al | 4 | 214 | 1.445 | 0.4993 | 1.36 | 1.412 | 0.3113 | 0.29 | 3.5 | 3.21 | 0.8946 | 1.938 | 0.03413 |
Si | 5 | 214 | 72.65 | 0.7745 | 72.79 | 72.71 | 0.5708 | 69.81 | 75.41 | 5.6 | -0.7202 | 2.816 | 0.05295 |
K | 6 | 214 | 0.4971 | 0.6522 | 0.555 | 0.4318 | 0.1705 | 0 | 6.21 | 6.21 | 6.46 | 52.87 | 0.04458 |
Ca | 7 | 214 | 8.957 | 1.423 | 8.6 | 8.742 | 0.6598 | 5.43 | 16.19 | 10.76 | 2.018 | 6.41 | 0.09728 |
Ba | 8 | 214 | 0.175 | 0.4972 | 0 | 0.03378 | 0 | 0 | 3.15 | 3.15 | 3.369 | 12.08 | 0.03399 |
Fe | 9 | 214 | 0.05701 | 0.09744 | 0 | 0.03581 | 0 | 0 | 0.51 | 0.51 | 1.73 | 2.52 | 0.006661 |
Type* | 10 | 214 | 2.542 | 1.708 | 2 | 2.308 | 1.483 | 1 | 6 | 5 | 1.038 | -0.2871 | 0.1167 |
Correlation Matrix
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | |
---|---|---|---|---|---|---|---|---|---|
RI | 1 | -0.1919 | -0.1223 | -0.4073 | -0.5421 | -0.2898 | 0.8104 | -0.000386 | 0.143 |
Na | -0.1919 | 1 | -0.2737 | 0.1568 | -0.06981 | -0.2661 | -0.2754 | 0.3266 | -0.2413 |
Mg | -0.1223 | -0.2737 | 1 | -0.4818 | -0.1659 | 0.005396 | -0.4438 | -0.4923 | 0.08306 |
Al | -0.4073 | 0.1568 | -0.4818 | 1 | -0.005524 | 0.326 | -0.2596 | 0.4794 | -0.0744 |
Si | -0.5421 | -0.06981 | -0.1659 | -0.005524 | 1 | -0.1933 | -0.2087 | -0.1022 | -0.0942 |
K | -0.2898 | -0.2661 | 0.005396 | 0.326 | -0.1933 | 1 | -0.3178 | -0.04262 | -0.007719 |
Ca | 0.8104 | -0.2754 | -0.4438 | -0.2596 | -0.2087 | -0.3178 | 1 | -0.1128 | 0.125 |
Ba | -0.000386 | 0.3266 | -0.4923 | 0.4794 | -0.1022 | -0.04262 | -0.1128 | 1 | -0.05869 |
Fe | 0.143 | -0.2413 | 0.08306 | -0.0744 | -0.0942 | -0.007719 | 0.125 | -0.05869 | 1 |
## integer(0)
3.2. Description of Soybean dataset
There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
Based upon the histograms of categorical data, the following predictor variables are candidates for degenerate variables that can be eliminated based upon an over concentration of data for one value and sparse occurrences of data elsewhere.
in.discolor, leaf.malf, leaf.mild, leaf.shread, leaves, lodging, mold.growth, mycelium, roots, sclerotia, seed, seed.discolor, seed.size and shriveling.
There are 682 total observations of which the following attributes contain the most missing data in rank order of increasing magnitude. Perhaps the cause of this is null data as non-observances. This would make sense in the case of hail, mold, seed discoloration, shriveling, fruit attributes etc. If it doesn’t exist it cannot be observed.
hail* 562 sever* 562 seed.tmt* 562 lodging* 562 germ* 571 leaf.mild* 575 fruiting.bodies* 577 fruit.spots* 577 seed.discolor* 577 shriveling* 577 leaf.shread* 583 seed* 591 mold.growth* 591 seed.size* 591 leaf.halo* 599 leaf.marg* 599 leaf.size* 599 leaf.malf* 599 fruit.pods* 599
All of the predictor variables in this data set contain enough observations to make them useful in a predictive model. A strategy for data-cleanup suitable for a regression model would be as follows . . .
Remove degenrate variables from part a. Variables that remain with missing data that can be considered a “non-observation” such as hail can be coded to zero. Varibales with missing data for unknown reasons should be imputed from the other variables in the observation using the predict function. Outliers records should be explained and perhaps removed
Once all non-degenerate variables have been assigned, a regression model can be developed and co-linear variables can be systematically eliminated until the simplest explanatory model is left.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Class* | 1 | 683 | 9.296 | 5.511 | 8 | 9.179 | 7.413 | 1 | 19 | 18 | 0.113 | -1.379 | 0.2109 |
date* | 2 | 682 | 4.554 | 1.694 | 5 | 4.615 | 1.483 | 1 | 7 | 6 | -0.304 | -0.9045 | 0.06487 |
plant.stand* | 3 | 647 | 1.453 | 0.4982 | 1 | 1.441 | 0 | 1 | 2 | 1 | 0.189 | -1.967 | 0.01958 |
precip* | 4 | 645 | 2.597 | 0.6861 | 3 | 2.745 | 0 | 1 | 3 | 2 | -1.416 | 0.5502 | 0.02702 |
temp* | 5 | 653 | 2.182 | 0.6282 | 2 | 2.228 | 0 | 1 | 3 | 2 | -0.1583 | -0.5843 | 0.02458 |
hail* | 6 | 562 | 1.226 | 0.4186 | 1 | 1.158 | 0 | 1 | 2 | 1 | 1.307 | -0.2925 | 0.01766 |
crop.hist* | 7 | 667 | 2.885 | 0.9758 | 3 | 2.978 | 1.483 | 1 | 4 | 3 | -0.3976 | -0.9188 | 0.03778 |
area.dam* | 8 | 682 | 2.581 | 1.074 | 2 | 2.601 | 1.483 | 1 | 4 | 3 | 0.01799 | -1.286 | 0.04114 |
sever* | 9 | 562 | 1.733 | 0.597 | 2 | 1.691 | 0 | 1 | 3 | 2 | 0.1739 | -0.5648 | 0.02518 |
seed.tmt* | 10 | 562 | 1.52 | 0.6122 | 1 | 1.447 | 0 | 1 | 3 | 2 | 0.7397 | -0.4397 | 0.02583 |
germ* | 11 | 571 | 2.049 | 0.791 | 2 | 2.061 | 1.483 | 1 | 3 | 2 | -0.08681 | -1.4 | 0.0331 |
plant.growth* | 12 | 667 | 1.339 | 0.4737 | 1 | 1.299 | 0 | 1 | 2 | 1 | 0.6795 | -1.541 | 0.01834 |
leaves* | 13 | 683 | 1.887 | 0.3165 | 2 | 1.984 | 0 | 1 | 2 | 1 | -2.444 | 3.977 | 0.01211 |
leaf.halo* | 14 | 599 | 2.202 | 0.949 | 3 | 2.252 | 0 | 1 | 3 | 2 | -0.4108 | -1.765 | 0.03878 |
leaf.marg* | 15 | 599 | 1.773 | 0.9565 | 1 | 1.717 | 0 | 1 | 3 | 2 | 0.4648 | -1.747 | 0.03908 |
leaf.size* | 16 | 599 | 2.284 | 0.6117 | 2 | 2.337 | 0 | 1 | 3 | 2 | -0.2495 | -0.6294 | 0.02499 |
leaf.shread* | 17 | 583 | 1.165 | 0.3712 | 1 | 1.081 | 0 | 1 | 2 | 1 | 1.804 | 1.255 | 0.01537 |
leaf.malf* | 18 | 599 | 1.075 | 0.2638 | 1 | 1 | 0 | 1 | 2 | 1 | 3.216 | 8.354 | 0.01078 |
leaf.mild* | 19 | 575 | 1.104 | 0.4041 | 1 | 1 | 0 | 1 | 3 | 2 | 3.953 | 14.68 | 0.01685 |
stem* | 20 | 667 | 1.556 | 0.4972 | 2 | 1.57 | 0 | 1 | 2 | 1 | -0.2258 | -1.952 | 0.01925 |
lodging* | 21 | 562 | 1.075 | 0.2632 | 1 | 1 | 0 | 1 | 2 | 1 | 3.226 | 8.421 | 0.0111 |
stem.cankers* | 22 | 645 | 2.06 | 1.352 | 1 | 1.952 | 0 | 1 | 4 | 3 | 0.6098 | -1.509 | 0.05322 |
canker.lesion* | 23 | 645 | 1.98 | 1.084 | 2 | 1.851 | 1.483 | 1 | 4 | 3 | 0.5146 | -1.238 | 0.04268 |
fruiting.bodies* | 24 | 577 | 1.18 | 0.3847 | 1 | 1.102 | 0 | 1 | 2 | 1 | 1.659 | 0.7549 | 0.01602 |
ext.decay* | 25 | 645 | 1.25 | 0.4775 | 1 | 1.162 | 0 | 1 | 3 | 2 | 1.695 | 1.975 | 0.0188 |
mycelium* | 26 | 645 | 1.009 | 0.09607 | 1 | 1 | 0 | 1 | 2 | 1 | 10.2 | 102.2 | 0.003783 |
int.discolor* | 27 | 645 | 1.13 | 0.419 | 1 | 1 | 0 | 1 | 3 | 2 | 3.339 | 10.57 | 0.0165 |
sclerotia* | 28 | 645 | 1.031 | 0.1735 | 1 | 1 | 0 | 1 | 2 | 1 | 5.399 | 27.19 | 0.00683 |
fruit.pods* | 29 | 599 | 1.504 | 0.8825 | 1 | 1.283 | 0 | 1 | 4 | 3 | 1.838 | 2.413 | 0.03606 |
fruit.spots* | 30 | 577 | 1.847 | 1.17 | 1 | 1.687 | 0 | 1 | 4 | 3 | 0.9465 | -0.7574 | 0.04871 |
seed* | 31 | 591 | 1.195 | 0.3962 | 1 | 1.118 | 0 | 1 | 2 | 1 | 1.539 | 0.3693 | 0.0163 |
mold.growth* | 32 | 591 | 1.113 | 0.3173 | 1 | 1.017 | 0 | 1 | 2 | 1 | 2.433 | 3.925 | 0.01305 |
seed.discolor* | 33 | 577 | 1.111 | 0.3143 | 1 | 1.015 | 0 | 1 | 2 | 1 | 2.472 | 4.116 | 0.01308 |
seed.size* | 34 | 591 | 1.1 | 0.3 | 1 | 1 | 0 | 1 | 2 | 1 | 2.663 | 5.1 | 0.01234 |
shriveling* | 35 | 577 | 1.066 | 0.2482 | 1 | 1 | 0 | 1 | 2 | 1 | 3.492 | 10.21 | 0.01033 |
roots* | 36 | 652 | 1.178 | 0.4388 | 1 | 1.069 | 0 | 1 | 3 | 2 | 2.458 | 5.486 | 0.01719 |
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3020 rows containing non-finite values (stat_bin).