Question No. 3.1

The Glass dataset consists of 214 observations and 10 variables.

## starting httpd help server ... done

Glass Identification Database
Description
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).

Usage
data(Glass)
Format
A data frame with 214 observations on 10 variables:

[,1] RI refractive index [,2] Na Sodium [,3] Mg Magnesium [,4] Al Aluminum [,5] Si Silicon [,6] K Potassium [,7] Ca Calcium [,8] Ba Barium [,9] Fe Iron [,10] Type Type of glass (class attribute) Source Creator: B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN

Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation

These data have been taken from the UCI Repository Of Machine Learning Databases at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Friedrich Leisch.

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1(a)  Question

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer

First, separate out the nine predictor variables (Ri,Na, Mg, Al, Si, K, Ca, Ba, and Fe) from the dependent variable, Type:

The following table below shows basic statistics, mean, standard deviation, median, min, max,skew, and the percentage of missing items for each variable in the field, pct_missing.

STATS vars n mean sd median trimmed mad min max range skew kurtosis se pct_missing
RI 1 214 1.5183654 0.0030369 1.51768 1.5180119 0.0018755 1.51115 1.53393 0.02278 1.6027151 4.7167266 0.0002076 0
Na 2 214 13.4078505 0.8166036 13.30000 13.3768023 0.6449310 10.73000 17.38000 6.65000 0.4478343 2.8979666 0.0558219 0
Mg 3 214 2.6845327 1.4424078 3.48000 2.8655233 0.3039330 0.00000 4.49000 4.49000 -1.1364523 -0.4526762 0.0986010 0
Al 4 214 1.4449065 0.4992696 1.36000 1.4122093 0.3113460 0.29000 3.50000 3.21000 0.8946104 1.9383534 0.0341294 0
Si 5 214 72.6509346 0.7745458 72.79000 72.7073256 0.5708010 69.81000 75.41000 5.60000 -0.7202392 2.8163627 0.0529469 0
K 6 214 0.4970561 0.6521918 0.55500 0.4318023 0.1704990 0.00000 6.21000 6.21000 6.4600889 52.8665268 0.0445829 0
Ca 7 214 8.9569626 1.4231535 8.60000 8.7421512 0.6597570 5.43000 16.19000 10.76000 2.0184463 6.4104000 0.0972848 0
Ba 8 214 0.1750467 0.4972193 0.00000 0.0337791 0.0000000 0.00000 3.15000 3.15000 3.3686800 12.0801412 0.0339892 0
Fe 9 214 0.0570093 0.0974387 0.00000 0.0358140 0.0000000 0.00000 0.51000 0.51000 1.7298107 2.5203615 0.0066608 0

The colSums function confirms that there are no missing values in the dataset.

## RI Na Mg Al Si  K Ca Ba Fe 
##  0  0  0  0  0  0  0  0  0

The correlation plot shows that the strongest correlation among the predictors is between “RI”, the refractive index and “Ca”, calcium.

In figure 2 below, the lower left triangle shows a scatter plot relationships between each predictor along with a regression line through each plot. The diagonal shows the histogram distribution of each predictor. We can see that K, Ba, and Fe are skewed to the right. The upper triangle shows the correlation between each predictor. Again we see that RI (refractive index) and Ca (Calcium).

3.1(b)  Answer

Do there appear to be any outliers in the data? Are any predictors skewed?

Yes. We can clearly see from Figures 2 (the Correlation plots) and 3 (the Boxplots) that variables Ba, Fe, and K have several outliers. These three variables are also skewed.

## No id variables; using all as measure variables

3.1(c)  Answer

Are there any relevant transformations of one or more predictors that might improve the classification model?

There are a few transformations that could be applied:

  1. Centering and Scaling

The most straightforward and common data transformation is to center scale the predictor variables. To center a predictor variable, the average predictor value is subtracted from all the values. As a result of centering, the predictor has a zero mean.

  1. Box Cox transformation

  2. Log and Square Root transformation

Question No. 3.2

3.2(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean Database
Description
There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

Usage
data(Soybean)
Format
A data frame with 683 observations on 36 variables. There are 35 categorical
attributes, all numerical and a nominal denoting the class.

[,1] Class the 19 classes [,2] date apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6). [,3] plant.stand normal(0),lt-normal(1). [,4] precip lt-norm(0),norm(1),gt-norm(2). [,5] temp lt-norm(0),norm(1),gt-norm(2). [,6] hail yes(0),no(1). [,7] crop.hist dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3). [,8] area.dam scatter(0),low-area(1),upper-ar(2),whole-field(3). [,9] sever minor(0),pot-severe(1),severe(2). [,10] seed.tmt none(0),fungicide(1),other(2). [,11] germ 90-100%(0),80-89%(1),lt-80%(2). [,12] plant.growth norm(0),abnorm(1). [,13] leaves norm(0),abnorm(1). [,14] leaf.halo absent(0),yellow-halos(1),no-yellow-halos(2). [,15] leaf.marg w-s-marg(0),no-w-s-marg(1),dna(2). [,16] leaf.size lt-1/8(0),gt-1/8(1),dna(2). [,17] leaf.shread absent(0),present(1). [,18] leaf.malf absent(0),present(1). [,19] leaf.mild absent(0),upper-surf(1),lower-surf(2). [,20] stem norm(0),abnorm(1). [,21] lodging yes(0),no(1). [,22] stem.cankers absent(0),below-soil(1),above-s(2),ab-sec-nde(3). [,23] canker.lesion dna(0),brown(1),dk-brown-blk(2),tan(3). [,24] fruiting.bodies absent(0),present(1). [,25] ext.decay absent(0),firm-and-dry(1),watery(2). [,26] mycelium absent(0),present(1). [,27] int.discolor none(0),brown(1),black(2). [,28] sclerotia absent(0),present(1). [,29] fruit.pods norm(0),diseased(1),few-present(2),dna(3). [,30] fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4). [,31] seed norm(0),abnorm(1). [,32] mold.growth absent(0),present(1). [,33] seed.discolor absent(0),present(1). [,34] seed.size norm(0),lt-norm(1). [,35] shriveling absent(0),present(1). [,36] roots norm(0),rotted(1),galls-cysts(2). Source
Source: R.S. Michalski and R.L. Chilausky “Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis”, International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.

Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmer%cs.cmu.edu)

These data have been taken from the UCI Repository Of Machine Learning Databases at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Evgenia Dimitriadou.

References Tan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning (pp. 121-134). Ann Arbor, Michigan: Morgan Kaufmann. - IWN recorded a 97.1% classification accuracy - 290 training and 340 test instances

Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and Predictive Accuracy. Proceedings of the Fifth International Conference on Machine Learning (pp. 22-28). Ann Arbor, Michigan: Morgan Kaufmann. - Notes why this database is highly predictable

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##               STATS vars   n     mean         sd median  trimmed    mad min max
## 1            Class*    1 683 9.295754 5.51115341      8 9.179159 7.4130   1  19
## 2             date*    2 682 4.554252 1.69411726      5 4.615385 1.4826   1   7
## 3      plant.stand*    3 647 1.452859 0.49815792      1 1.441233 0.0000   1   2
## 4           precip*    4 645 2.596899 0.68614709      3 2.744681 0.0000   1   3
## 5             temp*    5 653 2.182236 0.62821435      2 2.227533 0.0000   1   3
## 6             hail*    6 562 1.225979 0.41859776      1 1.157778 0.0000   1   2
## 7        crop.hist*    7 667 2.884558 0.97576561      3 2.977570 1.4826   1   4
## 8         area.dam*    8 682 2.580645 1.07437412      2 2.600733 1.4826   1   4
## 9            sever*    9 562 1.733096 0.59702831      2 1.691111 0.0000   1   3
## 10        seed.tmt*   10 562 1.519573 0.61224099      1 1.446667 0.0000   1   3
## 11            germ*   11 571 2.049037 0.79098758      2 2.061269 1.4826   1   3
## 12    plant.growth*   12 667 1.338831 0.47366739      1 1.299065 0.0000   1   2
## 13          leaves*   13 683 1.887262 0.31650395      2 1.983547 0.0000   1   2
## 14       leaf.halo*   14 599 2.202003 0.94899841      3 2.251559 0.0000   1   3
## 15       leaf.marg*   15 599 1.772955 0.95651425      1 1.717256 0.0000   1   3
## 16       leaf.size*   16 599 2.283806 0.61169336      2 2.336798 0.0000   1   3
## 17     leaf.shread*   17 583 1.164666 0.37119689      1 1.081370 0.0000   1   2
## 18       leaf.malf*   18 599 1.075125 0.26381357      1 1.000000 0.0000   1   2
## 19       leaf.mild*   19 575 1.104348 0.40411457      1 1.000000 0.0000   1   3
## 20            stem*   20 667 1.556222 0.49720190      2 1.570093 0.0000   1   2
## 21         lodging*   21 562 1.074733 0.26319445      1 1.000000 0.0000   1   2
## 22    stem.cankers*   22 645 2.060465 1.35169658      1 1.951644 0.0000   1   4
## 23   canker.lesion*   23 645 1.979845 1.08400138      2 1.851064 1.4826   1   4
## 24 fruiting.bodies*   24 577 1.180243 0.38472295      1 1.101512 0.0000   1   2
## 25       ext.decay*   25 645 1.249612 0.47746159      1 1.162476 0.0000   1   3
## 26        mycelium*   26 645 1.009302 0.09607342      1 1.000000 0.0000   1   2
## 27    int.discolor*   27 645 1.130233 0.41899848      1 1.000000 0.0000   1   3
## 28       sclerotia*   28 645 1.031008 0.17347313      1 1.000000 0.0000   1   2
## 29      fruit.pods*   29 599 1.504174 0.88251272      1 1.282744 0.0000   1   4
## 30     fruit.spots*   30 577 1.847487 1.17006859      1 1.686825 0.0000   1   4
## 31            seed*   31 591 1.194585 0.39621658      1 1.118393 0.0000   1   2
## 32     mold.growth*   32 591 1.113367 0.31730966      1 1.016913 0.0000   1   2
## 33   seed.discolor*   33 577 1.110919 0.31430372      1 1.015119 0.0000   1   2
## 34       seed.size*   34 591 1.099831 0.30002820      1 1.000000 0.0000   1   2
## 35      shriveling*   35 577 1.065858 0.24824873      1 1.000000 0.0000   1   2
## 36           roots*   36 652 1.177914 0.43882605      1 1.068966 0.0000   1   3
##    range        skew    kurtosis          se pct_missing
## 1     18  0.11302119  -1.3791026 0.210878424       0.000
## 2      6 -0.30397011  -0.9045074 0.064871103       0.001
## 3      1  0.18896734  -1.9673249 0.019584609       0.053
## 4      2 -1.41630633   0.5502093 0.027017015       0.056
## 5      2 -0.15829545  -0.5843151 0.024583927       0.044
## 6      1  1.30690508  -0.2925101 0.017657481       0.177
## 7      3 -0.39757148  -0.9187916 0.037781795       0.023
## 8      3  0.01799005  -1.2864923 0.041139911       0.001
## 9      2  0.17391297  -0.5647524 0.025184119       0.177
## 10     2  0.73966698  -0.4396667 0.025825828       0.177
## 11     2 -0.08680952  -1.3998550 0.033101800       0.164
## 12     1  0.67949699  -1.5405868 0.018340474       0.023
## 13     1 -2.44354029   3.9767180 0.012110687       0.000
## 14     2 -0.41080342  -1.7648507 0.038775024       0.123
## 15     2  0.46484621  -1.7465620 0.039082113       0.123
## 16     2 -0.24946067  -0.6293671 0.024993113       0.123
## 17     1  1.80367508   1.2554060 0.015373404       0.146
## 18     1  3.21564565   8.3543324 0.010779130       0.123
## 19     2  3.95290557  14.6848261 0.016852743       0.158
## 20     1 -0.22581409  -1.9519277 0.019251734       0.023
## 21     1  3.22582942   8.4209689 0.011102188       0.177
## 22     3  0.60983130  -1.5090610 0.053223001       0.056
## 23     3  0.51457211  -1.2379837 0.042682512       0.056
## 24     1  1.65939253   0.7549009 0.016016226       0.155
## 25     2  1.69543723   1.9750241 0.018800032       0.056
## 26     1 10.19921824 102.1824824 0.003782887       0.056
## 27     2  3.33861193  10.5712527 0.016498049       0.056
## 28     1  5.39870500  27.1881751 0.006830498       0.056
## 29     3  1.83817833   2.4130176 0.036058492       0.123
## 30     3  0.94650965  -0.7574031 0.048710593       0.155
## 31     1  1.53904600   0.3692961 0.016298172       0.135
## 32     1  2.43281985   3.9252628 0.013052375       0.135
## 33     1  2.47154019   4.1156528 0.013084635       0.155
## 34     1  2.66303034   5.1003693 0.012341511       0.135
## 35     1  3.49157641  10.2088077 0.010334730       0.155
## 36     2  2.45781443   5.4857676 0.017185754       0.045

According to the authors:

"Some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables. Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor.

. The fraction of unique values over the sample size is low (say 10 %).

. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20)."

The function nearZerovar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are or have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Here, Soybean2 is created by removing the nearZeroVar predictors. Taking the different column names shows the three columns with a near zero variance, “leaf.mild” “mycelium” “sclerotia”. Taking a summary of these columns shows that the zero value is the predominate value.

## [1] "leaf.mild" "mycelium"  "sclerotia"
##  leaf.mild 
##  0   :535  
##  1   : 20  
##  2   : 20  
##  NA's:108
##  mycelium  
##  0   :639  
##  1   :  6  
##  NA's: 38
##  sclerotia 
##  0   :625  
##  1   : 20  
##  NA's: 38

3.2(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

From the text:

“In our experience, missing values are more often related to predictor variables than the sample. Because of this, amount of missing data may be concentrated in a subset of predictors rather than occurring randomly across all the predictors. In some cases, the percentage of missing data is substantial enough to remove this predictor from subsequent modeling activities.”

The metastats data on the dataset were saved into the metrics variable below. From there, those predictors with the highest percentage of missing values are listed below. “hail”, “sever”, “seed.tmt”, and “lodging” have the highest percentage of missing data, approx. 18%. Each one is a factor variable.

There does not seem to be a clear explanation as to why the data is missing. Also, there is no clear understanding that the missing data is related to the outcome.

## # A tibble: 36 x 2
##    STATS            pct_missing
##    <chr>                  <dbl>
##  1 hail*                  0.177
##  2 sever*                 0.177
##  3 seed.tmt*              0.177
##  4 lodging*               0.177
##  5 germ*                  0.164
##  6 leaf.mild*             0.158
##  7 fruiting.bodies*       0.155
##  8 fruit.spots*           0.155
##  9 seed.discolor*         0.155
## 10 shriveling*            0.155
## # ... with 26 more rows
##  [1] diaporthe-stem-canker       charcoal-rot               
##  [3] rhizoctonia-root-rot        phytophthora-rot           
##  [5] brown-stem-rot              powdery-mildew             
##  [7] downy-mildew                brown-spot                 
##  [9] bacterial-blight            bacterial-pustule          
## [11] purple-seed-stain           anthracnose                
## [13] phyllosticta-leaf-spot      alternarialeaf-spot        
## [15] frog-eye-leaf-spot          diaporthe-pod-&-stem-blight
## [17] cyst-nematode               2-4-d-injury               
## [19] herbicide-injury           
## 19 Levels: 2-4-d-injury alternarialeaf-spot anthracnose ... rhizoctonia-root-rot

3.2(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Imputation may not be the best option in this instance due to these predictors being categorical variables with a finite number of entries. The best strategy just may be removing the predictors entirely.

Perhaps the overall best strategy would be to build the model with and without these predictors and select the model with the best results.

