Data 624 HW 4 (Week 5) Data Preprocessing

Data Processing/Overfitting

Alexander Ng

Due 10/3/2021

Overview

Exercises from Kuhn & Johnson, Applied Predictive Modeling, Chapter 3 Data Preprocessing. All R code is displayed at the end for clarity.

Exercise 3.1 Statement

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

>library(mlbench)
>data(Glass)
>str(Glass)
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

  2. Do there appear to be any outliers in the data? Are any predictors skewed?

  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

Exercise 3.1 Solution

Data Visualization

The correlogram of ggpairs shows the distributions, correlations and scatterplots of the variables in relation to each other.

The principal ingredients in glass are silicon (mean 72.7%) and sodium (13.4%) and calcium (8.96%). Other elements appear in trace quantities - under 3% on average.

The most significant relationships are between RI (refractive index) and other elements:

Certain elements are not present in many samples creating bimodal distributions. For example,

The chemical relationship between these trace elements and the refractive index and type of glass need to be explored further with a chemist.

Outliers and Skewness

Skewness

Using the box plots below and the density plots in the previous section, we observe that:

  • RI has a slight right skew.
  • Na is symmetric
  • Mg is bimodal with a cluster at zero. The non-zero observations appear left skewed.
  • Ba is bimodal with a cluster at zero. The non-zero observations are symmetrical in the subpopulations by Type. However, we should not draw too many conclusions because Ba is absent in 82% of samples.
  • Fe is bimodal with a cluster at zero. The non-zero observations appear right skewed in aggregate, but symmetric within their Type subgroups.
  • Al is present in all samples. It has a slight right skew. The subpopulations are relatively symmetric. Simpson’s paradox is applicable here.
  • Si is present in all samples and symmetric. Within subpopulations, Si is both left and right skewed.
  • Ca is present in all samples but left and right skewed at the subpopulation level.

Outliers

Because Glass lacks a unique row identifier, we begin by inserting an Id column to assist in outlier identification.

We will store the outliers by their Id, Field, Cause.
Obviously the same row can appear multiple times.

Let’s begin by plotting boxplots of the refraction index and each chemical component of the glass samples faceted by Type.

The outliers are identified using customized grouped boxplots. For each response variable (i.e. the elements and refractive index) we generate a grouped boxplot. Observations are grouped by the classification variable Type.

Examining the subgroup distribution by Type instead of for the pooled distribution for skewness is necessary to avoid Simpson’s Paradox. I alter the boxplots in ways to choose outliers:

  • Extend the whisker in the boxplot to 4 times the Inter quartile range (IQR) instead of using the default 1.5. This reduces the number of potential outliers.
  • Display all individual datapoints in gray via jitter. This allows me to distinguish points which fall on the cutoff boundary of the whisker from truly isolated points.
  • Mark outliers in red.

These alterations make the box plot more usable but I still omit or include some outliers based on visual inspection.

My final list contains 11 observations that I regard as potential outliers. To illustrate the selection within one boxplot, I observe for Mg (Magnesium) that Type 1 has an outlier of high value around 5. The Type 7 population is bimodal. Most observations have no Mg concentration. But some Type 7 samples have Mg and are marked in red. However, I consider none to be outliers because they appear clustered.

In the table below, I show row number, Type and outlier status with respect to each element or Refractive Index.

  • A table value of 1 means the observation is a high value outlier with respect to the column attribute.
  • A table value of -1 indicates a low value outlier.
  • A table value of 0 means it is not an outlier.

We observe that some observations have correlated outliers:

  • Observation 185 is Type 6. It has low Refractive Index, High Silicon and Low Sodium.
  • Observation 186 is Type 6. It has low Refractive Index, Low Calcium.
  • Observation 107 is Type 2. It has high Refractive Index, High Calcium.

List of Outliers

Row Type Mg Ba RI Ca Fe K Si Na
1 1 1 0 0 0 0 0 0 0
107 2 0 1 1 0 0 0 0 0
108 2 0 0 1 1 0 0 0 0
163 3 0 0 0 0 1 0 0 0
172 5 0 0 0 0 0 1 0 0
173 5 0 0 0 0 0 1 0 0
175 5 0 0 0 0 1 0 0 0
181 6 0 0 -1 0 0 0 0 0
185 6 0 0 -1 0 0 0 1 -1
186 7 0 0 -1 -1 0 0 0 0
208 7 0 1 0 0 0 0 0 0

Classification Improvements

To evaluate the benefit of variable transformation to improve prediction accuracy, I test their impact on an actual classification model. The authors imply that one can figure this out without experimental evidence. I don’t agree. I use a random forest algorithm to classify the observations. I compare the predictive accuracy in-sample and the variance important plot before and after the variable transformation.

I will show that a (square root) have no impact on random forest accuracy or variance importance.

In the output below, I show a random forest to predict Type using all other variables (exclude Row number). I set the parameters to be \(N=1000\) trees and \(mtry=3\) nodes. Accuracy results were robust to changes in these parameters.

We observe that the OOB accuracy is 19.6% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are useful variables for accurate prediction. We see that Silicon and Iron are least useful in prediction.

## 
## Call:
##  randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si +      Na + Al, data = Glass2, mtry = 3, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.63%
## Confusion matrix:
##    1  2 3 5 6  7 class.error
## 1 62  6 2 0 0  0   0.1142857
## 2 10 62 1 1 1  1   0.1842105
## 3  6  4 7 0 0  0   0.5882353
## 5  0  3 0 9 0  1   0.3076923
## 6  0  2 0 0 7  0   0.2222222
## 7  1  3 0 0 0 25   0.1379310

In the below model, we construct a new data set to transform the variables \(K, Ba, Mg, Al, Fe, RI\) with square root transformations. We observe that the OOB accuracy is 19.2% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are still the most useful variables for accurate prediction. We see that Silicon and Iron are still least useful in prediction. I judge the predictive accuracy improvement to be insignificant.

## 
## Call:
##  randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si +      Na + Al, data = Glass3, mtry = 3, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.16%
## Confusion matrix:
##    1  2 3 5 6  7 class.error
## 1 63  6 1 0 0  0   0.1000000
## 2  9 63 1 1 1  1   0.1710526
## 3  7  4 6 0 0  0   0.6470588
## 5  0  3 0 9 0  1   0.3076923
## 6  0  2 0 0 7  0   0.2222222
## 7  1  3 0 0 0 25   0.1379310

I conclude that variable transformations \(\color{red}{\text{don't}}\) help the classification accuracy when the model is robust to such transformations. The random forest is somewhat robust to variable transformations. This is not be true for regression models like OLS or multinomial models where the distribution of errors is more important.

Exercise 3.2 Statement

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

  2. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Exercise 3.2 Solution

A. Frequency Distributions of Categorical Predictors

We analyze the dataframe using the skimr package to answer basic questions about the frequency distribution of the predictors. The predictor name is listed under skim_variable. In the largest table of the skim output below the top_counts columns, we see the frequency distribution. The frequency distribution info is about the top 4 values of the predictor. For example, the column plant.stand takes on 2 values (0 occurs 354 times and 1 occurs 292 times).

Data summary
Name Soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15

We explore degeneracy of the predictors. Degeneracy of a predictor \(X\) is defined by Kuhn and Johnson as either:

\[ \lambda(X) = \frac{N_{1}}{N_{2}} \geq \lambda = 20\]

Degenerate predictors

predictor n_missing N1 N2 lambdaX
mycelium 38 639 6 106.5
sclerotia 38 625 20 31.2
leaf.mild 108 535 20 26.8
shriveling 106 539 38 14.2
int.discolor 38 581 44 13.2
lodging 121 520 42 12.4
leaf.malf 84 554 45 12.3
seed.size 92 532 59 9.0
seed.discolor 106 513 64 8.0
leaves 0 606 77 7.9
mold.growth 92 524 67 7.8
roots 31 551 86 6.4
leaf.shread 100 487 96 5.1
fruiting.bodies 106 473 104 4.5
seed 92 476 115 4.1
precip 38 459 112 4.1
ext.decay 38 497 135 3.7
fruit.spots 106 345 100 3.5
hail 121 435 127 3.4
fruit.pods 84 407 130 3.1
stem.cankers 38 379 191 2.0
plant.growth 16 441 226 2.0
temp 30 374 199 1.9
canker.lesion 38 320 177 1.8
sever 121 322 195 1.7
leaf.marg 84 357 221 1.6
leaf.halo 84 342 221 1.5
leaf.size 84 327 221 1.5
seed.tmt 121 305 222 1.4
stem 16 371 296 1.3
area.dam 1 227 187 1.2
plant.stand 36 354 293 1.2
date 1 149 131 1.1
germ 112 213 193 1.1
Class 0 92 91 1.0
crop.hist 16 219 218 1.0

Based on the sorted table above, we see that the top 3 predictors mycelium, sclerotia, leaf.mild violate the degeneracy rule. Also, there are no constant predictors. Thus, we have found all degenerate predictors.

B. Missing Predictor Patterns

Our calculations in R shown the percentage of observations (rows) which at least one column is missing values is 17.715959%. By contrast, the percentage of cells with missing values in the data frame is 9.5046364%.

Missingness is not Uniform among Predictors

As the skim output shows, the predictors with missing values vary by degree. We plot this data below shown the count of missing values for each column of the Soybean dataset. The top columns for missing data are sever, seed.tmt, lodging, hail with 124 missing. These columns have 82.3% completeness. In contrast, date, area.dam are missing 1 values and 99.9% complete.

Missingness distribution to values of Class are non-random.

I use the finalfit package to visually display the pattern of missing data between Class and each of the 35 predictors. To do this, I run a for loop to output a graph that compares each predictor’s missing data impact on each value of Class. \(\color{red}{\text{The plots are displayed in the subsection after this.}}\)

Based on visual inspection of the plot gallery, I conclude the following 32 predictors have a non-random distribution of missing values among the classes of the Class variable. These predictors are not Missing Completely at Random (MCAR) data. It is not possible to distinguish MAR and MNAR from the available information however.

The table below shows the predictors with missing values and the number of classes impacted by the missing values.

Predictor Distribution of Missing Values

predictor support
crop.hist 1
plant.growth 1
stem 1
fruit.pods 2
roots 2
temp 2
canker.lesion 3
ext.decay 3
int.discolor 3
leaf.halo 3
leaf.marg 3
leaf.size 3
leaft.malf 3
mold.growth 3
mycelium 3
plant.stand 3
precip 3
sclerotia 3
seed 3
seed.size 3
stem.cankers 3
fruit.bodies 4
fruit.spots 4
leaf.shread 4
seed.discolor 4
shriveling 4
germ 5
hail 5
leaf.mild 5
lodging 5
seed.tmt 5
sever 5

For example, the predictor crop.hist has 16 missing values occuring only in 1 class 2-4-d-injury. If the missing values are uniformly distributed across observations, the chance that only 1 class is affected is nearly zero.

Assuming the outcome is from a uniform probability distribution, we can explicitly calculate the mathematical probability of 16 missing values only occurring in the 2-4-d-injury class which has 16 members. Let the number of observations be \(N=683\). Let \(C\) be the class of 2-4-d-injury. Let \(M\) represent the number of missing values associated with the predictor crop.hist.

\[ P[\text{missing only in 2-4-d-injury}] =\left( 1 - \frac{|C|}{N} \right)^{M} = \left( 1 - \frac{16}{683} \right)^{16} \approx 8.22 \times 10^{-27}\] So we can strongly reject MCAR. A statistical test for the predictors which cover more than 1 class would similarly reject MCAR.

C. A strategy for handling missing data

My preferred strategy is to use robust models that can handle significant amounts of missing data. This includes ensemble methods like the random forest.

However, I would still remove the 3 degenerate variables described earlier.

I use the randomForest parameter na.action equal to na.roughfix which imputes with the median value.

The resulting model output shows that OOB error rate is 4.98% with \(mtry=5\) and \(ntree=500\) trees. Without excessive model tuning work, it seems that a 95% in-sample prediction rate is a good starting point. That seems pretty good.

## 
## Call:
##  randomForest(formula = Class ~ ., data = Soybean2, mtry = 5,      importance = TRUE, ntree = 500, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.98%
## Confusion matrix:
##                             2-4-d-injury alternarialeaf-spot anthracnose
## 2-4-d-injury                          16                   0           0
## alternarialeaf-spot                    0                  89           0
## anthracnose                            0                   0          44
## bacterial-blight                       0                   0           0
## bacterial-pustule                      0                   0           0
## brown-spot                             0                   1           0
## brown-stem-rot                         0                   0           0
## charcoal-rot                           0                   0           0
## cyst-nematode                          0                   0           0
## diaporthe-pod-&-stem-blight            0                   0           0
## diaporthe-stem-canker                  0                   0           0
## downy-mildew                           0                   0           0
## frog-eye-leaf-spot                     0                  22           0
## herbicide-injury                       0                   0           0
## phyllosticta-leaf-spot                 0                   0           0
## phytophthora-rot                       0                   0           0
## powdery-mildew                         0                   0           0
## purple-seed-stain                      0                   0           0
## rhizoctonia-root-rot                   0                   0           0
##                             bacterial-blight bacterial-pustule brown-spot
## 2-4-d-injury                               0                 0          0
## alternarialeaf-spot                        0                 0          1
## anthracnose                                0                 0          0
## bacterial-blight                          20                 0          0
## bacterial-pustule                          1                19          0
## brown-spot                                 0                 0         88
## brown-stem-rot                             0                 0          0
## charcoal-rot                               0                 0          0
## cyst-nematode                              0                 0          0
## diaporthe-pod-&-stem-blight                0                 0          0
## diaporthe-stem-canker                      0                 0          0
## downy-mildew                               0                 0          0
## frog-eye-leaf-spot                         0                 0          2
## herbicide-injury                           0                 0          0
## phyllosticta-leaf-spot                     0                 0          3
## phytophthora-rot                           0                 0          0
## powdery-mildew                             0                 0          0
## purple-seed-stain                          0                 0          0
## rhizoctonia-root-rot                       0                 0          0
##                             brown-stem-rot charcoal-rot cyst-nematode
## 2-4-d-injury                             0            0             0
## alternarialeaf-spot                      0            0             0
## anthracnose                              0            0             0
## bacterial-blight                         0            0             0
## bacterial-pustule                        0            0             0
## brown-spot                               0            0             0
## brown-stem-rot                          44            0             0
## charcoal-rot                             0           20             0
## cyst-nematode                            0            0            14
## diaporthe-pod-&-stem-blight              0            0             0
## diaporthe-stem-canker                    0            0             0
## downy-mildew                             0            0             0
## frog-eye-leaf-spot                       0            0             0
## herbicide-injury                         0            0             0
## phyllosticta-leaf-spot                   0            0             0
## phytophthora-rot                         0            0             0
## powdery-mildew                           0            0             0
## purple-seed-stain                        0            0             0
## rhizoctonia-root-rot                     0            0             0
##                             diaporthe-pod-&-stem-blight diaporthe-stem-canker
## 2-4-d-injury                                          0                     0
## alternarialeaf-spot                                   0                     0
## anthracnose                                           0                     0
## bacterial-blight                                      0                     0
## bacterial-pustule                                     0                     0
## brown-spot                                            0                     0
## brown-stem-rot                                        0                     0
## charcoal-rot                                          0                     0
## cyst-nematode                                         0                     0
## diaporthe-pod-&-stem-blight                          15                     0
## diaporthe-stem-canker                                 0                    20
## downy-mildew                                          0                     0
## frog-eye-leaf-spot                                    0                     0
## herbicide-injury                                      0                     0
## phyllosticta-leaf-spot                                0                     0
## phytophthora-rot                                      0                     0
## powdery-mildew                                        0                     0
## purple-seed-stain                                     0                     0
## rhizoctonia-root-rot                                  0                     0
##                             downy-mildew frog-eye-leaf-spot herbicide-injury
## 2-4-d-injury                           0                  0                0
## alternarialeaf-spot                    0                  1                0
## anthracnose                            0                  0                0
## bacterial-blight                       0                  0                0
## bacterial-pustule                      0                  0                0
## brown-spot                             0                  1                0
## brown-stem-rot                         0                  0                0
## charcoal-rot                           0                  0                0
## cyst-nematode                          0                  0                0
## diaporthe-pod-&-stem-blight            0                  0                0
## diaporthe-stem-canker                  0                  0                0
## downy-mildew                          20                  0                0
## frog-eye-leaf-spot                     0                 67                0
## herbicide-injury                       0                  0                8
## phyllosticta-leaf-spot                 0                  0                0
## phytophthora-rot                       0                  0                0
## powdery-mildew                         0                  0                0
## purple-seed-stain                      0                  0                0
## rhizoctonia-root-rot                   0                  0                0
##                             phyllosticta-leaf-spot phytophthora-rot
## 2-4-d-injury                                     0                0
## alternarialeaf-spot                              0                0
## anthracnose                                      0                0
## bacterial-blight                                 0                0
## bacterial-pustule                                0                0
## brown-spot                                       2                0
## brown-stem-rot                                   0                0
## charcoal-rot                                     0                0
## cyst-nematode                                    0                0
## diaporthe-pod-&-stem-blight                      0                0
## diaporthe-stem-canker                            0                0
## downy-mildew                                     0                0
## frog-eye-leaf-spot                               0                0
## herbicide-injury                                 0                0
## phyllosticta-leaf-spot                          17                0
## phytophthora-rot                                 0               88
## powdery-mildew                                   0                0
## purple-seed-stain                                0                0
## rhizoctonia-root-rot                             0                0
##                             powdery-mildew purple-seed-stain
## 2-4-d-injury                             0                 0
## alternarialeaf-spot                      0                 0
## anthracnose                              0                 0
## bacterial-blight                         0                 0
## bacterial-pustule                        0                 0
## brown-spot                               0                 0
## brown-stem-rot                           0                 0
## charcoal-rot                             0                 0
## cyst-nematode                            0                 0
## diaporthe-pod-&-stem-blight              0                 0
## diaporthe-stem-canker                    0                 0
## downy-mildew                             0                 0
## frog-eye-leaf-spot                       0                 0
## herbicide-injury                         0                 0
## phyllosticta-leaf-spot                   0                 0
## phytophthora-rot                         0                 0
## powdery-mildew                          20                 0
## purple-seed-stain                        0                20
## rhizoctonia-root-rot                     0                 0
##                             rhizoctonia-root-rot class.error
## 2-4-d-injury                                   0  0.00000000
## alternarialeaf-spot                            0  0.02197802
## anthracnose                                    0  0.00000000
## bacterial-blight                               0  0.00000000
## bacterial-pustule                              0  0.05000000
## brown-spot                                     0  0.04347826
## brown-stem-rot                                 0  0.00000000
## charcoal-rot                                   0  0.00000000
## cyst-nematode                                  0  0.00000000
## diaporthe-pod-&-stem-blight                    0  0.00000000
## diaporthe-stem-canker                          0  0.00000000
## downy-mildew                                   0  0.00000000
## frog-eye-leaf-spot                             0  0.26373626
## herbicide-injury                               0  0.00000000
## phyllosticta-leaf-spot                         0  0.15000000
## phytophthora-rot                               0  0.00000000
## powdery-mildew                                 0  0.00000000
## purple-seed-stain                              0  0.00000000
## rhizoctonia-root-rot                          20  0.00000000

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)
 .codeExample{
     background-color: lightgray
 }
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(finalfit)
library(mlbench)
data(Glass)
#str(Glass)
library(GGally)

ggpairs( Glass[1:9])

# Add a Row number for each observation
Glass2 <- Glass %>% mutate( Id = row_number())


# Refractive index plot - outliers as R.  IQR = 4 (fat whiskers), overlays jittered dots

pRI <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, RI, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Refractive Index") +
  geom_jitter(aes(Type, RI), alpha = 0.4 , position= position_jitter(width=0.2))

# Stores outliers in a data frame

outliers  <- Glass2 %>% filter(Type==2, RI > 1.530 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "High") %>%
  select( Row, Type, Field, Cause) 

outliers <- Glass2 %>% filter(Type==6, RI < 1.515 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

outliers <- Glass2 %>% filter(Type==7, RI < 1.515 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pNa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Na, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Na  - Sodium") +
  geom_jitter(aes(Type, Na), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <-   Glass2 %>% filter(Type==6, Na > 17 ) %>% 
  mutate( Row = Id,  Field = "Na", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pMg <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Mg, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Mg  - Magnesium ") +
  geom_jitter(aes(Type, Mg), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <- 
  Glass2 %>% filter(Type==1, Mg > 4.4 ) %>% 
  mutate( Row = Id,  Field = "Mg", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)


pAl <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Al, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Al - Aluminum ") +
  geom_jitter(aes(Type, Al), alpha = 0.4 , position= position_jitter(width=0.2))

pSi <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Si, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Si - Silicon") +
  geom_jitter(aes(Type, Si), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter(Type==6, Si > 75 ) %>% 
  mutate( Row = Id,  Field = "Si", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

pK <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, K, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("K - Potassium") +
  geom_jitter(aes(Type, K), alpha = 0.4 , position= position_jitter(width=0.2))



outliers <- 
  Glass2 %>% filter(Type==5, K > 6 ) %>% 
  mutate( Row = Id,  Field = "K", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pCa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Ca, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Ca - Calcium") +
  geom_jitter(aes(Type, Ca), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter(Type==2, Ca > 15  ) %>% 
  mutate( Row = Id,  Field = "Ca", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)


outliers <- 
  Glass2 %>% filter(Type==7, Ca < 5.5  ) %>% 
  mutate( Row = Id,  Field = "Ca", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers


pBa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Ba, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Ba - Barium") +
  geom_jitter(aes(Type, Ba), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter( ( Type==2 &  Ba > 3) | ( Type==7 & Ba > 2.5) ) %>% 
  mutate( Row = Id,  Field = "Ba", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)



pFe <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Fe, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Fe - Iron") +
  geom_jitter(aes(Type, Fe), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter( ( Type==5 &  Fe > 0.4) | ( Type==3 & Fe > 0.3) ) %>% 
  mutate( Row = Id,  Field = "Fe", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(pRI + tlp , pMg + tlp + c2, pBa + tlp  , 
                   pCa + tlp , pNa + tlp , pK  + tlp + c2, 
                   pSi + tlp , pAl + tlp , pFe + tlp , 
                   ncol = 3 ) 

outliers %>% mutate( IsHigh = ifelse(Cause=="High", 1, -1)) %>% 
  arrange(Row) %>% 
  select(Row, Type, Field, IsHigh ) %>% 
  pivot_wider(id_cols= c(Row, Type), names_from = "Field", values_from = "IsHigh") %>%
  replace_na( list(Mg=0, Ba = 0, RI = 0, Ca = 0, Fe = 0, K = 0, Si = 0, Na = 0)) %>%
  kable(row.names= FALSE , caption='List of Outliers') %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))

bpRI = Glass2 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass2 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass2 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass2 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass2 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK  = Glass2 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass2 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass2 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass2 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))


tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp  , 
                   bpCa + tlp , bpNa + tlp , bpK  + tlp + c2, 
                   bpSi + tlp , bpAl + tlp , bpFe + tlp , 
                   ncol = 3 ) 

set.seed(102)

library(randomForest)

rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al , 
                                  data = Glass2 , 
                                  mtry = 3,  
                                  importance = TRUE , 
                                  ntree = 1000
                                  )

rfm

varImpPlot(rfm)
# Debug
# Remove eval=false later

Glass2 %>% mutate( K  = sqrt(K) ,
                   Ba = sqrt(Ba) ,
                   Mg = sqrt(Mg) ,
                   Al = sqrt(Al) ,
                   Fe = sqrt(Fe) ,
                   RI = sqrt(RI)
                   ) -> Glass3



bpRI = Glass3 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass3 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass3 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass3 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass3 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK  = Glass3 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass3 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass3 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass3 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))


tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp  , 
                   bpCa + tlp , bpNa + tlp , bpK  + tlp + c2, 
                   bpSi + tlp , bpAl + tlp , bpFe + tlp , 
                   ncol = 3 ) 


# Debug
# Remove eval=false later

rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al , 
                                  data = Glass3 , 
                                  mtry = 3,  
                                  importance = TRUE , 
                                  ntree = 1000
                                  )

rfm

varImpPlot(rfm)
data(Soybean)
( soyskim = skimr::skim(Soybean) )

# Define the degeneracy condition

# skim extracts the top 1, 2 and 3 values and their frequencies as count for each column
aa = soyskim$factor.top_counts

# We parse the relevant frequencies based on the comma separated list of colon separated key-values
bb = str_match_all(aa, "^\\w+:\\s+(\\d+),\\s+\\w+:\\s+(\\d+)" )

N1 = sapply(bb, function(x) strtoi(x[[2]]))

N2 = sapply(bb, function(x) strtoi(x[[3]]))

lambdaX = N1 / N2

degeneracy = data.frame( predictor = soyskim$skim_variable, 
                         n_missing = soyskim$n_missing ,
                         N1 =  N1 ,
                         N2 =  N2 ,
                         lambdaX = lambdaX
                         )

degeneracy %>% arrange(desc(lambdaX)) %>% kable(caption="Degenerate predictors", digits = 1) %>% kable_styling(bootstrap_options = c("hover", "striped"))

# Observations with at least one column NA or ""
# Trick: rowSums returns the missing columns per row.
cells_with_na = rowSums(is.na(Soybean) | Soybean == "") 
fraction_with_na = sum( cells_with_na > 0 ) / nrow(Soybean)
fraction_cells_na = sum( cells_with_na ) / ( nrow(Soybean) * ncol(Soybean))

soyskim %>% 
  arrange(desc(n_missing)) %>% ggplot(aes(y=reorder(skim_variable, n_missing), x=n_missing) ) + geom_col() +
  labs(title="Variables Sorted by number of missing values", subtitle = "from Soybean") +
  ylab("Variable name") + xlab("Number Missing")


NonMCAR_Predictors = data.frame( 
 predictor = c( 'plant.stand', 'precip' , 'temp', 'hail', 'crop.hist',
                'sever', 'seed.tmt' , 'germ' , 'plant.growth' ,  'leaf.halo' ,
                'leaf.marg' , 'leaf.size', 'leaf.shread' , 'leaft.malf' , 'leaf.mild' ,
                'stem' , 'lodging' ,  'stem.cankers', 'canker.lesion' , 'fruit.bodies' ,
                'ext.decay' , 'mycelium' , 'int.discolor', 'sclerotia', 'fruit.pods' , 
                'fruit.spots' , 'seed' , 'mold.growth', 'seed.discolor' , 'seed.size' ,
                'shriveling' , 'roots' 
   ),
 support = c( 3 , 3,  2 , 5 , 1, 
              5 , 5 , 5,  1,  3 ,
              3 , 3 , 4 , 3 , 5 ,
              1 , 5 , 3 , 3 , 4 ,
              3 , 3 , 3 , 3 , 2 ,
              4 , 3 , 3 , 4 , 3,
              4 , 2
              )
)

NonMCAR_Predictors %>% 
    arrange(support, predictor) %>%
    kable(caption = "Predictor Distribution of Missing Values") %>%
    kable_styling(bootstrap_options = c("hover", "striped"))


# PRINT THE GALLERY OF CHARTS

explanatory_variables = soyskim$skim_variable[2:36]

for( v in explanatory_variables){
    print( Soybean %>% missing_pairs("Class", v) + 
       theme(axis.text.x = element_text(angle=90)) )
  
}



library(randomForest)

# Define a new dataset Soybean2 with removed variables.

Soybean %>% select(-mycelium, -sclerotia, -leaf.mild ) -> Soybean2
rfm = randomForest::randomForest( Class ~ . , 
                                  data = Soybean2, 
                                  mtry = 5,  
                                  importance = TRUE , 
                                  na.action = na.roughfix ,
                                  ntree = 500
                                  )

rfm

varImpPlot(rfm)