Exercise 3.1 Solution

Data Visualization

The correlogram of ggpairs shows the distributions, correlations and scatterplots of the variables in relation to each other.

The principal ingredients in glass are silicon (mean 72.7%) and sodium (13.4%) and calcium (8.96%). Other elements appear in trace quantities - under 3% on average.

The most significant relationships are between RI (refractive index) and other elements:

Ca which has a 80% correlation
Si (Silicon) has a -54% correlation
Al (Aluminum) has a -40% correlation

Certain elements are not present in many samples creating bimodal distributions. For example,

Ba Barium is absent in 82% of samples.
K Potassium is absent in 14% of samples.

The chemical relationship between these trace elements and the refractive index and type of glass need to be explored further with a chemist.

Outliers and Skewness

Skewness

Using the box plots below and the density plots in the previous section, we observe that:

RI has a slight right skew.
Na is symmetric
Mg is bimodal with a cluster at zero. The non-zero observations appear left skewed.
Ba is bimodal with a cluster at zero. The non-zero observations are symmetrical in the subpopulations by Type. However, we should not draw too many conclusions because Ba is absent in 82% of samples.
Fe is bimodal with a cluster at zero. The non-zero observations appear right skewed in aggregate, but symmetric within their Type subgroups.
Al is present in all samples. It has a slight right skew. The subpopulations are relatively symmetric. Simpson’s paradox is applicable here.
Si is present in all samples and symmetric. Within subpopulations, Si is both left and right skewed.
Ca is present in all samples but left and right skewed at the subpopulation level.

Outliers

Because Glass lacks a unique row identifier, we begin by inserting an Id column to assist in outlier identification.

We will store the outliers by their Id, Field, Cause.
Obviously the same row can appear multiple times.

Let’s begin by plotting boxplots of the refraction index and each chemical component of the glass samples faceted by Type.

The outliers are identified using customized grouped boxplots. For each response variable (i.e. the elements and refractive index) we generate a grouped boxplot. Observations are grouped by the classification variable Type.

Examining the subgroup distribution by Type instead of for the pooled distribution for skewness is necessary to avoid Simpson’s Paradox. I alter the boxplots in ways to choose outliers:

Extend the whisker in the boxplot to 4 times the Inter quartile range (IQR) instead of using the default 1.5. This reduces the number of potential outliers.
Display all individual datapoints in gray via jitter. This allows me to distinguish points which fall on the cutoff boundary of the whisker from truly isolated points.
Mark outliers in red.

These alterations make the box plot more usable but I still omit or include some outliers based on visual inspection.

My final list contains 11 observations that I regard as potential outliers. To illustrate the selection within one boxplot, I observe for Mg (Magnesium) that Type 1 has an outlier of high value around 5. The Type 7 population is bimodal. Most observations have no Mg concentration. But some Type 7 samples have Mg and are marked in red. However, I consider none to be outliers because they appear clustered.

In the table below, I show row number, Type and outlier status with respect to each element or Refractive Index.

A table value of 1 means the observation is a high value outlier with respect to the column attribute.
A table value of -1 indicates a low value outlier.
A table value of 0 means it is not an outlier.

We observe that some observations have correlated outliers:

Observation 185 is Type 6. It has low Refractive Index, High Silicon and Low Sodium.
Observation 186 is Type 6. It has low Refractive Index, Low Calcium.
Observation 107 is Type 2. It has high Refractive Index, High Calcium.

List of Outliers

Row	Type	Mg	Ba	RI	Ca	Fe	K	Si	Na
1	1	1	0	0	0	0	0	0	0
107	2	0	1	1	0	0	0	0	0
108	2	0	0	1	1	0	0	0	0
163	3	0	0	0	0	1	0	0	0
172	5	0	0	0	0	0	1	0	0
173	5	0	0	0	0	0	1	0	0
175	5	0	0	0	0	1	0	0	0
181	6	0	0	-1	0	0	0	0	0
185	6	0	0	-1	0	0	0	1	-1
186	7	0	0	-1	-1	0	0	0	0
208	7	0	1	0	0	0	0	0	0

Classification Improvements

To evaluate the benefit of variable transformation to improve prediction accuracy, I test their impact on an actual classification model. The authors imply that one can figure this out without experimental evidence. I don’t agree. I use a random forest algorithm to classify the observations. I compare the predictive accuracy in-sample and the variance important plot before and after the variable transformation.

I will show that a (square root) have no impact on random forest accuracy or variance importance.

In the output below, I show a random forest to predict Type using all other variables (exclude Row number). I set the parameters to be \(N=1000\) trees and \(mtry=3\) nodes. Accuracy results were robust to changes in these parameters.

We observe that the OOB accuracy is 19.6% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are useful variables for accurate prediction. We see that Silicon and Iron are least useful in prediction.

## 
## Call:
##  randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si +      Na + Al, data = Glass2, mtry = 3, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.63%
## Confusion matrix:
##    1  2 3 5 6  7 class.error
## 1 62  6 2 0 0  0   0.1142857
## 2 10 62 1 1 1  1   0.1842105
## 3  6  4 7 0 0  0   0.5882353
## 5  0  3 0 9 0  1   0.3076923
## 6  0  2 0 0 7  0   0.2222222
## 7  1  3 0 0 0 25   0.1379310

In the below model, we construct a new data set to transform the variables \(K, Ba, Mg, Al, Fe, RI\) with square root transformations. We observe that the OOB accuracy is 19.2% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are still the most useful variables for accurate prediction. We see that Silicon and Iron are still least useful in prediction. I judge the predictive accuracy improvement to be insignificant.

## 
## Call:
##  randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si +      Na + Al, data = Glass3, mtry = 3, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.16%
## Confusion matrix:
##    1  2 3 5 6  7 class.error
## 1 63  6 1 0 0  0   0.1000000
## 2  9 63 1 1 1  1   0.1710526
## 3  7  4 6 0 0  0   0.6470588
## 5  0  3 0 9 0  1   0.3076923
## 6  0  2 0 0 7  0   0.2222222
## 7  1  3 0 0 0 25   0.1379310

I conclude that variable transformations \(\color{red}{\text{don't}}\) help the classification accuracy when the model is robust to such transformations. The random forest is somewhat robust to variable transformations. This is not be true for regression models like OLS or multinomial models where the distribution of errors is more important.

Exercise 3.2 Statement

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Exercise 3.2 Solution

A. Frequency Distributions of Categorical Predictors

We analyze the dataframe using the skimr package to answer basic questions about the frequency distribution of the predictors. The predictor name is listed under skim_variable. In the largest table of the skim output below the top_counts columns, we see the frequency distribution. The frequency distribution info is about the top 4 values of the predictor. For example, the column plant.stand takes on 2 values (0 occurs 354 times and 1 occurs 292 times).

Data summary

Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

We explore degeneracy of the predictors. Degeneracy of a predictor \(X\) is defined by Kuhn and Johnson as either:

\(X\) takes only one value on the entire sample population where the value is not missing.
\(X\) takes on the most frequently occurring value \(v_1\) a disproportionately large number \(N_1\) times is relation to the next most frequent value \(v_2\) which is taken on \(N_2\) times. The threshold \(\lambda\) for degeneracy is 20. We can write the degeneracy condition for \(X\) as:

\[ \lambda(X) = \frac{N_{1}}{N_{2}} \geq \lambda = 20\]

Degenerate predictors

predictor	n_missing	N1	N2	lambdaX
mycelium	38	639	6	106.5
sclerotia	38	625	20	31.2
leaf.mild	108	535	20	26.8
shriveling	106	539	38	14.2
int.discolor	38	581	44	13.2
lodging	121	520	42	12.4
leaf.malf	84	554	45	12.3
seed.size	92	532	59	9.0
seed.discolor	106	513	64	8.0
leaves	0	606	77	7.9
mold.growth	92	524	67	7.8
roots	31	551	86	6.4
leaf.shread	100	487	96	5.1
fruiting.bodies	106	473	104	4.5
seed	92	476	115	4.1
precip	38	459	112	4.1
ext.decay	38	497	135	3.7
fruit.spots	106	345	100	3.5
hail	121	435	127	3.4
fruit.pods	84	407	130	3.1
stem.cankers	38	379	191	2.0
plant.growth	16	441	226	2.0
temp	30	374	199	1.9
canker.lesion	38	320	177	1.8
sever	121	322	195	1.7
leaf.marg	84	357	221	1.6
leaf.halo	84	342	221	1.5
leaf.size	84	327	221	1.5
seed.tmt	121	305	222	1.4
stem	16	371	296	1.3
area.dam	1	227	187	1.2
plant.stand	36	354	293	1.2
date	1	149	131	1.1
germ	112	213	193	1.1
Class	0	92	91	1.0
crop.hist	16	219	218	1.0

Based on the sorted table above, we see that the top 3 predictors mycelium, sclerotia, leaf.mild violate the degeneracy rule. Also, there are no constant predictors. Thus, we have found all degenerate predictors.

B. Missing Predictor Patterns

Our calculations in R shown the percentage of observations (rows) which at least one column is missing values is 17.715959%. By contrast, the percentage of cells with missing values in the data frame is 9.5046364%.

Missingness is not Uniform among Predictors

As the skim output shows, the predictors with missing values vary by degree. We plot this data below shown the count of missing values for each column of the Soybean dataset. The top columns for missing data are sever, seed.tmt, lodging, hail with 124 missing. These columns have 82.3% completeness. In contrast, date, area.dam are missing 1 values and 99.9% complete.

Missingness distribution to values of `Class` are non-random.

I use the finalfit package to visually display the pattern of missing data between Class and each of the 35 predictors. To do this, I run a for loop to output a graph that compares each predictor’s missing data impact on each value of Class. \(\color{red}{\text{The plots are displayed in the subsection after this.}}\)

Based on visual inspection of the plot gallery, I conclude the following 32 predictors have a non-random distribution of missing values among the classes of the Class variable. These predictors are not Missing Completely at Random (MCAR) data. It is not possible to distinguish MAR and MNAR from the available information however.

The table below shows the predictors with missing values and the number of classes impacted by the missing values.

Predictor Distribution of Missing Values

predictor	support
crop.hist	1
plant.growth	1
stem	1
fruit.pods	2
roots	2
temp	2
canker.lesion	3
ext.decay	3
int.discolor	3
leaf.halo	3
leaf.marg	3
leaf.size	3
leaft.malf	3
mold.growth	3
mycelium	3
plant.stand	3
precip	3
sclerotia	3
seed	3
seed.size	3
stem.cankers	3
fruit.bodies	4
fruit.spots	4
leaf.shread	4
seed.discolor	4
shriveling	4
germ	5
hail	5
leaf.mild	5
lodging	5
seed.tmt	5
sever	5

For example, the predictor crop.hist has 16 missing values occuring only in 1 class 2-4-d-injury. If the missing values are uniformly distributed across observations, the chance that only 1 class is affected is nearly zero.

Assuming the outcome is from a uniform probability distribution, we can explicitly calculate the mathematical probability of 16 missing values only occurring in the 2-4-d-injury class which has 16 members. Let the number of observations be \(N=683\). Let \(C\) be the class of 2-4-d-injury. Let \(M\) represent the number of missing values associated with the predictor crop.hist.

\[ P[\text{missing only in 2-4-d-injury}] =\left( 1 - \frac{|C|}{N} \right)^{M} = \left( 1 - \frac{16}{683} \right)^{16} \approx 8.22 \times 10^{-27}\] So we can strongly reject MCAR. A statistical test for the predictors which cover more than 1 class would similarly reject MCAR.

Gallery of Missing Data Impact on `Class`

This gallery contains each plot comparison missing data between Class and the 35 predictor variables.

In the following plots which are used to compare the count of missing values from a variable \(X\) versus the distribution of values of \(Class\), we conclude that missing values are unevenly distributed to the classes.

C. A strategy for handling missing data

My preferred strategy is to use robust models that can handle significant amounts of missing data. This includes ensemble methods like the random forest.

However, I would still remove the 3 degenerate variables described earlier.

I use the randomForest parameter na.action equal to na.roughfix which imputes with the median value.

The resulting model output shows that OOB error rate is 4.98% with \(mtry=5\) and \(ntree=500\) trees. Without excessive model tuning work, it seems that a 95% in-sample prediction rate is a good starting point. That seems pretty good.

## 
## Call:
##  randomForest(formula = Class ~ ., data = Soybean2, mtry = 5,      importance = TRUE, ntree = 500, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.98%
## Confusion matrix:
##                             2-4-d-injury alternarialeaf-spot anthracnose
## 2-4-d-injury                          16                   0           0
## alternarialeaf-spot                    0                  89           0
## anthracnose                            0                   0          44
## bacterial-blight                       0                   0           0
## bacterial-pustule                      0                   0           0
## brown-spot                             0                   1           0
## brown-stem-rot                         0                   0           0
## charcoal-rot                           0                   0           0
## cyst-nematode                          0                   0           0
## diaporthe-pod-&-stem-blight            0                   0           0
## diaporthe-stem-canker                  0                   0           0
## downy-mildew                           0                   0           0
## frog-eye-leaf-spot                     0                  22           0
## herbicide-injury                       0                   0           0
## phyllosticta-leaf-spot                 0                   0           0
## phytophthora-rot                       0                   0           0
## powdery-mildew                         0                   0           0
## purple-seed-stain                      0                   0           0
## rhizoctonia-root-rot                   0                   0           0
##                             bacterial-blight bacterial-pustule brown-spot
## 2-4-d-injury                               0                 0          0
## alternarialeaf-spot                        0                 0          1
## anthracnose                                0                 0          0
## bacterial-blight                          20                 0          0
## bacterial-pustule                          1                19          0
## brown-spot                                 0                 0         88
## brown-stem-rot                             0                 0          0
## charcoal-rot                               0                 0          0
## cyst-nematode                              0                 0          0
## diaporthe-pod-&-stem-blight                0                 0          0
## diaporthe-stem-canker                      0                 0          0
## downy-mildew                               0                 0          0
## frog-eye-leaf-spot                         0                 0          2
## herbicide-injury                           0                 0          0
## phyllosticta-leaf-spot                     0                 0          3
## phytophthora-rot                           0                 0          0
## powdery-mildew                             0                 0          0
## purple-seed-stain                          0                 0          0
## rhizoctonia-root-rot                       0                 0          0
##                             brown-stem-rot charcoal-rot cyst-nematode
## 2-4-d-injury                             0            0             0
## alternarialeaf-spot                      0            0             0
## anthracnose                              0            0             0
## bacterial-blight                         0            0             0
## bacterial-pustule                        0            0             0
## brown-spot                               0            0             0
## brown-stem-rot                          44            0             0
## charcoal-rot                             0           20             0
## cyst-nematode                            0            0            14
## diaporthe-pod-&-stem-blight              0            0             0
## diaporthe-stem-canker                    0            0             0
## downy-mildew                             0            0             0
## frog-eye-leaf-spot                       0            0             0
## herbicide-injury                         0            0             0
## phyllosticta-leaf-spot                   0            0             0
## phytophthora-rot                         0            0             0
## powdery-mildew                           0            0             0
## purple-seed-stain                        0            0             0
## rhizoctonia-root-rot                     0            0             0
##                             diaporthe-pod-&-stem-blight diaporthe-stem-canker
## 2-4-d-injury                                          0                     0
## alternarialeaf-spot                                   0                     0
## anthracnose                                           0                     0
## bacterial-blight                                      0                     0
## bacterial-pustule                                     0                     0
## brown-spot                                            0                     0
## brown-stem-rot                                        0                     0
## charcoal-rot                                          0                     0
## cyst-nematode                                         0                     0
## diaporthe-pod-&-stem-blight                          15                     0
## diaporthe-stem-canker                                 0                    20
## downy-mildew                                          0                     0
## frog-eye-leaf-spot                                    0                     0
## herbicide-injury                                      0                     0
## phyllosticta-leaf-spot                                0                     0
## phytophthora-rot                                      0                     0
## powdery-mildew                                        0                     0
## purple-seed-stain                                     0                     0
## rhizoctonia-root-rot                                  0                     0
##                             downy-mildew frog-eye-leaf-spot herbicide-injury
## 2-4-d-injury                           0                  0                0
## alternarialeaf-spot                    0                  1                0
## anthracnose                            0                  0                0
## bacterial-blight                       0                  0                0
## bacterial-pustule                      0                  0                0
## brown-spot                             0                  1                0
## brown-stem-rot                         0                  0                0
## charcoal-rot                           0                  0                0
## cyst-nematode                          0                  0                0
## diaporthe-pod-&-stem-blight            0                  0                0
## diaporthe-stem-canker                  0                  0                0
## downy-mildew                          20                  0                0
## frog-eye-leaf-spot                     0                 67                0
## herbicide-injury                       0                  0                8
## phyllosticta-leaf-spot                 0                  0                0
## phytophthora-rot                       0                  0                0
## powdery-mildew                         0                  0                0
## purple-seed-stain                      0                  0                0
## rhizoctonia-root-rot                   0                  0                0
##                             phyllosticta-leaf-spot phytophthora-rot
## 2-4-d-injury                                     0                0
## alternarialeaf-spot                              0                0
## anthracnose                                      0                0
## bacterial-blight                                 0                0
## bacterial-pustule                                0                0
## brown-spot                                       2                0
## brown-stem-rot                                   0                0
## charcoal-rot                                     0                0
## cyst-nematode                                    0                0
## diaporthe-pod-&-stem-blight                      0                0
## diaporthe-stem-canker                            0                0
## downy-mildew                                     0                0
## frog-eye-leaf-spot                               0                0
## herbicide-injury                                 0                0
## phyllosticta-leaf-spot                          17                0
## phytophthora-rot                                 0               88
## powdery-mildew                                   0                0
## purple-seed-stain                                0                0
## rhizoctonia-root-rot                             0                0
##                             powdery-mildew purple-seed-stain
## 2-4-d-injury                             0                 0
## alternarialeaf-spot                      0                 0
## anthracnose                              0                 0
## bacterial-blight                         0                 0
## bacterial-pustule                        0                 0
## brown-spot                               0                 0
## brown-stem-rot                           0                 0
## charcoal-rot                             0                 0
## cyst-nematode                            0                 0
## diaporthe-pod-&-stem-blight              0                 0
## diaporthe-stem-canker                    0                 0
## downy-mildew                             0                 0
## frog-eye-leaf-spot                       0                 0
## herbicide-injury                         0                 0
## phyllosticta-leaf-spot                   0                 0
## phytophthora-rot                         0                 0
## powdery-mildew                          20                 0
## purple-seed-stain                        0                20
## rhizoctonia-root-rot                     0                 0
##                             rhizoctonia-root-rot class.error
## 2-4-d-injury                                   0  0.00000000
## alternarialeaf-spot                            0  0.02197802
## anthracnose                                    0  0.00000000
## bacterial-blight                               0  0.00000000
## bacterial-pustule                              0  0.05000000
## brown-spot                                     0  0.04347826
## brown-stem-rot                                 0  0.00000000
## charcoal-rot                                   0  0.00000000
## cyst-nematode                                  0  0.00000000
## diaporthe-pod-&-stem-blight                    0  0.00000000
## diaporthe-stem-canker                          0  0.00000000
## downy-mildew                                   0  0.00000000
## frog-eye-leaf-spot                             0  0.26373626
## herbicide-injury                               0  0.00000000
## phyllosticta-leaf-spot                         0  0.15000000
## phytophthora-rot                               0  0.00000000
## powdery-mildew                                 0  0.00000000
## purple-seed-stain                              0  0.00000000
## rhizoctonia-root-rot                          20  0.00000000

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)
 .codeExample{
     background-color: lightgray
 }
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(finalfit)
library(mlbench)
data(Glass)
#str(Glass)
library(GGally)

ggpairs( Glass[1:9])

# Add a Row number for each observation
Glass2 <- Glass %>% mutate( Id = row_number())


# Refractive index plot - outliers as R.  IQR = 4 (fat whiskers), overlays jittered dots

pRI <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, RI, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Refractive Index") +
  geom_jitter(aes(Type, RI), alpha = 0.4 , position= position_jitter(width=0.2))

# Stores outliers in a data frame

outliers  <- Glass2 %>% filter(Type==2, RI > 1.530 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "High") %>%
  select( Row, Type, Field, Cause) 

outliers <- Glass2 %>% filter(Type==6, RI < 1.515 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

outliers <- Glass2 %>% filter(Type==7, RI < 1.515 ) %>% 
  mutate( Row = Id,  Field = "RI", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pNa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Na, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Na  - Sodium") +
  geom_jitter(aes(Type, Na), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <-   Glass2 %>% filter(Type==6, Na > 17 ) %>% 
  mutate( Row = Id,  Field = "Na", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pMg <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Mg, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Mg  - Magnesium ") +
  geom_jitter(aes(Type, Mg), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <- 
  Glass2 %>% filter(Type==1, Mg > 4.4 ) %>% 
  mutate( Row = Id,  Field = "Mg", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)


pAl <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Al, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Al - Aluminum ") +
  geom_jitter(aes(Type, Al), alpha = 0.4 , position= position_jitter(width=0.2))

pSi <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Si, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Si - Silicon") +
  geom_jitter(aes(Type, Si), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter(Type==6, Si > 75 ) %>% 
  mutate( Row = Id,  Field = "Si", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

pK <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, K, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("K - Potassium") +
  geom_jitter(aes(Type, K), alpha = 0.4 , position= position_jitter(width=0.2))



outliers <- 
  Glass2 %>% filter(Type==5, K > 6 ) %>% 
  mutate( Row = Id,  Field = "K", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

pCa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Ca, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Ca - Calcium") +
  geom_jitter(aes(Type, Ca), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter(Type==2, Ca > 15  ) %>% 
  mutate( Row = Id,  Field = "Ca", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)


outliers <- 
  Glass2 %>% filter(Type==7, Ca < 5.5  ) %>% 
  mutate( Row = Id,  Field = "Ca", Cause = "Low") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers


pBa <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Ba, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Ba - Barium") +
  geom_jitter(aes(Type, Ba), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter( ( Type==2 &  Ba > 3) | ( Type==7 & Ba > 2.5) ) %>% 
  mutate( Row = Id,  Field = "Ba", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)



pFe <- Glass2 %>% ggplot() + 
  geom_boxplot(aes(Type, Fe, fill = as.factor(Type)), outlier.color = "red", coef = 4) + 
  scale_fill_brewer( type="seq" , palette =  "Spectral") + 
  ggtitle("Fe - Iron") +
  geom_jitter(aes(Type, Fe), alpha = 0.4 , position= position_jitter(width=0.2))

outliers <- 
  Glass2 %>% filter( ( Type==5 &  Fe > 0.4) | ( Type==3 & Fe > 0.3) ) %>% 
  mutate( Row = Id,  Field = "Fe", Cause = "High") %>%
  select( Row, Type, Field, Cause) %>% bind_rows(outliers)

#outliers

tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(pRI + tlp , pMg + tlp + c2, pBa + tlp  , 
                   pCa + tlp , pNa + tlp , pK  + tlp + c2, 
                   pSi + tlp , pAl + tlp , pFe + tlp , 
                   ncol = 3 ) 

outliers %>% mutate( IsHigh = ifelse(Cause=="High", 1, -1)) %>% 
  arrange(Row) %>% 
  select(Row, Type, Field, IsHigh ) %>% 
  pivot_wider(id_cols= c(Row, Type), names_from = "Field", values_from = "IsHigh") %>%
  replace_na( list(Mg=0, Ba = 0, RI = 0, Ca = 0, Fe = 0, K = 0, Si = 0, Na = 0)) %>%
  kable(row.names= FALSE , caption='List of Outliers') %>% 
  kable_styling(bootstrap_options = c("hover", "striped"))

bpRI = Glass2 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass2 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass2 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass2 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass2 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK  = Glass2 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass2 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass2 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass2 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))


tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp  , 
                   bpCa + tlp , bpNa + tlp , bpK  + tlp + c2, 
                   bpSi + tlp , bpAl + tlp , bpFe + tlp , 
                   ncol = 3 ) 

set.seed(102)

library(randomForest)

rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al , 
                                  data = Glass2 , 
                                  mtry = 3,  
                                  importance = TRUE , 
                                  ntree = 1000
                                  )

rfm

varImpPlot(rfm)
# Debug
# Remove eval=false later

Glass2 %>% mutate( K  = sqrt(K) ,
                   Ba = sqrt(Ba) ,
                   Mg = sqrt(Mg) ,
                   Al = sqrt(Al) ,
                   Fe = sqrt(Fe) ,
                   RI = sqrt(RI)
                   ) -> Glass3



bpRI = Glass3 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass3 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass3 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass3 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass3 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK  = Glass3 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass3 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass3 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass3 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))


tlp = theme(legend.position = "none") 
c2  = coord_fixed(ratio=.5)

cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp  , 
                   bpCa + tlp , bpNa + tlp , bpK  + tlp + c2, 
                   bpSi + tlp , bpAl + tlp , bpFe + tlp , 
                   ncol = 3 ) 


# Debug
# Remove eval=false later

rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al , 
                                  data = Glass3 , 
                                  mtry = 3,  
                                  importance = TRUE , 
                                  ntree = 1000
                                  )

rfm

varImpPlot(rfm)
data(Soybean)
( soyskim = skimr::skim(Soybean) )

# Define the degeneracy condition

# skim extracts the top 1, 2 and 3 values and their frequencies as count for each column
aa = soyskim$factor.top_counts

# We parse the relevant frequencies based on the comma separated list of colon separated key-values
bb = str_match_all(aa, "^\\w+:\\s+(\\d+),\\s+\\w+:\\s+(\\d+)" )

N1 = sapply(bb, function(x) strtoi(x[[2]]))

N2 = sapply(bb, function(x) strtoi(x[[3]]))

lambdaX = N1 / N2

degeneracy = data.frame( predictor = soyskim$skim_variable, 
                         n_missing = soyskim$n_missing ,
                         N1 =  N1 ,
                         N2 =  N2 ,
                         lambdaX = lambdaX
                         )

degeneracy %>% arrange(desc(lambdaX)) %>% kable(caption="Degenerate predictors", digits = 1) %>% kable_styling(bootstrap_options = c("hover", "striped"))

# Observations with at least one column NA or ""
# Trick: rowSums returns the missing columns per row.
cells_with_na = rowSums(is.na(Soybean) | Soybean == "") 
fraction_with_na = sum( cells_with_na > 0 ) / nrow(Soybean)
fraction_cells_na = sum( cells_with_na ) / ( nrow(Soybean) * ncol(Soybean))

soyskim %>% 
  arrange(desc(n_missing)) %>% ggplot(aes(y=reorder(skim_variable, n_missing), x=n_missing) ) + geom_col() +
  labs(title="Variables Sorted by number of missing values", subtitle = "from Soybean") +
  ylab("Variable name") + xlab("Number Missing")


NonMCAR_Predictors = data.frame( 
 predictor = c( 'plant.stand', 'precip' , 'temp', 'hail', 'crop.hist',
                'sever', 'seed.tmt' , 'germ' , 'plant.growth' ,  'leaf.halo' ,
                'leaf.marg' , 'leaf.size', 'leaf.shread' , 'leaft.malf' , 'leaf.mild' ,
                'stem' , 'lodging' ,  'stem.cankers', 'canker.lesion' , 'fruit.bodies' ,
                'ext.decay' , 'mycelium' , 'int.discolor', 'sclerotia', 'fruit.pods' , 
                'fruit.spots' , 'seed' , 'mold.growth', 'seed.discolor' , 'seed.size' ,
                'shriveling' , 'roots' 
   ),
 support = c( 3 , 3,  2 , 5 , 1, 
              5 , 5 , 5,  1,  3 ,
              3 , 3 , 4 , 3 , 5 ,
              1 , 5 , 3 , 3 , 4 ,
              3 , 3 , 3 , 3 , 2 ,
              4 , 3 , 3 , 4 , 3,
              4 , 2
              )
)

NonMCAR_Predictors %>% 
    arrange(support, predictor) %>%
    kable(caption = "Predictor Distribution of Missing Values") %>%
    kable_styling(bootstrap_options = c("hover", "striped"))


# PRINT THE GALLERY OF CHARTS

explanatory_variables = soyskim$skim_variable[2:36]

for( v in explanatory_variables){
    print( Soybean %>% missing_pairs("Class", v) + 
       theme(axis.text.x = element_text(angle=90)) )
  
}



library(randomForest)

# Define a new dataset Soybean2 with removed variables.

Soybean %>% select(-mycelium, -sclerotia, -leaf.mild ) -> Soybean2
rfm = randomForest::randomForest( Class ~ . , 
                                  data = Soybean2, 
                                  mtry = 5,  
                                  importance = TRUE , 
                                  na.action = na.roughfix ,
                                  ntree = 500
                                  )

rfm

varImpPlot(rfm)

Data 624 HW 4 (Week 5) Data Preprocessing

Data Processing/Overfitting

Alexander Ng

Due 10/3/2021

Overview

Exercise 3.1 Statement

Exercise 3.1 Solution

Data Visualization

Outliers and Skewness

Skewness

Outliers

Classification Improvements

Exercise 3.2 Statement

Exercise 3.2 Solution

A. Frequency Distributions of Categorical Predictors

B. Missing Predictor Patterns

Missingness is not Uniform among Predictors

Missingness distribution to values of `Class` are non-random.

Gallery of Missing Data Impact on `Class`

C. A strategy for handling missing data

Code

Row	Type	Mg	Ba	RI	Ca	Fe	K	Si	Na
1	1	1	0	0	0	0	0	0	0
107	2	0	1	1	0	0	0	0	0
108	2	0	0	1	1	0	0	0	0
163	3	0	0	0	0	1	0	0	0
172	5	0	0	0	0	0	1	0	0
173	5	0	0	0	0	0	1	0	0
175	5	0	0	0	0	1	0	0	0
181	6	0	0	-1	0	0	0	0	0
185	6	0	0	-1	0	0	0	1	-1
186	7	0	0	-1	-1	0	0	0	0
208	7	0	1	0	0	0	0	0	0

Row	Type	Mg	Ba	RI	Ca	Fe	K	Si	Na
1	1	1	0	0	0	0	0	0	0
107	2	0	1	1	0	0	0	0	0
108	2	0	0	1	1	0	0	0	0
163	3	0	0	0	0	1	0	0	0
172	5	0	0	0	0	0	1	0	0
173	5	0	0	0	0	0	1	0	0
175	5	0	0	0	0	1	0	0	0
181	6	0	0	-1	0	0	0	0	0
185	6	0	0	-1	0	0	0	1	-1
186	7	0	0	-1	-1	0	0	0	0
208	7	0	1	0	0	0	0	0	0

Data 624 HW 4 (Week 5) Data Preprocessing

Data Processing/Overfitting

Alexander Ng

Due 10/3/2021

Overview

Exercise 3.1 Statement

Exercise 3.1 Solution

Data Visualization

Outliers and Skewness

Skewness

Outliers

Classification Improvements

Exercise 3.2 Statement

Exercise 3.2 Solution

A. Frequency Distributions of Categorical Predictors

B. Missing Predictor Patterns

Missingness is not Uniform among Predictors

Missingness distribution to values of Class are non-random.

Gallery of Missing Data Impact on Class

C. A strategy for handling missing data

Code

Missingness distribution to values of `Class` are non-random.

Gallery of Missing Data Impact on `Class`

Row	Type	Mg	Ba	RI	Ca	Fe	K	Si	Na
1	1	1	0	0	0	0	0	0	0
107	2	0	1	1	0	0	0	0	0
108	2	0	0	1	1	0	0	0	0
163	3	0	0	0	0	1	0	0	0
172	5	0	0	0	0	0	1	0	0
173	5	0	0	0	0	0	1	0	0
175	5	0	0	0	0	1	0	0	0
181	6	0	0	-1	0	0	0	0	0
185	6	0	0	-1	0	0	0	1	-1
186	7	0	0	-1	-1	0	0	0	0
208	7	0	1	0	0	0	0	0	0