Exercises from Kuhn & Johnson, Applied Predictive Modeling, Chapter 3 Data Preprocessing. All R code is displayed at the end for clarity.
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
>library(mlbench)
>data(Glass)
>str(Glass)
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
The correlogram of ggpairs shows the distributions, correlations and scatterplots of the variables in relation to each other.
The principal ingredients in glass are silicon (mean 72.7%) and sodium (13.4%) and calcium (8.96%). Other elements appear in trace quantities - under 3% on average.
The most significant relationships are between RI (refractive index) and other elements:
Ca which has a 80% correlation
Si (Silicon) has a -54% correlation
Al (Aluminum) has a -40% correlation
Certain elements are not present in many samples creating bimodal distributions. For example,
Ba Barium is absent in 82% of samples.K Potassium is absent in 14% of samples.The chemical relationship between these trace elements and the refractive index and type of glass need to be explored further with a chemist.
Using the box plots below and the density plots in the previous section, we observe that:
RI has a slight right skew.Na is symmetricMg is bimodal with a cluster at zero. The non-zero observations appear left skewed.Ba is bimodal with a cluster at zero. The non-zero observations are symmetrical in the subpopulations by Type. However, we should not draw too many conclusions because Ba is absent in 82% of samples.Fe is bimodal with a cluster at zero. The non-zero observations appear right skewed in aggregate, but symmetric within their Type subgroups.Al is present in all samples. It has a slight right skew. The subpopulations are relatively symmetric. Simpson’s paradox is applicable here.Si is present in all samples and symmetric. Within subpopulations, Si is both left and right skewed.Ca is present in all samples but left and right skewed at the subpopulation level.Because Glass lacks a unique row identifier, we begin by inserting an Id column to assist in outlier identification.
We will store the outliers by their Id, Field, Cause.
Obviously the same row can appear multiple times.
Let’s begin by plotting boxplots of the refraction index and each chemical component of the glass samples faceted by Type.
The outliers are identified using customized grouped boxplots. For each response variable (i.e. the elements and refractive index) we generate a grouped boxplot. Observations are grouped by the classification variable Type.
Examining the subgroup distribution by Type instead of for the pooled distribution for skewness is necessary to avoid Simpson’s Paradox. I alter the boxplots in ways to choose outliers:
These alterations make the box plot more usable but I still omit or include some outliers based on visual inspection.
My final list contains 11 observations that I regard as potential outliers. To illustrate the selection within one boxplot, I observe for Mg (Magnesium) that Type 1 has an outlier of high value around 5. The Type 7 population is bimodal. Most observations have no Mg concentration. But some Type 7 samples have Mg and are marked in red. However, I consider none to be outliers because they appear clustered.
In the table below, I show row number, Type and outlier status with respect to each element or Refractive Index.
We observe that some observations have correlated outliers:
List of Outliers
| Row | Type | Mg | Ba | RI | Ca | Fe | K | Si | Na |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 107 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 108 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 163 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 172 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 173 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 175 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 181 | 6 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 |
| 185 | 6 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | -1 |
| 186 | 7 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 |
| 208 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
To evaluate the benefit of variable transformation to improve prediction accuracy, I test their impact on an actual classification model. The authors imply that one can figure this out without experimental evidence. I don’t agree. I use a random forest algorithm to classify the observations. I compare the predictive accuracy in-sample and the variance important plot before and after the variable transformation.
I will show that a (square root) have no impact on random forest accuracy or variance importance.
In the output below, I show a random forest to predict Type using all other variables (exclude Row number). I set the parameters to be \(N=1000\) trees and \(mtry=3\) nodes. Accuracy results were robust to changes in these parameters.
We observe that the OOB accuracy is 19.6% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are useful variables for accurate prediction. We see that Silicon and Iron are least useful in prediction.
##
## Call:
## randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al, data = Glass2, mtry = 3, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.63%
## Confusion matrix:
## 1 2 3 5 6 7 class.error
## 1 62 6 2 0 0 0 0.1142857
## 2 10 62 1 1 1 1 0.1842105
## 3 6 4 7 0 0 0 0.5882353
## 5 0 3 0 9 0 1 0.3076923
## 6 0 2 0 0 7 0 0.2222222
## 7 1 3 0 0 0 25 0.1379310
In the below model, we construct a new data set to transform the variables \(K, Ba, Mg, Al, Fe, RI\) with square root transformations. We observe that the OOB accuracy is 19.2% in the original model. We see that Magnesium, Refractive Index and Aluminum, Calcium are still the most useful variables for accurate prediction. We see that Silicon and Iron are still least useful in prediction. I judge the predictive accuracy improvement to be insignificant.
##
## Call:
## randomForest(formula = Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al, data = Glass3, mtry = 3, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.16%
## Confusion matrix:
## 1 2 3 5 6 7 class.error
## 1 63 6 1 0 0 0 0.1000000
## 2 9 63 1 1 1 1 0.1710526
## 3 7 4 6 0 0 0 0.6470588
## 5 0 3 0 9 0 1 0.3076923
## 6 0 2 0 0 7 0 0.2222222
## 7 1 3 0 0 0 25 0.1379310
I conclude that variable transformations \(\color{red}{\text{don't}}\) help the classification accuracy when the model is robust to such transformations. The random forest is somewhat robust to variable transformations. This is not be true for regression models like OLS or multinomial models where the distribution of errors is more important.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
We analyze the dataframe using the skimr package to answer basic questions about the frequency distribution of the predictors. The predictor name is listed under skim_variable. In the largest table of the skim output below the top_counts columns, we see the frequency distribution. The frequency distribution info is about the top 4 values of the predictor. For example, the column plant.stand takes on 2 values (0 occurs 354 times and 1 occurs 292 times).
| Name | Soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
We explore degeneracy of the predictors. Degeneracy of a predictor \(X\) is defined by Kuhn and Johnson as either:
\[ \lambda(X) = \frac{N_{1}}{N_{2}} \geq \lambda = 20\]
Degenerate predictors
| predictor | n_missing | N1 | N2 | lambdaX |
|---|---|---|---|---|
| mycelium | 38 | 639 | 6 | 106.5 |
| sclerotia | 38 | 625 | 20 | 31.2 |
| leaf.mild | 108 | 535 | 20 | 26.8 |
| shriveling | 106 | 539 | 38 | 14.2 |
| int.discolor | 38 | 581 | 44 | 13.2 |
| lodging | 121 | 520 | 42 | 12.4 |
| leaf.malf | 84 | 554 | 45 | 12.3 |
| seed.size | 92 | 532 | 59 | 9.0 |
| seed.discolor | 106 | 513 | 64 | 8.0 |
| leaves | 0 | 606 | 77 | 7.9 |
| mold.growth | 92 | 524 | 67 | 7.8 |
| roots | 31 | 551 | 86 | 6.4 |
| leaf.shread | 100 | 487 | 96 | 5.1 |
| fruiting.bodies | 106 | 473 | 104 | 4.5 |
| seed | 92 | 476 | 115 | 4.1 |
| precip | 38 | 459 | 112 | 4.1 |
| ext.decay | 38 | 497 | 135 | 3.7 |
| fruit.spots | 106 | 345 | 100 | 3.5 |
| hail | 121 | 435 | 127 | 3.4 |
| fruit.pods | 84 | 407 | 130 | 3.1 |
| stem.cankers | 38 | 379 | 191 | 2.0 |
| plant.growth | 16 | 441 | 226 | 2.0 |
| temp | 30 | 374 | 199 | 1.9 |
| canker.lesion | 38 | 320 | 177 | 1.8 |
| sever | 121 | 322 | 195 | 1.7 |
| leaf.marg | 84 | 357 | 221 | 1.6 |
| leaf.halo | 84 | 342 | 221 | 1.5 |
| leaf.size | 84 | 327 | 221 | 1.5 |
| seed.tmt | 121 | 305 | 222 | 1.4 |
| stem | 16 | 371 | 296 | 1.3 |
| area.dam | 1 | 227 | 187 | 1.2 |
| plant.stand | 36 | 354 | 293 | 1.2 |
| date | 1 | 149 | 131 | 1.1 |
| germ | 112 | 213 | 193 | 1.1 |
| Class | 0 | 92 | 91 | 1.0 |
| crop.hist | 16 | 219 | 218 | 1.0 |
Based on the sorted table above, we see that the top 3 predictors mycelium, sclerotia, leaf.mild violate the degeneracy rule. Also, there are no constant predictors. Thus, we have found all degenerate predictors.
Our calculations in R shown the percentage of observations (rows) which at least one column is missing values is 17.715959%. By contrast, the percentage of cells with missing values in the data frame is 9.5046364%.
As the skim output shows, the predictors with missing values vary by degree. We plot this data below shown the count of missing values for each column of the Soybean dataset. The top columns for missing data are sever, seed.tmt, lodging, hail with 124 missing. These columns have 82.3% completeness. In contrast, date, area.dam are missing 1 values and 99.9% complete.
Class are non-random.I use the finalfit package to visually display the pattern of missing data between Class and each of the 35 predictors. To do this, I run a for loop to output a graph that compares each predictor’s missing data impact on each value of Class. \(\color{red}{\text{The plots are displayed in the subsection after this.}}\)
Based on visual inspection of the plot gallery, I conclude the following 32 predictors have a non-random distribution of missing values among the classes of the Class variable. These predictors are not Missing Completely at Random (MCAR) data. It is not possible to distinguish MAR and MNAR from the available information however.
The table below shows the predictors with missing values and the number of classes impacted by the missing values.
Predictor Distribution of Missing Values
| predictor | support |
|---|---|
| crop.hist | 1 |
| plant.growth | 1 |
| stem | 1 |
| fruit.pods | 2 |
| roots | 2 |
| temp | 2 |
| canker.lesion | 3 |
| ext.decay | 3 |
| int.discolor | 3 |
| leaf.halo | 3 |
| leaf.marg | 3 |
| leaf.size | 3 |
| leaft.malf | 3 |
| mold.growth | 3 |
| mycelium | 3 |
| plant.stand | 3 |
| precip | 3 |
| sclerotia | 3 |
| seed | 3 |
| seed.size | 3 |
| stem.cankers | 3 |
| fruit.bodies | 4 |
| fruit.spots | 4 |
| leaf.shread | 4 |
| seed.discolor | 4 |
| shriveling | 4 |
| germ | 5 |
| hail | 5 |
| leaf.mild | 5 |
| lodging | 5 |
| seed.tmt | 5 |
| sever | 5 |
For example, the predictor crop.hist has 16 missing values occuring only in 1 class 2-4-d-injury. If the missing values are uniformly distributed across observations, the chance that only 1 class is affected is nearly zero.
Assuming the outcome is from a uniform probability distribution, we can explicitly calculate the mathematical probability of 16 missing values only occurring in the 2-4-d-injury class which has 16 members. Let the number of observations be \(N=683\). Let \(C\) be the class of 2-4-d-injury. Let \(M\) represent the number of missing values associated with the predictor crop.hist.
\[ P[\text{missing only in 2-4-d-injury}] =\left( 1 - \frac{|C|}{N} \right)^{M} = \left( 1 - \frac{16}{683} \right)^{16} \approx 8.22 \times 10^{-27}\] So we can strongly reject MCAR. A statistical test for the predictors which cover more than 1 class would similarly reject MCAR.
ClassThis gallery contains each plot comparison missing data between Class and the 35 predictor variables.
In the following plots which are used to compare the count of missing values from a variable \(X\) versus the distribution of values of \(Class\), we conclude that missing values are unevenly distributed to the classes.
My preferred strategy is to use robust models that can handle significant amounts of missing data. This includes ensemble methods like the random forest.
However, I would still remove the 3 degenerate variables described earlier.
I use the randomForest parameter na.action equal to na.roughfix which imputes with the median value.
The resulting model output shows that OOB error rate is 4.98% with \(mtry=5\) and \(ntree=500\) trees. Without excessive model tuning work, it seems that a 95% in-sample prediction rate is a good starting point. That seems pretty good.
##
## Call:
## randomForest(formula = Class ~ ., data = Soybean2, mtry = 5, importance = TRUE, ntree = 500, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 4.98%
## Confusion matrix:
## 2-4-d-injury alternarialeaf-spot anthracnose
## 2-4-d-injury 16 0 0
## alternarialeaf-spot 0 89 0
## anthracnose 0 0 44
## bacterial-blight 0 0 0
## bacterial-pustule 0 0 0
## brown-spot 0 1 0
## brown-stem-rot 0 0 0
## charcoal-rot 0 0 0
## cyst-nematode 0 0 0
## diaporthe-pod-&-stem-blight 0 0 0
## diaporthe-stem-canker 0 0 0
## downy-mildew 0 0 0
## frog-eye-leaf-spot 0 22 0
## herbicide-injury 0 0 0
## phyllosticta-leaf-spot 0 0 0
## phytophthora-rot 0 0 0
## powdery-mildew 0 0 0
## purple-seed-stain 0 0 0
## rhizoctonia-root-rot 0 0 0
## bacterial-blight bacterial-pustule brown-spot
## 2-4-d-injury 0 0 0
## alternarialeaf-spot 0 0 1
## anthracnose 0 0 0
## bacterial-blight 20 0 0
## bacterial-pustule 1 19 0
## brown-spot 0 0 88
## brown-stem-rot 0 0 0
## charcoal-rot 0 0 0
## cyst-nematode 0 0 0
## diaporthe-pod-&-stem-blight 0 0 0
## diaporthe-stem-canker 0 0 0
## downy-mildew 0 0 0
## frog-eye-leaf-spot 0 0 2
## herbicide-injury 0 0 0
## phyllosticta-leaf-spot 0 0 3
## phytophthora-rot 0 0 0
## powdery-mildew 0 0 0
## purple-seed-stain 0 0 0
## rhizoctonia-root-rot 0 0 0
## brown-stem-rot charcoal-rot cyst-nematode
## 2-4-d-injury 0 0 0
## alternarialeaf-spot 0 0 0
## anthracnose 0 0 0
## bacterial-blight 0 0 0
## bacterial-pustule 0 0 0
## brown-spot 0 0 0
## brown-stem-rot 44 0 0
## charcoal-rot 0 20 0
## cyst-nematode 0 0 14
## diaporthe-pod-&-stem-blight 0 0 0
## diaporthe-stem-canker 0 0 0
## downy-mildew 0 0 0
## frog-eye-leaf-spot 0 0 0
## herbicide-injury 0 0 0
## phyllosticta-leaf-spot 0 0 0
## phytophthora-rot 0 0 0
## powdery-mildew 0 0 0
## purple-seed-stain 0 0 0
## rhizoctonia-root-rot 0 0 0
## diaporthe-pod-&-stem-blight diaporthe-stem-canker
## 2-4-d-injury 0 0
## alternarialeaf-spot 0 0
## anthracnose 0 0
## bacterial-blight 0 0
## bacterial-pustule 0 0
## brown-spot 0 0
## brown-stem-rot 0 0
## charcoal-rot 0 0
## cyst-nematode 0 0
## diaporthe-pod-&-stem-blight 15 0
## diaporthe-stem-canker 0 20
## downy-mildew 0 0
## frog-eye-leaf-spot 0 0
## herbicide-injury 0 0
## phyllosticta-leaf-spot 0 0
## phytophthora-rot 0 0
## powdery-mildew 0 0
## purple-seed-stain 0 0
## rhizoctonia-root-rot 0 0
## downy-mildew frog-eye-leaf-spot herbicide-injury
## 2-4-d-injury 0 0 0
## alternarialeaf-spot 0 1 0
## anthracnose 0 0 0
## bacterial-blight 0 0 0
## bacterial-pustule 0 0 0
## brown-spot 0 1 0
## brown-stem-rot 0 0 0
## charcoal-rot 0 0 0
## cyst-nematode 0 0 0
## diaporthe-pod-&-stem-blight 0 0 0
## diaporthe-stem-canker 0 0 0
## downy-mildew 20 0 0
## frog-eye-leaf-spot 0 67 0
## herbicide-injury 0 0 8
## phyllosticta-leaf-spot 0 0 0
## phytophthora-rot 0 0 0
## powdery-mildew 0 0 0
## purple-seed-stain 0 0 0
## rhizoctonia-root-rot 0 0 0
## phyllosticta-leaf-spot phytophthora-rot
## 2-4-d-injury 0 0
## alternarialeaf-spot 0 0
## anthracnose 0 0
## bacterial-blight 0 0
## bacterial-pustule 0 0
## brown-spot 2 0
## brown-stem-rot 0 0
## charcoal-rot 0 0
## cyst-nematode 0 0
## diaporthe-pod-&-stem-blight 0 0
## diaporthe-stem-canker 0 0
## downy-mildew 0 0
## frog-eye-leaf-spot 0 0
## herbicide-injury 0 0
## phyllosticta-leaf-spot 17 0
## phytophthora-rot 0 88
## powdery-mildew 0 0
## purple-seed-stain 0 0
## rhizoctonia-root-rot 0 0
## powdery-mildew purple-seed-stain
## 2-4-d-injury 0 0
## alternarialeaf-spot 0 0
## anthracnose 0 0
## bacterial-blight 0 0
## bacterial-pustule 0 0
## brown-spot 0 0
## brown-stem-rot 0 0
## charcoal-rot 0 0
## cyst-nematode 0 0
## diaporthe-pod-&-stem-blight 0 0
## diaporthe-stem-canker 0 0
## downy-mildew 0 0
## frog-eye-leaf-spot 0 0
## herbicide-injury 0 0
## phyllosticta-leaf-spot 0 0
## phytophthora-rot 0 0
## powdery-mildew 20 0
## purple-seed-stain 0 20
## rhizoctonia-root-rot 0 0
## rhizoctonia-root-rot class.error
## 2-4-d-injury 0 0.00000000
## alternarialeaf-spot 0 0.02197802
## anthracnose 0 0.00000000
## bacterial-blight 0 0.00000000
## bacterial-pustule 0 0.05000000
## brown-spot 0 0.04347826
## brown-stem-rot 0 0.00000000
## charcoal-rot 0 0.00000000
## cyst-nematode 0 0.00000000
## diaporthe-pod-&-stem-blight 0 0.00000000
## diaporthe-stem-canker 0 0.00000000
## downy-mildew 0 0.00000000
## frog-eye-leaf-spot 0 0.26373626
## herbicide-injury 0 0.00000000
## phyllosticta-leaf-spot 0 0.15000000
## phytophthora-rot 0 0.00000000
## powdery-mildew 0 0.00000000
## purple-seed-stain 0 0.00000000
## rhizoctonia-root-rot 20 0.00000000
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
.codeExample{
background-color: lightgray
}
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(finalfit)
library(mlbench)
data(Glass)
#str(Glass)
library(GGally)
ggpairs( Glass[1:9])
# Add a Row number for each observation
Glass2 <- Glass %>% mutate( Id = row_number())
# Refractive index plot - outliers as R. IQR = 4 (fat whiskers), overlays jittered dots
pRI <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, RI, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Refractive Index") +
geom_jitter(aes(Type, RI), alpha = 0.4 , position= position_jitter(width=0.2))
# Stores outliers in a data frame
outliers <- Glass2 %>% filter(Type==2, RI > 1.530 ) %>%
mutate( Row = Id, Field = "RI", Cause = "High") %>%
select( Row, Type, Field, Cause)
outliers <- Glass2 %>% filter(Type==6, RI < 1.515 ) %>%
mutate( Row = Id, Field = "RI", Cause = "Low") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
outliers <- Glass2 %>% filter(Type==7, RI < 1.515 ) %>%
mutate( Row = Id, Field = "RI", Cause = "Low") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
#outliers
pNa <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Na, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Na - Sodium") +
geom_jitter(aes(Type, Na), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <- Glass2 %>% filter(Type==6, Na > 17 ) %>%
mutate( Row = Id, Field = "Na", Cause = "Low") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
#outliers
pMg <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Mg, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Mg - Magnesium ") +
geom_jitter(aes(Type, Mg), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter(Type==1, Mg > 4.4 ) %>%
mutate( Row = Id, Field = "Mg", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
pAl <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Al, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Al - Aluminum ") +
geom_jitter(aes(Type, Al), alpha = 0.4 , position= position_jitter(width=0.2))
pSi <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Si, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Si - Silicon") +
geom_jitter(aes(Type, Si), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter(Type==6, Si > 75 ) %>%
mutate( Row = Id, Field = "Si", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
pK <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, K, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("K - Potassium") +
geom_jitter(aes(Type, K), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter(Type==5, K > 6 ) %>%
mutate( Row = Id, Field = "K", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
#outliers
pCa <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Ca, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Ca - Calcium") +
geom_jitter(aes(Type, Ca), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter(Type==2, Ca > 15 ) %>%
mutate( Row = Id, Field = "Ca", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
outliers <-
Glass2 %>% filter(Type==7, Ca < 5.5 ) %>%
mutate( Row = Id, Field = "Ca", Cause = "Low") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
#outliers
pBa <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Ba, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Ba - Barium") +
geom_jitter(aes(Type, Ba), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter( ( Type==2 & Ba > 3) | ( Type==7 & Ba > 2.5) ) %>%
mutate( Row = Id, Field = "Ba", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
pFe <- Glass2 %>% ggplot() +
geom_boxplot(aes(Type, Fe, fill = as.factor(Type)), outlier.color = "red", coef = 4) +
scale_fill_brewer( type="seq" , palette = "Spectral") +
ggtitle("Fe - Iron") +
geom_jitter(aes(Type, Fe), alpha = 0.4 , position= position_jitter(width=0.2))
outliers <-
Glass2 %>% filter( ( Type==5 & Fe > 0.4) | ( Type==3 & Fe > 0.3) ) %>%
mutate( Row = Id, Field = "Fe", Cause = "High") %>%
select( Row, Type, Field, Cause) %>% bind_rows(outliers)
#outliers
tlp = theme(legend.position = "none")
c2 = coord_fixed(ratio=.5)
cowplot::plot_grid(pRI + tlp , pMg + tlp + c2, pBa + tlp ,
pCa + tlp , pNa + tlp , pK + tlp + c2,
pSi + tlp , pAl + tlp , pFe + tlp ,
ncol = 3 )
outliers %>% mutate( IsHigh = ifelse(Cause=="High", 1, -1)) %>%
arrange(Row) %>%
select(Row, Type, Field, IsHigh ) %>%
pivot_wider(id_cols= c(Row, Type), names_from = "Field", values_from = "IsHigh") %>%
replace_na( list(Mg=0, Ba = 0, RI = 0, Ca = 0, Fe = 0, K = 0, Si = 0, Na = 0)) %>%
kable(row.names= FALSE , caption='List of Outliers') %>%
kable_styling(bootstrap_options = c("hover", "striped"))
bpRI = Glass2 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass2 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass2 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass2 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass2 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK = Glass2 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass2 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass2 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass2 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
tlp = theme(legend.position = "none")
c2 = coord_fixed(ratio=.5)
cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp ,
bpCa + tlp , bpNa + tlp , bpK + tlp + c2,
bpSi + tlp , bpAl + tlp , bpFe + tlp ,
ncol = 3 )
set.seed(102)
library(randomForest)
rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al ,
data = Glass2 ,
mtry = 3,
importance = TRUE ,
ntree = 1000
)
rfm
varImpPlot(rfm)
# Debug
# Remove eval=false later
Glass2 %>% mutate( K = sqrt(K) ,
Ba = sqrt(Ba) ,
Mg = sqrt(Mg) ,
Al = sqrt(Al) ,
Fe = sqrt(Fe) ,
RI = sqrt(RI)
) -> Glass3
bpRI = Glass3 %>% ggplot(aes(x=RI)) + geom_boxplot() + geom_jitter(aes(y=RI), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpMg = Glass3 %>% ggplot(aes(x=Mg)) + geom_boxplot() + geom_jitter(aes(y=Mg), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpBa = Glass3 %>% ggplot(aes(x=Ba)) + geom_boxplot() + geom_jitter(aes(y=Ba), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpCa = Glass3 %>% ggplot(aes(x=Ca)) + geom_boxplot() + geom_jitter(aes(y=Ca), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpFe = Glass3 %>% ggplot(aes(x=Fe)) + geom_boxplot() + geom_jitter(aes(y=Fe), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpK = Glass3 %>% ggplot(aes(x=K)) + geom_boxplot() + geom_jitter(aes(y=K), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpSi = Glass3 %>% ggplot(aes(x=Si)) + geom_boxplot() + geom_jitter(aes(y=Si), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpNa = Glass3 %>% ggplot(aes(x=Na)) + geom_boxplot() + geom_jitter(aes(y=Na), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
bpAl = Glass3 %>% ggplot(aes(x=Al)) + geom_boxplot() + geom_jitter(aes(y=Al), alpha = 0.4 , color = "green" ,position= position_jitter(height=0.2))
tlp = theme(legend.position = "none")
c2 = coord_fixed(ratio=.5)
cowplot::plot_grid(bpRI + tlp , bpMg + tlp + c2, bpBa + tlp ,
bpCa + tlp , bpNa + tlp , bpK + tlp + c2,
bpSi + tlp , bpAl + tlp , bpFe + tlp ,
ncol = 3 )
# Debug
# Remove eval=false later
rfm = randomForest::randomForest( Type ~ RI + Mg + Ba + Ca + Fe + K + Si + Na + Al ,
data = Glass3 ,
mtry = 3,
importance = TRUE ,
ntree = 1000
)
rfm
varImpPlot(rfm)
data(Soybean)
( soyskim = skimr::skim(Soybean) )
# Define the degeneracy condition
# skim extracts the top 1, 2 and 3 values and their frequencies as count for each column
aa = soyskim$factor.top_counts
# We parse the relevant frequencies based on the comma separated list of colon separated key-values
bb = str_match_all(aa, "^\\w+:\\s+(\\d+),\\s+\\w+:\\s+(\\d+)" )
N1 = sapply(bb, function(x) strtoi(x[[2]]))
N2 = sapply(bb, function(x) strtoi(x[[3]]))
lambdaX = N1 / N2
degeneracy = data.frame( predictor = soyskim$skim_variable,
n_missing = soyskim$n_missing ,
N1 = N1 ,
N2 = N2 ,
lambdaX = lambdaX
)
degeneracy %>% arrange(desc(lambdaX)) %>% kable(caption="Degenerate predictors", digits = 1) %>% kable_styling(bootstrap_options = c("hover", "striped"))
# Observations with at least one column NA or ""
# Trick: rowSums returns the missing columns per row.
cells_with_na = rowSums(is.na(Soybean) | Soybean == "")
fraction_with_na = sum( cells_with_na > 0 ) / nrow(Soybean)
fraction_cells_na = sum( cells_with_na ) / ( nrow(Soybean) * ncol(Soybean))
soyskim %>%
arrange(desc(n_missing)) %>% ggplot(aes(y=reorder(skim_variable, n_missing), x=n_missing) ) + geom_col() +
labs(title="Variables Sorted by number of missing values", subtitle = "from Soybean") +
ylab("Variable name") + xlab("Number Missing")
NonMCAR_Predictors = data.frame(
predictor = c( 'plant.stand', 'precip' , 'temp', 'hail', 'crop.hist',
'sever', 'seed.tmt' , 'germ' , 'plant.growth' , 'leaf.halo' ,
'leaf.marg' , 'leaf.size', 'leaf.shread' , 'leaft.malf' , 'leaf.mild' ,
'stem' , 'lodging' , 'stem.cankers', 'canker.lesion' , 'fruit.bodies' ,
'ext.decay' , 'mycelium' , 'int.discolor', 'sclerotia', 'fruit.pods' ,
'fruit.spots' , 'seed' , 'mold.growth', 'seed.discolor' , 'seed.size' ,
'shriveling' , 'roots'
),
support = c( 3 , 3, 2 , 5 , 1,
5 , 5 , 5, 1, 3 ,
3 , 3 , 4 , 3 , 5 ,
1 , 5 , 3 , 3 , 4 ,
3 , 3 , 3 , 3 , 2 ,
4 , 3 , 3 , 4 , 3,
4 , 2
)
)
NonMCAR_Predictors %>%
arrange(support, predictor) %>%
kable(caption = "Predictor Distribution of Missing Values") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
# PRINT THE GALLERY OF CHARTS
explanatory_variables = soyskim$skim_variable[2:36]
for( v in explanatory_variables){
print( Soybean %>% missing_pairs("Class", v) +
theme(axis.text.x = element_text(angle=90)) )
}
library(randomForest)
# Define a new dataset Soybean2 with removed variables.
Soybean %>% select(-mycelium, -sclerotia, -leaf.mild ) -> Soybean2
rfm = randomForest::randomForest( Class ~ . ,
data = Soybean2,
mtry = 5,
importance = TRUE ,
na.action = na.roughfix ,
ntree = 500
)
rfm
varImpPlot(rfm)