Our first steps are to import the data and use the str and skim functions to take a high level view. Str confirms 214 observations and 10 variables - 9 numerica and 1 factor. Sample observations are also provided.
The skim function tells us that the data is complete (no missing values), provides top counts for the factor variable, Type, basic statistics on each numeric variable as well as some insights on the distribution of the data.
## tibble [214 x 10] (S3: tbl_df/tbl/data.frame)
## $ RI : num [1:214] 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num [1:214] 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num [1:214] 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num [1:214] 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num [1:214] 71.8 72.7 73 72.6 73.1 ...
## $ K : num [1:214] 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num [1:214] 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num [1:214] 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num [1:214] 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
| Name | Glass |
| Number of rows | 214 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Type | 0 | 1 | FALSE | 6 | 2: 76, 1: 70, 7: 29, 3: 17 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| RI | 0 | 1 | 1.52 | 0.00 | 1.51 | 1.52 | 1.52 | 1.52 | 1.53 | <U+2581><U+2587><U+2582><U+2581><U+2581> |
| Na | 0 | 1 | 13.41 | 0.82 | 10.73 | 12.91 | 13.30 | 13.83 | 17.38 | <U+2581><U+2587><U+2586><U+2581><U+2581> |
| Mg | 0 | 1 | 2.68 | 1.44 | 0.00 | 2.11 | 3.48 | 3.60 | 4.49 | <U+2583><U+2581><U+2581><U+2587><U+2585> |
| Al | 0 | 1 | 1.44 | 0.50 | 0.29 | 1.19 | 1.36 | 1.63 | 3.50 | <U+2582><U+2587><U+2583><U+2581><U+2581> |
| Si | 0 | 1 | 72.65 | 0.77 | 69.81 | 72.28 | 72.79 | 73.09 | 75.41 | <U+2581><U+2582><U+2587><U+2582><U+2581> |
| K | 0 | 1 | 0.50 | 0.65 | 0.00 | 0.12 | 0.56 | 0.61 | 6.21 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Ca | 0 | 1 | 8.96 | 1.42 | 5.43 | 8.24 | 8.60 | 9.17 | 16.19 | <U+2581><U+2587><U+2581><U+2581><U+2581> |
| Ba | 0 | 1 | 0.18 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 3.15 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Fe | 0 | 1 | 0.06 | 0.10 | 0.00 | 0.00 | 0.00 | 0.10 | 0.51 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
The ggpairs plot is a swiss army knife of plots, it provides pair-wise scatter plots of the predictor variables, density plots, correlation statisics as well as box plots of the predictors relative to the Type variable.
Some highlevel observations include:
Next we use the ggcorr function to get a better look at the correlation between predictors. This plot is provides a more clear (at least larger) representation of the information included in the ggpairs plot above. Indeed, this plot confirms the strong correlation between Ri and Ba (lower leve bright red box) and also highligts several correlations at the +/- 0.5 level: Ri/Si, Mg/Ba, Mg/Ai, Ai/Ba.
glass <- Glass %>%
pivot_longer(!Type, names_to ='predictor', values_to = "val")
ggcorr(Glass, label = TRUE)
The desity plots of the predictors show some near-normal distributions (Ai, Na, Si,), but also reveals some skewed and bimodal distributions. These graphs leasd us to review the skew of the predictors. The skewness section below utilizes the skewness function to set forth the skewness of each variable. The histograms raise the possibility of missing (zero-filled) data for several of the predictors (Fe, Mg, and Ba). The histograms also indicate potential outliers or data entry errors for several variables (K, Fe and Ba)
p <- glass %>%
ggplot( aes(x=val, color=predictor, fill=predictor)) +
geom_density() +
labs(title = "Density Curves", subtitle = 'Glass Dataset Predictors') +
theme_fivethirtyeight() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) + facet_wrap(~predictor, scales = "free")
pp <- glass %>%
ggplot( aes(x=val, color=predictor, fill=predictor)) +
geom_histogram() +
labs(title = "Histograms", subtitle = 'Glass Dataset Predictors') +
theme_fivethirtyeight() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) + facet_wrap(~predictor, scales = "free")
p
Before looking more closely at skew and transformations, we’ll look at the boxplots below to see if outliers exist and to better understand the relathip between the predictors and the Type variable.
The first set of box plots reaffirms what we saw in the histograms, numerous variable (Fe, K, Si,Na, Ai) have what appear to be outliers. The box-plot for Mg shows that it has no outliers, however, this appears to be in conflict with the Mg density plot which shows a bimodal distribtion. (hmmmm)
The second set of boxplots provide a clear look at the relationships between the predictor variables and the Type variables. Several of these relationships look as if they would be beneficial to a classification model (Fe, Mg, Ba) as they draw clear distinctions between some of the categories.
p <- glass %>%
ggplot( aes(x=val, color=predictor, fill=predictor)) +
geom_boxplot() +
labs(title = "Box Plots", subtitle = 'Glass Dataset Predictors') +
theme_fivethirtyeight() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) + facet_wrap(~predictor, scales = "free")
pp <- glass %>%
ggplot( aes(x=Type, y=val, color=predictor, fill=predictor)) +
geom_boxplot() +
labs(title = "Box Plots", subtitle = 'By Predictor and Type') +
theme_fivethirtyeight() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) + facet_wrap(~predictor, scales = "free")
p
The rule of thumb for skewness is: - If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. - If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. - If the skewness is less than -1 or greater than 1, the data are highly skewed.
We see from the table below that we have all varities of skewness in our data. The K and Ba variables are the most skewed.
| x | |
|---|---|
| RI | 1.6027151 |
| Na | 0.4478343 |
| Mg | -1.1364523 |
| Al | 0.8946104 |
| Si | -0.7202392 |
| K | 6.4600889 |
| Ca | 2.0184463 |
| Ba | 3.3686800 |
| Fe | 1.7298107 |
After the box-cox transform we see most of skewness improved:
Glass$Mg = Glass$Mg + 1.e-6 # add a small value so that BoxCoxTransfs will converge
Glass$K = Glass$K + 1.e-6
Glass$Ba = Glass$Ba + 1.e-6
Glass$Fe = Glass$Fe + 1.e-6
boxcox_skewness <- function(x){
BCT = caret::BoxCoxTrans(x)
x_bc = predict( BCT, x )
skewness(x_bc)
}
s2 <- apply( Glass[,-10], 2, boxcox_skewness)
s2 %>%
kable() %>%
kable_styling()| x | |
|---|---|
| RI | 1.5656604 |
| Na | 0.0338464 |
| Mg | -1.4327087 |
| Al | 0.0910590 |
| Si | -0.6509057 |
| K | -0.7821621 |
| Ca | -0.1939557 |
| Ba | 1.6756661 |
| Fe | 0.7442440 |
| x | |
|---|---|
| RI | 1.6027151 |
| Na | 0.4478343 |
| Mg | -1.1364523 |
| Al | 0.8946104 |
| Si | -0.7202392 |
| K | 6.4600889 |
| Ca | 2.0184463 |
| Ba | 3.3686800 |
| Fe | 1.7298107 |
Our first steps are to import the data and use the str and skim functions to take a high level view. Str confirms 214 observations and 10 variables - 9 numerica and 1 factor. Sample observations are also provided.
The skim function tells us that the data is complete (no missing values), provides top counts for the factor variable, Type, basic statistics on each numeric variable as well as some insights on the distribution of the data.
## tibble [683 x 36] (S3: tbl_df/tbl/data.frame)
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
| Name | soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
Applying the nearZeroVar function to the soybean dataset reveals that three variables: leaf.mild, mycelium, sclerotia have near zero variance and would be candidates for exclusion from the dataset.
## [1] "leaf.mild" "mycelium" "sclerotia"
Given the number of observations, 683, and the complete rate of the worse variables, 82%, I would be inclined to dropp the NA values for this dataset. Next I would ensure that dropping the NAs did not create a data sample imbalance. If an imbalance was created I would make the appropriate adjustments and proceed with my analysis. I believe this approach would avoid the need to impute values for categorical data that could introduce bias. If this strategy did not produce a viable model/results I would consider imputatiion with Random Forest. This is a strategy introduced in Introduction to Statistical Learning - Chapter 8.