1. The UC Irvine Machine Learning Repository contains data sets related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentage of eight elements:Na, Mg, Al, Si, K, Ca, Ba, and Fe.

    (a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors
    (b) Does there appear to be any outliers in the data? Are the predictors skewed?
    (c) Are there any relevant transformations of one or more predictors that might improve the classification model.

Data Import and Inspection


Our first steps are to import the data and use the str and skim functions to take a high level view. Str confirms 214 observations and 10 variables - 9 numerica and 1 factor. Sample observations are also provided.

The skim function tells us that the data is complete (no missing values), provides top counts for the factor variable, Type, basic statistics on each numeric variable as well as some insights on the distribution of the data.

data(Glass)
Glass <- as_tibble(Glass)
str(Glass)
## tibble [214 x 10] (S3: tbl_df/tbl/data.frame)
##  $ RI  : num [1:214] 1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num [1:214] 13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num [1:214] 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num [1:214] 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num [1:214] 71.8 72.7 73 72.6 73.1 ...
##  $ K   : num [1:214] 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num [1:214] 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num [1:214] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num [1:214] 0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
skim(Glass)
Data summary
Name Glass
Number of rows 214
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Type 0 1 FALSE 6 2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RI 0 1 1.52 0.00 1.51 1.52 1.52 1.52 1.53 <U+2581><U+2587><U+2582><U+2581><U+2581>
Na 0 1 13.41 0.82 10.73 12.91 13.30 13.83 17.38 <U+2581><U+2587><U+2586><U+2581><U+2581>
Mg 0 1 2.68 1.44 0.00 2.11 3.48 3.60 4.49 <U+2583><U+2581><U+2581><U+2587><U+2585>
Al 0 1 1.44 0.50 0.29 1.19 1.36 1.63 3.50 <U+2582><U+2587><U+2583><U+2581><U+2581>
Si 0 1 72.65 0.77 69.81 72.28 72.79 73.09 75.41 <U+2581><U+2582><U+2587><U+2582><U+2581>
K 0 1 0.50 0.65 0.00 0.12 0.56 0.61 6.21 <U+2587><U+2581><U+2581><U+2581><U+2581>
Ca 0 1 8.96 1.42 5.43 8.24 8.60 9.17 16.19 <U+2581><U+2587><U+2581><U+2581><U+2581>
Ba 0 1 0.18 0.50 0.00 0.00 0.00 0.00 3.15 <U+2587><U+2581><U+2581><U+2581><U+2581>
Fe 0 1 0.06 0.10 0.00 0.00 0.00 0.10 0.51 <U+2587><U+2581><U+2581><U+2581><U+2581>

Review GGPairs Plots


The ggpairs plot is a swiss army knife of plots, it provides pair-wise scatter plots of the predictor variables, density plots, correlation statisics as well as box plots of the predictors relative to the Type variable.

Some highlevel observations include:

p <- ggpairs(Glass) + theme_fivethirtyeight()
p

Review GGCorr Plot


Next we use the ggcorr function to get a better look at the correlation between predictors. This plot is provides a more clear (at least larger) representation of the information included in the ggpairs plot above. Indeed, this plot confirms the strong correlation between Ri and Ba (lower leve bright red box) and also highligts several correlations at the +/- 0.5 level: Ri/Si, Mg/Ba, Mg/Ai, Ai/Ba.

glass <- Glass %>% 
  pivot_longer(!Type, names_to ='predictor', values_to = "val")

ggcorr(Glass, label = TRUE)

Review Desity Plots and Historgrams to Understand Distribution


The desity plots of the predictors show some near-normal distributions (Ai, Na, Si,), but also reveals some skewed and bimodal distributions. These graphs leasd us to review the skew of the predictors. The skewness section below utilizes the skewness function to set forth the skewness of each variable. The histograms raise the possibility of missing (zero-filled) data for several of the predictors (Fe, Mg, and Ba). The histograms also indicate potential outliers or data entry errors for several variables (K, Fe and Ba)

p <- glass %>% 
    ggplot( aes(x=val, color=predictor, fill=predictor)) +
    geom_density() +
    labs(title = "Density Curves", subtitle = 'Glass Dataset Predictors') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

p <- glass %>% 
    ggplot( aes(x=val, color=predictor, fill=predictor)) +
    geom_histogram() +
    labs(title = "Histograms", subtitle = 'Glass Dataset Predictors') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

Box Plots

Review Boxplots for Outliers and Relationships with Type


Before looking more closely at skew and transformations, we’ll look at the boxplots below to see if outliers exist and to better understand the relathip between the predictors and the Type variable.

The first set of box plots reaffirms what we saw in the histograms, numerous variable (Fe, K, Si,Na, Ai) have what appear to be outliers. The box-plot for Mg shows that it has no outliers, however, this appears to be in conflict with the Mg density plot which shows a bimodal distribtion. (hmmmm)

The second set of boxplots provide a clear look at the relationships between the predictor variables and the Type variables. Several of these relationships look as if they would be beneficial to a classification model (Fe, Mg, Ba) as they draw clear distinctions between some of the categories.

Skewness

Review Skewness Data


The rule of thumb for skewness is: - If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. - If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. - If the skewness is less than -1 or greater than 1, the data are highly skewed.

We see from the table below that we have all varities of skewness in our data. The K and Ba variables are the most skewed.

x
RI 1.6027151
Na 0.4478343
Mg -1.1364523
Al 0.8946104
Si -0.7202392
K 6.4600889
Ca 2.0184463
Ba 3.3686800
Fe 1.7298107

Box-Cox Transformation

Review Skewness Data After Box-Cox Transformation


After the box-cox transform we see most of skewness improved:

  • K went from 6.4 to -0.78
  • Ba went from 3.36 to 1.67
  • Mg’s skews did not improve, it went from -1.13 to -1.43
  • all other variables benefitted from the box-cox transform
x
RI 1.5656604
Na 0.0338464
Mg -1.4327087
Al 0.0910590
Si -0.6509057
K -0.7821621
Ca -0.1939557
Ba 1.6756661
Fe 0.7442440
x
RI 1.6027151
Na 0.4478343
Mg -1.1364523
Al 0.8946104
Si -0.7202392
K 6.4600889
Ca 2.0184463
Ba 3.3686800
Fe 1.7298107
  1. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

    (a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
    (b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
    (c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Data Import and Inspection


Our first steps are to import the data and use the str and skim functions to take a high level view. Str confirms 214 observations and 10 variables - 9 numerica and 1 factor. Sample observations are also provided.

The skim function tells us that the data is complete (no missing values), provides top counts for the factor variable, Type, basic statistics on each numeric variable as well as some insights on the distribution of the data.

## tibble [683 x 36] (S3: tbl_df/tbl/data.frame)
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Data summary
Name soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?


Near Zero Variables


Applying the nearZeroVar function to the soybean dataset reveals that three variables: leaf.mild, mycelium, sclerotia have near zero variance and would be candidates for exclusion from the dataset.

## [1] "leaf.mild" "mycelium"  "sclerotia"

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.


Missing Data Strategy


Given the number of observations, 683, and the complete rate of the worse variables, 82%, I would be inclined to dropp the NA values for this dataset. Next I would ensure that dropping the NAs did not create a data sample imbalance. If an imbalance was created I would make the appropriate adjustments and proceed with my analysis. I believe this approach would avoid the need to impute values for categorical data that could introduce bias. If this strategy did not produce a viable model/results I would consider imputatiion with Random Forest. This is a strategy introduced in Introduction to Statistical Learning - Chapter 8.