Exercise from Chapter 3

3.1

The UV Irving Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: NA, Mg, Al, Si, K, Ca, Ba, Fe.

a

Using visualizations, explore the predictor variabels to understand their distributions as well as the relationships between predictors.

data(Glass)

skim(Glass)
Data summary
Name Glass
Number of rows 214
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Type 0 1 FALSE 6 2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RI 0 1 1.52 0.00 1.51 1.52 1.52 1.52 1.53 ▁▇▂▁▁
Na 0 1 13.41 0.82 10.73 12.91 13.30 13.83 17.38 ▁▇▆▁▁
Mg 0 1 2.68 1.44 0.00 2.11 3.48 3.60 4.49 ▃▁▁▇▅
Al 0 1 1.44 0.50 0.29 1.19 1.36 1.63 3.50 ▂▇▃▁▁
Si 0 1 72.65 0.77 69.81 72.28 72.79 73.09 75.41 ▁▂▇▂▁
K 0 1 0.50 0.65 0.00 0.12 0.56 0.61 6.21 ▇▁▁▁▁
Ca 0 1 8.96 1.42 5.43 8.24 8.60 9.17 16.19 ▁▇▁▁▁
Ba 0 1 0.18 0.50 0.00 0.00 0.00 0.00 3.15 ▇▁▁▁▁
Fe 0 1 0.06 0.10 0.00 0.00 0.00 0.10 0.51 ▇▁▁▁▁
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
DataExplorer::plot_histogram(Glass, theme_config = defaulttheme)

plot_bar(Glass, theme_config = defaulttheme)

plot_correlation(Glass, type = "all")

b

Do there appear to be any outliers in the data? are any predictors skewed?

There appears to be quite a few outliers in the K distributions, as it has values that deviate quite far from the distribution. Some of these distributions at first glance may seem like they have outliers but they contain large amounts of 0’s which implies some bimodal distribution, with and without 0s. A few of the predictors have slight skewness. ignoring the 0s, Al , Ri, and Na, all have slight right skewness.

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Some of the transformations that may be applied include dummifying the type variable, as well as adjusting the distributions of some of the numeric variables via boxcox transformation to be slightly normal. also creating a feature that isolates 0’s from measured values may be useful

3.2

a

Investigate the frequency distributions for categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in the chapter?

Yes, based on the figure below, Many of the categorical values are missing information and additionally, many of the predictors are likely not descriptive enough to be useful in a model (low variance). The variables that have low variance are shown below and are leaf.mild, mycelium, and sclerotia with uniqueness values less than 0.5%

data("Soybean")
caret::nearZeroVar(Soybean, saveMetrics = T) %>% 
  filter(nzv == T) %>% kableExtra::kable()
freqRatio percentUnique zeroVar nzv
leaf.mild 26.75 0.4392387 FALSE TRUE
mycelium 106.50 0.2928258 FALSE TRUE
sclerotia 31.25 0.2928258 FALSE TRUE
plot_bar(Soybean, theme_config = defaulttheme)

b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? is the pattern of missing data related to the classes?

We are able to visualize the amount of missing data and the percentage of each class that is missing data in the figure below. The data shows that many of the missing values are coincident with missing values for other features and may be a product of how the data was collected for specific observation events.

vis_miss(Soybean)

c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Those observations where majority of the features are missing information would likely be stripped from evaluation rather than attempting imputations across the entire spread. some threshold value such as “If 50% of features are missing for a given observation, remove observation”. Cases where only one or a few features are missing for an observation, we may attempt different imputation methods and see what may provide the best results across bootstrapped test/train sets. Some of these methods might include, median, mean, knn, linear reg, or random forest imputation.