The UV Irving Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: NA, Mg, Al, Si, K, Ca, Ba, Fe.
Using visualizations, explore the predictor variabels to understand their distributions as well as the relationships between predictors.
data(Glass)
skim(Glass)
| Name | Glass |
| Number of rows | 214 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Type | 0 | 1 | FALSE | 6 | 2: 76, 1: 70, 7: 29, 3: 17 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| RI | 0 | 1 | 1.52 | 0.00 | 1.51 | 1.52 | 1.52 | 1.52 | 1.53 | ▁▇▂▁▁ |
| Na | 0 | 1 | 13.41 | 0.82 | 10.73 | 12.91 | 13.30 | 13.83 | 17.38 | ▁▇▆▁▁ |
| Mg | 0 | 1 | 2.68 | 1.44 | 0.00 | 2.11 | 3.48 | 3.60 | 4.49 | ▃▁▁▇▅ |
| Al | 0 | 1 | 1.44 | 0.50 | 0.29 | 1.19 | 1.36 | 1.63 | 3.50 | ▂▇▃▁▁ |
| Si | 0 | 1 | 72.65 | 0.77 | 69.81 | 72.28 | 72.79 | 73.09 | 75.41 | ▁▂▇▂▁ |
| K | 0 | 1 | 0.50 | 0.65 | 0.00 | 0.12 | 0.56 | 0.61 | 6.21 | ▇▁▁▁▁ |
| Ca | 0 | 1 | 8.96 | 1.42 | 5.43 | 8.24 | 8.60 | 9.17 | 16.19 | ▁▇▁▁▁ |
| Ba | 0 | 1 | 0.18 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 | 3.15 | ▇▁▁▁▁ |
| Fe | 0 | 1 | 0.06 | 0.10 | 0.00 | 0.00 | 0.00 | 0.10 | 0.51 | ▇▁▁▁▁ |
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
DataExplorer::plot_histogram(Glass, theme_config = defaulttheme)
plot_bar(Glass, theme_config = defaulttheme)
plot_correlation(Glass, type = "all")
Do there appear to be any outliers in the data? are any predictors skewed?
There appears to be quite a few outliers in the K distributions, as it has values that deviate quite far from the distribution. Some of these distributions at first glance may seem like they have outliers but they contain large amounts of 0’s which implies some bimodal distribution, with and without 0s. A few of the predictors have slight skewness. ignoring the 0s, Al , Ri, and Na, all have slight right skewness.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Some of the transformations that may be applied include dummifying the type variable, as well as adjusting the distributions of some of the numeric variables via boxcox transformation to be slightly normal. also creating a feature that isolates 0’s from measured values may be useful
Investigate the frequency distributions for categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in the chapter?
Yes, based on the figure below, Many of the categorical values are missing information and additionally, many of the predictors are likely not descriptive enough to be useful in a model (low variance). The variables that have low variance are shown below and are leaf.mild, mycelium, and sclerotia with uniqueness values less than 0.5%
data("Soybean")
caret::nearZeroVar(Soybean, saveMetrics = T) %>%
filter(nzv == T) %>% kableExtra::kable()
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| leaf.mild | 26.75 | 0.4392387 | FALSE | TRUE |
| mycelium | 106.50 | 0.2928258 | FALSE | TRUE |
| sclerotia | 31.25 | 0.2928258 | FALSE | TRUE |
plot_bar(Soybean, theme_config = defaulttheme)
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? is the pattern of missing data related to the classes?
We are able to visualize the amount of missing data and the percentage of each class that is missing data in the figure below. The data shows that many of the missing values are coincident with missing values for other features and may be a product of how the data was collected for specific observation events.
vis_miss(Soybean)
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Those observations where majority of the features are missing information would likely be stripped from evaluation rather than attempting imputations across the entire spread. some threshold value such as “If 50% of features are missing for a given observation, remove observation”. Cases where only one or a few features are missing for an observation, we may attempt different imputation methods and see what may provide the best results across bootstrapped test/train sets. Some of these methods might include, median, mean, knn, linear reg, or random forest imputation.