Exercise from Chapter 3

3.1

The UV Irving Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: NA, Mg, Al, Si, K, Ca, Ba, Fe.

a

Using visualizations, explore the predictor variabels to understand their distributions as well as the relationships between predictors.

data(Glass)

skim(Glass)

Data summary
Name	Glass
Number of rows	214
Number of columns	10
_______________________
Column type frequency:
factor	1
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Type	0	1	FALSE	6	2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RI	1	1.52	0.00	1.51	1.52	1.52	1.52	1.53	▁▇▂▁▁
Na	1	13.41	0.82	10.73	12.91	13.30	13.83	17.38	▁▇▆▁▁
Mg	1	2.68	1.44	0.00	2.11	3.48	3.60	4.49	▃▁▁▇▅
Al	1	1.44	0.50	0.29	1.19	1.36	1.63	3.50	▂▇▃▁▁
Si	1	72.65	0.77	69.81	72.28	72.79	73.09	75.41	▁▂▇▂▁
K	1	0.50	0.65	0.00	0.12	0.56	0.61	6.21	▇▁▁▁▁
Ca	1	8.96	1.42	5.43	8.24	8.60	9.17	16.19	▁▇▁▁▁
Ba	1	0.18	0.50	0.00	0.00	0.00	0.00	3.15	▇▁▁▁▁
Fe	1	0.06	0.10	0.00	0.00	0.00	0.10	0.51	▇▁▁▁▁

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

DataExplorer::plot_histogram(Glass, theme_config = defaulttheme)

plot_bar(Glass, theme_config = defaulttheme)

plot_correlation(Glass, type = "all")

b

Do there appear to be any outliers in the data? are any predictors skewed?

There appears to be quite a few outliers in the K distributions, as it has values that deviate quite far from the distribution. Some of these distributions at first glance may seem like they have outliers but they contain large amounts of 0’s which implies some bimodal distribution, with and without 0s. A few of the predictors have slight skewness. ignoring the 0s, Al , Ri, and Na, all have slight right skewness.

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

Some of the transformations that may be applied include dummifying the type variable, as well as adjusting the distributions of some of the numeric variables via boxcox transformation to be slightly normal. also creating a feature that isolates 0’s from measured values may be useful

3.2

a

Investigate the frequency distributions for categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in the chapter?

Yes, based on the figure below, Many of the categorical values are missing information and additionally, many of the predictors are likely not descriptive enough to be useful in a model (low variance). The variables that have low variance are shown below and are leaf.mild, mycelium, and sclerotia with uniqueness values less than 0.5%

data("Soybean")
caret::nearZeroVar(Soybean, saveMetrics = T) %>% 
  filter(nzv == T) %>% kableExtra::kable()

	freqRatio	percentUnique	zeroVar	nzv
leaf.mild	26.75	0.4392387	FALSE	TRUE
mycelium	106.50	0.2928258	FALSE	TRUE
sclerotia	31.25	0.2928258	FALSE	TRUE

plot_bar(Soybean, theme_config = defaulttheme)

b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? is the pattern of missing data related to the classes?

We are able to visualize the amount of missing data and the percentage of each class that is missing data in the figure below. The data shows that many of the missing values are coincident with missing values for other features and may be a product of how the data was collected for specific observation events.

vis_miss(Soybean)

c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Those observations where majority of the features are missing information would likely be stripped from evaluation rather than attempting imputations across the entire spread. some threshold value such as “If 50% of features are missing for a given observation, remove observation”. Cases where only one or a few features are missing for an observation, we may attempt different imputation methods and see what may provide the best results across bootstrapped test/train sets. Some of these methods might include, median, mean, knn, linear reg, or random forest imputation.

Assignment 4

Joshua Registe

3/7/2021

Exercise from Chapter 3

3.1

a

b

c

3.2

a

b

c