Data 624 - Assignment 4 - Data Pre-Processing

The UC Irvine Machine Learning Repository contains data sets related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentage of eight elements:Na, Mg, Al, Si, K, Ca, Ba, and Fe.

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors
(b) Does there appear to be any outliers in the data? Are the predictors skewed?
(c) Are there any relevant transformations of one or more predictors that might improve the classification model.

Data Import and Inspection

Our first steps are to import the data and use the str and skim functions to take a high level view. Str confirms 214 observations and 10 variables - 9 numerica and 1 factor. Sample observations are also provided.

The skim function tells us that the data is complete (no missing values), provides top counts for the factor variable, Type, basic statistics on each numeric variable as well as some insights on the distribution of the data.

data(Glass)
Glass <- as_tibble(Glass)

str(Glass)

## tibble [214 x 10] (S3: tbl_df/tbl/data.frame)
##  $ RI  : num [1:214] 1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num [1:214] 13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num [1:214] 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num [1:214] 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num [1:214] 71.8 72.7 73 72.6 73.1 ...
##  $ K   : num [1:214] 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num [1:214] 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num [1:214] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num [1:214] 0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

skim(Glass)

Data summary

Name	Glass
Number of rows	214
Number of columns	10
_______________________
Column type frequency:
factor	1
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Type	0	1	FALSE	6	2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RI	1	1.52	0.00	1.51	1.52	1.52	1.52	1.53	<U+2581><U+2587><U+2582><U+2581><U+2581>
Na	1	13.41	0.82	10.73	12.91	13.30	13.83	17.38	<U+2581><U+2587><U+2586><U+2581><U+2581>
Mg	1	2.68	1.44	0.00	2.11	3.48	3.60	4.49	<U+2583><U+2581><U+2581><U+2587><U+2585>
Al	1	1.44	0.50	0.29	1.19	1.36	1.63	3.50	<U+2582><U+2587><U+2583><U+2581><U+2581>
Si	1	72.65	0.77	69.81	72.28	72.79	73.09	75.41	<U+2581><U+2582><U+2587><U+2582><U+2581>
K	1	0.50	0.65	0.00	0.12	0.56	0.61	6.21	<U+2587><U+2581><U+2581><U+2581><U+2581>
Ca	1	8.96	1.42	5.43	8.24	8.60	9.17	16.19	<U+2581><U+2587><U+2581><U+2581><U+2581>
Ba	1	0.18	0.50	0.00	0.00	0.00	0.00	3.15	<U+2587><U+2581><U+2581><U+2581><U+2581>
Fe	1	0.06	0.10	0.00	0.00	0.00	0.10	0.51	<U+2587><U+2581><U+2581><U+2581><U+2581>

Review GGPairs Plots

The ggpairs plot is a swiss army knife of plots, it provides pair-wise scatter plots of the predictor variables, density plots, correlation statisics as well as box plots of the predictors relative to the Type variable.

Some highlevel observations include:

There does not appear to be a high level of correlation amongst variables, this is evident in the scatter plot and correlation figures.
Highest correlation between predictors is 0.81 - between Ca and Ri. Most others are less than +/- 0.5.
The Type vs predictor boxplots also show some interesting relationships: Type vs. Mg, Type vs Ba and others.
Types 1 and 2 have the highest counts of the various categories.

p <- ggpairs(Glass) + theme_fivethirtyeight()
p

Review GGCorr Plot

Next we use the ggcorr function to get a better look at the correlation between predictors. This plot is provides a more clear (at least larger) representation of the information included in the ggpairs plot above. Indeed, this plot confirms the strong correlation between Ri and Ba (lower leve bright red box) and also highligts several correlations at the +/- 0.5 level: Ri/Si, Mg/Ba, Mg/Ai, Ai/Ba.

glass <- Glass %>% 
  pivot_longer(!Type, names_to ='predictor', values_to = "val")

ggcorr(Glass, label = TRUE)

Review Desity Plots and Historgrams to Understand Distribution

The desity plots of the predictors show some near-normal distributions (Ai, Na, Si,), but also reveals some skewed and bimodal distributions. These graphs leasd us to review the skew of the predictors. The skewness section below utilizes the skewness function to set forth the skewness of each variable. The histograms raise the possibility of missing (zero-filled) data for several of the predictors (Fe, Mg, and Ba). The histograms also indicate potential outliers or data entry errors for several variables (K, Fe and Ba)

p <- glass %>% 
    ggplot( aes(x=val, color=predictor, fill=predictor)) +
    geom_density() +
    labs(title = "Density Curves", subtitle = 'Glass Dataset Predictors') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

p <- glass %>% 
    ggplot( aes(x=val, color=predictor, fill=predictor)) +
    geom_histogram() +
    labs(title = "Histograms", subtitle = 'Glass Dataset Predictors') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

Box Plots

Review Boxplots for Outliers and Relationships with Type

Before looking more closely at skew and transformations, we’ll look at the boxplots below to see if outliers exist and to better understand the relathip between the predictors and the Type variable.

The first set of box plots reaffirms what we saw in the histograms, numerous variable (Fe, K, Si,Na, Ai) have what appear to be outliers. The box-plot for Mg shows that it has no outliers, however, this appears to be in conflict with the Mg density plot which shows a bimodal distribtion. (hmmmm)

The second set of boxplots provide a clear look at the relationships between the predictor variables and the Type variables. Several of these relationships look as if they would be beneficial to a classification model (Fe, Mg, Ba) as they draw clear distinctions between some of the categories.

p <- glass %>% 
    ggplot( aes(x=val, color=predictor, fill=predictor)) +
    geom_boxplot() +
    labs(title = "Box Plots", subtitle = 'Glass Dataset Predictors') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

p <- glass %>% 
    ggplot( aes(x=Type, y=val,  color=predictor, fill=predictor)) +
    geom_boxplot() +
    labs(title = "Box Plots", subtitle = 'By Predictor and Type') +
    theme_fivethirtyeight() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) + facet_wrap(~predictor, scales = "free")

p

Skewness

Review Skewness Data

The rule of thumb for skewness is: - If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. - If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. - If the skewness is less than -1 or greater than 1, the data are highly skewed.

We see from the table below that we have all varities of skewness in our data. The K and Ba variables are the most skewed.

library(e1071)
s<-apply( Glass[,-10], 2, skewness ) %>% 
  kable() %>% 
  kable_styling()

s

	x
RI	1.6027151
Na	0.4478343
Mg	-1.1364523
Al	0.8946104
Si	-0.7202392
K	6.4600889
Ca	2.0184463
Ba	3.3686800
Fe	1.7298107

Box-Cox Transformation

Review Skewness Data After Box-Cox Transformation

After the box-cox transform we see most of skewness improved:

K went from 6.4 to -0.78
Ba went from 3.36 to 1.67
Mg’s skews did not improve, it went from -1.13 to -1.43
all other variables benefitted from the box-cox transform

Glass$Mg = Glass$Mg + 1.e-6 # add a small value so that BoxCoxTransfs will converge 
Glass$K = Glass$K + 1.e-6
Glass$Ba = Glass$Ba + 1.e-6
Glass$Fe = Glass$Fe + 1.e-6

boxcox_skewness <- function(x){
  BCT = caret::BoxCoxTrans(x)
  x_bc = predict( BCT, x )
  skewness(x_bc) 
}

s2 <- apply( Glass[,-10], 2, boxcox_skewness)

s2 %>% 
  kable() %>% 
  kable_styling()

	x
RI	1.5656604
Na	0.0338464
Mg	-1.4327087
Al	0.0910590
Si	-0.6509057
K	-0.7821621
Ca	-0.1939557
Ba	1.6756661
Fe	0.7442440

	x
RI	1.6027151
Na	0.4478343
Mg	-1.1364523
Al	0.8946104
Si	-0.7202392
K	6.4600889
Ca	2.0184463
Ba	3.3686800
Fe	1.7298107

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Data Import and Inspection

data(Soybean)
soybean <- as_tibble(Soybean)

str(soybean)

## tibble [683 x 36] (S3: tbl_df/tbl/data.frame)
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

skim(soybean)

Data summary

Name	soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Near Zero Variables

Applying the nearZeroVar function to the soybean dataset reveals that three variables: leaf.mild, mycelium, sclerotia have near zero variance and would be candidates for exclusion from the dataset.

zero_dat <- nearZeroVar(soybean)
colnames(soybean)[zero_dat]

## [1] "leaf.mild" "mycelium"  "sclerotia"

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Variables with NAs

Reviewing the output from the skim function above, we can see that the variables with the lowest complete rates (most NAs) are: hail, sever, seed.tmt, and lodging. These variables have 121 NA values. Additionally, the table below indicates that the False (0) class is more likely to have NA values than the TRUE (1) class. It may also be noteworthy that the cyst-nematode and diaporthe-pod-&-stem-blight classes are the only two classes that have NAs for the TRUE class, but no NAs for the FALSE class.

soybean$has_na <- apply( Soybean[,-1], 1, function(x){ sum(is.na(x)) > 0 })

table(soybean[,c(1,34)]) %>% 
  kable() %>% 
  kable_styling()

	0	1
2-4-d-injury	0	0
alternarialeaf-spot	91	0
anthracnose	22	22
bacterial-blight	20	0
bacterial-pustule	13	7
brown-spot	92	0
brown-stem-rot	44	0
charcoal-rot	20	0
cyst-nematode	0	14
diaporthe-pod-&-stem-blight	0	15
diaporthe-stem-canker	20	0
downy-mildew	20	0
frog-eye-leaf-spot	90	1
herbicide-injury	0	0
phyllosticta-leaf-spot	20	0
phytophthora-rot	20	0
powdery-mildew	20	0
purple-seed-stain	20	0
rhizoctonia-root-rot	20	0

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Missing Data Strategy

Given the number of observations, 683, and the complete rate of the worse variables, 82%, I would be inclined to dropp the NA values for this dataset. Next I would ensure that dropping the NAs did not create a data sample imbalance. If an imbalance was created I would make the appropriate adjustments and proceed with my analysis. I believe this approach would avoid the need to impute values for categorical data that could introduce bias. If this strategy did not produce a viable model/results I would consider imputatiion with Random Forest. This is a strategy introduced in Introduction to Statistical Learning - Chapter 8.

Data 624 - Assignment 4 - Data Pre-Processing

Jim Mundy

Data Import and Inspection

Review GGPairs Plots

Review GGCorr Plot

Review Desity Plots and Historgrams to Understand Distribution

Box Plots

Review Boxplots for Outliers and Relationships with Type

Skewness

Review Skewness Data

Box-Cox Transformation

Review Skewness Data After Box-Cox Transformation

Data Import and Inspection

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Near Zero Variables

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Variables with NAs

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Missing Data Strategy