#library(fpp3)
#library(seasonal)
library(magrittr)
library(tidyverse)
library(corrplot)
library(patchwork)
library(forecast)  # for boxcox
library(caret) # for spatial sign transformation
library(mlbench)
library(dlookr)
library(naniar) #missingness
library(flextable)

1 Kuhn and Johnson: Exercise 3.1

From Applied Predictive Modeling. 2016. Kuhn and Johnson. Exercise 3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

# read in glass data

data(Glass)

# save in working file

glass<-Glass

# review data

glass%>%head(3)%>%
  flextable()%>%
  set_caption('Glass Dataset: Subset of Observations')

Glass Dataset: Subset of Observations
RI	Na	Mg	Al	Si	K	Ca	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	1

3.1a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# plot distributions using dlookr

glass%>%plot_normality()

Pairwise comparisons to evaluate relationships between predictors.

The following two figures display pairwise correlations (p < .05) for covariates with and without zero value observations, respectively. A zero/negative value for a chemical concentration generally indicates that the measurement lies below the instrumental detection limit. It is interesting to note that, absent such observations, the pairwise correlations are higher in this dataset across a range of covariates (Figure 11). And in either case, Ba and Ca show a high positive correlation and Ba and Si show a high negative correlation.

An argument can be made for dropping either Ba or Ca for modeling purposes given that they account for similar variance. As Ba also correlates with Si, it may be prudent to retain Ca.

#create correlation matrix that includes all numerical values 

glass_cor <-glass%>%
  select(-Type)%>%
  cor()

#set colors
  
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))

# plot pairwise correlations with zero values included

corrplot(glass_cor, method = "shade", 
         shade.col = NA, 
         diag=FALSE, 
         type='upper', 
         tl.col = "black", 
         tl.srt = 45, 
         addCoef.col = "black", 
         cl.pos = "n", 
         order = "hclust", 
         col = col(200), 
         title ='Glass: Pairwise Correlations of Predictors - Zero Values Included' , 
         mar=c(0,0,1,0))

#plot pairwise correlations removing the zeros

less_cor<-glass%>%
  select(-Type)%>%
  filter_if(is.numeric, all_vars((.) != 0))%>%
  cor()


corrplot(less_cor, method = "shade", 
         shade.col = NA, 
         diag=FALSE, 
         type='upper', 
         tl.col = "black", 
         tl.srt = 45, 
         addCoef.col = "black", 
         cl.pos = "n", 
         order = "hclust", 
         col = col(200), 
         title ='Glass: Pairwise Correlations of Predictors - Zero Values Removed' , 
         mar=c(0,0,1,0))

3.1b. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, the covariates with skewed distributions include: “RI” “Mg” “K” “Ca” “Ba” “Fe”. From this list, Ba and Ca also have the highest proportion of outliers (17% and 12%, respectively). The following table and plots provide a summary of these measures.

# Identify predictors that are highly skewed

skew<-glass%>%find_skewness(index=FALSE, thres=TRUE)  

print(paste(cat(skew), "display a skewed distributions."))

## RI Mg K Ca Ba Fe[1] " display a skewed distributions."

# identify outliers -- above q0.75+1.5*IQR and below q0.25+1.5*IQR

diagnose_outlier(glass)%>%arrange(desc(outliers_cnt)) %>% 
  mutate_if(is.numeric, round , digits=3)%>% 
  flextable()%>%
  set_caption("Glass: Outlier Statistics")

Glass: Outlier Statistics
variables	outliers_cnt	outliers_ratio	outliers_mean	with_mean	without_mean
Ba	38	17.757	0.986	0.175	0.000
Ca	26	12.150	11.173	8.957	8.651
Al	18	8.411	2.088	1.445	1.386
RI	17	7.944	1.524	1.518	1.518
Si	12	5.607	71.824	72.651	72.700
Fe	12	5.607	0.324	0.057	0.041
Na	7	3.271	12.661	13.408	13.433
K	7	3.271	3.061	0.497	0.410
Mg	0	0.000		2.685	2.685

# assess change to distributions based on outlier removal

glass %>% 
    select(find_outliers(glass, index = FALSE)) %>% 
    plot_outlier()

3.1c. Are there any relevant transformations of one or more predictors that might improve the classification model?

Given that few of the covariates have near normal distributions, we can employ a Box_Cox estimation of lambda to select an appropriate transformation method. On this basis, the following covariates are listed with the ‘best-in-class’ transformation.

RI - inverse
Na - inverse or inverse sqrt
Mg - no transformation
Al - sqrt or log
Si - Inverse
K - log
Ca - inverse sqrt
Ba - log
Fe - log

A spatialSign transformation may also be appropriate for covariates that have a significant number of outliers (particularly if they are influential observations). In this Glass dataset, the covariates Ca and Ba are good candidates for the spatialSign transform. The plots below compare untransformed, spatialSign transformed, and lamba transformed distributions for these covariates. The choice of one transformation vs. another may be best evaluated from other model diagnostics and measures of fit.

#using boxcox from forecast pkg (we could use MASS as alternative)

g<-glass%>%select(-Type)
type<-glass%>%select(Type)

g%>%map(BoxCox.lambda)%>%
  as.data.frame()%>%
  flextable()%>%
  set_caption('Box-Cox Lambdas for Glass Predictors')

Box-Cox Lambdas for Glass Predictors
RI	Na	Mg	Al	Si	K	Ca	Ba	Fe
-0.9999242	-0.7425332	0.9999895	0.3616362	-0.9999242	0.0634853	-0.3850341	0.08844884	0.1314834

# Compare transformations on Ca and Ba distributions 

spatial<- preProcess(glass[, -10], method=c('center', 'scale'))
spatial_2<- predict(spatial, glass[,-10])
spatial_3<-spatialSign(spatial_2)
spatial_3<-as.data.frame(spatial_3)

ca1<-glass%>%ggplot(aes(x=Ca))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='Untransformed')

ca2<-spatial_3%>%ggplot(aes(x=Ca))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='Spatial Transform')

ca3<-glass%>%ggplot(aes(x=1/sqrt(Ca)))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='1/sqrt Transform')

ba1<-glass%>%ggplot(aes(x=Ba))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='Untransformed')

ba2<-spatial_3%>%ggplot(aes(x=Ba))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='Spatial Transform')

ba3<-glass%>%ggplot(aes(x=log(Ba)))+
  geom_histogram(fill='steelblue')+
  theme_classic()+
  labs(title='Log Transform')

# plot comparisons

ca1|ca2|ca3

ba1|ba2|ba3

The following scatter plot matrices are included to show pair-wise comparisons between untransformed and spatialSign transformed covariates. The first matrix displays the untransformed observations.

#based on http://rismyhammer.com/ml/OutliersSpatialSign.html#:~:text=Use%20spatialSign%20%28%29%20in%20caret%20to%20conduct%20spatial,myTransformed2%20%3C-%20spatialSign%28myTransformed%29%20myTransformed2%20%3C-%20as.data.frame%28myTransformed2%29%20head%28myTransformed2%2C%2010%29

#Plot spatial sign transformation using caret. Col 10 = 'Type'

trellis.par.set(theme = col.whitebg(), warn = FALSE)

featurePlot(glass[,-10], glass[,10], "pairs",  auto.key = list(columns = 10))

featurePlot(spatialSign(scale(glass[,-10])), glass[,10], "pairs", auto.key = list(columns = 10))

2 Kuhn and Johnson: Exercise 3.2

3.2. Soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

#load soybeen data

data(Soybean)

soybean<-Soybean

# print selection of dataset

soybean%>%head(3)%>%
  flextable()%>%
  set_caption("Soybean Dataset: Subset of Observations")

3.2a. Investigate the frequency distributions for the categorical predictors.

The following table and plots provide a summary of the frequency distributions. They provide a quick means to screen for low/zero variance covariates (discussed in the next prompt) as well as class imbalance.

# calculate frequency distributions of categorical predictor variables

soybean%>%
  select(!Class)%>%
  diagnose_category()%>%
  flextable()%>%
  set_caption("Soybean: Frequency Statistics for Predictor Variables")

Soybean: Frequency Statistics for Predictor Variables
variables	levels	N	freq	ratio	rank
date	5	683	149	21.8155198	1
date	4	683	131	19.1800878	2
date	3	683	118	17.2767204	3
date	2	683	93	13.6163982	4
date	6	683	90	13.1771596	5
date	1	683	75	10.9809663	6
date	0	683	26	3.8067350	7
date		683	1	0.1464129	8
plant.stand	0	683	354	51.8301611	1
plant.stand	1	683	293	42.8989751	2
plant.stand		683	36	5.2708638	3
precip	2	683	459	67.2035139	1
precip	1	683	112	16.3982430	2
precip	0	683	74	10.8345534	3
precip		683	38	5.5636896	4
temp	1	683	374	54.7584187	1
temp	2	683	199	29.1361640	2
temp	0	683	80	11.7130307	3
temp		683	30	4.3923865	4
hail	0	683	435	63.6896047	1
hail	1	683	127	18.5944363	2
hail		683	121	17.7159590	3
crop.hist	2	683	219	32.0644217	1
crop.hist	3	683	218	31.9180088	2
crop.hist	1	683	165	24.1581259	3
crop.hist	0	683	65	9.5168375	4
crop.hist		683	16	2.3426061	5
area.dam	1	683	227	33.2357247	1
area.dam	3	683	187	27.3792094	2
area.dam	2	683	145	21.2298682	3
area.dam	0	683	123	18.0087848	4
area.dam		683	1	0.1464129	5
sever	1	683	322	47.1449488	1
sever	0	683	195	28.5505124	2
sever		683	121	17.7159590	3
sever	2	683	45	6.5885798	4
seed.tmt	0	683	305	44.6559297	1
seed.tmt	1	683	222	32.5036603	2
seed.tmt		683	121	17.7159590	3
seed.tmt	2	683	35	5.1244510	4
germ	1	683	213	31.1859444	1
germ	2	683	193	28.2576867	2
germ	0	683	165	24.1581259	3
germ		683	112	16.3982430	4
plant.growth	0	683	441	64.5680820	1
plant.growth	1	683	226	33.0893119	2
plant.growth		683	16	2.3426061	3
leaves	1	683	606	88.7262079	1
leaves	0	683	77	11.2737921	2
leaf.halo	2	683	342	50.0732064	1
leaf.halo	0	683	221	32.3572474	2
leaf.halo		683	84	12.2986823	3
leaf.halo	1	683	36	5.2708638	4
leaf.marg	0	683	357	52.2693997	1
leaf.marg	2	683	221	32.3572474	2
leaf.marg		683	84	12.2986823	3
leaf.marg	1	683	21	3.0746706	4
leaf.size	1	683	327	47.8770132	1
leaf.size	2	683	221	32.3572474	2
leaf.size		683	84	12.2986823	3
leaf.size	0	683	51	7.4670571	4
leaf.shread	0	683	487	71.3030747	1
leaf.shread		683	100	14.6412884	2
leaf.shread	1	683	96	14.0556369	3
leaf.malf	0	683	554	81.1127379	1
leaf.malf		683	84	12.2986823	2
leaf.malf	1	683	45	6.5885798	3
leaf.mild	0	683	535	78.3308931	1
leaf.mild		683	108	15.8125915	2
leaf.mild	1	683	20	2.9282577	3
leaf.mild	2	683	20	2.9282577	3
stem	1	683	371	54.3191801	1
stem	0	683	296	43.3382138	2
stem		683	16	2.3426061	3
lodging	0	683	520	76.1346999	1
lodging		683	121	17.7159590	2
lodging	1	683	42	6.1493411	3
stem.cankers	0	683	379	55.4904832	1
stem.cankers	3	683	191	27.9648609	2
stem.cankers	1	683	39	5.7101025	3
stem.cankers		683	38	5.5636896	4
stem.cankers	2	683	36	5.2708638	5
canker.lesion	0	683	320	46.8521230	1
canker.lesion	2	683	177	25.9150805	2
canker.lesion	1	683	83	12.1522694	3
canker.lesion	3	683	65	9.5168375	4
canker.lesion		683	38	5.5636896	5
fruiting.bodies	0	683	473	69.2532943	1
fruiting.bodies		683	106	15.5197657	2
fruiting.bodies	1	683	104	15.2269400	3
ext.decay	0	683	497	72.7672035	1
ext.decay	1	683	135	19.7657394	2
ext.decay		683	38	5.5636896	3
ext.decay	2	683	13	1.9033675	4
mycelium	0	683	639	93.5578331	1
mycelium		683	38	5.5636896	2
mycelium	1	683	6	0.8784773	3
int.discolor	0	683	581	85.0658858	1
int.discolor	1	683	44	6.4421669	2
int.discolor		683	38	5.5636896	3
int.discolor	2	683	20	2.9282577	4
sclerotia	0	683	625	91.5080527	1
sclerotia		683	38	5.5636896	2
sclerotia	1	683	20	2.9282577	3
fruit.pods	0	683	407	59.5900439	1
fruit.pods	1	683	130	19.0336750	2
fruit.pods		683	84	12.2986823	3
fruit.pods	3	683	48	7.0278184	4
fruit.pods	2	683	14	2.0497804	5
fruit.spots	0	683	345	50.5124451	1
fruit.spots		683	106	15.5197657	2
fruit.spots	4	683	100	14.6412884	3
fruit.spots	1	683	75	10.9809663	4
fruit.spots	2	683	57	8.3455344	5
seed	0	683	476	69.6925329	1
seed	1	683	115	16.8374817	2
seed		683	92	13.4699854	3
mold.growth	0	683	524	76.7203514	1
mold.growth		683	92	13.4699854	2
mold.growth	1	683	67	9.8096633	3
seed.discolor	0	683	513	75.1098097	1
seed.discolor		683	106	15.5197657	2
seed.discolor	1	683	64	9.3704246	3
seed.size	0	683	532	77.8916545	1
seed.size		683	92	13.4699854	2
seed.size	1	683	59	8.6383602	3
shriveling	0	683	539	78.9165447	1
shriveling		683	106	15.5197657	2
shriveling	1	683	38	5.5636896	3
roots	0	683	551	80.6734993	1
roots	1	683	86	12.5915081	2
roots		683	31	4.5387994	3
roots	2	683	15	2.1961933	4

# plot frequency distributions for categorical predictors

soybean %>%
  select(c(,1))%>%
  plot_bar_category(typographic = FALSE, each=FALSE)

soybean %>%
  select(c(,2:12))%>%
  plot_bar_category(typographic = FALSE, each=FALSE)

soybean %>%
  select(c(,13:24))%>%
  plot_bar_category(typographic = FALSE, each=FALSE)

soybean %>%
  select(c(,25:36))%>%
  plot_bar_category(typographic = FALSE, each=FALSE)

Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A degenerate distribution is comprised of a single random variable with a single value. The Soybean dataset does not include covariates that have strictly degenerate distributions. However, there are several covariates that have near zero variance (nearly degenerate) - i.e. almost all observations belong to one level of the covariate. Included are the covariates ‘mycelium’ and ‘sclerotia’.

soybean%>%
  diagnose_category()%>%
  filter(ratio>90)%>%
  arrange(desc = TRUE)%>%
    flextable()

variables	levels	N	freq	ratio	rank
mycelium	0	683	639	93.55783	1
sclerotia	0	683	625	91.50805	1

3.2b. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The propensity for missingess among predictors owes more to aspects of data collection than what we might infer from the available data. Variables that are difficult/costly to collect are more apt to have high proportions of missingness.

In the soybeans dataset there is a pattern of missing data that relates to the Class variable. For example, the following class levels have very high levels of missingness: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide injury, and phytophthora-rot.

# calculate missingness statistics across variables

soybean%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    filter(missing_count>0)%>%
    arrange(desc(missing_count))%>%
    flextable()%>%
    set_caption("Missing Data Summary: Soybean")

Missing Data Summary: Soybean
variables	types	missing_count	missing_percent
hail	factor	121	17.7159590
sever	factor	121	17.7159590
seed.tmt	factor	121	17.7159590
lodging	factor	121	17.7159590
germ	ordered	112	16.3982430
leaf.mild	factor	108	15.8125915
fruiting.bodies	factor	106	15.5197657
fruit.spots	factor	106	15.5197657
seed.discolor	factor	106	15.5197657
shriveling	factor	106	15.5197657
leaf.shread	factor	100	14.6412884
seed	factor	92	13.4699854
mold.growth	factor	92	13.4699854
seed.size	factor	92	13.4699854
leaf.halo	factor	84	12.2986823
leaf.marg	factor	84	12.2986823
leaf.size	ordered	84	12.2986823
leaf.malf	factor	84	12.2986823
fruit.pods	factor	84	12.2986823
precip	ordered	38	5.5636896
stem.cankers	factor	38	5.5636896
canker.lesion	factor	38	5.5636896
ext.decay	factor	38	5.5636896
mycelium	factor	38	5.5636896
int.discolor	factor	38	5.5636896
sclerotia	factor	38	5.5636896
plant.stand	ordered	36	5.2708638
roots	factor	31	4.5387994
temp	ordered	30	4.3923865
crop.hist	factor	16	2.3426061
plant.growth	factor	16	2.3426061
stem	factor	16	2.3426061
date	factor	1	0.1464129
area.dam	factor	1	0.1464129

#plot missingness in relation to Class variable -- from Naniar package

gg_miss_fct(soybean, fct = Class)+labs(title='Proportion of Missing Data in Relation to Class Variable')

3.2c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Given that the proportion of missing data with any of our covariates is relatively low (< ~ 17%), I would be inclined to impute values using multivariate imputation with chained equations. This method often works well for categorical variables. I would also evaluate other options to reduce missingness prior to impution. These options could include dropping any covariate(s) with a degenerate distribution and/or near zero variance (e.g., mycelium, sclerotia) and/or dropping a covariate(s) that is highly correlated with other predictors. The latter can be evaluated via pairwise correlation and/or variance inflation factors.

There are yet other options for identifying/dropping covariates with low predictive value (e.g., PCA, Information Gain). I might defer to one or more these if other options don’t yield satisfactory results.

624: Data Processing and Overfitting

Sean Connin

02/28/22

1 Kuhn and Johnson: Exercise 3.1

2 Kuhn and Johnson: Exercise 3.2

Class	date	plant.stand	precip	temp	hail	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.halo	leaf.marg	leaf.size	leaf.shread	leaf.malf	leaf.mild	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	mycelium	int.discolor	sclerotia	fruit.pods	fruit.spots	seed	mold.growth	seed.discolor	seed.size	shriveling	roots
diaporthe-stem-canker	6	0	2	1	0	1	1	1	0	0	1	1	0	2	2	0	0	0	1	1	3	1	1	1	0	0	0	0	4	0	0	0	0	0	0
diaporthe-stem-canker	4	0	2	1	0	2	0	2	1	1	1	1	0	2	2	0	0	0	1	0	3	1	1	1	0	0	0	0	4	0	0	0	0	0	0
diaporthe-stem-canker	3	0	2	1	0	1	0	2	1	2	1	1	0	2	2	0	0	0	1	0	3	0	1	1	0	0	0	0	4	0	0	0	0	0	0