Question 3.1
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of several class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors
First we will take a look at the distribution of the predictors:

Glass is primarly made of silica (Si), soda (Na) and lime (Ca). Seeing these predictors at higher concentrations is not suprising.
Now we will examine how the predictors are related to each other. We will do that with a correlation plot.
#ColorBrewer's 5 class spectral color palette
col <- colorRampPalette(c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"))
Glass %>%
select(-Type) %>%
cor() %>%
round(., 2) %>%
corrplot(., method="color", col=col(200), type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )

Most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.
Most of the correlations are not very strong. The exception to this is the correlation between calcium oxide and the refraction index is strongly positively correlated. I am going to take some liberties and summarize the data in a tabular form, because this “visualization” speaks to me:
Predictor
|
Min
|
1st Qu.
|
Median
|
Mean
|
3rd Qu.
|
Max
|
Al
|
0.29000
|
1.190000
|
1.36000
|
1.4449065
|
1.630000
|
3.50000
|
Ba
|
0.00000
|
0.000000
|
0.00000
|
0.1750467
|
0.000000
|
3.15000
|
Ca
|
5.43000
|
8.240000
|
8.60000
|
8.9569626
|
9.172500
|
16.19000
|
Fe
|
0.00000
|
0.000000
|
0.00000
|
0.0570093
|
0.100000
|
0.51000
|
K
|
0.00000
|
0.122500
|
0.55500
|
0.4970561
|
0.610000
|
6.21000
|
Mg
|
0.00000
|
2.115000
|
3.48000
|
2.6845327
|
3.600000
|
4.49000
|
Na
|
10.73000
|
12.907500
|
13.30000
|
13.4078505
|
13.825000
|
17.38000
|
RI
|
1.51115
|
1.516522
|
1.51768
|
1.5183654
|
1.519157
|
1.53393
|
Si
|
69.81000
|
72.280000
|
72.79000
|
72.6509346
|
73.087500
|
75.41000
|
- Do there appear to be any outliers in the data? Are any predictors skewed?
I want to see how the predictors are distributed by the type of glass. I will use a scatter plot to do this but will be excluding scilica because of the difference in scale.

It looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potasium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.
Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.
- Are there any relevant transformations of one or more predictors that might improve the classification model?
Something like a Box-Cox transformation might improve the classification model’s preformance.
Question 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environemental conditions (e.g. temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
- Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
I am assuming the degenerate distibuted variaviables discussed earlier in the chapter refers to section 3.5 on removing predictors. Here’s some frequency tables:
date
|
n
|
share
|
5
|
149
|
0.2181552
|
4
|
131
|
0.1918009
|
3
|
118
|
0.1727672
|
2
|
93
|
0.1361640
|
6
|
90
|
0.1317716
|
1
|
75
|
0.1098097
|
0
|
26
|
0.0380673
|
NA
|
1
|
0.0014641
|
plant.stand
|
n
|
share
|
0
|
354
|
0.5183016
|
1
|
293
|
0.4289898
|
NA
|
36
|
0.0527086
|
precip
|
n
|
share
|
2
|
459
|
0.6720351
|
1
|
112
|
0.1639824
|
0
|
74
|
0.1083455
|
NA
|
38
|
0.0556369
|
temp
|
n
|
share
|
1
|
374
|
0.5475842
|
2
|
199
|
0.2913616
|
0
|
80
|
0.1171303
|
NA
|
30
|
0.0439239
|
hail
|
n
|
share
|
0
|
435
|
0.6368960
|
1
|
127
|
0.1859444
|
NA
|
121
|
0.1771596
|
crop.hist
|
n
|
share
|
2
|
219
|
0.3206442
|
3
|
218
|
0.3191801
|
1
|
165
|
0.2415813
|
0
|
65
|
0.0951684
|
NA
|
16
|
0.0234261
|
area.dam
|
n
|
share
|
1
|
227
|
0.3323572
|
3
|
187
|
0.2737921
|
2
|
145
|
0.2122987
|
0
|
123
|
0.1800878
|
NA
|
1
|
0.0014641
|
sever
|
n
|
share
|
1
|
322
|
0.4714495
|
0
|
195
|
0.2855051
|
NA
|
121
|
0.1771596
|
2
|
45
|
0.0658858
|
seed.tmt
|
n
|
share
|
0
|
305
|
0.4465593
|
1
|
222
|
0.3250366
|
NA
|
121
|
0.1771596
|
2
|
35
|
0.0512445
|
germ
|
n
|
share
|
1
|
213
|
0.3118594
|
2
|
193
|
0.2825769
|
0
|
165
|
0.2415813
|
NA
|
112
|
0.1639824
|
plant.growth
|
n
|
share
|
0
|
441
|
0.6456808
|
1
|
226
|
0.3308931
|
NA
|
16
|
0.0234261
|
leaves
|
n
|
share
|
1
|
606
|
0.8872621
|
0
|
77
|
0.1127379
|
leaf.halo
|
n
|
share
|
2
|
342
|
0.5007321
|
0
|
221
|
0.3235725
|
NA
|
84
|
0.1229868
|
1
|
36
|
0.0527086
|
leaf.marg
|
n
|
share
|
0
|
357
|
0.5226940
|
2
|
221
|
0.3235725
|
NA
|
84
|
0.1229868
|
1
|
21
|
0.0307467
|
leaf.size
|
n
|
share
|
1
|
327
|
0.4787701
|
2
|
221
|
0.3235725
|
NA
|
84
|
0.1229868
|
0
|
51
|
0.0746706
|
leaf.shread
|
n
|
share
|
0
|
487
|
0.7130307
|
NA
|
100
|
0.1464129
|
1
|
96
|
0.1405564
|
leaf.malf
|
n
|
share
|
0
|
554
|
0.8111274
|
NA
|
84
|
0.1229868
|
1
|
45
|
0.0658858
|
leaf.mild
|
n
|
share
|
0
|
535
|
0.7833089
|
NA
|
108
|
0.1581259
|
1
|
20
|
0.0292826
|
2
|
20
|
0.0292826
|
stem
|
n
|
share
|
1
|
371
|
0.5431918
|
0
|
296
|
0.4333821
|
NA
|
16
|
0.0234261
|
lodging
|
n
|
share
|
0
|
520
|
0.7613470
|
NA
|
121
|
0.1771596
|
1
|
42
|
0.0614934
|
stem.cankers
|
n
|
share
|
0
|
379
|
0.5549048
|
3
|
191
|
0.2796486
|
1
|
39
|
0.0571010
|
NA
|
38
|
0.0556369
|
2
|
36
|
0.0527086
|
canker.lesion
|
n
|
share
|
0
|
320
|
0.4685212
|
2
|
177
|
0.2591508
|
1
|
83
|
0.1215227
|
3
|
65
|
0.0951684
|
NA
|
38
|
0.0556369
|
fruiting.bodies
|
n
|
share
|
0
|
473
|
0.6925329
|
NA
|
106
|
0.1551977
|
1
|
104
|
0.1522694
|
ext.decay
|
n
|
share
|
0
|
497
|
0.7276720
|
1
|
135
|
0.1976574
|
NA
|
38
|
0.0556369
|
2
|
13
|
0.0190337
|
mycelium
|
n
|
share
|
0
|
639
|
0.9355783
|
NA
|
38
|
0.0556369
|
1
|
6
|
0.0087848
|
int.discolor
|
n
|
share
|
0
|
581
|
0.8506589
|
1
|
44
|
0.0644217
|
NA
|
38
|
0.0556369
|
2
|
20
|
0.0292826
|
sclerotia
|
n
|
share
|
0
|
625
|
0.9150805
|
NA
|
38
|
0.0556369
|
1
|
20
|
0.0292826
|
fruit.pods
|
n
|
share
|
0
|
407
|
0.5959004
|
1
|
130
|
0.1903367
|
NA
|
84
|
0.1229868
|
3
|
48
|
0.0702782
|
2
|
14
|
0.0204978
|
fruit.spots
|
n
|
share
|
0
|
345
|
0.5051245
|
NA
|
106
|
0.1551977
|
4
|
100
|
0.1464129
|
1
|
75
|
0.1098097
|
2
|
57
|
0.0834553
|
seed
|
n
|
share
|
0
|
476
|
0.6969253
|
1
|
115
|
0.1683748
|
NA
|
92
|
0.1346999
|
mold.growth
|
n
|
share
|
0
|
524
|
0.7672035
|
NA
|
92
|
0.1346999
|
1
|
67
|
0.0980966
|
seed.discolor
|
n
|
share
|
0
|
513
|
0.7510981
|
NA
|
106
|
0.1551977
|
1
|
64
|
0.0937042
|
seed.size
|
n
|
share
|
0
|
532
|
0.7789165
|
NA
|
92
|
0.1346999
|
1
|
59
|
0.0863836
|
shriveling
|
n
|
share
|
0
|
539
|
0.7891654
|
NA
|
106
|
0.1551977
|
1
|
38
|
0.0556369
|
roots
|
n
|
share
|
0
|
551
|
0.8067350
|
1
|
86
|
0.1259151
|
NA
|
31
|
0.0453880
|
2
|
15
|
0.0219619
|
There’s a lot of missing variables. The authors recommended removing variables with near zero variance. I know that the caret
package has a function for that. Here’s the output from that function:
|
freqRatio
|
percentUnique
|
zeroVar
|
nzv
|
Class
|
1.010989
|
2.7818448
|
FALSE
|
FALSE
|
date
|
1.137405
|
1.0248902
|
FALSE
|
FALSE
|
plant.stand
|
1.208191
|
0.2928258
|
FALSE
|
FALSE
|
precip
|
4.098214
|
0.4392387
|
FALSE
|
FALSE
|
temp
|
1.879397
|
0.4392387
|
FALSE
|
FALSE
|
hail
|
3.425197
|
0.2928258
|
FALSE
|
FALSE
|
crop.hist
|
1.004587
|
0.5856515
|
FALSE
|
FALSE
|
area.dam
|
1.213904
|
0.5856515
|
FALSE
|
FALSE
|
sever
|
1.651282
|
0.4392387
|
FALSE
|
FALSE
|
seed.tmt
|
1.373874
|
0.4392387
|
FALSE
|
FALSE
|
germ
|
1.103627
|
0.4392387
|
FALSE
|
FALSE
|
plant.growth
|
1.951327
|
0.2928258
|
FALSE
|
FALSE
|
leaves
|
7.870130
|
0.2928258
|
FALSE
|
FALSE
|
leaf.halo
|
1.547511
|
0.4392387
|
FALSE
|
FALSE
|
leaf.marg
|
1.615385
|
0.4392387
|
FALSE
|
FALSE
|
leaf.size
|
1.479638
|
0.4392387
|
FALSE
|
FALSE
|
leaf.shread
|
5.072917
|
0.2928258
|
FALSE
|
FALSE
|
leaf.malf
|
12.311111
|
0.2928258
|
FALSE
|
FALSE
|
leaf.mild
|
26.750000
|
0.4392387
|
FALSE
|
TRUE
|
stem
|
1.253378
|
0.2928258
|
FALSE
|
FALSE
|
lodging
|
12.380952
|
0.2928258
|
FALSE
|
FALSE
|
stem.cankers
|
1.984293
|
0.5856515
|
FALSE
|
FALSE
|
canker.lesion
|
1.807910
|
0.5856515
|
FALSE
|
FALSE
|
fruiting.bodies
|
4.548077
|
0.2928258
|
FALSE
|
FALSE
|
ext.decay
|
3.681481
|
0.4392387
|
FALSE
|
FALSE
|
mycelium
|
106.500000
|
0.2928258
|
FALSE
|
TRUE
|
int.discolor
|
13.204546
|
0.4392387
|
FALSE
|
FALSE
|
sclerotia
|
31.250000
|
0.2928258
|
FALSE
|
TRUE
|
fruit.pods
|
3.130769
|
0.5856515
|
FALSE
|
FALSE
|
fruit.spots
|
3.450000
|
0.5856515
|
FALSE
|
FALSE
|
seed
|
4.139130
|
0.2928258
|
FALSE
|
FALSE
|
mold.growth
|
7.820895
|
0.2928258
|
FALSE
|
FALSE
|
seed.discolor
|
8.015625
|
0.2928258
|
FALSE
|
FALSE
|
seed.size
|
9.016949
|
0.2928258
|
FALSE
|
FALSE
|
shriveling
|
14.184211
|
0.2928258
|
FALSE
|
FALSE
|
roots
|
6.406977
|
0.4392387
|
FALSE
|
FALSE
|
There are three variables (leaf.mild, mycelium, sclerotia) that have a near zero variance, and should probably be removed.
- Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

There are blocks of observations that are missing. Since the data are arranged by the classes this suggests that the patterns of missing data are related to the classes.
- Develop a strategy for handling missing data, either by eliminating predictors or imputation.
I will be eliminating the three near zero variance predictiors. For all other predictors I will be imputing values. I don’t have any domain expertise that would inform the imputations, so I will be using decision trees to (hopefully) produce good imputations. It has been my experience that decision trees preform really well. The dlookr
package integrates well with the tidyverse.
library(dlookr)
Soybean_complete <- Soybean %>%
# Impute missing values using rpart
mutate(
date = imputate_na(Soybean, date, Class, method = "rpart", no_attrs = TRUE),
plant.stand = imputate_na(Soybean, plant.stand, Class, method = "rpart", no_attrs = TRUE),
precip = imputate_na(Soybean, precip, Class, method = "rpart", no_attrs = TRUE),
temp = imputate_na(Soybean, temp, Class, method = "rpart", no_attrs = TRUE),
hail = imputate_na(Soybean, hail, Class, method = "rpart", no_attrs = TRUE),
crop.hist = imputate_na(Soybean, crop.hist, Class, method = "rpart", no_attrs = TRUE),
area.dam = imputate_na(Soybean, area.dam, Class, method = "rpart", no_attrs = TRUE),
sever = imputate_na(Soybean, sever, Class, method = "rpart", no_attrs = TRUE),
seed.tmt = imputate_na(Soybean, seed.tmt, Class, method = "rpart", no_attrs = TRUE),
germ = imputate_na(Soybean, germ, Class, method = "rpart", no_attrs = TRUE),
plant.growth = imputate_na(Soybean, plant.growth, Class, method = "rpart", no_attrs = TRUE),
leaf.halo = imputate_na(Soybean, leaf.halo, Class, method = "rpart", no_attrs = TRUE),
leaf.marg = imputate_na(Soybean, leaf.marg, Class, method = "rpart", no_attrs = TRUE),
leaf.size = imputate_na(Soybean, leaf.size, Class, method = "rpart", no_attrs = TRUE),
leaf.shread = imputate_na(Soybean, leaf.shread, Class, method = "rpart", no_attrs = TRUE),
leaf.malf = imputate_na(Soybean, leaf.malf, Class, method = "rpart", no_attrs = TRUE),
stem = imputate_na(Soybean, stem, Class, method = "rpart", no_attrs = TRUE),
lodging = imputate_na(Soybean, lodging, Class, method = "rpart", no_attrs = TRUE),
stem.cankers = imputate_na(Soybean, stem.cankers, Class, method = "rpart", no_attrs = TRUE),
canker.lesion = imputate_na(Soybean, canker.lesion, Class, method = "rpart", no_attrs = TRUE),
fruiting.bodies = imputate_na(Soybean, fruiting.bodies, Class, method = "rpart", no_attrs = TRUE),
ext.decay = imputate_na(Soybean, ext.decay, Class, method = "rpart", no_attrs = TRUE),
int.discolor = imputate_na(Soybean, int.discolor, Class, method = "rpart", no_attrs = TRUE),
fruit.pods = imputate_na(Soybean, fruit.pods, Class, method = "rpart", no_attrs = TRUE),
seed = imputate_na(Soybean, seed, Class, method = "rpart", no_attrs = TRUE),
mold.growth = imputate_na(Soybean, mold.growth, Class, method = "rpart", no_attrs = TRUE),
seed.discolor = imputate_na(Soybean, seed.discolor, Class, method = "rpart", no_attrs = TRUE),
seed.size = imputate_na(Soybean, seed.size, Class, method = "rpart", no_attrs = TRUE),
shriveling = imputate_na(Soybean, shriveling, Class, method = "rpart", no_attrs = TRUE),
fruit.spots = imputate_na(Soybean, fruit.spots, Class, method = "rpart", no_attrs = TRUE),
roots = imputate_na(Soybean, roots, Class, method = "rpart", no_attrs = TRUE)) %>%
# Drop the near zero variance predictors
select(-leaf.mild, -mycelium, -sclerotia)
To prove that it worked I present the following
