DATA 624 Homework 4

Question 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of several class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

First we will take a look at the distribution of the predictors:

Glass is primarly made of silica (Si), soda (Na) and lime (Ca). Seeing these predictors at higher concentrations is not suprising.

Now we will examine how the predictors are related to each other. We will do that with a correlation plot.

Most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.

Most of the correlations are not very strong. The exception to this is the correlation between calcium oxide and the refraction index is strongly positively correlated. I am going to take some liberties and summarize the data in a tabular form, because this “visualization” speaks to me:

Predictor Min 1st Qu. Median Mean 3rd Qu. Max
Al 0.29000 1.190000 1.36000 1.4449065 1.630000 3.50000
Ba 0.00000 0.000000 0.00000 0.1750467 0.000000 3.15000
Ca 5.43000 8.240000 8.60000 8.9569626 9.172500 16.19000
Fe 0.00000 0.000000 0.00000 0.0570093 0.100000 0.51000
K 0.00000 0.122500 0.55500 0.4970561 0.610000 6.21000
Mg 0.00000 2.115000 3.48000 2.6845327 3.600000 4.49000
Na 10.73000 12.907500 13.30000 13.4078505 13.825000 17.38000
RI 1.51115 1.516522 1.51768 1.5183654 1.519157 1.53393
Si 69.81000 72.280000 72.79000 72.6509346 73.087500 75.41000
  1. Do there appear to be any outliers in the data? Are any predictors skewed?

I want to see how the predictors are distributed by the type of glass. I will use a scatter plot to do this but will be excluding scilica because of the difference in scale.

It looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potasium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.

Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Something like a Box-Cox transformation might improve the classification model’s preformance.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environemental conditions (e.g. temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

I am assuming the degenerate distibuted variaviables discussed earlier in the chapter refers to section 3.5 on removing predictors. Here’s some frequency tables:

date n share
5 149 0.2181552
4 131 0.1918009
3 118 0.1727672
2 93 0.1361640
6 90 0.1317716
1 75 0.1098097
0 26 0.0380673
NA 1 0.0014641
plant.stand n share
0 354 0.5183016
1 293 0.4289898
NA 36 0.0527086
precip n share
2 459 0.6720351
1 112 0.1639824
0 74 0.1083455
NA 38 0.0556369
temp n share
1 374 0.5475842
2 199 0.2913616
0 80 0.1171303
NA 30 0.0439239
hail n share
0 435 0.6368960
1 127 0.1859444
NA 121 0.1771596
crop.hist n share
2 219 0.3206442
3 218 0.3191801
1 165 0.2415813
0 65 0.0951684
NA 16 0.0234261
area.dam n share
1 227 0.3323572
3 187 0.2737921
2 145 0.2122987
0 123 0.1800878
NA 1 0.0014641
sever n share
1 322 0.4714495
0 195 0.2855051
NA 121 0.1771596
2 45 0.0658858
seed.tmt n share
0 305 0.4465593
1 222 0.3250366
NA 121 0.1771596
2 35 0.0512445
germ n share
1 213 0.3118594
2 193 0.2825769
0 165 0.2415813
NA 112 0.1639824
plant.growth n share
0 441 0.6456808
1 226 0.3308931
NA 16 0.0234261
leaves n share
1 606 0.8872621
0 77 0.1127379
leaf.halo n share
2 342 0.5007321
0 221 0.3235725
NA 84 0.1229868
1 36 0.0527086
leaf.marg n share
0 357 0.5226940
2 221 0.3235725
NA 84 0.1229868
1 21 0.0307467
leaf.size n share
1 327 0.4787701
2 221 0.3235725
NA 84 0.1229868
0 51 0.0746706
leaf.shread n share
0 487 0.7130307
NA 100 0.1464129
1 96 0.1405564
leaf.malf n share
0 554 0.8111274
NA 84 0.1229868
1 45 0.0658858
leaf.mild n share
0 535 0.7833089
NA 108 0.1581259
1 20 0.0292826
2 20 0.0292826
stem n share
1 371 0.5431918
0 296 0.4333821
NA 16 0.0234261
lodging n share
0 520 0.7613470
NA 121 0.1771596
1 42 0.0614934
stem.cankers n share
0 379 0.5549048
3 191 0.2796486
1 39 0.0571010
NA 38 0.0556369
2 36 0.0527086
canker.lesion n share
0 320 0.4685212
2 177 0.2591508
1 83 0.1215227
3 65 0.0951684
NA 38 0.0556369
fruiting.bodies n share
0 473 0.6925329
NA 106 0.1551977
1 104 0.1522694
ext.decay n share
0 497 0.7276720
1 135 0.1976574
NA 38 0.0556369
2 13 0.0190337
mycelium n share
0 639 0.9355783
NA 38 0.0556369
1 6 0.0087848
int.discolor n share
0 581 0.8506589
1 44 0.0644217
NA 38 0.0556369
2 20 0.0292826
sclerotia n share
0 625 0.9150805
NA 38 0.0556369
1 20 0.0292826
fruit.pods n share
0 407 0.5959004
1 130 0.1903367
NA 84 0.1229868
3 48 0.0702782
2 14 0.0204978
fruit.spots n share
0 345 0.5051245
NA 106 0.1551977
4 100 0.1464129
1 75 0.1098097
2 57 0.0834553
seed n share
0 476 0.6969253
1 115 0.1683748
NA 92 0.1346999
mold.growth n share
0 524 0.7672035
NA 92 0.1346999
1 67 0.0980966
seed.discolor n share
0 513 0.7510981
NA 106 0.1551977
1 64 0.0937042
seed.size n share
0 532 0.7789165
NA 92 0.1346999
1 59 0.0863836
shriveling n share
0 539 0.7891654
NA 106 0.1551977
1 38 0.0556369
roots n share
0 551 0.8067350
1 86 0.1259151
NA 31 0.0453880
2 15 0.0219619

There’s a lot of missing variables. The authors recommended removing variables with near zero variance. I know that the caret package has a function for that. Here’s the output from that function:

freqRatio percentUnique zeroVar nzv
Class 1.010989 2.7818448 FALSE FALSE
date 1.137405 1.0248902 FALSE FALSE
plant.stand 1.208191 0.2928258 FALSE FALSE
precip 4.098214 0.4392387 FALSE FALSE
temp 1.879397 0.4392387 FALSE FALSE
hail 3.425197 0.2928258 FALSE FALSE
crop.hist 1.004587 0.5856515 FALSE FALSE
area.dam 1.213904 0.5856515 FALSE FALSE
sever 1.651282 0.4392387 FALSE FALSE
seed.tmt 1.373874 0.4392387 FALSE FALSE
germ 1.103627 0.4392387 FALSE FALSE
plant.growth 1.951327 0.2928258 FALSE FALSE
leaves 7.870130 0.2928258 FALSE FALSE
leaf.halo 1.547511 0.4392387 FALSE FALSE
leaf.marg 1.615385 0.4392387 FALSE FALSE
leaf.size 1.479638 0.4392387 FALSE FALSE
leaf.shread 5.072917 0.2928258 FALSE FALSE
leaf.malf 12.311111 0.2928258 FALSE FALSE
leaf.mild 26.750000 0.4392387 FALSE TRUE
stem 1.253378 0.2928258 FALSE FALSE
lodging 12.380952 0.2928258 FALSE FALSE
stem.cankers 1.984293 0.5856515 FALSE FALSE
canker.lesion 1.807910 0.5856515 FALSE FALSE
fruiting.bodies 4.548077 0.2928258 FALSE FALSE
ext.decay 3.681481 0.4392387 FALSE FALSE
mycelium 106.500000 0.2928258 FALSE TRUE
int.discolor 13.204546 0.4392387 FALSE FALSE
sclerotia 31.250000 0.2928258 FALSE TRUE
fruit.pods 3.130769 0.5856515 FALSE FALSE
fruit.spots 3.450000 0.5856515 FALSE FALSE
seed 4.139130 0.2928258 FALSE FALSE
mold.growth 7.820895 0.2928258 FALSE FALSE
seed.discolor 8.015625 0.2928258 FALSE FALSE
seed.size 9.016949 0.2928258 FALSE FALSE
shriveling 14.184211 0.2928258 FALSE FALSE
roots 6.406977 0.4392387 FALSE FALSE

There are three variables (leaf.mild, mycelium, sclerotia) that have a near zero variance, and should probably be removed.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

There are blocks of observations that are missing. Since the data are arranged by the classes this suggests that the patterns of missing data are related to the classes.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will be eliminating the three near zero variance predictiors. For all other predictors I will be imputing values. I don’t have any domain expertise that would inform the imputations, so I will be using decision trees to (hopefully) produce good imputations. It has been my experience that decision trees preform really well. The dlookr package integrates well with the tidyverse.

library(dlookr)

Soybean_complete <- Soybean %>%
  # Impute missing values using rpart
  mutate(
    date = imputate_na(Soybean, date, Class, method = "rpart", no_attrs = TRUE),
    plant.stand = imputate_na(Soybean, plant.stand, Class, method = "rpart", no_attrs = TRUE),
    precip = imputate_na(Soybean, precip, Class, method = "rpart", no_attrs = TRUE),
    temp = imputate_na(Soybean, temp, Class, method = "rpart", no_attrs = TRUE),
    hail = imputate_na(Soybean, hail, Class, method = "rpart", no_attrs = TRUE),
    crop.hist = imputate_na(Soybean, crop.hist, Class, method = "rpart", no_attrs = TRUE),
    area.dam = imputate_na(Soybean, area.dam, Class, method = "rpart", no_attrs = TRUE),
    sever = imputate_na(Soybean, sever, Class, method = "rpart", no_attrs = TRUE),
    seed.tmt = imputate_na(Soybean, seed.tmt, Class, method = "rpart", no_attrs = TRUE),
    germ = imputate_na(Soybean, germ, Class, method = "rpart", no_attrs = TRUE),
    plant.growth = imputate_na(Soybean, plant.growth, Class, method = "rpart", no_attrs = TRUE),
    leaf.halo = imputate_na(Soybean, leaf.halo, Class, method = "rpart", no_attrs = TRUE),
    leaf.marg = imputate_na(Soybean, leaf.marg, Class, method = "rpart", no_attrs = TRUE),
    leaf.size = imputate_na(Soybean, leaf.size, Class, method = "rpart", no_attrs = TRUE),
    leaf.shread = imputate_na(Soybean, leaf.shread, Class, method = "rpart", no_attrs = TRUE),
    leaf.malf = imputate_na(Soybean, leaf.malf, Class, method = "rpart", no_attrs = TRUE),
    stem = imputate_na(Soybean, stem, Class, method = "rpart", no_attrs = TRUE),
    lodging = imputate_na(Soybean, lodging, Class, method = "rpart", no_attrs = TRUE),
    stem.cankers = imputate_na(Soybean, stem.cankers, Class, method = "rpart", no_attrs = TRUE),
    canker.lesion = imputate_na(Soybean, canker.lesion, Class, method = "rpart", no_attrs = TRUE),
    fruiting.bodies = imputate_na(Soybean, fruiting.bodies, Class, method = "rpart", no_attrs = TRUE),
    ext.decay = imputate_na(Soybean, ext.decay, Class, method = "rpart", no_attrs = TRUE),
    int.discolor = imputate_na(Soybean, int.discolor, Class, method = "rpart", no_attrs = TRUE),
    fruit.pods = imputate_na(Soybean, fruit.pods, Class, method = "rpart", no_attrs = TRUE),
    seed = imputate_na(Soybean, seed, Class, method = "rpart", no_attrs = TRUE),
    mold.growth = imputate_na(Soybean, mold.growth, Class, method = "rpart", no_attrs = TRUE),
    seed.discolor = imputate_na(Soybean, seed.discolor, Class, method = "rpart", no_attrs = TRUE),
    seed.size = imputate_na(Soybean, seed.size, Class, method = "rpart", no_attrs = TRUE),
    shriveling = imputate_na(Soybean, shriveling, Class, method = "rpart", no_attrs = TRUE),
    fruit.spots = imputate_na(Soybean, fruit.spots, Class, method = "rpart", no_attrs = TRUE),
    roots = imputate_na(Soybean, roots, Class, method = "rpart", no_attrs = TRUE)) %>%
  # Drop the near zero variance predictors
  select(-leaf.mild, -mycelium, -sclerotia) 

To prove that it worked I present the following

2020-03-01