DATA 624 Homework 4

Question 3.1
Question 3.2

Question 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of several class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

First we will take a look at the distribution of the predictors:

long_glass <- Glass %>%
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>%
  mutate(Predictor = as.factor(Predictor))

long_glass %>%
  ggplot(aes(Value, color = Predictor, fill = Predictor)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ Predictor, ncol = 3, scales = "free") +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  theme_light() +
  theme(legend.position = "none") +
  ggtitle("Distribution of Predictor Variables")

Glass is primarly made of silica (Si), soda (Na) and lime (Ca). Seeing these predictors at higher concentrations is not suprising.

Now we will examine how the predictors are related to each other. We will do that with a correlation plot.

#ColorBrewer's 5 class spectral color palette
col <- colorRampPalette(c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"))

Glass %>%
  select(-Type) %>%
  cor() %>%
  round(., 2) %>%
  corrplot(., method="color", col=col(200), type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )

Most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.

Most of the correlations are not very strong. The exception to this is the correlation between calcium oxide and the refraction index is strongly positively correlated. I am going to take some liberties and summarize the data in a tabular form, because this “visualization” speaks to me:

Predictor	Min	1st Qu.	Median	Mean	3rd Qu.	Max
Al	0.29000	1.190000	1.36000	1.4449065	1.630000	3.50000
Ba	0.00000	0.000000	0.00000	0.1750467	0.000000	3.15000
Ca	5.43000	8.240000	8.60000	8.9569626	9.172500	16.19000
Fe	0.00000	0.000000	0.00000	0.0570093	0.100000	0.51000
K	0.00000	0.122500	0.55500	0.4970561	0.610000	6.21000
Mg	0.00000	2.115000	3.48000	2.6845327	3.600000	4.49000
Na	10.73000	12.907500	13.30000	13.4078505	13.825000	17.38000
RI	1.51115	1.516522	1.51768	1.5183654	1.519157	1.53393
Si	69.81000	72.280000	72.79000	72.6509346	73.087500	75.41000

Do there appear to be any outliers in the data? Are any predictors skewed?

I want to see how the predictors are distributed by the type of glass. I will use a scatter plot to do this but will be excluding scilica because of the difference in scale.

long_glass %>%
  ggplot(aes(x = Type, y = Value, color = Predictor)) +
  geom_jitter() +
  ylim(0, 20) + 
  scale_color_brewer(palette = "Set1") +
  theme_light()

It looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potasium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.

Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Something like a Box-Cox transformation might improve the classification model’s preformance.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environemental conditions (e.g. temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

I am assuming the degenerate distibuted variaviables discussed earlier in the chapter refers to section 3.5 on removing predictors. Here’s some frequency tables:

for (predictor in names(select(Soybean, -Class))){
  temp <- Soybean %>%
    group_by(.dots=predictor) %>%
    tally() %>%
    arrange(desc(n)) 
  temp %>%
    summarise(total = sum(n)) %>%
    merge(temp) %>%
    mutate(share = n / total) %>%
    select(-total) %>%
    kable() %>%
    kable_styling() %>%
    print()
}

date	n	share
5	149	0.2181552
4	131	0.1918009
3	118	0.1727672
2	93	0.1361640
6	90	0.1317716
1	75	0.1098097
0	26	0.0380673
NA	1	0.0014641

plant.stand	n	share
0	354	0.5183016
1	293	0.4289898
NA	36	0.0527086

precip	n	share
2	459	0.6720351
1	112	0.1639824
0	74	0.1083455
NA	38	0.0556369

temp	n	share
1	374	0.5475842
2	199	0.2913616
0	80	0.1171303
NA	30	0.0439239

hail	n	share
0	435	0.6368960
1	127	0.1859444
NA	121	0.1771596

crop.hist	n	share
2	219	0.3206442
3	218	0.3191801
1	165	0.2415813
0	65	0.0951684
NA	16	0.0234261

area.dam	n	share
1	227	0.3323572
3	187	0.2737921
2	145	0.2122987
0	123	0.1800878
NA	1	0.0014641

sever	n	share
1	322	0.4714495
0	195	0.2855051
NA	121	0.1771596
2	45	0.0658858

seed.tmt	n	share
0	305	0.4465593
1	222	0.3250366
NA	121	0.1771596
2	35	0.0512445

germ	n	share
1	213	0.3118594
2	193	0.2825769
0	165	0.2415813
NA	112	0.1639824

plant.growth	n	share
0	441	0.6456808
1	226	0.3308931
NA	16	0.0234261

leaves	n	share
1	606	0.8872621
0	77	0.1127379

leaf.halo	n	share
2	342	0.5007321
0	221	0.3235725
NA	84	0.1229868
1	36	0.0527086

leaf.marg	n	share
0	357	0.5226940
2	221	0.3235725
NA	84	0.1229868
1	21	0.0307467

leaf.size	n	share
1	327	0.4787701
2	221	0.3235725
NA	84	0.1229868
0	51	0.0746706

leaf.shread	n	share
0	487	0.7130307
NA	100	0.1464129
1	96	0.1405564

leaf.malf	n	share
0	554	0.8111274
NA	84	0.1229868
1	45	0.0658858

leaf.mild	n	share
0	535	0.7833089
NA	108	0.1581259
1	20	0.0292826
2	20	0.0292826

stem	n	share
1	371	0.5431918
0	296	0.4333821
NA	16	0.0234261

lodging	n	share
0	520	0.7613470
NA	121	0.1771596
1	42	0.0614934

stem.cankers	n	share
0	379	0.5549048
3	191	0.2796486
1	39	0.0571010
NA	38	0.0556369
2	36	0.0527086

canker.lesion	n	share
0	320	0.4685212
2	177	0.2591508
1	83	0.1215227
3	65	0.0951684
NA	38	0.0556369

fruiting.bodies	n	share
0	473	0.6925329
NA	106	0.1551977
1	104	0.1522694

ext.decay	n	share
0	497	0.7276720
1	135	0.1976574
NA	38	0.0556369
2	13	0.0190337

mycelium	n	share
0	639	0.9355783
NA	38	0.0556369
1	6	0.0087848

int.discolor	n	share
0	581	0.8506589
1	44	0.0644217
NA	38	0.0556369
2	20	0.0292826

sclerotia	n	share
0	625	0.9150805
NA	38	0.0556369
1	20	0.0292826

fruit.pods	n	share
0	407	0.5959004
1	130	0.1903367
NA	84	0.1229868
3	48	0.0702782
2	14	0.0204978

fruit.spots	n	share
0	345	0.5051245
NA	106	0.1551977
4	100	0.1464129
1	75	0.1098097
2	57	0.0834553

seed	n	share
0	476	0.6969253
1	115	0.1683748
NA	92	0.1346999

mold.growth	n	share
0	524	0.7672035
NA	92	0.1346999
1	67	0.0980966

seed.discolor	n	share
0	513	0.7510981
NA	106	0.1551977
1	64	0.0937042

seed.size	n	share
0	532	0.7789165
NA	92	0.1346999
1	59	0.0863836

shriveling	n	share
0	539	0.7891654
NA	106	0.1551977
1	38	0.0556369

roots	n	share
0	551	0.8067350
1	86	0.1259151
NA	31	0.0453880
2	15	0.0219619

There’s a lot of missing variables. The authors recommended removing variables with near zero variance. I know that the caret package has a function for that. Here’s the output from that function:

nearZeroVar(Soybean, saveMetrics = T) %>%
  kable() %>%
  kable_styling()

	freqRatio	percentUnique	zeroVar	nzv
Class	1.010989	2.7818448	FALSE	FALSE
date	1.137405	1.0248902	FALSE	FALSE
plant.stand	1.208191	0.2928258	FALSE	FALSE
precip	4.098214	0.4392387	FALSE	FALSE
temp	1.879397	0.4392387	FALSE	FALSE
hail	3.425197	0.2928258	FALSE	FALSE
crop.hist	1.004587	0.5856515	FALSE	FALSE
area.dam	1.213904	0.5856515	FALSE	FALSE
sever	1.651282	0.4392387	FALSE	FALSE
seed.tmt	1.373874	0.4392387	FALSE	FALSE
germ	1.103627	0.4392387	FALSE	FALSE
plant.growth	1.951327	0.2928258	FALSE	FALSE
leaves	7.870130	0.2928258	FALSE	FALSE
leaf.halo	1.547511	0.4392387	FALSE	FALSE
leaf.marg	1.615385	0.4392387	FALSE	FALSE
leaf.size	1.479638	0.4392387	FALSE	FALSE
leaf.shread	5.072917	0.2928258	FALSE	FALSE
leaf.malf	12.311111	0.2928258	FALSE	FALSE
leaf.mild	26.750000	0.4392387	FALSE	TRUE
stem	1.253378	0.2928258	FALSE	FALSE
lodging	12.380952	0.2928258	FALSE	FALSE
stem.cankers	1.984293	0.5856515	FALSE	FALSE
canker.lesion	1.807910	0.5856515	FALSE	FALSE
fruiting.bodies	4.548077	0.2928258	FALSE	FALSE
ext.decay	3.681481	0.4392387	FALSE	FALSE
mycelium	106.500000	0.2928258	FALSE	TRUE
int.discolor	13.204546	0.4392387	FALSE	FALSE
sclerotia	31.250000	0.2928258	FALSE	TRUE
fruit.pods	3.130769	0.5856515	FALSE	FALSE
fruit.spots	3.450000	0.5856515	FALSE	FALSE
seed	4.139130	0.2928258	FALSE	FALSE
mold.growth	7.820895	0.2928258	FALSE	FALSE
seed.discolor	8.015625	0.2928258	FALSE	FALSE
seed.size	9.016949	0.2928258	FALSE	FALSE
shriveling	14.184211	0.2928258	FALSE	FALSE
roots	6.406977	0.4392387	FALSE	FALSE

There are three variables (leaf.mild, mycelium, sclerotia) that have a near zero variance, and should probably be removed.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")

There are blocks of observations that are missing. Since the data are arranged by the classes this suggests that the patterns of missing data are related to the classes.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I will be eliminating the three near zero variance predictiors. For all other predictors I will be imputing values. I don’t have any domain expertise that would inform the imputations, so I will be using decision trees to (hopefully) produce good imputations. It has been my experience that decision trees preform really well. The dlookr package integrates well with the tidyverse.

library(dlookr)

Soybean_complete <- Soybean %>%
  # Impute missing values using rpart
  mutate(
    date = imputate_na(Soybean, date, Class, method = "rpart", no_attrs = TRUE),
    plant.stand = imputate_na(Soybean, plant.stand, Class, method = "rpart", no_attrs = TRUE),
    precip = imputate_na(Soybean, precip, Class, method = "rpart", no_attrs = TRUE),
    temp = imputate_na(Soybean, temp, Class, method = "rpart", no_attrs = TRUE),
    hail = imputate_na(Soybean, hail, Class, method = "rpart", no_attrs = TRUE),
    crop.hist = imputate_na(Soybean, crop.hist, Class, method = "rpart", no_attrs = TRUE),
    area.dam = imputate_na(Soybean, area.dam, Class, method = "rpart", no_attrs = TRUE),
    sever = imputate_na(Soybean, sever, Class, method = "rpart", no_attrs = TRUE),
    seed.tmt = imputate_na(Soybean, seed.tmt, Class, method = "rpart", no_attrs = TRUE),
    germ = imputate_na(Soybean, germ, Class, method = "rpart", no_attrs = TRUE),
    plant.growth = imputate_na(Soybean, plant.growth, Class, method = "rpart", no_attrs = TRUE),
    leaf.halo = imputate_na(Soybean, leaf.halo, Class, method = "rpart", no_attrs = TRUE),
    leaf.marg = imputate_na(Soybean, leaf.marg, Class, method = "rpart", no_attrs = TRUE),
    leaf.size = imputate_na(Soybean, leaf.size, Class, method = "rpart", no_attrs = TRUE),
    leaf.shread = imputate_na(Soybean, leaf.shread, Class, method = "rpart", no_attrs = TRUE),
    leaf.malf = imputate_na(Soybean, leaf.malf, Class, method = "rpart", no_attrs = TRUE),
    stem = imputate_na(Soybean, stem, Class, method = "rpart", no_attrs = TRUE),
    lodging = imputate_na(Soybean, lodging, Class, method = "rpart", no_attrs = TRUE),
    stem.cankers = imputate_na(Soybean, stem.cankers, Class, method = "rpart", no_attrs = TRUE),
    canker.lesion = imputate_na(Soybean, canker.lesion, Class, method = "rpart", no_attrs = TRUE),
    fruiting.bodies = imputate_na(Soybean, fruiting.bodies, Class, method = "rpart", no_attrs = TRUE),
    ext.decay = imputate_na(Soybean, ext.decay, Class, method = "rpart", no_attrs = TRUE),
    int.discolor = imputate_na(Soybean, int.discolor, Class, method = "rpart", no_attrs = TRUE),
    fruit.pods = imputate_na(Soybean, fruit.pods, Class, method = "rpart", no_attrs = TRUE),
    seed = imputate_na(Soybean, seed, Class, method = "rpart", no_attrs = TRUE),
    mold.growth = imputate_na(Soybean, mold.growth, Class, method = "rpart", no_attrs = TRUE),
    seed.discolor = imputate_na(Soybean, seed.discolor, Class, method = "rpart", no_attrs = TRUE),
    seed.size = imputate_na(Soybean, seed.size, Class, method = "rpart", no_attrs = TRUE),
    shriveling = imputate_na(Soybean, shriveling, Class, method = "rpart", no_attrs = TRUE),
    fruit.spots = imputate_na(Soybean, fruit.spots, Class, method = "rpart", no_attrs = TRUE),
    roots = imputate_na(Soybean, roots, Class, method = "rpart", no_attrs = TRUE)) %>%
  # Drop the near zero variance predictors
  select(-leaf.mild, -mycelium, -sclerotia)

To prove that it worked I present the following

Soybean_complete %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")