library(GGally)
library(mlbench)
library(tidyverse)
DATA 624: Assignment 04
Setup
<- function(x) {
skewness <- length(x)
n <- mean(x)
m <- sd(x)
s sum((x - m)^3) / ((n - 1) * s^3)
}
Assignment
Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf
for your run code.
Problem 3.1
The UC Irvine Machine Learning Repository 6 contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).
data(Glass)
glimpse(Glass)
Rows: 214
Columns: 10
$ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.…
$ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13…
$ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,…
$ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,…
$ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72…
$ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,…
$ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,…
$ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,…
$ Type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
summary(Glass)
RI Na Mg Al
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
Median :1.518 Median :13.30 Median :3.480 Median :1.360
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
Si K Ca Ba
Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
Fe Type
Min. :0.00000 1:70
1st Qu.:0.00000 2:76
Median :0.00000 3:17
Mean :0.05701 5:13
3rd Qu.:0.10000 6: 9
Max. :0.51000 7:29
Element Unit?
It looks like the unit of each element predictor is a percentage of the composition of the glass. I’m not sure about RI (refractive index). Let’s try confirm this suspicion. Each row’s elements should be sum to 100.
|>
Glass mutate(Total = Na + Mg + Al + Si + K + Ca + Ba + Fe) |>
summarize(
Min = min(Total),
Max = max(Total)
)
Min Max
1 99.02 100.1
Alright, that’s enough confirmation for me that the predictors are all percentages of the total composition of each observation. I think it’s safe to assume that the refractive index is a measure of how much light is bent when it passes through the glass. What are its units? According to Wikipedia, “…the refractive index (or refraction index) of an optical medium is the ratio of the apparent speed of light in the air or vacuum to the speed in the medium.”
3.1.1
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Visualizations
First, let’s get a feeling for the distribution of the predictor variables. We shall create histograms and boxplots for each of the predictors.
|>
Glass pivot_longer(cols = Na:Fe, names_to = "Predictor", values_to = "Value") |>
group_by(Type, Predictor) |>
summarize(Value = mean(Value)) |>
# get this guy to sort. I remember this solution from 607. I still think its weird.
mutate(
Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
|>
) ggplot(mapping = aes(x = Predictor, y = Value)) +
geom_col() +
facet_wrap(~Type, scales = "free") +
labs(
title = "Composition of Glass by Type",
x = "Element",
y = "Percentage"
)
`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.
It’s looking very exponential. Let’s see it without Si which dominates
|>
Glass pivot_longer(cols = c(Na, Mg, Al, K, Ca, Ba, Fe), names_to = "Predictor", values_to = "Value") |>
group_by(Type, Predictor) |>
summarize(Value = mean(Value)) |>
mutate(
Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
|>
) ggplot(mapping = aes(x = Predictor, y = Value)) +
geom_col() +
facet_wrap(~Type, scales = "free") +
labs(
title = "Composition of Glass (minus Si) by Type",
x = "Element",
y = "Total Percentage"
)
`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.
|>
Glass pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") |>
ggplot(mapping = aes(x = Value)) +
geom_histogram(bins = 30) +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Distribution of Predictors in Glass Dataset",
x = "Value",
y = "Count"
)
|>
Glass pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") |>
ggplot(mapping = aes(x = Value)) +
geom_boxplot() +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Boxplots of Predictors in Glass Dataset",
x = "Value",
y = "Count"
)
|>
Glass select(RI:Fe) |>
ggpairs(
title = "Pairwise Scatterplots of Predictors in Glass Dataset"
)
3.1.2
Do there appear to be any outliers in the data? Are any predictors skewed?
Answer
Yes, as per the \(\pm 1.5\times IQR\) rule, there are many outliers in the data. The boxplots show that there are outliers in the predictors: Al, Ba, Ca, Fe, K, Na, RI and Si; the only predictor missing from this list is Mg:
- Al: outliers above and below the whiskers.
- Ba: outliers above the whiskers.
- Ca: outliers above and below the whiskers.
- Fe: outliers above the whiskers.
- K: outliers above the whiskers.
- Mg: no outliers.
- Na: outliers above and below the whiskers.
- Si: outliers above and below the whiskers.
- RI: most outliers are above the whiskers.
The histograms and boxplots show that the predictors are to varying degrees not normally distributed. The predictors Al, Ba, Ca, Fe, K, Na, and RI are right-skewed, while the predictors Mg and Si are left-skewed. We will calculate their skewness to confirm this. Ba, Fe, K and Mg are a little strange. They all have significant numbers of observations with 0%. I presume they are elements that induce special qualities in glass. We’ll follow up on that if we have time.
skewness(Glass$Al)
[1] 0.8988105
skewness(Glass$Ba)
[1] 3.384495
skewness(Glass$Ca)
[1] 2.027923
skewness(Glass$Fe)
[1] 1.737932
skewness(Glass$K)
[1] 6.490418
skewness(Glass$Mg)
[1] -1.141788
skewness(Glass$Na)
[1] 0.4499368
skewness(Glass$Si)
[1] -0.7236206
skewness(Glass$RI)
[1] 1.61024
3.1.3
Are there any relevant transformations of one or more predictors that might improve the classification model?
I would be inclined to apply a log transformation to all of the predictors for two reasons:
- All but two of the the predictors are right-skewed. A log transformation would help to normalize the distributions of these predictors.
- The predictors are exponentially distributed. A log transformation would help to linearize the relationships between the predictors and the outcome variable.
I might consider an exponential transformation for the predictors Mg and Si, but I am hesitant, because Si is already so dominant. It accounts for ~70% of the total composition of each type of glass. I would be concerned that an exponential transformation would make it even more dominant.
I am curious about trimming as well. As noted before, Ba, Fe, K and Mg are all have significant numbers of observations with 0%. They will be resistant to a straightforward log transformation.
Let’s see what a uniform \(log(x+1)\) transformation does to the distribution of the predictors.
|>
Glass mutate(across(Na:Fe, log1p)) |>
pivot_longer(cols = Na:Fe, names_to = "Predictor", values_to = "Value") |>
group_by(Type, Predictor) |>
summarize(Value = mean(Value)) |>
# get this guy to sort. I remember this solution from 607. I still think its weird.
mutate(
Predictor = factor(Predictor, levels = unique(Predictor[order(Value)]))
|>
) ggplot(mapping = aes(x = Predictor, y = Value)) +
geom_col() +
facet_wrap(~Type, scales = "free") +
labs(
title = "Composition of Glass by Type",
x = "Element",
y = "log(%+1)"
)
`summarise()` has grouped output by 'Type'. You can override using the
`.groups` argument.
|>
Glass select(RI:Fe) |>
mutate(across(everything(), log1p)) |>
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
ggplot(mapping = aes(x = Value)) +
geom_histogram(bins = 30) +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Distribution of Log-Transformed Predictors in Glass Dataset",
x = "Value",
y = "log(%+1)"
)
Let’s see what happens if we get rid of the 0% observations. If for no other reason than to help us to see the distributions of the predictors more clearly.
|>
Glass select(RI:Fe) |>
mutate(across(everything(), log1p)) |>
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
filter(Value > 0) |>
ggplot(mapping = aes(x = Value)) +
geom_histogram(bins = 30) +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Distribution of Log-Transformed Predictors in Glass Dataset",
subtitle = "0% observations removed",
x = "Value",
y = "log(%+1)"
)
Ha, I don’t know if that made things better or worse! We can now see Ba’s and Fe’s distribution, neither of which normal at all.
Problem 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.
Load the data and have a peek:
data(Soybean)
glimpse(Soybean)
Rows: 683
Columns: 36
$ Class <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…
$ date <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
$ plant.stand <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ precip <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
$ temp <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
$ hail <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
$ crop.hist <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
$ area.dam <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
$ sever <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
$ seed.tmt <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
$ germ <ord> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
$ plant.growth <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ leaves <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ leaf.halo <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.marg <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ leaf.size <ord> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ leaf.shread <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.malf <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ leaf.mild <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stem <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ lodging <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ stem.cankers <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
$ canker.lesion <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
$ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ext.decay <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mycelium <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ int.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
$ sclerotia <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
$ fruit.pods <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fruit.spots <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ seed <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mold.growth <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ seed.discolor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ seed.size <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ shriveling <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ roots <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
summary(Soybean)
Class date plant.stand precip temp
brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
anthracnose : 44 6 : 90
brown-stem-rot : 44 (Other):101
(Other) :233 NA's : 1
hail crop.hist area.dam sever seed.tmt germ plant.growth
0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
3 :218 3 :187 NA's:121 NA's:121 NA's:112
NA's: 16 NA's: 1
leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
NA's: 84 NA's: 84 NA's: 84 NA's:108
stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
3 :191 3 : 65 NA's: 38
NA's: 38 NA's: 38
mycelium int.discolor sclerotia fruit.pods fruit.spots seed
0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
NA's: 38 3 : 48 4 :100
NA's: 84 NA's:106
mold.growth seed.discolor seed.size shriveling roots
0 :524 0 :513 0 :532 0 :539 0 :551
1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
NA's: 31
levels(Soybean$Class)
[1] "2-4-d-injury" "alternarialeaf-spot"
[3] "anthracnose" "bacterial-blight"
[5] "bacterial-pustule" "brown-spot"
[7] "brown-stem-rot" "charcoal-rot"
[9] "cyst-nematode" "diaporthe-pod-&-stem-blight"
[11] "diaporthe-stem-canker" "downy-mildew"
[13] "frog-eye-leaf-spot" "herbicide-injury"
[15] "phyllosticta-leaf-spot" "phytophthora-rot"
[17] "powdery-mildew" "purple-seed-stain"
[19] "rhizoctonia-root-rot"
There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
# The unjustified
levels(Soybean$Class)[16:19]
[1] "phytophthora-rot" "powdery-mildew" "purple-seed-stain"
[4] "rhizoctonia-root-rot"
Category Table
I have changed the factor levels to reflect our modification. See the callout in section 3.2.1 for more information.
Predictor | Categories |
---|---|
date | NA(0), apr(1), may(2), june(3), july(4), aug(5), sept(6), oct(7) |
plant.stand | NA(0), normal(1), lt-normal(2) |
precip | NA(0), lt-norm(1), norm(2), gt-norm(3) |
temp | NA(0), lt-norm(1), norm(2), gt-norm(3) |
hail | NA(0), yes(1), no(2) |
crop.hist | NA(0), dif-lst-yr(1), s-l-y(2), s-l-3-y(3), s-l-7-y(4) |
area.dam | NA(0), scatter(1), low-area(2), upper-ar(3), whole-field(4) |
sever | NA(0), minor(1), pot-severe(2), severe(3) |
seed.tmt | NA(0), none(1), fungicide(2), other(3) |
germ | NA(0), 91-211%(1), 81-89%(2), lt-81%(3) |
plant.growth | NA(0), norm(1), abnorm(2) |
leaves | NA(0), norm(1), abnorm(2) |
leaf.halo | NA(0), absent(1), yellow-halos(2), no-yellow-halos(3) |
leaf.marg | NA(0), w-s-marg(1), no-w-s-marg(2), dna(3) |
leaf.size | NA(0), lt-2/8(1), gt-2/8(2), dna(3) |
leaf.shread | NA(0), absent(1), present(2) |
leaf.malf | NA(0), absent(1), present(2) |
leaf.mild | NA(0), absent(1), upper-surf(2), lower-surf(3) |
stem | NA(0), norm(1), abnorm(2) |
lodging | NA(0), yes(1), no(2) |
stem.cankers | NA(0), absent(1), below-soil(2), above-s(3), ab-sec-nde(4) |
canker.lesion | NA(0), dna(1), brown(2), dk-brown-blk(3), tan(4) |
fruiting.bodies | NA(0), absent(1), present(2) |
ext.decay | NA(0), absent(1), firm-and-dry(2), watery(3) |
mycelium | NA(0), absent(1), present(2) |
int.discolor | NA(0), none(1), brown(2), black(3) |
sclerotia | NA(0), absent(1), present(2) |
fruit.pods | NA(0), norm(1), diseased(2), few-present(3), dna(4) |
fruit.spots | NA(0), absent(1), col(2), br-w/blk-speck(3), distort(4), dna(5) |
seed | NA(0), norm(1), abnorm(2) |
mold.growth | NA(0), absent(1), present(2) |
seed.discolor | NA(0), absent(1), present(2) |
seed.size | NA(0), norm(1), lt-norm(2) |
shriveling | NA(0), absent(1), present(2) |
roots | NA(0), norm(1), rotted(2), galls-cysts(3) |
3.2.1
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Answer
as.numeric
adding 1 to each factor level?
The as.numeric
function is adding 1 to each factor level because the first level of a factor is 1, not 0. This is a little confusing, but the first level of a factor is always 1. And, when you use as.numeric
on a factor in R, it converts the factor levels to their underlying integer codes, not the actual values of the factor. This is because factors are stored internally as integers with corresponding levels. And the values of the factor levels are stored as a character vector. So, converting the value is not a safe alternative.
Ah, I see. The as.numeric
function is adding 1 to each factor level because the first level of a factor is 1, not 0. Well, we will take advantage of this. We will convert the NA
values to 0 so that we can get a grasp of how many there are per predictor.
|>
Soybean select(-Class) |>
mutate(across(everything(), ~if_else(is.na(.), 0, as.numeric(.)))) |>
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
group_by(Predictor, Value) |>
summarize(Count = n()) |>
ggplot(mapping = aes(x = Value, y = Count)) +
geom_col() +
facet_wrap(~Predictor, scales = "free") +
labs(
title = "Distribution of Categorical Predictors in Soybean Dataset",
x = "Value",
y = "Count"
)
`summarise()` has grouped output by 'Predictor'. You can override using the
`.groups` argument.
Degenerate?
To be honest, I was not totally clear on what a degenerate distribution is. I searched the chapter and found one reference to “degenerate”. I researched it a little and found that for categorical predictors, a degenerate distribution is one in which one of the categories has a very small number of observations or one category dominates the others. That we have. Looking at our distribution plot up above we can see that the following all have degenerate distributions:
int.discolor
: most belong 0 which is our encoding ofNA
.leaf.malf
: most belong to 1, which is actually 0, which is “absent”.leaf.mild
: there are a fair number ofNA
s, but 0 (“absent”) dominates.leaves
: 1 (“abnormal”) dominates.mycellium
: 0 (“absent”) dominates.sclerotia
: 0 (“absent”) dominates.
There are some others with significant differences. It’s really a matter of what threshold (percentage) one adopts as the cutoff as being “degenerate”.
3.2.2
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Answer
Those above the average threshold (18%) are as follows:
|>
Soybean select(-Class) |>
mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") |>
group_by(Predictor) |>
summarize(
NAFraction = sum(Value == 0) / n()
|>
) arrange(desc(NAFraction))
# A tibble: 35 × 2
Predictor NAFraction
<chr> <dbl>
1 hail 0.177
2 lodging 0.177
3 seed.tmt 0.177
4 sever 0.177
5 germ 0.164
6 leaf.mild 0.158
7 fruit.spots 0.155
8 fruiting.bodies 0.155
9 seed.discolor 0.155
10 shriveling 0.155
# ℹ 25 more rows
Let’s see if there is any correlation to the classes. We will do this by creating a data.frame
of the counts of each class for each predictor.
|>
Soybean mutate(across(-Class, ~ if_else(is.na(.), 0, as.numeric(.)))) |>
pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Value") |>
group_by(Class) |>
summarize(
CountNA = sum(Value == 0),
CountAll = n(),
NAFraction = CountNA / CountAll
|>
) filter(NAFraction > 0) |>
arrange(desc(NAFraction))
# A tibble: 5 × 4
Class CountNA CountAll NAFraction
<fct> <int> <int> <dbl>
1 2-4-d-injury 450 560 0.804
2 cyst-nematode 336 490 0.686
3 herbicide-injury 160 280 0.571
4 phytophthora-rot 1214 3080 0.394
5 diaporthe-pod-&-stem-blight 177 525 0.337
3.2.3
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Answer
Let’s first look for correlation between the predictors. We will use the GGally
package to create a correlation plot.
|>
Soybean select(-Class) |>
mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
ggpairs(
title = "Pairwise Scatterplots of Predictors in Soybean Dataset"
)
Hahaha, that was ridiculous. 35 predictors by 35 predictors yielded ~612 plots that looked comical and completely illegible on an 8” wide page. We really don’t need every every pairwise plot. The missing classes are isolated to 5 different classes. We might be able to impute the missing data based on the non NA
composition of the other classes.
Wait, let’s not give up on correlation. Let try a correlation matrix instead. Uh-oh, it’s a matrix. I have never worked with one in R before. I searched for and found a way to convert it to a data.frame
. Let’s see if we can find any high correlations.
<- Soybean |>
cor_matrix select(-Class) |>
mutate(across(everything(), ~ if_else(is.na(.), 0, as.numeric(.)))) |>
cor()
class(cor_matrix)
[1] "matrix" "array"
as.data.frame(as.table(cor_matrix)) |>
filter(Var1 != Var2 & abs(Freq) > 0.80) |>
arrange(desc(abs(Freq))) |>
print()
Var1 Var2 Freq
1 mold.growth seed 0.8686516
2 seed mold.growth 0.8686516
3 leaf.size leaf.marg 0.8573120
4 leaf.marg leaf.size 0.8573120
5 seed.size mold.growth 0.8380320
6 mold.growth seed.size 0.8380320
7 seed.size seed 0.8323945
8 seed seed.size 0.8323945
9 shriveling mold.growth 0.8317743
10 mold.growth shriveling 0.8317743
11 sclerotia int.discolor 0.8285093
12 int.discolor sclerotia 0.8285093
13 shriveling fruiting.bodies 0.8117533
14 fruiting.bodies shriveling 0.8117533
15 shriveling seed.discolor 0.8102875
16 seed.discolor shriveling 0.8102875
I somewhat arbitrarily chose 0.80 as the threshold for high correlation. There are 8 pairs of predictors that are highly correlated. My strategy would be to remove one of the predictors from each unique pair to reduce collinearity.