Looking at the distrobutions of the predictors in the Glass dataset, we immediately see some interesting scenarios. The Aluminum content appears to be the most normally distributed with Sodium, Calcium, and the Refractive Index being largely normal with a right skew. The Silicon and Magnesium measures also have fairly normal distributions but with a left skew instead. The Magnesium measure also has a large number of 0 values which would heavily impact any analysis. The Iron, Barium, and to a lesser extent Potassium measures are heavily skewed by a large number of zero values which hide any trends in the non-zero values.
data(Glass)
pivot_glass <- Glass |>
pivot_longer(RI:Fe, names_to = "predictor", values_to = "measurement")
pivot_glass |>
ggplot(aes(x = measurement)) +
geom_histogram(bins = 50) +
facet_wrap(vars(predictor), scales = "free")
Looking at correlation between the predictors without regards to the actual glass type there are only a few pairs that show meaningful correlation (in part due to the large number of 0 values in certain fields mentioned above). By far the most significant correlation is between the Refractive Index and Calcium with a correlation coefficient of 0.81. Of particular note should be the Refractive Index, Magnesium, and Aluminum which appear to have the most of the correlations of significance.
remove_type <- Glass |> select(!Type)
ggpairs(remove_type, diag = list(continuos = "blankDiag"))
Using a box and whisker plot we can see that, except for magnesium, there are some significant outliers in every measure. Of particular note is Barium with almost no none-zero values in the interquartile range. The Barium, Iron, and Potassium sets are clearly heavily skewed by the large number of 0 values, but Potassium has relatively few outliers even when compared to the more normally distrubeted measures.
pivot_glass |>
ggplot(aes(x = measurement)) +
geom_boxplot() +
facet_wrap(vars(predictor), scales = "free")
Skewness was already touched on in section 3.1.1 but to continue the analyses we can use the e1071 package to calculate the skewness of each column. As noted before, Sodium, Aluminum, and Silicon are the most normally distributed with Sodium being more normal than Aluminum, running counter to our initial assesment. Additionally, the Refractive Index is actually closer to an exponential distrobution similar to Iron and Calcium than we originally assessed. It should be noted that all measures except Magnesium and Silicon display a right skew, with neither Magnesium nor Silicon exibiting a particularly strong left skew. The skewness of Potassium and Barium are by far the most extreme.
remove_type |> apply(2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
It would be a good idea to center and scale the data in order to improve the usability of some of the non-normal columns. Additionally, several of the columns would benefit from additional transformations, with Box-Cox indicating the Refractive Index, Aluminum, Silicon, and Calcium columns would benefit from transformations of -2, 0.5, 2, and -1 respectively.
transforms <- preProcess(remove_type, method = c("BoxCox", "center", "scale"))
transforms
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (0)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
The two requirements for a predictor to be degenerate according to
the text are:
We will use the values provided as our limits for determining
degeneracy. There are 683 observations in the dataset, meaning a
predictor has a low number of unique values if it has 68 or fewer unique
values. All of the predictors have fewer than 68 unique values, so all
meet criteria 1. To determine which columns meet criteria 2 we can make
use of the nearZeroVar() function.
data("Soybean")
colnames(Soybean[nearZeroVar(Soybean)])
## [1] "leaf.mild" "mycelium" "sclerotia"
Doing so indicates that the leaf.mild, mycelium, and sclerotia columns are degenerate according to the criteria laid out in our text. Using the tabyl function and performing some simple comparisons we can see that there are a few more candidates for degeneracy. The columns shriveling, leaf.malf, lodging, and int.discolor are not quite at the 20:1 ratio but are close enough that they could also be good candidates for removal if other indicators lean towards removal.
apply(Soybean, 2, tabyl, show_na = FALSE)
## $Class
## newX[, i] n percent
## 2-4-d-injury 16 0.02342606
## alternarialeaf-spot 91 0.13323572
## anthracnose 44 0.06442167
## bacterial-blight 20 0.02928258
## bacterial-pustule 20 0.02928258
## brown-spot 92 0.13469985
## brown-stem-rot 44 0.06442167
## charcoal-rot 20 0.02928258
## cyst-nematode 14 0.02049780
## diaporthe-pod-&-stem-blight 15 0.02196193
## diaporthe-stem-canker 20 0.02928258
## downy-mildew 20 0.02928258
## frog-eye-leaf-spot 91 0.13323572
## herbicide-injury 8 0.01171303
## phyllosticta-leaf-spot 20 0.02928258
## phytophthora-rot 88 0.12884334
## powdery-mildew 20 0.02928258
## purple-seed-stain 20 0.02928258
## rhizoctonia-root-rot 20 0.02928258
##
## $date
## newX[, i] n percent
## 0 26 0.03812317
## 1 75 0.10997067
## 2 93 0.13636364
## 3 118 0.17302053
## 4 131 0.19208211
## 5 149 0.21847507
## 6 90 0.13196481
##
## $plant.stand
## newX[, i] n percent
## 0 354 0.5471406
## 1 293 0.4528594
##
## $precip
## newX[, i] n percent
## 0 74 0.1147287
## 1 112 0.1736434
## 2 459 0.7116279
##
## $temp
## newX[, i] n percent
## 0 80 0.1225115
## 1 374 0.5727412
## 2 199 0.3047473
##
## $hail
## newX[, i] n percent
## 0 435 0.7740214
## 1 127 0.2259786
##
## $crop.hist
## newX[, i] n percent
## 0 65 0.09745127
## 1 165 0.24737631
## 2 219 0.32833583
## 3 218 0.32683658
##
## $area.dam
## newX[, i] n percent
## 0 123 0.1803519
## 1 227 0.3328446
## 2 145 0.2126100
## 3 187 0.2741935
##
## $sever
## newX[, i] n percent
## 0 195 0.34697509
## 1 322 0.57295374
## 2 45 0.08007117
##
## $seed.tmt
## newX[, i] n percent
## 0 305 0.54270463
## 1 222 0.39501779
## 2 35 0.06227758
##
## $germ
## newX[, i] n percent
## 0 165 0.2889667
## 1 213 0.3730298
## 2 193 0.3380035
##
## $plant.growth
## newX[, i] n percent
## 0 441 0.6611694
## 1 226 0.3388306
##
## $leaves
## newX[, i] n percent
## 0 77 0.1127379
## 1 606 0.8872621
##
## $leaf.halo
## newX[, i] n percent
## 0 221 0.36894825
## 1 36 0.06010017
## 2 342 0.57095159
##
## $leaf.marg
## newX[, i] n percent
## 0 357 0.59599332
## 1 21 0.03505843
## 2 221 0.36894825
##
## $leaf.size
## newX[, i] n percent
## 0 51 0.0851419
## 1 327 0.5459098
## 2 221 0.3689482
##
## $leaf.shread
## newX[, i] n percent
## 0 487 0.8353345
## 1 96 0.1646655
##
## $leaf.malf
## newX[, i] n percent
## 0 554 0.92487479
## 1 45 0.07512521
##
## $leaf.mild
## newX[, i] n percent
## 0 535 0.93043478
## 1 20 0.03478261
## 2 20 0.03478261
##
## $stem
## newX[, i] n percent
## 0 296 0.4437781
## 1 371 0.5562219
##
## $lodging
## newX[, i] n percent
## 0 520 0.9252669
## 1 42 0.0747331
##
## $stem.cankers
## newX[, i] n percent
## 0 379 0.58759690
## 1 39 0.06046512
## 2 36 0.05581395
## 3 191 0.29612403
##
## $canker.lesion
## newX[, i] n percent
## 0 320 0.4961240
## 1 83 0.1286822
## 2 177 0.2744186
## 3 65 0.1007752
##
## $fruiting.bodies
## newX[, i] n percent
## 0 473 0.8197574
## 1 104 0.1802426
##
## $ext.decay
## newX[, i] n percent
## 0 497 0.77054264
## 1 135 0.20930233
## 2 13 0.02015504
##
## $mycelium
## newX[, i] n percent
## 0 639 0.990697674
## 1 6 0.009302326
##
## $int.discolor
## newX[, i] n percent
## 0 581 0.90077519
## 1 44 0.06821705
## 2 20 0.03100775
##
## $sclerotia
## newX[, i] n percent
## 0 625 0.96899225
## 1 20 0.03100775
##
## $fruit.pods
## newX[, i] n percent
## 0 407 0.67946578
## 1 130 0.21702838
## 2 14 0.02337229
## 3 48 0.08013356
##
## $fruit.spots
## newX[, i] n percent
## 0 345 0.59792028
## 1 75 0.12998267
## 2 57 0.09878683
## 4 100 0.17331023
##
## $seed
## newX[, i] n percent
## 0 476 0.8054146
## 1 115 0.1945854
##
## $mold.growth
## newX[, i] n percent
## 0 524 0.8866328
## 1 67 0.1133672
##
## $seed.discolor
## newX[, i] n percent
## 0 513 0.8890815
## 1 64 0.1109185
##
## $seed.size
## newX[, i] n percent
## 0 532 0.9001692
## 1 59 0.0998308
##
## $shriveling
## newX[, i] n percent
## 0 539 0.93414211
## 1 38 0.06585789
##
## $roots
## newX[, i] n percent
## 0 551 0.84509202
## 1 86 0.13190184
## 2 15 0.02300613
The below plot provides a good starting point when analyzing the missing values, indicating which are the most frequently missing.
empty_counts <- Soybean |>
select(!Class) |>
summarise(across(date:roots, ~ sum(is.na(.)))) |>
pivot_longer(cols = date:roots, names_to = "Column", values_to = "na_values") |>
arrange(desc(na_values))
empty_counts |>
ggplot(aes(x = reorder(Column, na_values), y = na_values)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Column Name", y = "Count of Missing Values")
Interestingly there are only five classes with missing data, and among those Classes phytophthora-rot has over half of the total missing values.
empty_by_class <- Soybean |>
group_by(Class) |>
summarise(across(date:roots, ~ sum(is.na(.))))
empty_by_class |>
rowwise(Class) |>
summarise(total_missing = sum(across(date:roots))) |>
filter(total_missing != 0)
## `summarise()` has grouped output by 'Class'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 2
## # Groups: Class [5]
## Class total_missing
## <fct> <int>
## 1 2-4-d-injury 450
## 2 cyst-nematode 336
## 3 diaporthe-pod-&-stem-blight 177
## 4 herbicide-injury 160
## 5 phytophthora-rot 1214
Given that leaf.mild, mycelium, and sclerotia all had indicators of degeneracy and appeared in our count of columns with missing values they are all good candidates for elimination. Additionally, sever, seed.tmt, lodging, hail, germ, shriveling, seed.discolor, fruiting.bodies, and fruit.spots all have more than 15% of their entries missing which makes them good candidates for elimination as well. Our working set of data now consists of 24 observations which is a reasonable subset of the recorded data. Using Multiple Correspondence Analysis we can impute the missing values in the remaining columns. Our data is now ready for modeling.
imputation_set <- Soybean |>
select(!c(leaf.mild, mycelium, sclerotia, sever, seed.tmt, lodging, hail, germ, shriveling, seed.discolor, fruiting.bodies, fruit.spots))
modeling_data <- imputeMCA(imputation_set)
head(modeling_data$completeObs, 5)
## Class date plant.stand precip temp crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 1 1
## 2 diaporthe-stem-canker 4 0 2 1 2 0
## 3 diaporthe-stem-canker 3 0 2 1 1 0
## 4 diaporthe-stem-canker 3 0 2 1 1 0
## 5 diaporthe-stem-canker 6 0 2 1 2 0
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem
## 1 1 1 0 2 2 0 0 1
## 2 1 1 0 2 2 0 0 1
## 3 1 1 0 2 2 0 0 1
## 4 1 1 0 2 2 0 0 1
## 5 1 1 0 2 2 0 0 1
## stem.cankers canker.lesion ext.decay int.discolor fruit.pods seed mold.growth
## 1 3 1 1 0 0 0 0
## 2 3 1 1 0 0 0 0
## 3 3 0 1 0 0 0 0
## 4 3 0 1 0 0 0 0
## 5 3 1 1 0 0 0 0
## seed.size roots
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0