3.1 The UC Irvine machine learning repository contains a data set related to glass identification. The data consis of 214 glass samples labeled as one of seven class categories. There are nine predictors includign the refractive index and percentages of eign elements: Na, Mg, Al, Si, K, Ba, and Fe. The data can be accessed via library(mlbench) data(Glass) str(Glass)
Creating graphs
#to create all the histograms
Glass |> keep (is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~key, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#to visualize the distribution of glass types (the variable we are attempting to forecast)
ggplot(data = Glass, aes(x = Type)) +
geom_bar()
#dot plots based on type
glass_long <- Glass |>
pivot_longer(cols = RI:Fe, names_to = "predictor", values_to = "pred_value")
#Type is a categorical variable, so these plots only illustrate correlations between specific
#types of glass and element amounts
ggplot(glass_long, aes(x = pred_value, y = Type)) +
geom_point() +
facet_wrap(~ predictor, scales = "free_x")
#let's see if reflective index is helpful
glass_long_2 <- Glass |> mutate(Type = as.numeric(Type)) |>
pivot_longer(cols = -RI, names_to = "predictor", values_to = "pred_value")
ggplot(glass_long_2, aes(x = pred_value, y = RI)) +
geom_point() +
facet_wrap(~ predictor, scales = "free_x")
#the strongest correlation seems to be between CA and RI
#plotting some other predictors against each other
ggplot(Glass, aes(x = Na, y = Ca, color = Type)) +
geom_point()
ggplot(Glass, aes(x = Al, y = K, color = Type)) +
geom_point()
Fe (Iron) is skewed right, as are potassium and barium. Magnesium has many values close to zero, but is otherwise skewed left. The distribution for Al is the most normal, and the distributions for Ca, Si, Na, and reflective index are close-ish (though Ri is a little lumpy, slightly skewed right.) Several of the distributions are bimodal-ish, like magnesium and potassium.
Iron might have an outlier, as it has kind of a wide distribution. Potassium looks like it has an outlier higher than 6, with most values falling in the 0-1 range.
More on outliers:
Glass |> pivot_longer(cols = RI:Fe) |>
ggplot(aes(x = name, y = value)) + geom_boxplot() + facet_wrap(~name, scales = "free")
#Based on box plots, it looks like there are a few outliers for almost every variable except magnesium.
Scaling and transformations: The predictors all have very different values and scales (for example, values for calcium go up to 16 whereas values for iron are all under 1). Therefore, to make data easier to interpret, it may make sense to place them all on a similar scale.
The skewed distributions, such as iron, potassium, and barium, can benefit from a transformation (like log – box cox is not appropriate given the number of zero values).
Transformations to resist outliers – depending on the predictive model chosen and whether it is sensitive to outliers, it may make sense to use the spatial sign procedure.
Question 3.2
The soybean data can also be found at the UC Irvine ML repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g, temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. a) investigate frequency distributions for the categorized predictors. Are any of the distributions degenerate in any way discussed earlier in this chapter? b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing. Is the pattern of missing data related to the classes? c) develop a strategy for handling missing data, either by eliminating predictors or imputation.
data(Soybean)
soy_long <- Soybean %>%
mutate(across(-Class, as.character)) %>%
pivot_longer(cols = -Class, names_to = "variable", values_to = "value")
soy_long |>
filter(!is.na(value)) |>
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(~ variable, scales = "free_x")
#zero variance and near-zero variance
nzv_results <- nearZeroVar(Soybean, saveMetrics = TRUE)
nzv_results
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
#same function, but without NAs, in case NAs are skewing things
soy_no_na <- na.omit(Soybean)
nzv_results_na <- nearZeroVar(soy_no_na, saveMetrics = TRUE)
nzv_results_na
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.6690391 FALSE FALSE
## date 1.186441 1.2455516 FALSE FALSE
## plant.stand 1.613953 0.3558719 FALSE FALSE
## precip 4.809524 0.5338078 FALSE FALSE
## temp 2.141026 0.5338078 FALSE FALSE
## hail 3.425197 0.3558719 FALSE FALSE
## crop.hist 1.005495 0.7117438 FALSE FALSE
## area.dam 1.136054 0.7117438 FALSE FALSE
## sever 1.651282 0.5338078 FALSE FALSE
## seed.tmt 1.373874 0.5338078 FALSE FALSE
## germ 1.104712 0.5338078 FALSE FALSE
## plant.growth 3.132353 0.3558719 FALSE FALSE
## leaves 8.064516 0.3558719 FALSE FALSE
## leaf.halo 1.797872 0.5338078 FALSE FALSE
## leaf.marg 1.898936 0.5338078 FALSE FALSE
## leaf.size 1.718085 0.5338078 FALSE FALSE
## leaf.shread 4.854167 0.3558719 FALSE FALSE
## leaf.malf 25.761905 0.3558719 FALSE TRUE
## leaf.mild 26.100000 0.5338078 FALSE TRUE
## stem 1.007143 0.3558719 FALSE FALSE
## lodging 12.380952 0.3558719 FALSE FALSE
## stem.cankers 2.265823 0.7117438 FALSE FALSE
## canker.lesion 2.798165 0.7117438 FALSE FALSE
## fruiting.bodies 5.314607 0.3558719 FALSE FALSE
## ext.decay 3.162963 0.3558719 FALSE FALSE
## mycelium 92.666667 0.3558719 FALSE TRUE
## int.discolor 11.318182 0.5338078 FALSE FALSE
## sclerotia 27.100000 0.3558719 FALSE TRUE
## fruit.pods 3.539130 0.5338078 FALSE FALSE
## fruit.spots 3.450000 0.7117438 FALSE FALSE
## seed 5.314607 0.3558719 FALSE FALSE
## mold.growth 9.807692 0.3558719 FALSE FALSE
## seed.discolor 10.469388 0.3558719 FALSE FALSE
## seed.size 17.733333 0.3558719 FALSE FALSE
## shriveling 23.434783 0.3558719 FALSE TRUE
## roots 55.100000 0.5338078 FALSE TRUE
#looking at all the rows with NAs
soy_na_only_2 <- Soybean |>
filter(!complete.cases(across()))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!complete.cases(across())`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
Looking at the histogram above, some variables are very heavily skewed toward one value, like fruiting bodies, leaves, leaf shread, leaf malformation, int discolor, lodging, mold growth, mycelium, scierotia, seed discolor, seed size, shriveling (hail, seed, plant growth to a lesser extent).
After running the NZV function on the data, I can see that none of the variables have zero variance, but some have near-zero, including leaf malformation, leaf mild, mycelium, sclerotia, shriveling, and roots.
#LIttle's MCAR test
library(naniar)
## Warning: package 'naniar' was built under R version 4.5.2
##
## Attaching package: 'naniar'
## The following object is masked from 'package:tsibble':
##
## pedestrian
mcar_test(Soybean)
## Warning in norm::prelim.norm(data): NAs introduced by coercion to integer range
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 3712. 130 0 9
#The p value is zero, which means the data is not missing completely at random
#LIttle's Mcar test shows 9 patterns to the missing data
print(miss_var_summary(Soybean),n=36)
## # A tibble: 36 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 hail 121 17.7
## 2 sever 121 17.7
## 3 seed.tmt 121 17.7
## 4 lodging 121 17.7
## 5 germ 112 16.4
## 6 leaf.mild 108 15.8
## 7 fruiting.bodies 106 15.5
## 8 fruit.spots 106 15.5
## 9 seed.discolor 106 15.5
## 10 shriveling 106 15.5
## 11 leaf.shread 100 14.6
## 12 seed 92 13.5
## 13 mold.growth 92 13.5
## 14 seed.size 92 13.5
## 15 leaf.halo 84 12.3
## 16 leaf.marg 84 12.3
## 17 leaf.size 84 12.3
## 18 leaf.malf 84 12.3
## 19 fruit.pods 84 12.3
## 20 precip 38 5.56
## 21 stem.cankers 38 5.56
## 22 canker.lesion 38 5.56
## 23 ext.decay 38 5.56
## 24 mycelium 38 5.56
## 25 int.discolor 38 5.56
## 26 sclerotia 38 5.56
## 27 plant.stand 36 5.27
## 28 roots 31 4.54
## 29 temp 30 4.39
## 30 crop.hist 16 2.34
## 31 plant.growth 16 2.34
## 32 stem 16 2.34
## 33 date 1 0.146
## 34 area.dam 1 0.146
## 35 Class 0 0
## 36 leaves 0 0
#many of these numbers are similar, suggesting that certain missing data variables are related
#the first 19 values in this table are highly likely to be missing (~100 out of 700 values).
#There are a total of 121 rows with missing values
rows_with_na <- sum(!complete.cases(Soybean))
#list of all classes
Soybean |>
count(Class)
## Class n
## 1 2-4-d-injury 16
## 2 alternarialeaf-spot 91
## 3 anthracnose 44
## 4 bacterial-blight 20
## 5 bacterial-pustule 20
## 6 brown-spot 92
## 7 brown-stem-rot 44
## 8 charcoal-rot 20
## 9 cyst-nematode 14
## 10 diaporthe-pod-&-stem-blight 15
## 11 diaporthe-stem-canker 20
## 12 downy-mildew 20
## 13 frog-eye-leaf-spot 91
## 14 herbicide-injury 8
## 15 phyllosticta-leaf-spot 20
## 16 phytophthora-rot 88
## 17 powdery-mildew 20
## 18 purple-seed-stain 20
## 19 rhizoctonia-root-rot 20
#because all the NA values seem to be relegated to the same 121 rows
#where at least 4 variables are all missing
#let's see if there are any patterns there
Soybean |>
filter(is.na(hail)) |>
count(Class) |>
arrange(desc(n))
## Class n
## 1 phytophthora-rot 68
## 2 2-4-d-injury 16
## 3 diaporthe-pod-&-stem-blight 15
## 4 cyst-nematode 14
## 5 herbicide-injury 8
#these NA values impact only 5 classes, with phytophthora-rot composing 68/121
#here they are side by side
Soybean |>
count(Class, name = "total") |>
left_join(
Soybean |> filter(is.na(hail)) |> count(Class, name = "na_count"),
by = "Class"
) |>
mutate(na_count = coalesce(na_count, 0)) |>
arrange(desc(na_count))
## Class total na_count
## 1 phytophthora-rot 88 68
## 2 2-4-d-injury 16 16
## 3 diaporthe-pod-&-stem-blight 15 15
## 4 cyst-nematode 14 14
## 5 herbicide-injury 8 8
## 6 alternarialeaf-spot 91 0
## 7 anthracnose 44 0
## 8 bacterial-blight 20 0
## 9 bacterial-pustule 20 0
## 10 brown-spot 92 0
## 11 brown-stem-rot 44 0
## 12 charcoal-rot 20 0
## 13 diaporthe-stem-canker 20 0
## 14 downy-mildew 20 0
## 15 frog-eye-leaf-spot 91 0
## 16 phyllosticta-leaf-spot 20 0
## 17 powdery-mildew 20 0
## 18 purple-seed-stain 20 0
## 19 rhizoctonia-root-rot 20 0
#4 classes are 100% affected by the missing values
#upSet plot that visualizes this
gg_miss_upset(Soybean)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Eliminating rows with missing values would mean eliminating a total of 121 rows (1/5 of the data set), and this would disproportionately affect five classes. Four classes would be dropped entirely if we eliminated the rows with missing data.
Of the top 4 missing variables, only one of them, lodging, has a variance near zero.
KNN wouldn’t work because for many missing values there is no nearest neighbor (the class has no values for that variable). Using the median also doesn’t make sense because many of these are categorical values. It may be reasonable to assume that these missing values are related to the class, since they are consistent across classes, and treat NA as its own categorical value. In this case, imputation would be inappropriate. The nature of the data may provide a clue: soybean plant growth and disease status correlate with factors like weather and climate, which can also affect data collection in the field.
Treating NA values as their own category wasn’t really an option in the question, so, as an alternative, I would explore dropping some of the commonly missing variables and imputing others. This data set has so many variables, I would consider dropping the 4-6 top missing variables (including germ, hail, sever, seed treatment, and lodging) and imputing other missing values using KNN. I would drop the variables with missing values that show as having near-zero variance, including leaf mild (108 missing values), mycelium (38 missing values), leaf malformation (84 missing values), sclerotia (38 missing values), shriveling (106 missing values). The rest, I would attempt to impute, if possible.