3.1 The UC Irvine machine learning repository contains a data set related to glass identification. The data consis of 214 glass samples labeled as one of seven class categories. There are nine predictors includign the refractive index and percentages of eign elements: Na, Mg, Al, Si, K, Ba, and Fe. The data can be accessed via library(mlbench) data(Glass) str(Glass)

  1. Using visualizations, explore the predictor variables to understand their distributions as well as relationships between predictors b)Do there appear to be any outliers in the data? Are any predictors skewed?
  2. Are there any relevant transformations of one or more predictors that might improve the classification model?

Creating graphs

#to create all the histograms
Glass |> keep (is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_histogram() + 
  facet_wrap(~key, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#to visualize the distribution of glass types (the variable we are attempting to forecast)
ggplot(data = Glass, aes(x = Type)) +
    geom_bar()

#dot plots based on type

glass_long <- Glass |>
    pivot_longer(cols = RI:Fe, names_to = "predictor", values_to = "pred_value")

#Type is a categorical variable, so these plots only illustrate correlations between specific 
#types of glass and element amounts
ggplot(glass_long, aes(x = pred_value, y = Type)) + 
    geom_point() +
    facet_wrap(~ predictor, scales = "free_x")

#let's see if reflective index is helpful
glass_long_2 <- Glass |> mutate(Type = as.numeric(Type)) |>
    pivot_longer(cols = -RI, names_to = "predictor", values_to = "pred_value")

ggplot(glass_long_2, aes(x = pred_value, y = RI)) + 
    geom_point() +
    facet_wrap(~ predictor, scales = "free_x")

#the strongest correlation seems to be between CA and RI

#plotting some other predictors against each other
ggplot(Glass, aes(x = Na, y = Ca, color = Type)) + 
    geom_point() 

ggplot(Glass, aes(x = Al, y = K, color = Type)) + 
    geom_point() 

Fe (Iron) is skewed right, as are potassium and barium. Magnesium has many values close to zero, but is otherwise skewed left. The distribution for Al is the most normal, and the distributions for Ca, Si, Na, and reflective index are close-ish (though Ri is a little lumpy, slightly skewed right.) Several of the distributions are bimodal-ish, like magnesium and potassium.

Iron might have an outlier, as it has kind of a wide distribution. Potassium looks like it has an outlier higher than 6, with most values falling in the 0-1 range.

More on outliers:

Glass |> pivot_longer(cols = RI:Fe) |> 
  ggplot(aes(x = name, y = value)) + geom_boxplot() + facet_wrap(~name, scales = "free")

#Based on box plots, it looks like there are a few outliers for almost every variable except magnesium. 

Scaling and transformations: The predictors all have very different values and scales (for example, values for calcium go up to 16 whereas values for iron are all under 1). Therefore, to make data easier to interpret, it may make sense to place them all on a similar scale.

The skewed distributions, such as iron, potassium, and barium, can benefit from a transformation (like log – box cox is not appropriate given the number of zero values).

Transformations to resist outliers – depending on the predictive model chosen and whether it is sensitive to outliers, it may make sense to use the spatial sign procedure.

Question 3.2

The soybean data can also be found at the UC Irvine ML repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g, temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. a) investigate frequency distributions for the categorized predictors. Are any of the distributions degenerate in any way discussed earlier in this chapter? b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing. Is the pattern of missing data related to the classes? c) develop a strategy for handling missing data, either by eliminating predictors or imputation.

data(Soybean)

soy_long <- Soybean %>%
    mutate(across(-Class, as.character)) %>% 
    pivot_longer(cols = -Class, names_to = "variable", values_to = "value")

soy_long |> 
  filter(!is.na(value)) |> 
  ggplot(aes(x = value)) +
  geom_bar() +
  facet_wrap(~ variable, scales = "free_x") 

#zero variance and near-zero variance
nzv_results <- nearZeroVar(Soybean, saveMetrics = TRUE)

nzv_results
##                  freqRatio percentUnique zeroVar   nzv
## Class             1.010989     2.7818448   FALSE FALSE
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE
#same function, but without NAs, in case NAs are skewing things
soy_no_na <- na.omit(Soybean)

nzv_results_na <- nearZeroVar(soy_no_na, saveMetrics = TRUE)

nzv_results_na
##                 freqRatio percentUnique zeroVar   nzv
## Class            1.010989     2.6690391   FALSE FALSE
## date             1.186441     1.2455516   FALSE FALSE
## plant.stand      1.613953     0.3558719   FALSE FALSE
## precip           4.809524     0.5338078   FALSE FALSE
## temp             2.141026     0.5338078   FALSE FALSE
## hail             3.425197     0.3558719   FALSE FALSE
## crop.hist        1.005495     0.7117438   FALSE FALSE
## area.dam         1.136054     0.7117438   FALSE FALSE
## sever            1.651282     0.5338078   FALSE FALSE
## seed.tmt         1.373874     0.5338078   FALSE FALSE
## germ             1.104712     0.5338078   FALSE FALSE
## plant.growth     3.132353     0.3558719   FALSE FALSE
## leaves           8.064516     0.3558719   FALSE FALSE
## leaf.halo        1.797872     0.5338078   FALSE FALSE
## leaf.marg        1.898936     0.5338078   FALSE FALSE
## leaf.size        1.718085     0.5338078   FALSE FALSE
## leaf.shread      4.854167     0.3558719   FALSE FALSE
## leaf.malf       25.761905     0.3558719   FALSE  TRUE
## leaf.mild       26.100000     0.5338078   FALSE  TRUE
## stem             1.007143     0.3558719   FALSE FALSE
## lodging         12.380952     0.3558719   FALSE FALSE
## stem.cankers     2.265823     0.7117438   FALSE FALSE
## canker.lesion    2.798165     0.7117438   FALSE FALSE
## fruiting.bodies  5.314607     0.3558719   FALSE FALSE
## ext.decay        3.162963     0.3558719   FALSE FALSE
## mycelium        92.666667     0.3558719   FALSE  TRUE
## int.discolor    11.318182     0.5338078   FALSE FALSE
## sclerotia       27.100000     0.3558719   FALSE  TRUE
## fruit.pods       3.539130     0.5338078   FALSE FALSE
## fruit.spots      3.450000     0.7117438   FALSE FALSE
## seed             5.314607     0.3558719   FALSE FALSE
## mold.growth      9.807692     0.3558719   FALSE FALSE
## seed.discolor   10.469388     0.3558719   FALSE FALSE
## seed.size       17.733333     0.3558719   FALSE FALSE
## shriveling      23.434783     0.3558719   FALSE  TRUE
## roots           55.100000     0.5338078   FALSE  TRUE
#looking at all the rows with NAs
soy_na_only_2 <- Soybean |> 
    filter(!complete.cases(across()))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!complete.cases(across())`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.

Looking at the histogram above, some variables are very heavily skewed toward one value, like fruiting bodies, leaves, leaf shread, leaf malformation, int discolor, lodging, mold growth, mycelium, scierotia, seed discolor, seed size, shriveling (hail, seed, plant growth to a lesser extent).

After running the NZV function on the data, I can see that none of the variables have zero variance, but some have near-zero, including leaf malformation, leaf mild, mycelium, sclerotia, shriveling, and roots.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
#LIttle's MCAR test
library(naniar)
## Warning: package 'naniar' was built under R version 4.5.2
## 
## Attaching package: 'naniar'
## The following object is masked from 'package:tsibble':
## 
##     pedestrian
mcar_test(Soybean)
## Warning in norm::prelim.norm(data): NAs introduced by coercion to integer range
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1     3712.   130       0                9
#The p value is zero, which means the data is not missing completely at random

#LIttle's Mcar test shows 9 patterns to the missing data

print(miss_var_summary(Soybean),n=36)
## # A tibble: 36 × 3
##    variable        n_miss pct_miss
##    <chr>            <int>    <num>
##  1 hail               121   17.7  
##  2 sever              121   17.7  
##  3 seed.tmt           121   17.7  
##  4 lodging            121   17.7  
##  5 germ               112   16.4  
##  6 leaf.mild          108   15.8  
##  7 fruiting.bodies    106   15.5  
##  8 fruit.spots        106   15.5  
##  9 seed.discolor      106   15.5  
## 10 shriveling         106   15.5  
## 11 leaf.shread        100   14.6  
## 12 seed                92   13.5  
## 13 mold.growth         92   13.5  
## 14 seed.size           92   13.5  
## 15 leaf.halo           84   12.3  
## 16 leaf.marg           84   12.3  
## 17 leaf.size           84   12.3  
## 18 leaf.malf           84   12.3  
## 19 fruit.pods          84   12.3  
## 20 precip              38    5.56 
## 21 stem.cankers        38    5.56 
## 22 canker.lesion       38    5.56 
## 23 ext.decay           38    5.56 
## 24 mycelium            38    5.56 
## 25 int.discolor        38    5.56 
## 26 sclerotia           38    5.56 
## 27 plant.stand         36    5.27 
## 28 roots               31    4.54 
## 29 temp                30    4.39 
## 30 crop.hist           16    2.34 
## 31 plant.growth        16    2.34 
## 32 stem                16    2.34 
## 33 date                 1    0.146
## 34 area.dam             1    0.146
## 35 Class                0    0    
## 36 leaves               0    0
#many of these numbers are similar, suggesting that certain missing data variables are related
#the first 19 values in this table are highly likely to be missing (~100 out of 700 values).


#There are a total of 121 rows with missing values 
rows_with_na <- sum(!complete.cases(Soybean))

#list of all classes
Soybean |>
  count(Class)
##                          Class  n
## 1                 2-4-d-injury 16
## 2          alternarialeaf-spot 91
## 3                  anthracnose 44
## 4             bacterial-blight 20
## 5            bacterial-pustule 20
## 6                   brown-spot 92
## 7               brown-stem-rot 44
## 8                 charcoal-rot 20
## 9                cyst-nematode 14
## 10 diaporthe-pod-&-stem-blight 15
## 11       diaporthe-stem-canker 20
## 12                downy-mildew 20
## 13          frog-eye-leaf-spot 91
## 14            herbicide-injury  8
## 15      phyllosticta-leaf-spot 20
## 16            phytophthora-rot 88
## 17              powdery-mildew 20
## 18           purple-seed-stain 20
## 19        rhizoctonia-root-rot 20
#because all the NA values seem to be relegated to the same 121 rows 
#where at least 4 variables are all missing
#let's see if there are any patterns there
Soybean |> 
  filter(is.na(hail)) |> 
  count(Class) |>
  arrange(desc(n))
##                         Class  n
## 1            phytophthora-rot 68
## 2                2-4-d-injury 16
## 3 diaporthe-pod-&-stem-blight 15
## 4               cyst-nematode 14
## 5            herbicide-injury  8
#these NA values impact only 5 classes, with phytophthora-rot composing 68/121

#here they are side by side

Soybean |> 
  count(Class, name = "total") |> 
  left_join(
    Soybean |> filter(is.na(hail)) |> count(Class, name = "na_count"),
    by = "Class"
  ) |> 
  mutate(na_count = coalesce(na_count, 0)) |> 
  arrange(desc(na_count))
##                          Class total na_count
## 1             phytophthora-rot    88       68
## 2                 2-4-d-injury    16       16
## 3  diaporthe-pod-&-stem-blight    15       15
## 4                cyst-nematode    14       14
## 5             herbicide-injury     8        8
## 6          alternarialeaf-spot    91        0
## 7                  anthracnose    44        0
## 8             bacterial-blight    20        0
## 9            bacterial-pustule    20        0
## 10                  brown-spot    92        0
## 11              brown-stem-rot    44        0
## 12                charcoal-rot    20        0
## 13       diaporthe-stem-canker    20        0
## 14                downy-mildew    20        0
## 15          frog-eye-leaf-spot    91        0
## 16      phyllosticta-leaf-spot    20        0
## 17              powdery-mildew    20        0
## 18           purple-seed-stain    20        0
## 19        rhizoctonia-root-rot    20        0
#4 classes are 100% affected by the missing values

#upSet plot that visualizes this
gg_miss_upset(Soybean)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the UpSetR package.
##   Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  1. develop a strategy for handling missing data, either by eliminating predictors or imputation.

Eliminating rows with missing values would mean eliminating a total of 121 rows (1/5 of the data set), and this would disproportionately affect five classes. Four classes would be dropped entirely if we eliminated the rows with missing data.

Of the top 4 missing variables, only one of them, lodging, has a variance near zero.

KNN wouldn’t work because for many missing values there is no nearest neighbor (the class has no values for that variable). Using the median also doesn’t make sense because many of these are categorical values. It may be reasonable to assume that these missing values are related to the class, since they are consistent across classes, and treat NA as its own categorical value. In this case, imputation would be inappropriate. The nature of the data may provide a clue: soybean plant growth and disease status correlate with factors like weather and climate, which can also affect data collection in the field.

Treating NA values as their own category wasn’t really an option in the question, so, as an alternative, I would explore dropping some of the commonly missing variables and imputing others. This data set has so many variables, I would consider dropping the 4-6 top missing variables (including germ, hail, sever, seed treatment, and lodging) and imputing other missing values using KNN. I would drop the variables with missing values that show as having near-zero variance, including leaf mild (108 missing values), mycelium (38 missing values), leaf malformation (84 missing values), sclerotia (38 missing values), shriveling (106 missing values). The rest, I would attempt to impute, if possible.