624 HW4

Jeff Shamp

2021-03-02

Question KJ 3.1

Question

This data set is related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(tidyverse)
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
  2. Do there appear to be any outliers in the data? Are any predictors skewed?
  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer

The answer will combine parts (a), (b), and (c) into one narrative for a more wholistic approach to what is a sprawling process.

Variable Distributions

The first step is to understand the data. The call to str in the question is very valuable. It tells us that we can naïvely expect that each of the values comes from a continuous distribution, since they are coded as floats/doubles. The exception is the Type variable which is a factor which takes 6 discrete values. If we wanted to use this as a predictor in a regression we would have to consider using dummy variables.

The first visualization step is usually to plot the distribution of each of the variables. Below are histograms for each of the 10 variables. To make life a bit simpler, we will exclude Type.

Glass %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~key, scales = 'free')

The plots show that some of the variable such as Potassium (K), Barium (Ba), and Iron (Fe) have classic very right-tailed skew distributions. Others, such as Magnesium (Mg), Calcium (Ca), and the refractive index (RI) have slightly right-tailed distributions. Still others, such as Silicon (Si), Sodium (Na), and Aluminum (Al) are borderline symmetrical. Lastly, Magnesium (Mg) doesn’t fit into any category. Rather it has peaks at the ends and dips in the middle. Maybe this is a beta distribution, but it isn’t clear at this point. Type doesn’t have an immediately recognizable shape either, but as it is a categorical variable, so it’s of little concern. We can check for skewness using the sample skew parameter.

library(moments)
Glass %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  group_by(key) %>%
  summarise(skewness = skewness(value)) %>%
  arrange(desc(skewness))
## # A tibble: 9 x 2
##   key   skewness
##   <chr>    <dbl>
## 1 K        6.51 
## 2 Ba       3.39 
## 3 Ca       2.03 
## 4 Fe       1.74 
## 5 RI       1.61 
## 6 Al       0.901
## 7 Na       0.451
## 8 Si      -0.725
## 9 Mg      -1.14

As seen in the histograms, Ca, K, and Ba show the heaviest right skew. We will likely need transformations to handle this variables.

Transforms

Scaling and Centering

The next step, with continuous data at least, is to look at the data after transformations. The simplest is the Z-scaling of the data, which subtracts the mean and divides by the standard deviation of each variable. This sets all means to 0 and puts the variability on the same scale. This which will obviously not be applied to the Type variable.

scaled<- 
  Glass %>%
  select(!Type) %>%
  scale() %>%
  as.data.frame()

glass_scale<- cbind(Glass$Type, scaled)

glass_scale %>%
  rename(Type = `Glass$Type`)%>%
  pivot_longer(!Type,
               names_to = "key",
               values_to = "value") %>%
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~key, scales = 'free')

Unfortunately, this did not help matters much and the skew not at all, which stands to reason, as centering and scaling do not directly address skew. I find that scaling and centering are, in the real world, fairly useless. They add another layer of complexity for non-technical people, and often do not address things like skew of distribution or irregularities.

Logarithmic Transformation

A transformation often used to address skew in particular is to log the data. Again, this is inappropriate for the Type variable.

logged<- 
  Glass %>%
  select(!Type) %>%
  log() 

glass_log<- cbind(Glass$Type, logged)

glass_log %>%
  rename(Type = `Glass$Type`) %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~key, scales = 'free')

glass_log %>%
  rename(Type = `Glass$Type`) %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  group_by(key) %>%
  summarise(skewness = skewness(value)) %>%
  arrange(desc(skewness))
## # A tibble: 9 x 2
##   key   skewness
##   <chr>    <dbl>
## 1 RI      1.60  
## 2 Ca      1.06  
## 3 Na      0.0341
## 4 Si     -0.794 
## 5 Al     -0.841 
## 6 Ba    NaN     
## 7 Fe    NaN     
## 8 K     NaN     
## 9 Mg    NaN

Now, some of the distributions begin to look more symmetrical. Maybe we could try a lognormal distribution. Magnesium, however, becomes heavily left-tailed. I believe this is seen in gamma distributions and in beta distributions where both shape parameters are nearly equal. This confirms to some extent our observation above.

Note that many of the variables had observations of 0, which returns -Inf when logged. This is the reasons for the Nans in the skew calculations. This also indicates that a log-transform may not be the best for those variables.

Other Transformations

Other transformations include power transformations and the Box-Cox family of transforms with parameter \(\lambda\): \[ x^* = \begin{cases} \frac{x^\lambda - 1}{\lambda}\qquad &\textrm{if }\lambda \neq 0\\ \log(x) &\textrm{if }\lambda = 0 \end{cases} \]

We have used this extensively already in class, so let’s see if Box-Cox can be of use here. My guess is that it will only work for handful of elements. We will look at the lambda values first and determine where to go from there.

 Glass %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  group_by(key) %>%
   summarise(lambda_key = forecast::BoxCox.lambda(value))
## # A tibble: 9 x 2
##   key   lambda_key
##   <chr>      <dbl>
## 1 Al        0.362 
## 2 Ba        0.0884
## 3 Ca       -0.385 
## 4 Fe        0.131 
## 5 K         0.0635
## 6 Mg        1.00  
## 7 Na       -0.743 
## 8 RI       -1.00  
## 9 Si       -1.00

This is likely to do little for our data. We have inverses for RI and Si, a power of one for Mg, so that doesn’t help. The cubic root for Al, and possibly it’s inverse for Ca will likely help, but that’s about it. As such, we will pass on Box Cox.

Variable Correlations

To visually inspect correlation, the simplest approach is to create scatter plots of each variable against each other.

GGally::ggpairs(Glass)

Not much in the way of overwhelming correlations. However, there are seem to be some relationships. For example, Refractive Index seems to be highly positively correlated with Ca and somewhat negatively correlated with Si.

Outliers

The scatterplot doesn’t show strong evidence of outliers. However, boxplots or violin plots are often useful in identifying outliers.

Glass %>%
  pivot_longer(!Type,
               names_to = "key", 
               values_to = "value") %>%
  ggplot(aes(key, value, fill=key)) +
  geom_boxplot()

There are clearly a few observations for each variable which exceed 1.5 times the IQR, the majority of which are in Calcium. We could further investigate whether the outliers have significant leverage using Cooke’s distance, if outlier removal was our intended goal.

Summary

The variables in the Glass dataset have a range of distributions from the highly skewed to the nearly symmetrical, with an expect set of outliers. A sequence of transformations were applied to address some of these issues with a log transform appearing to be most helpful. Some dimension reduction could be considered based on desired interruptibility. All academic books advise PCA, but a simpler, more explabable reducer is generally advised in practice. I find a tree booster model to be much simpler at identifying like predictors for removal over PCA.

Question KJ 3.2

Question

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
  2. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
  3. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer

Part (a)

Soybean %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_bar() +
  facet_wrap(~key, scales = 'free')

From the above, we see that there is no mathematically degenerate distribution, in that every category has at least one instance of more than one variable type. We should test the ratio of most frequent to the second most frequent observation in each category.

name_col <- names(Soybean)
soy_ratios <- data.frame(variable = character(), ratio = double())
for (i in seq_len(dim(Soybean)[[2]])) {
  soy_ratios[i, 1] <- name_col[i]
  soy_ratios[i, 2] <- max(table(Soybean[, i])) /
    max(table(Soybean[, i])[-which.max(table(Soybean[, i]))])
}
soy_ratios[which(soy_ratios$ratio > 20), ]
##     variable  ratio
## 19 leaf.mild  26.75
## 26  mycelium 106.50
## 28 sclerotia  31.25

Here we have three variable with a ratio that is higher than 20, the value suggested by Kuhn & Johnson. These would be categories for consideration of removal or some other transform due to their imbalance. I would imagine that mycelium is the most likely candidate for removal.

Part (b)

There clearly is a correlation between the Class and the missingness of data.

Soybean %>% 
  group_by(Class) %>%
  summarise_all(~sum(is.na(.))) %>%
  mutate(sum_missing = rowSums(across(where(is.numeric)))) %>%
  select(Class, sum_missing) %>%
  filter(sum_missing > 0)
## # A tibble: 5 x 2
##   Class                       sum_missing
##   <fct>                             <dbl>
## 1 2-4-d-injury                        450
## 2 cyst-nematode                       336
## 3 diaporthe-pod-&-stem-blight         177
## 4 herbicide-injury                    160
## 5 phytophthora-rot                   1214

It’s clear that some classes have significant missing values where the other 14 classes have none.

Part (c)

From the above we see that when a variable is missing, it tends to be missing for the entirety of its class.

Lastly, of the 35 explanatory variables (not counting class itself), these missing classes tend to be missing a lot of them

Soybean %>% 
  group_by(Class) %>%
  summarise_all(~sum(is.na(.)))
## # A tibble: 19 x 36
##    Class  date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##    <fct> <int>       <int>  <int> <int> <int>     <int>    <int> <int>    <int>
##  1 2-4-…     1          16     16    16    16        16        1    16       16
##  2 alte…     0           0      0     0     0         0        0     0        0
##  3 anth…     0           0      0     0     0         0        0     0        0
##  4 bact…     0           0      0     0     0         0        0     0        0
##  5 bact…     0           0      0     0     0         0        0     0        0
##  6 brow…     0           0      0     0     0         0        0     0        0
##  7 brow…     0           0      0     0     0         0        0     0        0
##  8 char…     0           0      0     0     0         0        0     0        0
##  9 cyst…     0          14     14    14    14         0        0    14       14
## 10 diap…     0           6      0     0    15         0        0    15       15
## 11 diap…     0           0      0     0     0         0        0     0        0
## 12 down…     0           0      0     0     0         0        0     0        0
## 13 frog…     0           0      0     0     0         0        0     0        0
## 14 herb…     0           0      8     0     8         0        0     8        8
## 15 phyl…     0           0      0     0     0         0        0     0        0
## 16 phyt…     0           0      0     0    68         0        0    68       68
## 17 powd…     0           0      0     0     0         0        0     0        0
## 18 purp…     0           0      0     0     0         0        0     0        0
## 19 rhiz…     0           0      0     0     0         0        0     0        0
## # … with 26 more variables: germ <int>, plant.growth <int>, leaves <int>,
## #   leaf.halo <int>, leaf.marg <int>, leaf.size <int>, leaf.shread <int>,
## #   leaf.malf <int>, leaf.mild <int>, stem <int>, lodging <int>,
## #   stem.cankers <int>, canker.lesion <int>, fruiting.bodies <int>,
## #   ext.decay <int>, mycelium <int>, int.discolor <int>, sclerotia <int>,
## #   fruit.pods <int>, fruit.spots <int>, seed <int>, mold.growth <int>,
## #   seed.discolor <int>, seed.size <int>, shriveling <int>, roots <int>

All the classes with missing data are missing more than half of their data with the exception of the diaporthe-pod-&-stem-blight class. These are likely to be dropped from the dataset due to the fact that we would not want to impute the majority of the data.

As for diaporthe-pod-&-stem-blight, we would like to see just how predictive the missing features are in the remaining data.They may not be of much value. If they are, we could impute the 15 or so cases.