Questions

Question 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass %>%
  gather(key='predictor', value='response', -Type) %>%
  ggplot(aes(response, fill=Type)) +
  geom_histogram(bins=20) +
  facet_wrap(~predictor, scales='free') +
  labs(x='', y='') +
  scale_x_continuous(expand=c(0.02, 0.02, 0.02, 0.02)) +
  scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) +
  scale_fill_brewer(palette='Set1') + 
  theme_bw()

The 9 predictors differ greatly. Predictors like Al, Na and Mg appear to have strong predictive value. From the distributions we can see that there is some segregation by type based on the value of that predictor. This indicates that the predictor may aid in creating a model. Conversly, predictors like Ba and Fe have such a small range of values that it is unclear whether they can even be used to build a model. In both cases more than 90% of all observations are essentially 0. Other predictors have a more normal distribution, and may be of some use as well. The predictors values may need to scaled as they are mesaured on greatly different scales.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>%
  gather(key='predictor', value='response', -Type) %>%
  ggplot(aes(predictor, response)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  facet_wrap(~predictor, scales='free') +
  labs(x='', y='') +
  theme_bw() +
  theme(panel.border = element_blank(),
        strip.background = element_rect(fill='grey80', color='white'))

Glass %>%
  gather(key='predictor', value='response', -Type) %>%
  ggplot(aes(predictor, response)) +
  geom_boxplot() +
  facet_wrap(~predictor, scales='free') +
  labs(x='', y='') +
  theme_bw() +
  theme(panel.border = element_blank(),
        strip.background = element_rect(fill='grey80', color='white'))

Comparing the previous plot and this one we can see that predictors Ba, Fe, and K are heavily skewed with outliers while predictors Al, Ca, Na are more normally distributed. Mg is bimodal. The above boxplot clearly identifies all the points that are definitionally outliers.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

The type of transformations that should be made depend on the nature of the model being developed. For example, random forests are robust against data that is either skewed or of different order of magnitude. On the other hand, if the model is a multiple logistic regression (or GAM) then the predictors will need to be modified.

In this case, the predictors should be scaled as they are all on greatly different scales. For eaxmple, Si and Fe are different order of magnitudes. Next we would want to address the skew of the predictors. This can be done via a BoxCox transformation or by simply taking the natural log of the predictors values. Predictor Ba may need to be removed entirely as it is a near-zero variance predictor and this can cause trouble for regressions. Conversly, for trees these modifications would not be necessary as trees are rubust when considering skewed, diverse predictors.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean %>%
  select_if(is.factor) %>%
  gather(key='predictor', value='value', -Class) %>%
  count(predictor, value) %>%
  ggplot(aes(value, n)) +
  geom_histogram(stat='identity') +
  facet_wrap(~predictor, scales='free') +
  labs(x='', y='') +
  scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) +
  scale_x_discrete(expand=c(0.15, 0.15, 0.15, 0.15)) + 
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())

There are a few degenerate distributions. That is, distributions where a single value is so overwhelming that there is essentially no measure coming from that predictor. These are, int.discolor, leaf.mild, mycelium, sclerotia and possibly shriveling. It is important to note that none of these are truely degenerate as they all have values present for each categorical value. However, the distributions are so skewed that it may cause problems based on the selected model. Further investigation into each of these predictors should be performed before removing them. Removing a predictor should be considered a last case scenario.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

## 
##  Variables sorted by number of missings: 
##         Variable Count
##             hail   121
##            sever   121
##         seed.tmt   121
##          lodging   121
##             germ   112
##        leaf.mild   108
##  fruiting.bodies   106
##      fruit.spots   106
##    seed.discolor   106
##       shriveling   106
##      leaf.shread   100
##             seed    92
##      mold.growth    92
##        seed.size    92
##        leaf.halo    84
##        leaf.marg    84
##        leaf.size    84
##        leaf.malf    84
##       fruit.pods    84
##           precip    38
##     stem.cankers    38
##    canker.lesion    38
##        ext.decay    38
##         mycelium    38
##     int.discolor    38
##        sclerotia    38
##      plant.stand    36
##            roots    31
##             temp    30
##        crop.hist    16
##     plant.growth    16
##             stem    16
##             date     1
##         area.dam     1
##            Class     0
##           leaves     0

There is a strong pattern of missing data. The above plot indicates that if a sample is missing a single predictor, it is likely also missing several others. In fact, there are no samples missing only one or two predictors. This is problematic because it makes imputing dangerous as we would be supplementing numerous values for individual samples.

Soybean %>%
  mutate(complete = complete.cases(.)) %>%
  count(Class, complete) %>%
  ggplot(aes(reorder(Class, -n, FUN=sum), n, fill=complete)) +
  geom_bar(stat='identity') +
  coord_flip() +
  scale_y_continuous(expand=c(0, 0, 0.05, 0.05)) + 
  scale_fill_brewer(palette='Set1') +
  labs(y='Observations',
       x='Soybean Type',
       fill='Complete Observations') + 
  theme_bw() +
  theme(panel.grid.major.y = element_blank(),
        panel.border = element_blank(),
        legend.position = c(0.65, 0.75),
        legend.background = element_rect(fill='grey90', color='black'),
        legend.key = element_rect(color='grey90'))

The missing data is entirely contained within 5 of the predictors. For 4 of the classes, every single observation is missing multiple predictors and for one of the classes a sizeable majority of samples are missing multiple predictors.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The strategy taken depends on the goal of the model. If I were given free reign I would eliminate the 5 classes that are missing predictors. This would leave me with approximately 82% of the available data and, more importantly, no missing data. Of course, eliminating classes is not something to be taken lightly (and may not even be appropriate given the task). If I am required to keep all of the classes, the next step would be to determine which (if any) of the predictors can be imputed and which may be better left out due to the number of missing values. For example, hail, sever, seed.tmt, ad lodging are all missing 121 observations. Furthermore, several plots (not shown) demonstrate that the soybeans that are missing data are all missing the same data. Imputed values then is a bad idea.