3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
data(Glass)

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot() 

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

It can be seen that: * Al is slightly right skewed * Ba is right skewed and mostly centered around 0 * Ca is right skewed * Fe is right skewed and mostly centered around 0 * K is right skewed * Mg is left skewed and bimodal * Na is almost normal with a slight right tail * RI is right skewed * Si is left skewed * Type is mostly centered around Types 1,2, and 7

There also seems to be a strong positive correlation between RI and Ca. There are also notable negative correlations between RI and Si, Al and Mg, Ca and Mg, Ba and Mg. There is also notable positive correlations between Ba and Al.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

There seems to be outliers in Ba, K, RI, Ca, Fe, and possibly Na. There are some predictors that are skewed as mentioned in 3.1.a

Glass %>%
  keep(is.numeric) %>%
  apply(., 2, skewness) %>%
  round(4)
##      RI      Na      Mg      Al      Si       K      Ca      Ba      Fe 
##  1.6027  0.4478 -1.1365  0.8946 -0.7202  6.4601  2.0184  3.3687  1.7298
  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Since Be, Fe, and Khave a strong right skewness with a concentrations of points with low values, they may benefit from a log transformation. Mg may also be log transformed since it is left skewed. The table below shows the optimal lambdas. RI can be inverse squared while Si can be squared. Al can be square rooted. It would also be interesting to see how the model performs without Ca as it has correlations with other variables.

Glass %>%
  keep(is.numeric) %>%
  mutate_all(funs(BoxCoxTrans(.)$lambda)) %>%
  head(1)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
##   RI   Na Mg  Al Si  K   Ca Ba Fe
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data(Soybean)

columns <- colnames(Soybean)

lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

Degenerate distributions are ones that take on one possible value. mycelium and sclerotia seem to be degenerate. leaf.mild and leaf.malf seem to also almost one-sided when you discount the missing values.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Soybean %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n(),
         Proportion =  Missing / class_Total) %>% 
  ungroup()%>%
  select(Class, Proportion) %>%
  distinct() 
## # A tibble: 5 x 2
##   Class                       Proportion
##   <fct>                            <dbl>
## 1 phytophthora-rot                 0.773
## 2 diaporthe-pod-&-stem-blight      1    
## 3 cyst-nematode                    1    
## 4 2-4-d-injury                     1    
## 5 herbicide-injury                 1
Soybean %>%
  filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
                       "2-4-d-injury", "herbicide-injury")) %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values with Missing Classes Removed",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

There does seem to be a pattern in that some of the cases that are missing data are affiliated with certain cases. After those five classes were removed from the data, there seems to be no missing data.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

One strategy would be to remove those 5 classes completely from the data. You can also subset the data by their class, with those 5 classes separately. You can then impute the variables that have missing values using KNN. If there are certain variables that are affiliated with those classes that have no data at all, then they can be removed in the subsetted dataset.