library(mlbench)
library(tidyverse)
library(caret)

Excercise 3.1:

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass %>% 
    select(-Type) %>% 
    pivot_longer(cols = everything(), names_to = "predictors", values_to = "vals") %>% 
    ggplot(aes(x = vals))+
    geom_histogram(bins = 30, fill = "coral", color = "black", alpha = 0.5)+
    facet_wrap(~ predictors, scales = "free")+
    theme_minimal()

library(corrplot)

## corrplot 0.92 loaded

# Remove the 'RI' and 'Type' columns
Glass_filtered <- Glass %>%
  select(-Type)

# Calculate the correlation matrix
cor_matrix <- cor(Glass_filtered, use = "complete.obs")

# Create the correlation plot
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black", number.cex = 0.7, 
         col = colorRampPalette(c("darkgreen", "white", "coral"))(200))

Analysis part A:

Checking for skewness, there a few clear examples of right skew. Namely the predictors BA, FE and K. While MG exhibits a left skew.
While AI, CA, SI, NA and RI exhibit the most centrality across the data.
As for correlation, CI and RI as well as AL and FE exibit the highest positive relationships. While SI and K and BA and MG have the highest negative relationship.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

#Using the same long format df from part As:
Glass_filtered <- Glass %>% select(-Type) %>% 
    pivot_longer(cols = everything(), names_to = "predictors", values_to = "vals")


ggplot(Glass_filtered, aes(x = predictors, y = vals)) +
  geom_boxplot(fill = "coral", color = "black", alpha = 0.7) +
  labs(title = "Boxplots for All Variables in the Glass Dataset",
       x = "Variable",
       y = "Value") +
  theme_minimal() +
  facet_wrap(~ predictors, scales = "free")

* The predictors for this dataset all contain significant outliers except the MG predictor.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

bc_glass <- Glass %>% 
    select( -Type) %>% 
    preProcess(method = c("BoxCox")) 
bc_glass

## Created from 214 samples and 5 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1

There are a few different transformations that could work in the case for this dataset. To start a box cox transformation could be very helpful here since it can help to stabilize skewness or non-normal distribution across the dataset (there really only seems to be a few normally distributed vars). A square-root transformation can also be used here, however, this issue has BoxCox written all over it, given the variability, skeweness and outliers present in the data.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(ggplot2)
library(mlbench)
data(Soybean)

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean %>% 
    select(-Class) %>% 
    gather() %>% 
    ggplot(aes(x = value))+
    geom_bar()+
    facet_wrap(~ key)+
    ggtitle(label = "Soybean Categorical Dispersion")

## Warning: attributes are not identical across measure variables; they will be
## dropped

Degeneration refers to a distribution that only has a chance of being that specific value (category in this instance). This can also be referred to by constant distribution. There do seem to be quite a few distributions in the dataset that degenerate. Some good examples are:

Mycelium
Sclerotia
leaf.mild

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean %>% 
    summarise(across(everything(),~ sum(is.na(.)))) %>% 
    pivot_longer(cols = everything(), names_to = "predictors", values_to = "missing") %>% 
    ggplot(aes(x = predictors, y = missing))+
    geom_bar(stat = "identity", fill = "forestgreen", color = "black")+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))+
    coord_flip()+
    ggtitle(label = "Missingness among predictors")

missing_class_counts <- Soybean %>%
  mutate(has_missing = rowSums(is.na(.)) > 0) %>%
  filter(has_missing) %>%
  count(Class) %>%
  arrange(desc(n))
missing_class_counts

##                         Class  n
## 1            phytophthora-rot 68
## 2                2-4-d-injury 16
## 3 diaporthe-pod-&-stem-blight 15
## 4               cyst-nematode 14
## 5            herbicide-injury  8

There is a pretty clear pattern here, showing that the class phytophythora-rot carries the highest amount of missingness across each predictor in the dataset, while the next highest missing count is 16.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Dealing with missing data is tricky and depends on what you are trying to address and achieve with your research and I have not been able to find any sort of best practice when it comes to missingness and droping variables (ie, this column is missing >15% of its data so you should drop). So I think that imputation is the way to go here. To impute across the missing predictors I will use MICE. The basic code is found here : https://libguides.princeton.edu/R-Missingdata. This imputation will utilize Predictive Mean Matching to achieve its

library(mice)

## Warning: package 'mice' was built under R version 4.3.3

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imputed_data <- mice(Soybean, m=5, method = "pmm", print=FALSE)

## Warning: Number of logged events: 1666

Just pulling out the first dataset created by pmm for the example

complete_data_1 <- complete(imputed_data, action = 1)

Week 5 Data Preprocessing/Overfitting

Jonathan Burns

2024-09-29