Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.

library(mlbench)
library(tidyverse)
library(e1071)
library(skimr)
library(caret)
library(corrplot)
library(ggplot2)
library(mice)

Question 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# Gather the data into a long format
glass_his <- Glass%>%
  select(!Type) %>%
  gather()


ggplot(glass_his, aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Histogram of Variables in Data Set Glass")

glass_cor <- Glass%>%
  select(!Type)

cor_matrix <- cor(glass_cor)

corrplot(cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")



Checking for a relationships between predictors or multicollinearity involves calculating the Pearson correlation coefficient. Although Ca and Rl exhibit a strong correlation of 81%, it is still acceptable to consider them as separate predictors.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
ggplot(glass_his, aes(value)) +
  geom_boxplot() +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Boxplot of Variables in Data Set Glass")



Using a boxplot for the predictors, it is evident that all variables except Mg contain outliers. Although we can detect skewness in the predictors through histogram plots, let’s examine the measurements of each variable.

(skewValues <- apply(glass_cor, 2, skewness))
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

If the skewness measurement is negative (for Mg and SI), it indicates that the tail is on the left side of the distribution. Conversely, if the skewness measurement is positive, it indicates that the tail is on the right side of the distribution. A skewness value of zero suggests no skewness. Some values are very close to zero, suggesting they could be somewhat useful in our predictive models.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Centering and scaling the predictive variables is crucial because it helps mitigate the influence of different scales and units, ensuring that each feature contributes equally to a given model and preventing certain variables from disproportionately impacting the results. Additionally, a Box-Cox transformation was applied to normalize the variables due to skewness, and spatial sign was utilized to address the numerous outliers in the data.

Upon revisiting the histogram and examining the skewness measurements, it is evident that the predictor variables are much closer to a normal distribution. Additionally, upon reviewing the boxplot, it can be observed that many of the outliers have been addressed.

glass_trans <- Glass%>%
  select(!Type)
trans <- preProcess(glass_trans, method = c("BoxCox", "center", "scale", "spatialSign"))
preprocessed_data <- predict(trans, newdata = glass_trans)
# Gather the data into a long format
glass_preprocessed_data <- preprocessed_data%>%
  gather()


ggplot(glass_preprocessed_data, aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Histogram of Variables in Data Set Glass")

(skewValues <- apply(preprocessed_data, 2, skewness))
##          RI          Na          Mg          Al          Si           K 
##  0.52441071 -0.00285868 -0.76629519 -0.02455364 -0.40553550  0.43586824 
##          Ca          Ba          Fe 
##  0.23207681  2.01690834  0.87338104
ggplot(glass_preprocessed_data, aes(value)) +
  geom_boxplot() +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Boxplot of Variables in Data Set Glass")

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
str(Soybean)
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# Gather the data into a long format
Soybean_bar <- Soybean%>%
  select(!Class) %>%
  gather()


ggplot(Soybean_bar, aes(value)) +
  geom_bar() +
  facet_wrap(~key)+
  labs(title = "Frequency Distributions of Variables in Data Set Soybean")



A degenerate distribution, in probability theory, is a unique scenario where all the probability mass is concentrated at a singular point. Within this database, numerous variables exhibit a higher concentration of values within specific categories. However, the mycelium and sclerotia stand out as the most prominent examples.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Create a dataframe with missing value information
missing_data <- data.frame(
  variable = names(Soybean),
  missing_count = colSums(is.na(Soybean))
)

# Create the plot
ggplot(missing_data, aes(x = variable, y = missing_count)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
  labs(title = "Missing Values in Each Variable",
       x = "Variable", y = "Number of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



The particular predictors that are more likely to be missing hail, sever, seed.tmt,lodging.

(na.count <- Soybean %>% 
  filter(rowSums(is.na(.)) > 0) %>% 
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))))
## # A tibble: 5 × 36
##   Class    date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##   <fct>   <int>       <int>  <int> <int> <int>     <int>    <int> <int>    <int>
## 1 2-4-d-…     1          16     16    16    16        16        1    16       16
## 2 cyst-n…     0          14     14    14    14         0        0    14       14
## 3 diapor…     0           6      0     0    15         0        0    15       15
## 4 herbic…     0           0      8     0     8         0        0     8        8
## 5 phytop…     0           0      0     0    68         0        0    68       68
## # ℹ 26 more variables: germ <int>, plant.growth <int>, leaves <int>,
## #   leaf.halo <int>, leaf.marg <int>, leaf.size <int>, leaf.shread <int>,
## #   leaf.malf <int>, leaf.mild <int>, stem <int>, lodging <int>,
## #   stem.cankers <int>, canker.lesion <int>, fruiting.bodies <int>,
## #   ext.decay <int>, mycelium <int>, int.discolor <int>, sclerotia <int>,
## #   fruit.pods <int>, fruit.spots <int>, seed <int>, mold.growth <int>,
## #   seed.discolor <int>, seed.size <int>, shriveling <int>, roots <int>

It seems that the missing data is limited to just 5 specific classes.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Chose to replace missing values in categorical predictors with the most frequent value, utilizing the median.

Soybean_trans <- Soybean%>%
  select(!Class)

Soybean_trans <- mutate_all(Soybean_trans, as.numeric)

for(i in 1:ncol(Soybean_trans )) {
  Soybean_trans [ , i][is.na(Soybean_trans [ , i])] <- median(Soybean_trans [ , i], na.rm=TRUE)
}

Checking to make sure NA values are replaced.

preprocessed_Soybean_bar <- Soybean_trans%>%
  gather()


ggplot(preprocessed_Soybean_bar, aes(value)) +
  geom_bar() +
  facet_wrap(~key)+
  labs(title = "Frequency Distributions of Variables in Data Set Soybean")