Homework 4 - Data Preprocessing/Overfitting

Name: Charles Ugigabe.

Date: 10/14/23

library(mlbench)
library(tidyverse)
library(corrplot)

Question 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

solution 3.1

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15,aes(x=value, y = ..density..),fill="gray", color="#e9ecef") + 
  geom_density(aes(x=value), color='red', lwd = 0.8) +
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Numeric Data

Al - right skewed
Ba - right skewed, outlier
Ca - right skewed, outlier
Fe - right skewed, outlier
K - right skewed, outlier, bimodal
Mg - left skewed, bimodal
Na - Close to near normal
RI - right skewed
Si - left skewed

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

Correlation Matrix

Strong Negative Relationships variables
Ca / Mg; RI / Si; RI / Al; Mg / Al; Mg / Ba
Strong Positive Relationships
Ca / RI; K / Al; Al / Ba; Na / Ba

There seems to be outliers in Ba, K, RI, Ca, and Fe. As mention earlier, there is right skewed in Al, Ba, Ca, Fe, K and Ri while there is a left skewness in Mg and Si

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

It appears that there are outliers in some of the variables in the dataset. Ba, Ca, Fe, K, Mg, and Na appear to have observations that are outliers to the rest of the variable. There are also predictors that have a skewed data distribution. Ca, Ba, Na, and RI are right skewed. Mg and Si are left skewed.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Box-Cox transformation would be very helpful in improving the classification model. For the predictors with outliers we can use a log, square root and spatial sign transformation.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

soybean <- data(Soybean)
Soybean %>%  select(!Class)%>%  drop_na() %>%  gather() %>% ggplot(aes(value)) +  geom_bar() +  facet_wrap(~ key)

## Warning: attributes are not identical across measure variables; they will be
## dropped

Degenerate distributions are distributions where the variable primarily takes one value and others occur at a very low rate. Here we can say ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate.

*(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?**

Soybean %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","violet"))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <]8;;https://github.com/tidyverse/dplyr/issueshttps://github.com/tidyverse/dplyr/issues]8;;>.

Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n(),
         Proportion =  Missing / class_Total) %>% 
  ungroup()%>%
  select(Class, Proportion) %>%
  distinct()

## # A tibble: 5 × 2
##   Class                       Proportion
##   <fct>                            <dbl>
## 1 phytophthora-rot                 0.773
## 2 diaporthe-pod-&-stem-blight      1    
## 3 cyst-nematode                    1    
## 4 2-4-d-injury                     1    
## 5 herbicide-injury                 1

There does seem to be a pattern in that some of the cases that are missing data are affiliated with certain cases. After those five classes were removed from the data, there seems to be no missing data.

*(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Imputation method of handling missing data is one way we can account for missing data without getting rid of it all together. For the imputation strategy we can fill the missing data with several values such as: maxia, minima, mean, or median. For this strategy filling in the missing data with the mean for that column would be the best approach.