Homework 4 Data 624

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

## corrplot 0.94 loaded

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

corr_matrix <- cor(Glass[, 1:9])

# Visualize the correlation matrix as a heatmap
corrplot(corr_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

Glass %>%
  select_if(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free')

Glass %>% 
  select(!Type) %>% 
  gather() %>% 
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free')

By looking at the graphs we can see that RI (Refractive Index) and Ca (Calcium) show the highest positive correlation in our heatmap and RI and Mg (Magnesium) have the lowest correlation.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Most of the predictors exhibit some degree of right skewness, with the exception of Mg (Magnesium) and Si (Silicon). Among these, Si shows a more normal distribution, while the majority of the others appear to be right skewed. Outliers are very evident in the box plots for all predictors except for Mg, which does not appear to have any outliers.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

The Box-Cox Transformation is suitable for predictors that show right skewness, as it helps to stabilize variance and make the data more normally distributed. The Spatial Sign Transformation is useful for predictors with evident outliers. This transformation reduces the impact of extreme values by normalizing the data

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First of all degenerate distribution occurs when one or more categories have very few or no observations.

Soybean %>%  
  select(!Class)%>%  
  drop_na() %>%  
  gather() %>% 
  ggplot(aes(value)) +  
  geom_bar() +  
  facet_wrap(~ key) +
  labs(title="Soybean")

In our case ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate as observed on the plot above.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_percentage <- Soybean %>%
  summarise_all(~mean(is.na(.)) * 100) %>%
  gather(key = "predictor", value = "missing_percent") %>%
  arrange(desc(missing_percent))

print(missing_percentage)

##          predictor missing_percent
## 1             hail      17.7159590
## 2            sever      17.7159590
## 3         seed.tmt      17.7159590
## 4          lodging      17.7159590
## 5             germ      16.3982430
## 6        leaf.mild      15.8125915
## 7  fruiting.bodies      15.5197657
## 8      fruit.spots      15.5197657
## 9    seed.discolor      15.5197657
## 10      shriveling      15.5197657
## 11     leaf.shread      14.6412884
## 12            seed      13.4699854
## 13     mold.growth      13.4699854
## 14       seed.size      13.4699854
## 15       leaf.halo      12.2986823
## 16       leaf.marg      12.2986823
## 17       leaf.size      12.2986823
## 18       leaf.malf      12.2986823
## 19      fruit.pods      12.2986823
## 20          precip       5.5636896
## 21    stem.cankers       5.5636896
## 22   canker.lesion       5.5636896
## 23       ext.decay       5.5636896
## 24        mycelium       5.5636896
## 25    int.discolor       5.5636896
## 26       sclerotia       5.5636896
## 27     plant.stand       5.2708638
## 28           roots       4.5387994
## 29            temp       4.3923865
## 30       crop.hist       2.3426061
## 31    plant.growth       2.3426061
## 32            stem       2.3426061
## 33            date       0.1464129
## 34        area.dam       0.1464129
## 35           Class       0.0000000
## 36          leaves       0.0000000

missing_percentage <- Soybean %>%
  summarise_all(~mean(is.na(.)) * 100) %>%
  gather(key = "predictor", value = "missing_percent") %>%
  arrange(desc(missing_percent))

# Create a barplot using ggplot2
ggplot(missing_percentage, aes(x = reorder(predictor, -missing_percent), y = missing_percent)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Percentage of Missing Data by Predictor",
       x = "Predictor",
       y = "Missing Percentage (%)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Soybean %>%
  filter_all(any_vars(is.na(.))) %>%
  select(Class) %>%
  group_by(Class) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 phytophthora-rot               68
## 2 2-4-d-injury                   16
## 3 diaporthe-pod-&-stem-blight    15
## 4 cyst-nematode                  14
## 5 herbicide-injury                8

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I’d use mode imputation for handling the missing data in this dataset. Since most of the predictors are categorical, we can fill in missing values with the most frequent category for each predictor. This approach is simple and effective, ensuring that we don’t lose too much data by removing rows or columns. Mode imputation helps preserve the structure of the dataset while maintaining the integrity of the categorical variables. After imputation, we can proceed with analysis without worrying about gaps in the data.

Homework 4 Data 624

Nikoleta Emanouilidi

2024-09-24

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

By looking at the graphs we can see that RI (Refractive Index) and Ca (Calcium) show the highest positive correlation in our heatmap and RI and Mg (Magnesium) have the lowest correlation.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First of all degenerate distribution occurs when one or more categories have very few or no observations.

In our case ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate as observed on the plot above.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.