Hw_4_Data

R Markdown

The data can be accessed via:

library(AppliedPredictiveModeling)
library(mlbench)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)

## corrplot 0.94 loaded

library(purrr)
library(tidyr)

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

#Create a heat mat with corrplot to see if there's any correlation between class categories

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

#Basic barchart to see glass distribution

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Glass Distributions")

#Histogram of each class category

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Predictors")

#Box and Whisker of each class category

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Predictors")

It appears that some predictors are more normally distributed than others (ex. AL, CA) while the remainder are skewed. When looking at the correlation plot, we can see that Ca and RI has the strongest correlation among all the class categories, while Ri-Si, Al-Mg, Ca-Mg, and Ba-Mg have a large negative correlation. From the bar plot, types 2,1, and 7 have the largest counts, meaning the data is heavily centered around them.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, there appears to be outliers, namely in the K, Fe, Na, Ba and Ri predictors. Most of the predictors are skewed as well, such as Al, Ca, Fe, K and Ba (right skewness), Mg, Si (left skewness).

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

The Box-Cox Transformation may be useful since a good amount of the predictors have right skewness. It will help stabilize the variance and remove the skew to help the data be more normally distributed. Because some of the predictors also contain outliers, Spatial SIgn Transformation will be helpful to minimize the impact of the outliers on the data.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

Soybean %>%
  select(!Class) %>%
  drop_na() %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_bar() +
  facet_wrap(~ key)

## Warning: attributes are not identical across measure variables; they will be
## dropped

Degenerate distributions are when the variable of the distribution take primarily one value and the others occur at a very low rate. We can see here that mycelium, scleroita, and roots appear to be degenerate.

b. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y=variables,x=n,fill=missing))+
  geom_col()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

With making stacked bar for the missing data, we see that the variables with the highest green bar contain the most missing values, we see that hail, sever, lodging, and seed.tmt have the most missing values.

Soybean %>% 
  filter_all(any_vars(is.na(.))) %>% 
  select(Class) %>% 
  group_by(Class) %>% 
  summarise(count = n())

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

After filtering out and grouping the data, the phytophthora-rot class has a significantly larger amount of missing values compared to other classes. Other classes such as 2-4-D injury (16), Diaporthe-pod & stem blight (15) and Cyst nematode (14) also exhibit relatively high amounts of missing data. Meanwhile, Herbicide injury has fewer missing values, with only 8. This suggests that the missing data isn’t uniformly distributed across classes and that the pattern of missing data may be correlated to the class labels, particularly with specific classes having more missing values.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For a strategy, I would use mode imputation in order to take care of missing data in the set. Because the majority of predictors are categorical, we can fill in missing values using the most frequent category. This approach will allow us to not lose too much data from removing rows/columns. Mode imputation also will let us preserve the dataset’s structure as well as the integrity of the categorical variables, allowing for the elimination of gaps in the data.

Hw_4_Data_624

Natalie Kalukeerthie

2024-09-28