DATA624

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.3.3

library(psych)

## Warning: package 'psych' was built under R version 4.3.3

library(MASS)
library(gridExtra)
library(tidyr)
library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(tidyr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

library(mlbench)

#(a) Using visualizations, explore the predictor variables to understand their  distributions as well as the relationships between predictors. 
corrplot(cor(Glass%>%dplyr::select(-10)),type="lower")

Glass %>%
  dplyr::select(-10)%>%
  gather() %>%
  ggplot(aes(x=value))+
  geom_histogram(fill="red")+
  facet_wrap(~key,scales = "free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Type excluded since it was non-numerical.

Al: mild right skewed right. Ba: unimodal, strong right skew, high concentration around 0. Ca: right skew, outliers present. Fe: unimodal, strong right skew, high concentration around 0. K: not normally distributed, outliers, concentration around 0. Mg: skewed left, non-normal. Na: normal with outliers. RI: skewed right with outliers. Si: slightly left skewed with outliers.

Ca and RI enjoy a strong positive correlation, and Si and RI have the strongest negative correlation.

#Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>%
  dplyr::select(-10)%>%
  gather() %>%
  ggplot(aes(value))+
  geom_boxplot()+
  facet_wrap(~key,scales = "free")

I mention the outliers above, but it’s worth visualizing again to note that all but Mg have outlierts.

#Are there any relevant transformations of one or more predictors that might improve the classification model?

Depending on where we set our cutoff for correlation, we could remove RI most easily, given it’s correlation with Ca, Si, and Al. We could also, in an effort to treat the skewness and 0-concentration, perform a Box Cox.

3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

#(a) Investigate the frequency distributions for the categorical predictors. Are  any of the distributions degenerate in the ways discussed earlier in this  chapter?  
library(inspectdf)

## Warning: package 'inspectdf' was built under R version 4.3.3

library(tidyr)

data(Soybean)
?Soybean

## starting httpd help server ... done

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

#removing Class because it crowds out the other images
cat_vars <- Soybean %>%
  select_if(is.factor) %>%
  select(-Class)  
cat_vars_long <- cat_vars %>%
  gather(key = "variable", value = "value")

## Warning: attributes are not identical across measure variables; they will be
## dropped

ggplot(cat_vars_long, aes(x = value)) +
  geom_bar(fill = "purple", color = "black") +
  facet_wrap(~ variable, scales = "free_x") +  
  labs(x = "Categories", y = "Count", title = "Frequency Distribution of Categorical Predictors") +
  theme_minimal() +
  theme(axis.text.x = element_text(hjust = 1),axis.text.y = element_text(size = 6))

Examining the above output, we are looking for low variability when trying to identify degenerate distribution. It seems as though mycelium fits this, with almost all records falling into the 0 category, unless missing. Sclerotia also seems to fit this description.

#(b) Roughly 18% of the data are missing. Are there particular predictors that  are more likely to be missing? Is the pattern of missing data related to  the classes?  

library(naniar)

## Warning: package 'naniar' was built under R version 4.3.3

## 
## Attaching package: 'naniar'

## The following object is masked from 'package:tsibble':
## 
##     pedestrian

vis_miss(Soybean)

9.5% of the data is missing from the datset. Hail and sever seem to be missing the most, along with lodging, all at 17.7%. Class is missing 0%. It is odd that some classes are missing quite a bit while others are missing none. Additionally, it’s strange that the categories that are missing the most, all have the same percentage of missing values, suggesting a pattern or relationship between those categories.

#(c) Develop a strategy for handling missing data, either by eliminating  predictors or imputation.

Given the large number of predictors and the relatively low proportion of missing data, I would probably recommend imputation instead of removal. For binary variables like “hail” or “lodging,” imputing “no” makes sense, while for other categorical variables, I would use the mode. In cases where imputing a value could distort the data, adding an “unknown” category could help, and applying normalization like Box-Cox post-imputation could improve model fit.

3.3.

Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 59). Springer New York. Kindle Edition.

DATA624_HW4

Evan McLaughlin

2024-09-25

3.1

3.2