library(cowplot)
## Warning: package 'cowplot' was built under R version 4.3.3
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
library(MASS)
library(gridExtra)
library(tidyr)
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(mlbench)
#(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
corrplot(cor(Glass%>%dplyr::select(-10)),type="lower")
Glass %>%
dplyr::select(-10)%>%
gather() %>%
ggplot(aes(x=value))+
geom_histogram(fill="red")+
facet_wrap(~key,scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Type excluded since it was non-numerical.
Al: mild right skewed right. Ba: unimodal, strong right skew, high concentration around 0. Ca: right skew, outliers present. Fe: unimodal, strong right skew, high concentration around 0. K: not normally distributed, outliers, concentration around 0. Mg: skewed left, non-normal. Na: normal with outliers. RI: skewed right with outliers. Si: slightly left skewed with outliers.
Ca and RI enjoy a strong positive correlation, and Si and RI have the strongest negative correlation.
#Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>%
dplyr::select(-10)%>%
gather() %>%
ggplot(aes(value))+
geom_boxplot()+
facet_wrap(~key,scales = "free")
I mention the outliers above, but it’s worth visualizing again to note
that all but Mg have outlierts.
#Are there any relevant transformations of one or more predictors that might improve the classification model?
Depending on where we set our cutoff for correlation, we could remove RI most easily, given it’s correlation with Ca, Si, and Al. We could also, in an effort to treat the skewness and 0-concentration, perform a Box Cox.
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
#(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.3.3
library(tidyr)
data(Soybean)
?Soybean
## starting httpd help server ... done
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#removing Class because it crowds out the other images
cat_vars <- Soybean %>%
select_if(is.factor) %>%
select(-Class)
cat_vars_long <- cat_vars %>%
gather(key = "variable", value = "value")
## Warning: attributes are not identical across measure variables; they will be
## dropped
ggplot(cat_vars_long, aes(x = value)) +
geom_bar(fill = "purple", color = "black") +
facet_wrap(~ variable, scales = "free_x") +
labs(x = "Categories", y = "Count", title = "Frequency Distribution of Categorical Predictors") +
theme_minimal() +
theme(axis.text.x = element_text(hjust = 1),axis.text.y = element_text(size = 6))
Examining the above output, we are looking for low variability when
trying to identify degenerate distribution. It seems as though mycelium
fits this, with almost all records falling into the 0 category, unless
missing. Sclerotia also seems to fit this description.
#(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
library(naniar)
## Warning: package 'naniar' was built under R version 4.3.3
##
## Attaching package: 'naniar'
## The following object is masked from 'package:tsibble':
##
## pedestrian
vis_miss(Soybean)
9.5% of the data is missing from the datset. Hail and sever seem to be
missing the most, along with lodging, all at 17.7%. Class is missing 0%.
It is odd that some classes are missing quite a bit while others are
missing none. Additionally, it’s strange that the categories that are
missing the most, all have the same percentage of missing values,
suggesting a pattern or relationship between those categories.
#(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Given the large number of predictors and the relatively low proportion of missing data, I would probably recommend imputation instead of removal. For binary variables like “hail” or “lodging,” imputing “no” makes sense, while for other categorical variables, I would use the mode. In cases where imputing a value could distort the data, adding an “unknown” category could help, and applying normalization like Box-Cox post-imputation could improve model fit.
3.3.
Kuhn, Max; Johnson, Kjell. Applied Predictive Modeling (p. 59). Springer New York. Kindle Edition.