###3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(mlbench)
library(purrr)
library(corrplot)
## corrplot 0.94 loaded
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a)Using visualizations,explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Glass Types")
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms" )
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots")
Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
(c)Are there any relevant transformations of one or more predictors that might improve the classification model?
We can use lambda /Box-Cox transformation, log for positively skewed variables, maybe standardization since the variables are on different scales.
###3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean) ## See ?Soybean for details
# facet wrapping geombar makes it hard to see; trying another approach
library(tidyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
index <- nearZeroVar(Soybean)
colnames(Soybean)[index]
## [1] "leaf.mild" "mycelium" "sclerotia"
missing_proportions <- sort(apply(Soybean, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)
missing_df <- data.frame(variable = names(missing_proportions), missing_proportion = missing_proportions)
ggplot(missing_df, aes(x = reorder(variable, missing_proportion), y = missing_proportion)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Proportion of Missing Values",
x = "Variable",
y = "Proportion") +
theme_minimal()
classes_w_missing <- Soybean %>%
filter_all(any_vars(is.na(.))) %>%
select(Class) %>%
group_by(Class) %>%
summarise(count = n())
classes_w_missing
## # A tibble: 5 × 2
## Class count
## <fct> <int>
## 1 2-4-d-injury 16
## 2 cyst-nematode 14
## 3 diaporthe-pod-&-stem-blight 15
## 4 herbicide-injury 8
## 5 phytophthora-rot 68
for (class in unique(classes_w_missing$Class)) {
t <- Soybean %>%
filter(Class == class)
missing_proportions <- sort(apply(t, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)
missing_100 <- names(missing_proportions[missing_proportions == 1])
print(paste("Class:", class))
if (length(missing_100) > 0) {
print(paste("Variables with 100% missing:", paste(missing_100, collapse = ", ")))
} else {
print("No variables with 100% missing values.")
}
}
## [1] "Class: 2-4-d-injury"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, crop.hist, sever, seed.tmt, germ, plant.growth, leaf.shread, leaf.mild, stem, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.pods, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling, roots"
## [1] "Class: cyst-nematode"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, sever, seed.tmt, germ, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed.discolor, shriveling"
## [1] "Class: diaporthe-pod-&-stem-blight"
## [1] "Variables with 100% missing: hail, sever, seed.tmt, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, roots"
## [1] "Class: herbicide-injury"
## [1] "Variables with 100% missing: precip, hail, sever, seed.tmt, germ, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling"
## [1] "Class: phytophthora-rot"
## [1] "No variables with 100% missing values."