In this assignment, the problems 3.1 and 3.2 have been solved from the Kuhn and Johnson book-Applied Predictive Modeling.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.3.2
## ✔ lubridate 1.9.3 ✔ fable 0.3.4
## ✔ ggplot2 3.5.1 ✔ fabletools 0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
library (MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:fabletools':
##
## interpolate
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:caret':
##
## lift
library(dplyr)
library(corrplot)
## corrplot 0.94 loaded
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
#?Glass
There are 10 variables in the Glass data set. Of them, the “Type” variable is the response variable and the rest nine (9) are the predictor variables.All the predictors are numerical variables. To visualize their distributions, a histogram for each of the predictor variable will be plotted. Additionally, a correlation plot will also plotted to understand their relations.
Data visualizations for the predictor variable:
# Plot histograms
Glass |>
keep(is.numeric)|>
gather()|>
ggplot(aes(value)) +
geom_histogram(bin=5)+
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms for the Numerical Predictors")
## Warning in geom_histogram(bin = 5): Ignoring unknown parameters: `bin`
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Plot box plots
Glass |>
keep(is.numeric)|>
gather() |>
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots for the Numerical Predictors")
Correlations of predictors:
Glass|>
keep(is.numeric) |>
cor() |>
corrplot()
Glass|>keep(is.numeric) |>
cor()|>
round(2)
## RI Na Mg Al Si K Ca Ba Fe
## RI 1.00 -0.19 -0.12 -0.41 -0.54 -0.29 0.81 0.00 0.14
## Na -0.19 1.00 -0.27 0.16 -0.07 -0.27 -0.28 0.33 -0.24
## Mg -0.12 -0.27 1.00 -0.48 -0.17 0.01 -0.44 -0.49 0.08
## Al -0.41 0.16 -0.48 1.00 -0.01 0.33 -0.26 0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01 1.00 -0.19 -0.21 -0.10 -0.09
## K -0.29 -0.27 0.01 0.33 -0.19 1.00 -0.32 -0.04 -0.01
## Ca 0.81 -0.28 -0.44 -0.26 -0.21 -0.32 1.00 -0.11 0.12
## Ba 0.00 0.33 -0.49 0.48 -0.10 -0.04 -0.11 1.00 -0.06
## Fe 0.14 -0.24 0.08 -0.07 -0.09 -0.01 0.12 -0.06 1.00
The correlation matrix shows that the refractive index (RI) has a strong positive correlation with calcium (Ca). In contrast, RI has a strong negative correlation with silicon (Si). Other notable relationships include a moderate negative correlation between aluminum (Al) and magnesium (Mg), and a moderate positive correlation between sodium (Na) and barium (Ba). Most other correlations are weak, suggesting minimal relationships among those elements.
Primarily it seems that the predictor sets: Refractive Index, Sodium, Potassium, Calcium, and Iron have presence of potential outliers.However, the reside of the possible outliers is stronger in the Potassium and Iron data sets.
Histograms above reveal that the predictor variables: Sodium is slightly right skewed, Al is moderately right skewed, and refractive index, Potassium, Calcium, Barium and Iron are strongly right skewed. On the other hand, the distribution of Magnesium and Silicon data show left skewness, where the distribution for Magnesium shows higher left skewness compared to the Silicon’s distribution.
Following is the data showing skewness static for each of the predictor:
Glass1<-Glass|>keep(is.numeric)
(skewValues <- apply(Glass1, 2, skewness))
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
As all the data have skewness to some extent so replacing the data with the log, square root, or inverse may help to remove the skew. Since, Refractive Index (RI), Potassium (K), Calcium (Ca), Barium (Ba) and Iron (Fe) have strong right skewness, the log transformation may be worked well for them. Also, Magnesium (Mg) has strong left skewness and again log transformation may also be worked good for this data set. Additionally, the Box-Cox method can be applied here to find out the appropriate transformation technique for individual predictor.
Glass|>keep(is.numeric)|>
mutate_all(funs(BoxCoxTrans(.)$lambda))|>
head(1)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## RI Na Mg Al Si K Ca Ba Fe
## 1 -2 -0.1 NA 0.5 2 NA -1.1 NA NA
From the optimal lambda values above, it is seen that the log transformation will be reasonable for Sodium (Na), Magnesium(Mg), Potassium (K),Barium (Ba) and Iron (Fe). Moreover, Aluminium can be square rooted, Silicon can be squared, and Refractive Index (RI) and Calcium can be inverse squared.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# ?Soybean
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
# Exclude response variable 'Class' from the data set
Soybean1 <- Soybean[, !(names(Soybean) %in% "Class")]
# Bar plots to see frequency distribution of 35 categorical predictors
columns<-colnames(Soybean1)
lapply(columns,
function(col) {
ggplot(Soybean1,
aes_string(col)) + geom_bar() + ggtitle(col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
Based on the frequency distribution, mycelium and sclerotia appear to be degenerate. Leaf.mild and leaf.malf also seem heavily skewed, especially when excluding the missing values.
Create bar plots too see percentage of missing values in predictor variables:
# Calculate missing percentages
miss_percent <- colSums(is.na(Soybean1)) / nrow(Soybean1) * 100
# Create data frame with missing percentages
miss_df <- data.frame(Predictor = names(miss_percent),
Miss_Percent = miss_percent)
# Bar plot of missing percentages for predictors
ggplot(miss_df, aes(x = reorder(Predictor, -Miss_Percent),
y = Miss_Percent)) +
geom_bar(stat = "identity", fill = "red") +
coord_flip() +
labs(title = "Percentage of Missing Values in Predictors",
x = "Predictors",
y = " % Missing") +
theme_minimal()
Determine missing counts by class and see the predictor’s calsses that have missing counts:
# Count missing values by Class
missing_by_class <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
pivot_longer(cols = starts_with("missing_"),
names_to = "Predictor",
values_to = "Missing_Count") %>%
mutate(Predictor = sub("missing_", "", Predictor))
# See missing counts by class
head(missing_by_class,20)
## # A tibble: 20 × 3
## Class Predictor Missing_Count
## <fct> <chr> <int>
## 1 2-4-d-injury date 1
## 2 2-4-d-injury plant.stand 16
## 3 2-4-d-injury precip 16
## 4 2-4-d-injury temp 16
## 5 2-4-d-injury hail 16
## 6 2-4-d-injury crop.hist 16
## 7 2-4-d-injury area.dam 1
## 8 2-4-d-injury sever 16
## 9 2-4-d-injury seed.tmt 16
## 10 2-4-d-injury germ 16
## 11 2-4-d-injury plant.growth 16
## 12 2-4-d-injury leaves 0
## 13 2-4-d-injury leaf.halo 0
## 14 2-4-d-injury leaf.marg 0
## 15 2-4-d-injury leaf.size 0
## 16 2-4-d-injury leaf.shread 16
## 17 2-4-d-injury leaf.malf 0
## 18 2-4-d-injury leaf.mild 16
## 19 2-4-d-injury stem 16
## 20 2-4-d-injury lodging 16
# Classes with missing counts
classes_with_missing <- missing_by_class %>%
filter(Missing_Count > 0) %>%
distinct(Class)
# See distinct classes
print(classes_with_missing)
## # A tibble: 5 × 1
## Class
## <fct>
## 1 2-4-d-injury
## 2 cyst-nematode
## 3 diaporthe-pod-&-stem-blight
## 4 herbicide-injury
## 5 phytophthora-rot
Bar plot to see missing percentage after removing classes with missing counts:
# Define classes to remove
classes_to_remove <- c("2-4-d-injury", "cyst-nematode", "diaporthe-pod-&-stem-blight", "herbicide-injury", "phytophthora-rot")
# Filter out defined classes from Soybean data set
Soybean2 <- Soybean %>%
filter(!Class %in% classes_to_remove)
# Calculate missing percentages
miss_percent <- colSums(is.na(Soybean2)) / nrow(Soybean2) * 100
# Create data frame for missing percentages
miss_data <- data.frame(Predictor = names(miss_percent),
Miss_Percent = miss_percent)
# Bar plot of missing percentages for predictors
ggplot(miss_data, aes(x = reorder(Predictor, -Miss_Percent),
y = Miss_Percent)) +
geom_bar(stat = "identity", fill = "red") +
coord_flip() +
labs(title = "Percentage of Missing Values in Predictors",
x = "Predictors",
y = "% Missing") +
theme_minimal()
From the 1st bar plot above, it is seen that except predictor, “leaves”, the rest 34 predictors have missing counts to some percentages in the Soybean data set. It is also observed that the five classes:2-4-d-injury, herbicide-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and phytophthora-rot have the missing counts associated with 34 predictors. It seems like there is a pattern of missing data related to the this five classes. After removing these classes there was no missing data observed in the data set as shown by the 2nd bar plot above.
To handle the missing data in the Soybean data set, I will adapt a strategy based on elimination and imputation. At first, I will assess the volume of the missing data for each of the predictor. The predictor with 20% or more missing values will be removed from the analysis. Here, I have chosen 20% as a threshold for this analysis based on the sample size of the data. Next, for the remaining predictors with missing values, I will use imputation techniques to fill in the gaps for missing values in the data set. Here, I can apply simple mean, median, or mode methods to replace the missing values. Alternatively, I can use more advanced techniques like K-Nearest Neighbors (KNN) to better estimate the missing values. I think, by combining these approaches, I will be able to ensure a more reliable data set for this analysis while minimizing the impact of missing data.