Load libraries
library(tidyverse)
library(fpp3)
library(corrplot)
Exercises 3.1
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
glass2 <- Glass%>%
select(-Type) %>%
mutate(across(where(is.factor), as.numeric)) %>%
pivot_longer(cols = everything(), names_to = "name", values_to = "value")
ggplot(glass2, aes(x= value)) +
geom_histogram(bins =30) +
facet_wrap(~name, scales = "free") +
labs(title = "Glass Distribution")
glass <- Glass[3:9]
corrplot(cor(glass),
method = "number",
type = "upper")
ggplot(glass2, aes(name, value)) +
geom_boxplot() +
labs(title = "Boxplot for Glass")
ggplot(Glass, aes(Type)) +
geom_bar() +
labs(title = "Count for Type for Glass")
The outlier in the data seems to be SI as most of the data is below the value of 20 while SI is much greater. Reviewing the histogram charts you can see the distribution for BA, FE, K are extremely skewed to the right. AI,CA and RI are skewed to the right but less than the BA, FE and K.
Using the log transformation or Box-Cox Transformation will help with the right skewness in BA, FE and K.
Exercise 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data("Soybean")
The distributions are degenerate in a way since a lot of the data has missing data for some of the data such as hail, sclerotia, seed, shriveling, stem, leaves, seed.size, seed.discolor, mycellium, leaves, lodging, plant.growth and roots.
soybean_data <- Soybean %>%
select(-Class, - date) %>%
mutate(across(where(is.factor), as.numeric)) %>%
pivot_longer(cols = everything(), names_to = "name", values_to = "value")
ggplot(soybean_data, aes(x= value)) +
geom_histogram(stat = "count") +
facet_wrap(~name, scales = "free") +
labs(title = "Glass Distribution")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
## Warning: Removed 2336 rows containing non-finite outside the scale range
## (`stat_count()`).
The particular predictors with the most missing data are hail,
server, seed.tmt. lodging with 121 missing data. The data shows
2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight,
herbicide-injury and
phytophthora-rot are the major contributor with missing data.
colSums(is.na(Soybean))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
Soybean %>%
group_by(Class) %>%
mutate(across(where(is.factor), as.numeric)) %>%
summarise(across(where(is.numeric), sum, na.rm = FALSE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), sum, na.rm = FALSE)`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 19 × 36
## Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2-4-d… NA NA NA NA NA NA NA NA NA
## 2 alter… 552 124 264 233 101 261 236 129 146
## 3 anthr… 241 67 132 99 55 122 122 79 71
## 4 bacte… 90 25 50 43 30 56 50 30 30
## 5 bacte… 75 29 48 39 30 53 50 28 26
## 6 brown… 293 127 266 194 103 291 296 179 135
## 7 brown… 234 55 53 81 53 140 122 95 66
## 8 charc… 115 20 20 55 31 55 70 40 30
## 9 cyst-… 58 NA NA NA NA 45 34 NA NA
## 10 diapo… 88 NA 43 45 NA 44 54 NA NA
## 11 diapo… 110 20 60 40 21 61 23 46 29
## 12 downy… 90 31 60 35 29 56 50 34 30
## 13 frog-… 501 119 263 219 101 266 227 134 143
## 14 herbi… 15 16 NA 8 NA 12 20 NA NA
## 15 phyll… 68 31 31 50 29 50 52 26 32
## 16 phyto… 266 176 234 195 NA 262 178 NA NA
## 17 powde… 95 31 31 30 29 50 50 30 34
## 18 purpl… 113 20 60 39 29 50 50 20 28
## 19 rhizo… 45 38 60 20 22 50 40 51 24
## # ℹ 26 more variables: germ <dbl>, plant.growth <dbl>, leaves <dbl>,
## # leaf.halo <dbl>, leaf.marg <dbl>, leaf.size <dbl>, leaf.shread <dbl>,
## # leaf.malf <dbl>, leaf.mild <dbl>, stem <dbl>, lodging <dbl>,
## # stem.cankers <dbl>, canker.lesion <dbl>, fruiting.bodies <dbl>,
## # ext.decay <dbl>, mycelium <dbl>, int.discolor <dbl>, sclerotia <dbl>,
## # fruit.pods <dbl>, fruit.spots <dbl>, seed <dbl>, mold.growth <dbl>,
## # seed.discolor <dbl>, seed.size <dbl>, shriveling <dbl>, roots <dbl>
A strategy for handling missing data is imputation which is using the K-nearest neighbor in getting the surrounding data points to fill in the missing data. Each time there is missing data we will have to looking at the surround points and fill in the missing information.