library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.2
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'tsibble' was built under R version 4.4.2
## Warning: package 'tsibbledata' was built under R version 4.4.2
## Warning: package 'feasts' was built under R version 4.4.2
## Warning: package 'fabletools' was built under R version 4.4.2
## Warning: package 'fable' was built under R version 4.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ purrr 1.0.2 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Glass |>
melt() |>
ggplot() +
geom_histogram(aes(x=value, fill = Type)) +
facet_wrap(~variable, scales = "free")
## Using Type as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Glass |>
melt() |>
filter(variable == "K", value != 0) |>
ggplot() +
geom_histogram(aes(x=value), binwidth = .1)
## Using Type as id variables
Glass |>
melt() |>
filter(variable == "Ba", value != 0) |>
ggplot() +
geom_histogram(aes(x=value), binwidth = .1)
## Using Type as id variables
Glass |>
melt() |>
filter(variable == "Fe", value != 0) |>
ggplot() +
geom_histogram(aes(x=value), binwidth = .05)
## Using Type as id variables
Glass |>
ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are outliers for a large number of predictors, in fact most of the data seems to have outliers. For Mg, K, Ba and Fe there seem to be a large number of 0 measurements, which may be valid number or they may be errors in the data. Either way it causes me to wary of those values being correct. The Fe and Ba seem to be mostly 0 with very small other values which probably means that those elements are supposed to be more rare in these glass samples that the rest of the elements.
None of the predictors are perfectly distributed but some are pretty good like Si, Ca and Mg and it seems like RI, Na, Al and Fe (after removing the 0 values) are skewed. Ba doesn’t seem to have enough non-zero values to have any sort of meaningful distribution. K seems to have an interesting distribution with a secondary smaller peak closer to zero.
PCA and the preProcess
function using the BoxCox method
would work for this data. In general most of the data has a pretty
normal distribution but using these methods we can remove some of the
skew to normalize the data to produce a cleaner dataset for processing.
Just doing a PCA would produce a nice enough output with transformations
relative to the other predictors but using the preProcess
function we get a nice set of fewer predictors which will help with the
processing later down the line considering that the model we choose will
need less computational effort since there is less data to consider.
for (x in Glass[1:9]){
x <- x[x!=0]
lambda <- BoxCoxTrans(x)$lambda
print(lambda)
if (!is.na(lambda)){
print(ggplot(data.frame(box_cox(x, lambda))) +
geom_histogram(aes(box_cox(x, lambda))) +
ggtitle(paste("Lambda ", lambda)))
} else{
print(ggplot(data.frame(x)) +
geom_histogram(aes(x))+
ggtitle(paste("Lambda NA")))
}
}
## [1] -2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -0.1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.5
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] -1.1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 0.5
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pcaGlass <- Glass|>
select(!Type) |>
prcomp(center = TRUE, scale. = TRUE)
pcaGlass
## Standard deviations (1, .., p=9):
## [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 0.60741950
## [8] 0.25269141 0.04011007
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5 PC6
## RI -0.5451766 0.28568318 0.0869108293 0.14738099 -0.073542700 0.11528772
## Na 0.2581256 0.27035007 -0.3849196197 0.49124204 0.153683304 -0.55811757
## Mg -0.1108810 -0.59355826 0.0084179590 0.37878577 0.123509124 0.30818598
## Al 0.4287086 0.29521154 0.3292371183 -0.13750592 0.014108879 -0.01885731
## Si 0.2288364 -0.15509891 -0.4587088382 -0.65253771 0.008500117 0.08609797
## K 0.2193440 -0.15397013 0.6625741197 -0.03853544 -0.307039842 -0.24363237
## Ca -0.4923061 0.34537980 -0.0009847321 -0.27644322 -0.188187742 -0.14866937
## Ba 0.2503751 0.48470218 0.0740547309 0.13317545 0.251334261 0.65721884
## Fe -0.1858415 -0.06203879 0.2844505524 -0.23049202 0.873264047 -0.24304431
## PC7 PC8 PC9
## RI -0.08186724 -0.75221590 -0.02573194
## Na -0.14858006 -0.12769315 0.31193718
## Mg 0.20604537 -0.07689061 0.57727335
## Al 0.69923557 -0.27444105 0.19222686
## Si -0.21606658 -0.37992298 0.29807321
## K -0.50412141 -0.10981168 0.26050863
## Ca 0.09913463 0.39870468 0.57932321
## Ba -0.35178255 0.14493235 0.19822820
## Fe -0.07372136 -0.01627141 0.01466944
tfm <- preProcess(Glass, method = c("BoxCox", "center", "scale", "pca"))
pdt <- predict(tfm, Glass)
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. 6 http://archive.ics.uci.edu/ml/index.html. 3.8 Computing 59 The data can be loaded via:
data(Soybean)
A large number of these predictors have missing data which is the biggest impact to the distributions. Since these predictors are categorical they don’t necessarily have a normal distribution so the only major impacts are missing data and uneven distributions (which may not even be an issue, just an artifact of the data itself.)
for (i in 1:length(Soybean)){
print(ggplot(data.frame(Soybean[[i]])) +
geom_bar(aes(Soybean[[i]])) +
ggtitle(colnames(Soybean)[[i]]))
}
By counting the number of missing entries per predictor and sorting
them we can see that there is a lot of grouping. By that I mean that
there are a lot of predictors missing the same number of entries which
probably means that there is an underlying issue with the data
collection or they were purposely left blank. Since the table is sorted
you can see that the most likely predictors to have missing data are
hail
, lodging
, seed.tmt
and
sever
.
Soybean |>
select(!Class) |>
as.list() |>
melt() |>
summarise(numNA = sum(is.na(value)),
n = n(),
percentNA = round(numNA/n*100,2)) |>
print()
## numNA n percentNA
## 1 2337 23905 9.78
Soybean |>
select(!Class) |>
as.list() |>
melt() |>
rename('predictors' = 'L1') |>
group_by(predictors) |>
summarise(numNA = sum(is.na(value)),
percentNA = round(numNA / 683 *100, 2)) |>
arrange(desc(numNA)) |>
head(10) |>
print()
## # A tibble: 10 × 3
## predictors numNA percentNA
## <chr> <int> <dbl>
## 1 hail 121 17.7
## 2 lodging 121 17.7
## 3 seed.tmt 121 17.7
## 4 sever 121 17.7
## 5 germ 112 16.4
## 6 leaf.mild 108 15.8
## 7 fruit.spots 106 15.5
## 8 fruiting.bodies 106 15.5
## 9 seed.discolor 106 15.5
## 10 shriveling 106 15.5
I’ve chosen to eliminate some irrelevant predictors by finding those that are near zero and those that have a strong correlation to other predictors. Imputation for the categorical data with this many categories seemed to be a bit of a difficult endeavor considering I don’t have the faintest clue how what any of the data means. If I had more context into the data and how it worked I would be much more likely to consider trying an imputation into a more useful form. That being said I tried converting the data into a numeric to try to find highly correlated predictors. While I know this has it’s own issues I was unable to find a way to find correlations in any other way for this data and decided to remove those with high correlations to other predictors and was able to remove 6 predictors in total.
removals <- Soybean |>
nearZeroVar()
lessbeans <- Soybean |>
select(!all_of(removals) & !Class) |>
sapply(as.numeric)
beancor <- lessbeans |>
cor(use = "complete.obs")
corrplot(beancor, order = 'hclust')
delbean <- findCorrelation(beancor, cutoff = .75) +1
Soybean |>
select(!all_of(removals) & !all_of(delbean+1)) |>
head(10) |>
print()
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## 7 diaporthe-stem-canker 5 0 2 1 0 2 0
## 8 diaporthe-stem-canker 4 0 2 1 1 1 0
## 9 diaporthe-stem-canker 6 0 2 1 0 3 0
## 10 diaporthe-stem-canker 4 0 2 1 0 2 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.shread leaf.malf stem
## 1 1 0 0 1 1 0 0 0 1
## 2 2 1 1 1 1 0 0 0 1
## 3 2 1 2 1 1 0 0 0 1
## 4 2 0 1 1 1 0 0 0 1
## 5 1 0 2 1 1 0 0 0 1
## 6 1 0 1 1 1 0 0 0 1
## 7 1 1 0 1 1 0 0 0 1
## 8 1 0 2 1 1 0 0 0 1
## 9 1 1 1 1 1 0 0 0 1
## 10 2 0 2 1 1 0 0 0 1
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
## 1 1 3 1 1 1 0
## 2 0 3 1 1 1 0
## 3 0 3 0 1 1 0
## 4 0 3 0 1 1 0
## 5 0 3 1 1 1 0
## 6 0 3 0 1 1 0
## 7 1 3 1 1 1 0
## 8 0 3 1 1 1 0
## 9 0 3 1 1 1 0
## 10 0 3 1 1 1 0
## fruit.pods fruit.spots seed mold.growth seed.size shriveling roots
## 1 0 4 0 0 0 0 0
## 2 0 4 0 0 0 0 0
## 3 0 4 0 0 0 0 0
## 4 0 4 0 0 0 0 0
## 5 0 4 0 0 0 0 0
## 6 0 4 0 0 0 0 0
## 7 0 4 0 0 0 0 0
## 8 0 4 0 0 0 0 0
## 9 0 4 0 0 0 0 0
## 10 0 4 0 0 0 0 0