DATA624_HW3_CameronGray

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.2

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'tsibble' was built under R version 4.4.2

## Warning: package 'tsibbledata' was built under R version 4.4.2

## Warning: package 'feasts' was built under R version 4.4.2

## Warning: package 'fabletools' was built under R version 4.4.2

## Warning: package 'fable' was built under R version 4.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.2

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass |>
  melt() |>
  ggplot() +
  geom_histogram(aes(x=value, fill = Type)) +
  facet_wrap(~variable, scales = "free")

## Using Type as id variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Glass |>
  melt() |>
  filter(variable == "K", value != 0) |>
  ggplot() +
  geom_histogram(aes(x=value), binwidth = .1)

## Using Type as id variables

Glass |>
  melt() |>
  filter(variable == "Ba", value != 0) |>
  ggplot() +
  geom_histogram(aes(x=value), binwidth = .1)

## Using Type as id variables

Glass |>
  melt() |>
  filter(variable == "Fe", value != 0) |>
  ggplot() +
  geom_histogram(aes(x=value), binwidth = .05)

## Using Type as id variables

Glass |>
  ggpairs()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Do there appear to be any outliers in the data? Are any predictors skewed?

There are outliers for a large number of predictors, in fact most of the data seems to have outliers. For Mg, K, Ba and Fe there seem to be a large number of 0 measurements, which may be valid number or they may be errors in the data. Either way it causes me to wary of those values being correct. The Fe and Ba seem to be mostly 0 with very small other values which probably means that those elements are supposed to be more rare in these glass samples that the rest of the elements.

None of the predictors are perfectly distributed but some are pretty good like Si, Ca and Mg and it seems like RI, Na, Al and Fe (after removing the 0 values) are skewed. Ba doesn’t seem to have enough non-zero values to have any sort of meaningful distribution. K seems to have an interesting distribution with a secondary smaller peak closer to zero.

Are there any relevant transformations of one or more predictors that might improve the classification model?

PCA and the preProcess function using the BoxCox method would work for this data. In general most of the data has a pretty normal distribution but using these methods we can remove some of the skew to normalize the data to produce a cleaner dataset for processing. Just doing a PCA would produce a nice enough output with transformations relative to the other predictors but using the preProcess function we get a nice set of fewer predictors which will help with the processing later down the line considering that the model we choose will need less computational effort since there is less data to consider.

for (x in Glass[1:9]){
  x <- x[x!=0]
  lambda <- BoxCoxTrans(x)$lambda
  print(lambda)
  if (!is.na(lambda)){
    print(ggplot(data.frame(box_cox(x, lambda))) + 
            geom_histogram(aes(box_cox(x, lambda))) +
            ggtitle(paste("Lambda ",  lambda)))
  } else{
    print(ggplot(data.frame(x)) + 
            geom_histogram(aes(x))+
            ggtitle(paste("Lambda NA")))
  }
}

## [1] -2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.5

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -1.1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.4

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.5

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pcaGlass <- Glass|>
  select(!Type) |>
  prcomp(center = TRUE, scale. = TRUE)

pcaGlass

## Standard deviations (1, .., p=9):
## [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 0.60741950
## [8] 0.25269141 0.04011007
## 
## Rotation (n x k) = (9 x 9):
##           PC1         PC2           PC3         PC4          PC5         PC6
## RI -0.5451766  0.28568318  0.0869108293  0.14738099 -0.073542700  0.11528772
## Na  0.2581256  0.27035007 -0.3849196197  0.49124204  0.153683304 -0.55811757
## Mg -0.1108810 -0.59355826  0.0084179590  0.37878577  0.123509124  0.30818598
## Al  0.4287086  0.29521154  0.3292371183 -0.13750592  0.014108879 -0.01885731
## Si  0.2288364 -0.15509891 -0.4587088382 -0.65253771  0.008500117  0.08609797
## K   0.2193440 -0.15397013  0.6625741197 -0.03853544 -0.307039842 -0.24363237
## Ca -0.4923061  0.34537980 -0.0009847321 -0.27644322 -0.188187742 -0.14866937
## Ba  0.2503751  0.48470218  0.0740547309  0.13317545  0.251334261  0.65721884
## Fe -0.1858415 -0.06203879  0.2844505524 -0.23049202  0.873264047 -0.24304431
##            PC7         PC8         PC9
## RI -0.08186724 -0.75221590 -0.02573194
## Na -0.14858006 -0.12769315  0.31193718
## Mg  0.20604537 -0.07689061  0.57727335
## Al  0.69923557 -0.27444105  0.19222686
## Si -0.21606658 -0.37992298  0.29807321
## K  -0.50412141 -0.10981168  0.26050863
## Ca  0.09913463  0.39870468  0.57932321
## Ba -0.35178255  0.14493235  0.19822820
## Fe -0.07372136 -0.01627141  0.01466944

tfm <- preProcess(Glass, method = c("BoxCox", "center", "scale", "pca"))
pdt <- predict(tfm, Glass)

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. 6 http://archive.ics.uci.edu/ml/index.html. 3.8 Computing 59 The data can be loaded via:

data(Soybean)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A large number of these predictors have missing data which is the biggest impact to the distributions. Since these predictors are categorical they don’t necessarily have a normal distribution so the only major impacts are missing data and uneven distributions (which may not even be an issue, just an artifact of the data itself.)

for (i in 1:length(Soybean)){
  print(ggplot(data.frame(Soybean[[i]])) +
          geom_bar(aes(Soybean[[i]])) +
          ggtitle(colnames(Soybean)[[i]]))
}

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

By counting the number of missing entries per predictor and sorting them we can see that there is a lot of grouping. By that I mean that there are a lot of predictors missing the same number of entries which probably means that there is an underlying issue with the data collection or they were purposely left blank. Since the table is sorted you can see that the most likely predictors to have missing data are hail, lodging, seed.tmtand sever.

Soybean |>
  select(!Class) |>
  as.list() |>
  melt() |>
  summarise(numNA = sum(is.na(value)),
            n = n(),
            percentNA = round(numNA/n*100,2)) |>
  print()

##   numNA     n percentNA
## 1  2337 23905      9.78

Soybean |>
  select(!Class) |>
  as.list() |>
  melt() |>
  rename('predictors' = 'L1') |>
  group_by(predictors) |>
  summarise(numNA = sum(is.na(value)),
            percentNA = round(numNA / 683 *100, 2)) |>
  arrange(desc(numNA)) |>
  head(10) |>
  print()

## # A tibble: 10 × 3
##    predictors      numNA percentNA
##    <chr>           <int>     <dbl>
##  1 hail              121      17.7
##  2 lodging           121      17.7
##  3 seed.tmt          121      17.7
##  4 sever             121      17.7
##  5 germ              112      16.4
##  6 leaf.mild         108      15.8
##  7 fruit.spots       106      15.5
##  8 fruiting.bodies   106      15.5
##  9 seed.discolor     106      15.5
## 10 shriveling        106      15.5

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I’ve chosen to eliminate some irrelevant predictors by finding those that are near zero and those that have a strong correlation to other predictors. Imputation for the categorical data with this many categories seemed to be a bit of a difficult endeavor considering I don’t have the faintest clue how what any of the data means. If I had more context into the data and how it worked I would be much more likely to consider trying an imputation into a more useful form. That being said I tried converting the data into a numeric to try to find highly correlated predictors. While I know this has it’s own issues I was unable to find a way to find correlations in any other way for this data and decided to remove those with high correlations to other predictors and was able to remove 6 predictors in total.

removals <- Soybean |>
  nearZeroVar()

lessbeans <- Soybean |>
  select(!all_of(removals) & !Class) |>
  sapply(as.numeric)

beancor <- lessbeans |>
  cor(use = "complete.obs")

corrplot(beancor, order = 'hclust')

delbean <- findCorrelation(beancor, cutoff = .75) +1

Soybean |>
  select(!all_of(removals) & !all_of(delbean+1)) |>
  head(10) |>
  print()

##                    Class date plant.stand precip temp hail crop.hist area.dam
## 1  diaporthe-stem-canker    6           0      2    1    0         1        1
## 2  diaporthe-stem-canker    4           0      2    1    0         2        0
## 3  diaporthe-stem-canker    3           0      2    1    0         1        0
## 4  diaporthe-stem-canker    3           0      2    1    0         1        0
## 5  diaporthe-stem-canker    6           0      2    1    0         2        0
## 6  diaporthe-stem-canker    5           0      2    1    0         3        0
## 7  diaporthe-stem-canker    5           0      2    1    0         2        0
## 8  diaporthe-stem-canker    4           0      2    1    1         1        0
## 9  diaporthe-stem-canker    6           0      2    1    0         3        0
## 10 diaporthe-stem-canker    4           0      2    1    0         2        0
##    sever seed.tmt germ plant.growth leaves leaf.halo leaf.shread leaf.malf stem
## 1      1        0    0            1      1         0           0         0    1
## 2      2        1    1            1      1         0           0         0    1
## 3      2        1    2            1      1         0           0         0    1
## 4      2        0    1            1      1         0           0         0    1
## 5      1        0    2            1      1         0           0         0    1
## 6      1        0    1            1      1         0           0         0    1
## 7      1        1    0            1      1         0           0         0    1
## 8      1        0    2            1      1         0           0         0    1
## 9      1        1    1            1      1         0           0         0    1
## 10     2        0    2            1      1         0           0         0    1
##    lodging stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
## 1        1            3             1               1         1            0
## 2        0            3             1               1         1            0
## 3        0            3             0               1         1            0
## 4        0            3             0               1         1            0
## 5        0            3             1               1         1            0
## 6        0            3             0               1         1            0
## 7        1            3             1               1         1            0
## 8        0            3             1               1         1            0
## 9        0            3             1               1         1            0
## 10       0            3             1               1         1            0
##    fruit.pods fruit.spots seed mold.growth seed.size shriveling roots
## 1           0           4    0           0         0          0     0
## 2           0           4    0           0         0          0     0
## 3           0           4    0           0         0          0     0
## 4           0           4    0           0         0          0     0
## 5           0           4    0           0         0          0     0
## 6           0           4    0           0         0          0     0
## 7           0           4    0           0         0          0     0
## 8           0           4    0           0         0          0     0
## 9           0           4    0           0         0          0     0
## 10          0           4    0           0         0          0     0

DATA624_HW3_CameronGray

Cameron Gray

2025-02-24

3.1

3.2