Homework 4

All libraries needed for the Homework

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## -- Attaching packages -------------------------------------------- fpp3 1.0.0 --

## v tibble      3.2.1     v tsibble     1.1.5
## v dplyr       1.1.2     v tsibbledata 0.4.1
## v tidyr       1.3.0     v feasts      0.3.2
## v lubridate   1.9.2     v fable       0.3.4
## v ggplot2     3.5.1     v fabletools  0.4.2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tsibble' was built under R version 4.3.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v forcats 1.0.0     v readr   2.1.4
## v purrr   1.0.1     v stringr 1.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag()        masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lubridate)
library(tsibble)
library(pracma)

## Warning: package 'pracma' was built under R version 4.3.3

## 
## Attaching package: 'pracma'
## 
## The following object is masked from 'package:purrr':
## 
##     cross

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(corrplot)

## corrplot 0.92 loaded

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.1

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:pracma':
## 
##     sigmoid
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate

library(psych)

## Warning: package 'psych' was built under R version 4.3.1

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:pracma':
## 
##     logit, polar
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

3.1 - The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# I will read in the glass data
data(Glass)

#I will create a histograms for each of the predictor variables to analyze their distribution  
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 20) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Glass Predictor Variable Histograms")

#I will create a density plots for each of the predictor variables to analyze their distribution  
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Glass Predictor Variable density plots")

#I will create a correlation plot for each of the predictor variables to analyze their correlation
Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

Based on the histogram and density plots above, it is evident that NA and AL has pretty normal distribution. We can also see that K,Fe,RI and Ca are right skewed. Mg and SI are left skewed. Based on the correlation plot, I see a strong correlation between Ca and RI. I also see correlations in the positive direction. I also see a correlation between AI and Mg in the negative direction.

Do there appear to be any outliers in the data? Are any predictors skewed?

#I will now create a box plot to check for outliers
Glass.norm <-Glass/apply(Glass,2,sd)

## Warning in Ops.factor(left, right): '/' not meaningful for factors

Glass.norm %>%
  gather() %>% 
  ggplot(aes(x=key,y=value,color=key)) +
    geom_boxplot()+
    scale_y_continuous()+
    ggtitle("Boxplot Glass predictor variables to capture outliers")

## Warning: Removed 214 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

#I will also plot the skew to see which predictors are skewed
Glass_Predictors <-  describe(Glass)
ggplot(Glass_Predictors,aes(x = row.names(Glass_Predictors),y=skew))+
  geom_bar(stat='identity') +
  ggtitle("Skewness of Glass Predictors")

Based on the Box plot the outliers appear to be Ca, Na and SI. In terms of skewness, the bar plot above shows that 6 predictors are skewed. These include: AI, Ba, Ca, Fe,K and RI.

Are there any relevant transformations of one or more predictors that might improve the classification model?

I believe that to improve the classification model we need to address the Ca, Na and SI outliers. Thy are best addressed by utilizing the log transformation. This transformation deals with outliers without removing and use information.

3.2 - The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# I will read in the Soybean data
data(Soybean)
# I will create frequency distributions bar plots for the categorical predictors
columns <- colnames(Soybean)

lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + ggtitle(col) +  scale_x_discrete(guide = guide_axis(angle = 90))})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## i Please use tidy evaluation idioms with `aes()`.
## i See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Degenerate distribution is a distribution which takes only a single value. Looking at the frequency distributions from the plots above sclerotia and mycellum seem to be degenerate distributions, basically taking in one value.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

#I will create a table which will list the predictor and the number of times particular predictors are missing to try and identify a pattern. I will list the top 6 predictors.
Soybean  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  filter(missing == TRUE) %>%
  arrange(desc(n)) %>%
  head()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## i Please use `reframe()` instead.
## i When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## i The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 6 x 3
##   variables missing     n
##   <chr>     <lgl>   <int>
## 1 hail      TRUE      121
## 2 lodging   TRUE      121
## 3 seed.tmt  TRUE      121
## 4 sever     TRUE      121
## 5 germ      TRUE      112
## 6 leaf.mild TRUE      108

Based on the table above, there seems to be a pattern where 5 of the predictors have the same value. These are: hail, lodging, seed.tmt, sever and germ. They all have 121 missing values.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The strategy I would implement is to replace the missing values with the sample mean of all the observed classes. I would not want to risk removing the classes and they might give us useful information upon comparison. Thus, keeping the five classes which have the same number of missing variables on an even playing field makes the most sense.

Homework 4

Vladimir Nimchenko

2024-09-27