DATA624 -HW4

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## Warning: package 'purrr' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

\[3.1\]

glass <- data(Glass)
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

#a.

Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_histogram(bins = 20) + facet_wrap(~key, scales = 'free')

Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_boxplot() + facet_wrap(~key, scales = 'free')

# b. Do there appear to be any outliers in the data? Are any predictors skewed?

The histograms show us that several predictors are skewed. The boxplots show us that all predictors except ‘Mg’ have outliers.

#c. Are there any relevant transformations of one or more predictors that might improve the classification model?

For the predictors that show a skewness, a Box-Cox transformation would be applicable. For the predictors with outliers we can use a spatial sign transformation.

\[3.2\]

soybean <- data(Soybean)

#a.

Soybean %>%  select(!Class)%>%  drop_na() %>%  gather() %>% ggplot(aes(value)) +  geom_bar() +  facet_wrap(~ key)

## Warning: attributes are not identical across measure variables; they will be
## dropped

# Degenerate distributions are distributions where the variable primarily takes one value and others occur at a very low rate. Here we can say ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate.

#b.

Soybean  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y=variables,x=n,fill=missing))+
  geom_col()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Soybean %>% filter_all(any_vars(is.na(.))) %>% select(Class) %>% group_by(Class) %>% summarise(count = n())

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

All missing data is contained within the five classes shown above.The predictors, it seems ‘hail’, ‘lodging’, ‘sever’ and ‘seed.tmt’ have the most missing values.

#c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Imputation is one way we can account for missing data without getting rid of it all together. For the imputation strategy we can fill the missing data with several values such as: maxia, minima, mean, or median. For this strategy filling in the missing data with the mean for that column would be the best approach.

DATA624 -HW4

CN

2024-09-14