624assignment4.knit

Assignment 4

Daniel DeBonis

library(AppliedPredictiveModeling)
library(mlbench)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(missForest)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Question 3.1

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

A series of histograms can help us visualize these distributions:

Glass |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~ key, scales = "free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A correlation matrix can help visualize the relationship between predictors.

glass_cor <- cor(Glass |>
                   keep(is.numeric))
print(glass_cor)

##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000

Looking at the values in the correlation matrix, we can see the majority of these variables are not highly correlated. The most prominent exception is the high correlations between Calcium and Refraction Index (.8104). Some other noteworthy correlations exist between Silicon and Refraction Index (-.5421), Barium and Magnesium (-.4922), Aluminum and Magnesium (-.4818), Barium and Aluminum (.4794), and Calcium and Magnesium (-.4438).

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Glass |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~ key, scales = "free")

The distributions for Aluminium, Sodium, and Silicon are the most normal appearing of our predictors. Barium, Calcium, Iron, and Potassium appear to have right skewed distributions. Iron and Barium appear this way due to the large number of zeroes in the data. This is also present in the distribution of Magnesium, creating a bimodal distribution with one mode of zero and another around 3.5. The clearest outliers are found in the Potassium, Iron, and Sodium distributions. However, with the individual points present in each box pot (except that for magnesium) many points could be considered outliers across the data.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

A Box-Cox transformation would be relevant for the skewed variables mentioned. The Box Plot could be helpful in interpreting one of the variables that would not be helped by the Box-Cox tranformation, Magnesium. There are many samples in our data set with no magnesium present, but the box plot shows 75% of samples have at least some magnesium present (above the level of 2)

Question 3.2

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean |>
  select(!Class) |>
  drop_na() |>
  gather() |>
  ggplot(aes(value)) +
  geom_bar() +
  facet_wrap(~key)

## Warning: attributes are not identical across measure variables; they will be
## dropped

By looking at the frequency distributions, we can identify degenerate distributions in the variables ‘mycelium’, ‘roots’, ‘seed.size’, ‘shriveling’, and ‘sclerotia’. Others that could also represent degenerate distributions are ‘seed.discolor’, ‘leaf.malf’, and ‘lodging’.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can show the percentage of nas found for each variable in a bar graph.

Soybean |>
  summarize_all(list(~is.na(.))) |>
  pivot_longer(everything(),
               names_to = "values", values_to = "missing") |>
  count(values, missing) |>
  ggplot(aes(y=values, x=n, fill=missing)) +
  geom_col()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can clearly see which of the variables have the most missing values are those with the greatest green bars present, such as ‘hail’, ‘germ’, ‘lodging’, ‘seed.tmt’, and ‘sever’. Now to investigate whether this is related to class:

unique(Soybean$Class)

##  [1] diaporthe-stem-canker       charcoal-rot               
##  [3] rhizoctonia-root-rot        phytophthora-rot           
##  [5] brown-stem-rot              powdery-mildew             
##  [7] downy-mildew                brown-spot                 
##  [9] bacterial-blight            bacterial-pustule          
## [11] purple-seed-stain           anthracnose                
## [13] phyllosticta-leaf-spot      alternarialeaf-spot        
## [15] frog-eye-leaf-spot          diaporthe-pod-&-stem-blight
## [17] cyst-nematode               2-4-d-injury               
## [19] herbicide-injury           
## 19 Levels: 2-4-d-injury alternarialeaf-spot anthracnose ... rhizoctonia-root-rot

There are 19 different classes present in the data

Soybean |>
  filter_all(any_vars(is.na(.))) |>
  select(Class)|>
  group_by(Class) |>
  summarize(count = n())

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

All na values are contained in these five classes. Clearly there is a class that contains many more na values than the others, the phytophthora-rot.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The na values are only present in five of our classes, so there is a chance there is a structural reason for the missing values. A botanist or biologist could help explain whether that is the reason for the missing values. On the other hand, looking at some of these variables, they include ‘hail’, which is a binary variable saying whether or not hail occurred. If there indeed is no structural reason for these values, one option is eliminating the class with missing data from further analysis, but this is inadvisable. Even if it is not for structural reasons, there is still a possibility that there is some reason behind certain classes having the na values that could be relevant to our analysis. In this case, I would prefer towards imputation, particularly using median values. Ideally, we could use a more sophisticated method, such as a random forest algorithm to predict the missing values.

set.seed(241241)
Soyforest <- missForest(Soybean)

We have no more missing data. The OOBerror is .1051, which shows the fairly strong accuracy of the predictions of data produced in this random forest algorithm.