DATA 624 HW2

library(tidyverse)
library(corrplot)
library(e1071)
library(caret)
library(gridExtra)
library(fpp2)

KJ 3.1

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

First we’ll take a look at histograms of the predictors:

# subset the data
data.sub = Glass[,1:9] 

# plot histograms of the subset data
par(mfrow = c(3,3), mar = c(2, 2, 2, 2))
for (col in colnames(data.sub)) {
  hist(data.sub[, col], main=col)
}

None of the predictors are normally distributed. Let’s take a look at relationships between the predictors.

par(mar = c(1,1,1,1))
pairs(data.sub)

There appears to be a linear relationship between RI and Ca, but otherwise nothing jumps out. Let’s take a look at correlation matrix.

# store correlation matrix
corrplot(cor(data.sub), order = "hclust")

There is some collinearity between RI and Ca. There is some correlation between a few of the other variables: RI and Si, Mg and AI, and Ba and AI.

b) Do there appear to be any outliers in the data? Are any predictors skewed?

As we observed from the histograms, all of the predictors are skewed to some degree, but the worst offenders are Mg, which appears to be bimodal, and K, Ba, and Fe, which appear to be right-skewed. We can confirm this by using the skewness function:

skewValues = apply(data.sub, 2, skewness)
skewValues

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

As far as outliers, let’s take a look at some boxplots,

data.sub |>
  pivot_longer(cols=all_of(c(1:9)),
               names_to = "name",
               values_to = "value") |>
  ggplot(aes(value)) +
  geom_boxplot() +
  facet_wrap(~name, scales = "free")

There are a lot of outliers present in each predictor with the exception of Mg.

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

A Box-Cox transformation, centering and scaling the predictors, and extracting principal components might improve the classificiation model.

trans <- preProcess(data.sub, 
                    method = c("BoxCox", "center", "scale", "pca"))
trans

## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

Apply the transformations:

transformed = predict(trans, data.sub)
head(transformed[, 1:5])

##          PC1        PC2        PC3        PC4        PC5
## 1 -1.2126444 -0.3942139 -0.1730756 -1.7193852  0.1913387
## 2  0.6179073 -0.7020476 -0.5507034 -0.8575350  0.1566312
## 3  0.9907027 -0.8876886 -0.6452946 -0.3027716  0.1363025
## 4  0.1510212 -0.9042336 -0.1622361 -0.4521567  0.4291846
## 5  0.3582849 -1.0160965 -0.5763959 -0.1667831  0.3634192
## 6  0.3408017 -1.3565637  0.7451275  1.0568333 -1.7762845

Additionally, as most of the predictor variables are right-skewed, a log or square root transformation may make the distribution more symmetrical and reduce the effect of extreme variables.

KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean) # from mlbench library
head(Soybean)

##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

par(mfrow = c(9,4), mar = c(2,2,2,2))
for (col in colnames(Soybean)) {
  freq = table(Soybean[,col])
  barplot(freq, main = col, xlab = "", ylab = "")
}

We can see that a lot of the variables are degenerate, including:

leaves leaf.malf leaf.mild lodging mycelium
scelrotia
seed.size
shriveling

b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

col_sums = colSums(is.na(Soybean))

data.frame(variable = names(col_sums), missing = col_sums, row.names = NULL) |> 
  ggplot(aes(y=reorder(variable,missing), x=missing)) +
  geom_col() +
  ggtitle('missing values') + xlab('') + ylab('')

We see a bunch of predictors with more than 80 NAs, related to seed (seed.tmt,seed.size,seed.discolor,seed), leaf information (leaf.size,leaf.shread,leaf.mild,leaf.marg,leaf.malf,leaf.halo), fruit (fruiting.bodies,fruit.spots,fruit.pods), and a few others that are not of a clear category (shriveling,sever,mold.growth,lodging,hail,germ).

Checking to see how these NAs relate by class:

Soybean |>
  group_by(class = Class) |>
  summarize(total_na = sum(is.na(across(-Class))),
            pct_na=sum(is.na(across(-Class)))
            / n()) |>
  arrange(desc(pct_na))

## # A tibble: 19 × 3
##    class                       total_na pct_na
##    <fct>                          <int>  <dbl>
##  1 2-4-d-injury                     450   28.1
##  2 cyst-nematode                    336   24  
##  3 herbicide-injury                 160   20  
##  4 phytophthora-rot                1214   13.8
##  5 diaporthe-pod-&-stem-blight      177   11.8
##  6 alternarialeaf-spot                0    0  
##  7 anthracnose                        0    0  
##  8 bacterial-blight                   0    0  
##  9 bacterial-pustule                  0    0  
## 10 brown-spot                         0    0  
## 11 brown-stem-rot                     0    0  
## 12 charcoal-rot                       0    0  
## 13 diaporthe-stem-canker              0    0  
## 14 downy-mildew                       0    0  
## 15 frog-eye-leaf-spot                 0    0  
## 16 phyllosticta-leaf-spot             0    0  
## 17 powdery-mildew                     0    0  
## 18 purple-seed-stain                  0    0  
## 19 rhizoctonia-root-rot               0    0

We see that five of the 19 classes are responsible for all of the NAs. This indicates to me a likelihood that certain measurements are either applicable or not available to these classes. This should be taken into consideration when considering how to deal with these missing values.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The degenerate values should be removed as they offer no predictive value.

Because NA values only appear in 5 of the 14 classes, missingness is potentially informative in predicting a class. Two ways to deal with this while retaining the predictive value of the missingness is to 1) create an indicator variable that represents the presence or absence of missing values in a specific variable, or 2) create a missingness pattern variable that indicates whether a specific combination of variables is missing or not.

It’s also possible to apply pattern-based imputation using the mice or amelia packages in R.

HA 7.1

Consider the `pigs` series – the number of pigs slaughtered in Victoria each month.

a) Use the `ses()` function in R to find the optimal values of

\(α\) and \(l_0\)

###, and generate forecasts for the next four months.

data(pigs)
fc <- ses(pigs, h=4)
fc$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665

a = fc$model$par[1] #alpha
l = fc$model$par[2] # l0

cat(sprintf("Alpha: %f\nl0: %f", a, l))

## Alpha: 0.297149
## l0: 77260.056146

autoplot(pigs) + 
  autolayer(fc) + 
  ggtitle('Forecast from Simple Exponential Smoothing') +
  xlab('Year') + ylab('Pigs Slaughtered')

The forecasts can be accessed by calling the summary function on the fc object, or by accessing the mean attribute of the fc object,

summary(fc)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

fc$mean

##           Sep      Oct      Nov      Dec
## 1995 98816.41 98816.41 98816.41 98816.41

b) Compute a 95% prediction interval for the first forecast using the

\(\hat{y} \verb|+=| 1.96s\).

where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

s = sd(residuals(fc)) # standard deviation of residuals 
first.fc = fc$mean[1] # first forecast
ci.95 = c(Lower = first.fc - 1.96*s, Upper = first.fc + 1.96*s)
ci.95

##     Lower     Upper 
##  78679.97 118952.84

HA 7.2

Write your own function to implement simple exponential smoothing.

The function should take arguments y (the time series), alpha (the smoothing parameter \(α\)) and level (the initial level) \(l_0\). It should return the forecast of the next observation in the series. Does it give the same forecast as ses()

simple_exponential_smoothing = function(y, alpha, level) {

  for (i in 1:length(y)) {
    level = alpha * y[i] + (1 - alpha) * level
  }
  
  return(level[[1]])
}

first.fc == simple_exponential_smoothing(pigs, alpha = a, level = l)

## [1] TRUE

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the `optim()` function to find the optimal values of

\(α\) and \(l_0\)

Does it give the same forecast as ses()?

# optim requires a vector as first argument
sum_squared_errors = function(pars = c(alpha, level), y) {
  alpha = pars[1]
  level = pars[2]
  sse = 0 # initialize sum of squared errors
  for (i in 1:length(y)) {
    error = y[i] - level # calculate the error
    sse = sse + error^2 # sum the squared error
    level = alpha * y[i] + (1 - alpha) * level # update the level
  }
  
  return(sse[[1]])
}

sse.pars = optim(par = c(0, pigs[1]), y = pigs, fn = sum_squared_errors)$par

simple.exponential.smoothing = c(alpha = a[[1]], l0 = l[[1]])
sum.squared.errors = c(alpha = sse.pars[1], l0 = sse.pars[2])
rbind(simple.exponential.smoothing, sum.squared.errors)

##                                  alpha       l0
## simple.exponential.smoothing 0.2971488 77260.06
## sum.squared.errors           0.2970860 77272.08

The forecasts are nearly the same, we can see that \(l_0\) for simple exponential smoothing is slightly lower than for sum of squared errors.

DATA 624 HW2

DATA 624 HW2

KJ 3.1

The data can be accessed via:

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

b) Do there appear to be any outliers in the data? Are any predictors skewed?

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

KJ 3.2

The data can be loaded via:

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

HA 7.1

Consider the pigs series – the number of pigs slaughtered in Victoria each month.

a) Use the ses() function in R to find the optimal values of

b) Compute a 95% prediction interval for the first forecast using the

where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

HA 7.2

Write your own function to implement simple exponential smoothing.

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of

Consider the `pigs` series – the number of pigs slaughtered in Victoria each month.

a) Use the `ses()` function in R to find the optimal values of

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the `optim()` function to find the optimal values of