DATA 624 HW2
library(tidyverse)
library(corrplot)
library(e1071)
library(caret)
library(gridExtra)
library(fpp2)
KJ 3.1
The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
First we’ll take a look at histograms of the predictors:
# subset the data
= Glass[,1:9]
data.sub
# plot histograms of the subset data
par(mfrow = c(3,3), mar = c(2, 2, 2, 2))
for (col in colnames(data.sub)) {
hist(data.sub[, col], main=col)
}
None of the predictors are normally distributed. Let’s take a look at relationships between the predictors.
par(mar = c(1,1,1,1))
pairs(data.sub)
There appears to be a linear relationship between RI
and
Ca
, but otherwise nothing jumps out. Let’s take a look at
correlation matrix.
# store correlation matrix
corrplot(cor(data.sub), order = "hclust")
There is some collinearity between RI
and
Ca
. There is some correlation between a few of the other
variables: RI
and Si
, Mg
and
AI
, and Ba
and AI
.
b) Do there appear to be any outliers in the data? Are any predictors skewed?
As we observed from the histograms, all of the predictors are skewed
to some degree, but the worst offenders are Mg
, which
appears to be bimodal, and K
, Ba
, and
Fe
, which appear to be right-skewed. We can confirm this by
using the skewness
function:
= apply(data.sub, 2, skewness)
skewValues skewValues
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
As far as outliers, let’s take a look at some boxplots,
|>
data.sub pivot_longer(cols=all_of(c(1:9)),
names_to = "name",
values_to = "value") |>
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~name, scales = "free")
There are a lot of outliers present in each predictor with the
exception of Mg
.
c) Are there any relevant transformations of one or more predictors that might improve the classification model?
A Box-Cox transformation, centering and scaling the predictors, and extracting principal components might improve the classificiation model.
<- preProcess(data.sub,
trans method = c("BoxCox", "center", "scale", "pca"))
trans
## Created from 214 samples and 9 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - centered (9)
## - ignored (0)
## - principal component signal extraction (9)
## - scaled (9)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance
Apply the transformations:
= predict(trans, data.sub)
transformed head(transformed[, 1:5])
## PC1 PC2 PC3 PC4 PC5
## 1 -1.2126444 -0.3942139 -0.1730756 -1.7193852 0.1913387
## 2 0.6179073 -0.7020476 -0.5507034 -0.8575350 0.1566312
## 3 0.9907027 -0.8876886 -0.6452946 -0.3027716 0.1363025
## 4 0.1510212 -0.9042336 -0.1622361 -0.4521567 0.4291846
## 5 0.3582849 -1.0160965 -0.5763959 -0.1667831 0.3634192
## 6 0.3408017 -1.3565637 0.7451275 1.0568333 -1.7762845
Additionally, as most of the predictor variables are right-skewed, a log or square root transformation may make the distribution more symmetrical and reduce the effect of extreme variables.
KJ 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean) # from mlbench library
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
par(mfrow = c(9,4), mar = c(2,2,2,2))
for (col in colnames(Soybean)) {
= table(Soybean[,col])
freq barplot(freq, main = col, xlab = "", ylab = "")
}
We can see that a lot of the variables are degenerate, including:
leaves
leaf.malf
leaf.mild
lodging
mycelium
scelrotia
seed.size
shriveling
c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The degenerate values should be removed as they offer no predictive value.
Because NA values only appear in 5 of the 14 classes, missingness is potentially informative in predicting a class. Two ways to deal with this while retaining the predictive value of the missingness is to 1) create an indicator variable that represents the presence or absence of missing values in a specific variable, or 2) create a missingness pattern variable that indicates whether a specific combination of variables is missing or not.
It’s also possible to apply pattern-based imputation using the
mice
or amelia
packages in R.
HA 7.1
Consider the pigs
series – the number of pigs
slaughtered in Victoria each month.
a) Use the ses()
function in R to find the optimal
values of
\(α\) and \(l_0\)
###, and generate forecasts for the next four months.
data(pigs)
<- ses(pigs, h=4)
fc $model fc
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
= fc$model$par[1] #alpha
a = fc$model$par[2] # l0
l
cat(sprintf("Alpha: %f\nl0: %f", a, l))
## Alpha: 0.297149
## l0: 77260.056146
autoplot(pigs) +
autolayer(fc) +
ggtitle('Forecast from Simple Exponential Smoothing') +
xlab('Year') + ylab('Pigs Slaughtered')
The forecasts can be accessed by calling the summary
function on the fc
object, or by accessing the
mean
attribute of the fc
object,
summary(fc)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
$mean fc
## Sep Oct Nov Dec
## 1995 98816.41 98816.41 98816.41 98816.41
b) Compute a 95% prediction interval for the first forecast using the
\(\hat{y} \verb|+=| 1.96s\).
where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.
= sd(residuals(fc)) # standard deviation of residuals
s = fc$mean[1] # first forecast
first.fc .95 = c(Lower = first.fc - 1.96*s, Upper = first.fc + 1.96*s)
ci.95 ci
## Lower Upper
## 78679.97 118952.84
HA 7.2
Write your own function to implement simple exponential smoothing.
The function should take arguments y
(the time
series), alpha
(the smoothing parameter \(α\)) and level
(the initial
level) \(l_0\). It should return the
forecast of the next observation in the series. Does it give the same
forecast as ses()
= function(y, alpha, level) {
simple_exponential_smoothing
for (i in 1:length(y)) {
= alpha * y[i] + (1 - alpha) * level
level
}
return(level[[1]])
}
== simple_exponential_smoothing(pigs, alpha = a, level = l) first.fc
## [1] TRUE
HA 7.3
Modify your function from the previous exercise to return the sum of
squared errors rather than the forecast of the next observation. Then
use the optim()
function to find the optimal values of
\(α\) and \(l_0\)
Does it give the same forecast as
ses()
?
# optim requires a vector as first argument
= function(pars = c(alpha, level), y) {
sum_squared_errors = pars[1]
alpha = pars[2]
level = 0 # initialize sum of squared errors
sse for (i in 1:length(y)) {
= y[i] - level # calculate the error
error = sse + error^2 # sum the squared error
sse = alpha * y[i] + (1 - alpha) * level # update the level
level
}
return(sse[[1]])
}
= optim(par = c(0, pigs[1]), y = pigs, fn = sum_squared_errors)$par
sse.pars
= c(alpha = a[[1]], l0 = l[[1]])
simple.exponential.smoothing = c(alpha = sse.pars[1], l0 = sse.pars[2])
sum.squared.errors rbind(simple.exponential.smoothing, sum.squared.errors)
## alpha l0
## simple.exponential.smoothing 0.2971488 77260.06
## sum.squared.errors 0.2970860 77272.08
The forecasts are nearly the same, we can see that \(l_0\) for simple exponential smoothing is slightly lower than for sum of squared errors.