library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(tibble)
library(dplyr)
library(tidyverse)
library(tidyr)Use the help function to explore what the series gold, woolyrnq and gas represent.
a) Use autoplot() to plot each of these in separate plots.
autoplot(gold) +
ggtitle("Daily morning gold prices") +
xlab("Days since 1/1/1985") +
ylab("US dollars")autoplot(woolyrnq) +
ggtitle("Quarterly Wool Yarn Production in Australia") +
xlab("Days since 1/1/1985") +
ylab("US dollars")autoplot(gas) +
ggtitle("Monthly Gas Production in") +
xlab("Days since 1/1/1985") +
ylab("US dollars")b) What is the frequency of each series? Hint: apply the frequency() function.
## [1] 1
## [1] 4
## [1] 12
c) Use which.max() to spot the outlier in the gold series. Which observation was it?
## [1] 770
The outlier is the day the starting price of gold was $770.
Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
You can read the data into R with the following script:
retaildata <- readxl::read_excel('/Users/aaronzalki/Downloads/retail.xlsx', skip=1)
#View(retaildata)a) Select one of the time series as follows (but replace the column name with your own chosen column):
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1982 63.9 64.0 62.7 65.6 62.6 64.4 66.0 65.3 77.9
## 1983 65.1 62.3 65.7 61.9 63.7 64.9 69.5 67.9 67.5 66.0 67.2 78.4
## 1984 69.2 68.2 70.7 67.4 71.0 67.9 67.4 70.7 68.1 73.5 75.2 81.9
## 1985 74.7 68.7 72.9 73.1 78.3 73.2 78.8 80.7 77.5 86.0 87.4 98.2
## 1986 85.4 76.5 83.7 84.9 91.9 86.7 90.5 91.3 89.6 93.0 92.3 107.4
## 1987 91.2 87.0 88.9 91.4 93.2 91.3 98.5 97.4 97.3 99.8 96.9 111.3
## 1988 100.1 95.0 101.4 96.2 97.8 91.3 94.6 95.4 98.3 101.4 100.7 120.3
## 1989 106.2 97.3 105.4 100.6 103.9 107.4 106.9 109.6 108.8 110.6 113.8 134.4
## 1990 121.4 114.6 123.5 118.8 123.6 127.5 125.9 130.1 116.9 123.3 125.3 136.5
## 1991 121.5 112.3 121.6 120.2 124.1 120.1 124.5 130.2 119.1 127.7 127.0 141.8
## 1992 133.5 125.9 135.1 135.1 137.7 129.5 134.2 134.9 132.7 138.3 134.9 153.8
## 1993 130.4 122.0 125.9 130.6 128.5 126.2 133.9 127.6 127.3 134.3 132.4 156.4
## 1994 145.7 131.7 143.9 137.1 136.1 127.8 130.5 135.7 138.5 153.9 151.6 179.2
## 1995 160.9 157.0 165.5 164.4 160.8 156.9 158.5 169.3 164.3 184.2 192.0 219.4
## 1996 168.0 166.6 170.2 163.3 167.9 174.3 169.1 173.1 164.1 178.5 172.1 188.8
## 1997 165.1 149.5 161.0 165.3 165.1 161.4 201.3 201.0 200.3 207.4 204.7 240.8
## 1998 233.7 207.8 214.0 218.9 219.3 212.4 216.7 211.9 218.6 223.4 209.4 247.8
## 1999 219.7 206.9 218.4 226.4 217.0 215.4 227.7 223.5 228.3 230.6 226.0 284.5
## 2000 185.8 183.1 186.9 176.9 172.2 176.8 169.9 171.9 169.3 153.4 150.9 174.9
## 2001 151.0 142.9 157.5 179.2 172.6 173.0 188.3 191.7 189.3 218.8 218.1 243.0
## 2002 230.5 210.4 216.0 212.6 226.1 213.6 272.8 268.3 272.3 251.0 249.8 289.3
## 2003 241.6 220.6 235.7 216.9 212.1 203.5 232.7 228.2 223.8 240.7 242.7 263.7
## 2004 231.3 215.4 228.6 234.5 221.4 218.5 224.6 228.2 226.1 236.4 248.1 296.5
## 2005 239.3 225.1 247.7 244.5 242.9 238.6 245.9 251.9 246.8 267.2 271.7 316.0
## 2006 287.0 265.7 291.5 268.2 261.7 253.6 264.8 273.6 269.1 289.6 289.6 341.9
## 2007 304.6 288.8 311.3 307.1 304.0 279.5 300.1 308.5 296.2 276.8 272.0 322.6
## 2008 239.7 229.9 241.3 223.8 234.7 221.4 238.0 250.0 243.7 250.6 254.7 320.6
## 2009 236.1 222.5 243.3 241.6 238.4 230.6 237.1 245.2 241.0 266.1 275.6 342.0
## 2010 255.4 215.5 229.6 214.4 224.5 205.6 198.7 203.3 203.2 214.4 201.4 251.8
## 2011 205.8 194.6 200.4 196.2 185.9 183.2 188.5 192.4 188.5 197.6 201.4 257.2
## 2012 188.1 200.1 199.7 214.5 216.2 208.6 204.5 213.9 211.7 230.6 219.5 276.2
## 2013 234.5 226.2 233.4 234.4 243.0 228.7 235.9 247.8 240.2 244.5 232.2 270.9
b) Explore your chosen retail time series using the following functions:
autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
Conclusion: There is an overall positive trend in the sales data, with an increase in the December month. The downward slope in the final autocorrelation function plot also depicts the positive trend.
3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
kable(summary(Glass[,1:10]))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1.511 | Min. :10.73 | Min. :0.000 | Min. :0.290 | Min. :69.81 | Min. :0.0000 | Min. : 5.430 | Min. :0.000 | Min. :0.00000 | 1:70 | |
| 1st Qu.:1.517 | 1st Qu.:12.91 | 1st Qu.:2.115 | 1st Qu.:1.190 | 1st Qu.:72.28 | 1st Qu.:0.1225 | 1st Qu.: 8.240 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 2:76 | |
| Median :1.518 | Median :13.30 | Median :3.480 | Median :1.360 | Median :72.79 | Median :0.5550 | Median : 8.600 | Median :0.000 | Median :0.00000 | 3:17 | |
| Mean :1.518 | Mean :13.41 | Mean :2.685 | Mean :1.445 | Mean :72.65 | Mean :0.4971 | Mean : 8.957 | Mean :0.175 | Mean :0.05701 | 5:13 | |
| 3rd Qu.:1.519 | 3rd Qu.:13.82 | 3rd Qu.:3.600 | 3rd Qu.:1.630 | 3rd Qu.:73.09 | 3rd Qu.:0.6100 | 3rd Qu.: 9.172 | 3rd Qu.:0.000 | 3rd Qu.:0.10000 | 6: 9 | |
| Max. :1.534 | Max. :17.38 | Max. :4.490 | Max. :3.500 | Max. :75.41 | Max. :6.2100 | Max. :16.190 | Max. :3.150 | Max. :0.51000 | 7:29 |
#omit type column
r <- round(cor(Glass[,1:9]),3)
ggcorrplot::ggcorrplot(r,
type = 'lower', lab=T, lab_size=2)par(mfrow=c(3,3))
for(i in names(Glass)[-10]){
#title of element boxplot
boxplot(Glass[i], main=paste('Boxplot of', i), horizontal = T)
}(b) Do there appear to be any outliers in the data? Are any predictors skewed? Magnesium Mg is the only boxplot to not have outliers.
## K Ba Ca Fe RI Al Na
## 6.4600889 3.3686800 2.0184463 1.7298107 1.6027151 0.8946104 0.4478343
## Si Mg
## -0.7202392 -1.1364523
Barium Ba, Calcium Ca, Iron Fe, Potassium K, Magnesium Mg and RI have high skewness.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
#box-cox transformation
transform <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
predicted <- predict(transform, Glass[-10])
#same code as first boxplot with transformed data
par(mfrow=c(3,3))
for(i in names(predicted)[-10]){
boxplot(predicted[i], main=paste('Boxplot of', i), horizontal = T)
}The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Degenerate distributions are defined as having very small variances or 0 variances, so we will check for distributions that match this definition.
## [1] "Degenerate variables are: leaf.mild, mycelium, sclerotia"
b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
The above graph captures a clear pattern of missing data related to classes. germ, lodging, seed.tmt, sever and hail have the most missing data.
c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
par(mar=c(1,1,1,1))
soybean_missing %>%
gather(-Class, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Class)) +
geom_bar(alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("target",
values = c('#99d8c9','#8856a7','#43a2ca','#e34a33','#ece7f2',
'#1c9099','#dd1c77','#c994c7','#fde0dd','#addd8e',
'#f7fcb9','#7fcdbb','#8c510a','#bababa',
'#1f78b4','#bf812d','#80cdc1','#01665e','#a6cee3')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
a) Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
The plastics time series has an increasing trend with a seasonal fluctuation of higher sales in the summer and lower sales in the winter. Based on the annual seasonal plot, the apex of plastics sales exist around August and September.
b) Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
decompose_plastics <- decompose(plastics, type="multiplicative")
decompose_plastics %>%
autoplot() +
ggtitle("Multiplicative Decomposition of Plastic Product")c) Do the results support the graphical interpretation from part a?
Yes. We can see in the above graphs that there is a seasonal component to plastic sales and the trend in increasing.
d) Compute and plot the seasonally adjusted data.
seasonal_plastics <- plastics / decompose_plastics$seasonal
autoplot(plastics, series = "original data") +
autolayer(seasonal_plastics, series = "seasonally adjusted") +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product Sales") +
scale_color_brewer(palette = "Set2")e) Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
plastics_outlier <- plastics
plastics_outlier[50] <- plastics_outlier[50] + 500
decomp_plastics_outlier <- decompose(plastics_outlier, type="multiplicative")
seasonal_plastics_outlier <- plastics_outlier / decomp_plastics_outlier$seasonal
autoplot(plastics, series = 'original data') +
autolayer(seasonal_plastics, series = 'without outlier') +
autolayer(seasonal_plastics_outlier, series = 'with outlier') +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product Sales") +
scale_color_brewer(palette = "Set2")The outlier causes the series to be slightly higher than the original series. There are also throughs in the series that weren’t present in the original series.
f) Does it make any difference if the outlier is near the end rather than in the middle of the time series?
autoplot(plastics, series = 'original data') +
autolayer(seasonal_plastics, series = 'without outlier') +
autolayer(seasonal_plastics_outlier, series = 'with outlier') +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product Sales") +
scale_color_brewer(palette = "Set2")#End
plastics_outlier_end <- plastics
plastics_outlier_end[50] <- plastics_outlier_end[50] + 500
decomp_plastics_outlier_end <- decompose(plastics_outlier_end, type="multiplicative")
seasonal_plastics_outlier_end <- plastics_outlier_end / decomp_plastics_outlier_end$seasonal
#Middle
plastics_outlier_mid <- plastics
plastics_outlier_mid[27] <- plastics_outlier_mid[27] + 500
decomp_plastics_outlier_mid <- decompose(plastics_outlier_mid, type="multiplicative")
seasonal_plastics_outlier_mid <- plastics_outlier_mid / decomp_plastics_outlier_mid$seasonal
#Plot
autoplot(plastics, series = 'original data') +
autolayer(seasonal_plastics_outlier_end, series = 'with outlier near the end') +
autolayer(seasonal_plastics_outlier_mid, series = 'with outlier in the middle') +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product A Sales") +
scale_color_brewer(palette = "Set2")Where the outlier lies seems to make no large difference on the series overall, except where the through lies. The throughs occur slightly later for the series with the outlier towards the end.
Consider the pigs series — the number of pigs slaughtered in Victoria each month.
a) Use the ses() function in R to find the optimal values of \(\alpha\) and \(l_{0}\) and generate forecasts for the next four months.
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
b) Compute a 95% prediction interval for the first forecast using \(y \pm 1.96s\) where \(s\) is the standard deviation of the residuals. Compare your interval with the interval produced by R.
pig_mean <- ses_pigs$mean[1]
pig_stdev <- sd(ses_pigs$residuals)
pig_lower <- pig_mean - 1.96 * pig_stdev
pig_upper <- pig_mean + 1.96 * pig_stdevWe are 95% confident that the true mean lies in the interval (78679.97, 118952.8)
Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of \(\alpha\) and \(l_{0}\) .Do you get the same values as the ses() function?
Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
a) Explain the differences among these figures. Do they all indicate that the data are white noise?
As the number of numbers increase, the ACF bands become narrower. The data is likely white noise, since there is no discernable pattern in both the bars of the ACF charts and the bars are within the bands.
b) Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?
As we increase the number of observations, we become more certain of patterns within the data. Therefore outliers become more apparent and decrease, as observations increase.
A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
APPROACH
tsggplot <- function(ts, title){
grid.arrange(
autoplot(ts) +
ggtitle(title),
grid.arrange(
ggAcf(ts),
ggPacf(ts), ncol=2), nrow=2)
}
tsggplot(ibmclose, "IBM Closing Stock Price")In the top plot, we can see that there is a downward trend in IBM’s stock price. The gradual decrease in the ACF implies that this time series is not stationary.
a) Use R to simulate and plot some data from simple ARIMA models. Use the following R code to generate data from an AR(1) model with \(\phi_1=.6\) and \(\sigma^2=1\). The process starts with \(y_1=0\).
b) Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?
First, we will need to adjust the code to see the effects of changing \(\phi_1\).
phi_sim <- function(phi){
set.seed(50)
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100){
y[i] <- phi*y[i-1] + e[i]
}
return(y)
}
p <- autoplot(phi_sim(0.6))
for(phi in seq(0.1, 0.9, 0.1)){
p <- p + autolayer(phi_sim(phi), series = paste(phi))
}
p +
labs(title="Changes in Phi", color = "Phi") +
theme(axis.title = element_blank(), legend.position = "bottom") +
scale_color_brewer(palette = "Set1")We can see that as phi increases, the distance of the time series from 0 increases.
c) Write your own code to generate data from an MA(1) model with \(\theta_1=.6\) and \(\sigma^2=1\).
MA1 <- function(theta){
set.seed(50)
y <- ts(numeric(100))
e <- rnorm(100)
e[1] <- 0
for(i in 2:100){
y[i] <- theta*e[i-1]+ e[i]
}
return(y)
}
ma <- MA1(.6)
plot(ma)d) Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?
p <- autoplot(MA1(0.6))
for(theta in seq(0.1, 0.9, 0.1)){
p <- p + autolayer(MA1(theta), series = paste(theta))
}
p +
labs(title="Changes of Theta", color = "Theta") +
theme(axis.title = element_blank(), legend.position = "bottom") +
scale_color_brewer(palette = "Set1")The effects of changing \(\theta\) are less drastic than changing \(\phi\). As \(\theta\) increases, the disctance from 0 also increases.
e) Generate data from an ARMA(1,1) model with \(\phi_1=.6\), \(\theta_1=.6\), and \(\sigma^2=1\).
ARMA <- function(phi, theta){
set.seed(50)
y <- ts(numeric(100))
e <- rnorm(100)
e[1] <- 0
for(i in 2:100)
y[i] <- phi*y[i-1] + theta*e[i-1] + e[i]
return(y)
}
arma <- ARMA(.6,.6)
plot(arma)f) Generate data from an AR(2) model with \(\phi_1=-.8\), \(\phi_2=.3\), and \(\sigma^2=1\).(Note that these parameters will give a non-stationary series.)
AR2 <- function(phi_1, phi_2){
set.seed(50)
y <- ts(numeric(100))
e <- rnorm(100)
for(i in 3:100)
y[i] <- phi_1*y[i-1] + phi_2*y[i-2] + e[i]
return(y)
}
ar2 <- AR2(-.8, .3)
plot(ar2)g) Graph the latter two series and compare them.
p <- autoplot(arma)
p <- p + autolayer(ar2)
p +
labs(title="ARMA vs AR2") +
theme(axis.title = element_blank(), legend.position = "bottom") +
scale_color_brewer(palette = "Set1")Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.
a) Use auto.arima() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
## Series: austa
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.3006 0.1735
## s.e. 0.1647 0.0390
##
## sigma^2 estimated as 0.03376: log likelihood=10.62
## AIC=-15.24 AICc=-14.46 BIC=-10.57
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0008313383 0.1759116 0.1520309 -1.069983 5.513269 0.7461559
## ACF1
## Training set -0.000571993
Auto ARIMA chose the model ARIMA(0,1,1) with drift.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
##
## Model df: 2. Total lags used: 7
The residuals are fairly normal.
b) Plot forecasts from an ARIMA(0,1,1) model with no drift and compare these to part a. Remove the MA term and plot again.
aa_austa_2<-arima(austa,order=c(0,1,1))
aa_austa_2_fc<-forecast(aa_austa_2,h=10)
autoplot(aa_austa_2_fc)aa_austa_3<-arima(austa,order=c(0,1,0))
aa_austa_3_fc<-forecast(aa_austa_3,h=10)
autoplot(aa_austa_3_fc)c) Plot forecasts from an ARIMA(2,1,3) model with drift. Remove the constant and see what happens.
aa_austa_4<-arima(austa,order=c(2,1,0))
aa_austa_4_fc<-forecast(aa_austa_4,h=10)
autoplot(aa_austa_4_fc)d) Plot forecasts from an ARIMA(0,0,1) model with a constant. Remove the MA term and plot again.
aa_austa5<-arima(austa, order=c(0,0,1))
aa_austa5_fc<-forecast(aa_austa5,h=10)
autoplot(aa_austa5_fc)aa_austa5<-arima(austa, order=c(0,0,1))
aa_austa5.fc<-forecast(aa_austa5,h=10)
autoplot(aa_austa5.fc)e) Plot forecasts from an ARIMA(0,2,1) model with no constant.
aa_austa6<-arima(austa, order=c(0,2,1))
aa_austa6_fc<-forecast(aa_austa6,h=10)
autoplot(aa_austa6_fc)