PredictiveAssignment5

library(forecast)
library(knitr)
library(dplyr)
library(pander)
library(Hmisc)

1. Canadian Manufacturing

Analysis of Canadian Manufacturing Workers Work-Hours: The time series plot in Figure 7.9 describes the average annual number of weekly hours spent by Canadian manufacturing workers. The data is available in CanadianWorkHours.xls.

This shows the linear trend plotted on the generic plot

(a) If we computed the autocorrelation of this series, would the lag-1 autocorrelation exhibit negative, positive, or no autocorrelation? How can you see this from the plot?

The time series plot of the averge weekly hours worked in Canadian manufacturing largely shows a downward trend. If we computed a lag-1 autocorrelation, we would see “stickiness”, consecutive values that move in same direction, be it up or down. This is a positive autocorrelation. When fitting a model to this data series, we would want to look for one that either captures trend or we would detrend the data first.

(b) Compute the autocorrelation and produce an ACF plot. Verify your answer to the previous question.

The Acf plot exhibits a strong positive autocorrelation at a lag 1. This is apparent from the line at period 1 on the x axis that crosses the significance threshold and goes nearly to 1 on the y axis. The next 6 lag periods show significance as well and are probably attributed to the trend described in 1a. Detrending the data series with lag-1 differencing will likely get rid of the remaining autocorrelation.

canadaAcf <- Acf(canadian.ts,  main = "Auto Correlation \nCanadian Work Hours")

canadaAcf

## 
## Autocorrelations of series 'canadian.ts', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.928  0.839  0.752  0.665  0.571  0.473  0.369  0.265  0.164 
##     10     11     12     13     14     15 
##  0.047 -0.082 -0.185 -0.261 -0.310 -0.346

To verify that the Acf of the Canadian work hours data series would no longer exhibit stickiness once detrended with lag-1 differncing, see the plot and table below.

DiffAcf <- diff(canadian.ts, lag = 1)
canadaAcfDiff <- Acf(DiffAcf, main = "Acf Plot of Detrended Series \nwith lag-1 Differencing")

canadaAcfDiff

## 
## Autocorrelations of series 'DiffAcf', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.253 -0.137 -0.040 -0.053  0.020  0.142  0.065 -0.055  0.166 
##     10     11     12     13     14     15 
##  0.228 -0.170 -0.231 -0.221 -0.095  0.129

2. Wal-Mart Stock

Forecasting Wal-Mart Stock: Figure 7.10 shows a time plot of Wal-Mart daily closing prices between February 2001 and February 2002. The data is available at finance.yahoo.com and in WalMartStock.xls. 21 The ACF plots of these daily closing prices and its lag-1 differenced series are in Figure 7.11. Table 7.4 shows the output from fitting an AR(1) model to the series of closing prices and to the series of differences. Use all the information to answer the following questions.

Acf(walMart.ts, main = "ACF forWalMart Series", lag.max = 10)

Acf(diff(walMart.ts, lag=1),  main="ACF Plot for Differenced WalMart Series", lag.max = 10)

(a) Create a time plot of the differenced series.

plot(diff(walMart.ts, lag = 1), main = "Differenced WalMart")

(b) Which of the following is/are relevant for testing whether this stock is a random walk?

The autocorrelations of the closing price series Relevant
The AR(1) slope coefficient for the closing price series Relevant If the AR(1) slope coefficient \(\beta_1 = 1\), the series is a random walk.
The AR(1) constant coefficient for the closing price series Not Relevant
The autocorrelations of the differenced series Relevant
The AR(1) slope coefficient for the differenced series Not Relevant
The AR(1) constant coefficient for the differenced series Relevant

(c) Recreate the AR(1) model output for the Close price series shown in the left of Table 7.4. Does the AR model indicate that this is a random walk? Explain how you reached your conclusion.

To test whether this series is a random walk, the \(\beta_1 = 1\). We can determine this by performing an hypothesis test where the \(H_0: \beta_1 = 0\) and \(H_1: \beta_1\ne 1\). The p value is calculated using the coefficient and standard error output from the AR(1) model. At \(\alpha = 0.01\), the AR(1) data indicates the Walmart closing price data series is a random walk because \(p>\alpha\) using both the t distribution and the normal distribution. However, at \(\alpha = 0.05\), we can reject the null hypothesis and conclude there is an underlying pattern and we can ultimately use a forecasting technique beyone the naive method.

I’ve recreated the AR(1) model output.

walMartFit <- arima(walMart.ts, order = c(1,0,0))
walMartFit

## 
## Call:
## arima(x = walMart.ts, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.9558    52.9497
## s.e.  0.0187     1.3280
## 
## sigma^2 estimated as 0.9735:  log likelihood = -349.8,  aic = 705.59

The p-value using the t distribution

#Calculate the two-tailed p-value from a t-distribution, p = 0.01896261
pander(2*pt(-abs((1 - walMartFit$coef["ar1"]) / 0.0187), df=length(walMart.ts)-1))

ar1
0.01896

The p-value using a normal distribution

#Now calculate it using the normal distribution, p = 0.01818593
pander(2*pnorm(-abs((1 - walMartFit$coef["ar1"]) / 0.0187)))

ar1
0.01819

(d) What are the implications of finding that a time series is a random walk? Choose the correct statement(s) below.

It is impossible to obtain useful forecasts of the series. True, can only model with naive forecasting
The series is random. Not correct
The changes in the series from one period to the other are random. This is a correct statement

3. Souvenir Sales

3. Souvenir Sales: The file SouvenirSales.xls contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995 and 2001.

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

Log(souvenir) plot

plot(log(souvenir.ts), type = "n", xaxt = "n", yaxt = "n", xlab = "Year", ylab = "Souvenir Sales (thousands)", main = "Log Souvenir Sales")
lines(log(souvenir.ts), col = "grey48", lwd = 2, lty = 1, pch = 18, type = "o")
# Add the x-axis
axis(1, at=seq(1995,2002,1), labels=format(seq(1995,2002,1)))
# Add the y-axis
axis(2, at = seq(7, 11, 1), labels = format(round(exp(seq(7, 11, 1)), digits = 0), las = 2))

(a) Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Use this model to forecast the sales in February 2002.

Log Regression trend and seasonality

LogFC <- tslm(log(souvenirTrain) ~ trend + season)
pander(summary(LogFC))

	Estimate	Std. Error	t value	Pr(>\|t\|)
trend	0.02112	0.001086	19.45	2.416e-27
season2	0.282	0.109	2.587	0.01218
season3	0.695	0.109	6.374	3.078e-08
season4	0.3739	0.1091	3.428	0.001115
season5	0.4217	0.1091	3.865	0.0002791
season6	0.447	0.1092	4.095	0.0001301
season7	0.5834	0.1092	5.341	1.551e-06
season8	0.5469	0.1093	5.004	5.366e-06
season9	0.6356	0.1094	5.811	2.654e-07
season10	0.7295	0.1095	6.664	9.98e-09
season11	1.201	0.1096	10.96	7.38e-16
season12	1.952	0.1097	17.8	2.129e-25
(Intercept)	7.646	0.08412	90.9	4.077e-65

Fitting linear model: log(souvenirTrain) ~ trend + season
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
72	0.1888	0.9424	0.9306

Forecast using a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors

FCFC <- forecast(LogFC, h = validLength)

(b) Create an ACF plot until lag-15 for the forecast errors. Now fit an AR model with lag-2 [ARIMA(2, 0, 0)] to the forecast errors.

Acf(FCFC$residuals, lag.max = 15, main = "ACF plot of Forecast Errors \nLag = 15")

Arima2 <- Arima(LogFC$residuals, order = c(2,0,0))
Arima2

## Series: LogFC$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2     mean
##       0.3072  0.3687  -0.0025
## s.e.  0.1090  0.1102   0.0489
## 
## sigma^2 estimated as 0.0205:  log likelihood=39.03
## AIC=-70.05   AICc=-69.46   BIC=-60.95

Running an AR(2) model shows there might be some trend which can be further modeled. To determine this we need to see if these AR(2) findings are statistically significant.

Tests for significance

# Calculate the t statistics for each: coefficient / s.e.
# Rough rule is anything > 2 (or < -2) is statistically significant
#AR(1) is statistically significant 2.818482
Arima2$coef["ar1"] / 0.1090

##      ar1 
## 2.818482

#AR(2) is statistically significant at 3.346186
Arima2$coef["ar2"] / 0.1102

##      ar2 
## 3.346186

#both t statistics are statistically significant, so there is trend remaining which might be worth correcting the forecast for.
#estimate the p values ar1 p value < alpha, 0.004825136
pander(2*pnorm(-abs(Arima2$coef["ar1"]/0.1090)))

ar1
0.004825

#p = 0.0008193151, p<alpha, 
pander(2*pnorm(-abs(Arima2$coef["ar2"]/0.1102)))

ar2
0.0008193

The Acf of the AR(2)residuals

Acf(Arima2$residuals, main = "Acf of AR(2) Residuals")

i. Examining the ACF plot and the estimated coefficients of the AR(2) model (and their statistical significance), what can we learn about the regression forecasts?

Looking at the ACF plot and the estimated coefficients for the forecast errors, we see some trend that was missed in the forecast. Specifically at lags 1 and 2. This is reinforeced with the AR(2) model of the residuals; further modeling may take care of the remaining trend. We can perform hypothesis testing using the coefficients and standard errors to calculate t statistics. Above, I calculate the t-statistics for the ar1 and ar2. Both are statistically significant a because they are greater than 2. I further calculate the p values using the coefficients and standard errors. Using a normal distribution, both p values of ar1 and ar2 are smaller than an \(\alpha = 0.01\). With a statistically significant AR2 model, we see there is some trend that can be further modeled to improve the Souvenir Sales forecast. The Acf of the AR(2) model shows that the remaining trend in the residuals can be modeled with the Arima method.

ii. Use the autocorrelation information to compute a forecast for January 2002, using the regression model and the AR(2) model above.

FCArima <- forecast(Arima2, h = validLength)
ErrorError <- exp(FCFC$mean + FCArima$mean)
ErrorError[1]

## [1] 10894.13

Plot the actual souvenir sales, the original forecast using a regression model with trend and seasonality using a log(souvenir sales)

plot(souvenirValid, bty = "l", xaxt = "n", yaxt = "n", xlab = "2001", ylab = "Souvenir Sales (thousands)", main = "Actual Sales vs Forecasted Sales \nSouvenir Sales", ylim = c(9000, 110000), lwd = 2)
axis(1, at=seq(2001,2001+11/12,1/12), labels= c("Jan ", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at = seq(0, 110000, 20000), labels = format(seq(0, 110, 20)), las = 2)
lines(exp(FCFC$mean), col = "blue", lwd = 2)
lines(ErrorError, col = "green", lwd = 2, lty = 2)
legend(2001, 90000, c("Actual Sales", "Forecasted Sales MAPE = 15.52", "Adjusted Forecasted Sales MAPE = 15.35"), lty = c(1,1,2), lwd = c(2,2,2), bty = "n", col = c("black", "blue", "green"))

The predictive accuracy between the adjusted forecast and the original forecast shows a minor difference

accuracy(ErrorError, souvenirValid)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 4129.219 6862.888 4963.377 8.315146 15.35239 0.4680694 0.4595104

accuracy(exp(FCFC$mean), souvenirValid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

4. Shipments of Household Appliances

4. Shipments of Household Appliances: The file ApplianceShipments.xls contains the series of quarterly shipments (in millions of USD) of U.S. household appliances between 1985 and 1989. The series is plotted in Figure 7.13.

appliance <- read.csv("ApplianceShipments1.csv", header = TRUE, stringsAsFactors = FALSE)
appliance <- appliance %>% select(Quarter, Shipments)
#split Quarter column
col1Split <- strsplit(appliance$Quarter, "-")
#Make quarter column into proper columns in a dataframe
tempDF <- data.frame(matrix(unlist(col1Split), nrow = length(col1Split), byrow = TRUE), appliance$Shipments)
#reorder rows by year and then by quarter
tempDF <- tempDF[order(tempDF$X2, tempDF$X1),]
#merge quarter and year back together
m <- paste0(tempDF$X1, "-", tempDF$X2)
#put a new dataframe together
newDF <- data.frame(m, tempDF$appliance.Shipments)
#rename columns
names(newDF)[1] <- "Quarter"
names(newDF)[2] <- "Shipments"
#make this into a timeseries
applianceShipments.ts <- ts(newDF$Shipments, start = c(1985, 1), end = c(1989,4), frequency = 4)
plot(applianceShipments.ts, type = "o")

library(Hmisc)

(a) If we compute the autocorrelation of the series, which lag (> 0) is most likely to have the largest coefficient (in absolute value)?

In the autocorrelation of the household appliance shipments data, the lag that would have the largest coefficient is the lag-4. This is because the data is quarterly and it does appear to have a seasonal component.

(b) Create an ACF plot and compare it with your answer.

Acf(applianceShipments.ts, main = "Quarterly Appliance Shipments \nAcf Plot")

Acf(diff(applianceShipments.ts, lag = 4), main = "Lag-4 Differenced Data Series \nAcf Plot of Quarterly Appliance Shipments")

In the first Acf plot above, the lag-4 has the highest coefficient. In the second plot, where the original series has been deseasonalized with differencing using a lag-4, the higher value coefficient at lag-4 is no longer present.

PredictiveAssignment5

Christine Iyer

April 3, 2017

1. Canadian Manufacturing

2. Wal-Mart Stock

3. Souvenir Sales

4. Shipments of Household Appliances