Read in the .csv files

CAworkhours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
walmartStock <- read.csv("WalMartStock.csv", stringsAsFactors = FALSE)
souvSales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
appShipments <- read.csv("ApplianceShipments.csv", stringsAsFactors = FALSE)

Problem 1: Analysis of Canadian Manufacturing Workers Work-Hours

caHours.ts <- ts(CAworkhours$HoursPerWeek, start = c(1966), frequency = 1)

plot(c(1966,2000), ylim = c(34.5,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Avg Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")

lines(caHours.ts, bty = "l")

axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)), las = 3, cex.axis = 0.75)
axis(2, at = seq(34.5,38,0.5), labels = format(seq(34.5,38,0.5)), las = 0, cex.axis = 0.6)

  1. If we computed the autocorrelation of this series, would the lag-1 autocorrelation exhibit negative, positive, or no autocorrelation? How can you see this from the plot?
# Use lag.max=1 to compute and plot the lag-1 autocorrelation
par(oma = c(0,0,1,0))
Acf(caHours.ts, lag.max=1)

Differencing can make the series more stationary.

# First difference of the time series
caHoursdiff1 <- diff(caHours.ts, differences = 1)
plot.ts(caHoursdiff1)

# Second difference of the time series
caHoursdiff2 <- diff(caHours.ts, differences = 2)
plot.ts(caHoursdiff2)

The trend component is removed by taking the first differences and it appears to be more stationary.

The lag-1 autocorrelation of the original series exhibits positive autocorrelation. The more stationary series, created by differencing, exhibits a negative lag-1 autocorrelation.

caHoursACF <- Acf(caHours.ts)

caHoursACF
## 
## Autocorrelations of series 'caHours.ts', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.928  0.839  0.752  0.665  0.571  0.473  0.369  0.265  0.164 
##     10     11     12     13     14     15 
##  0.047 -0.082 -0.185 -0.261 -0.310 -0.346
Acf(caHoursdiff1)

Problem 2: Forecasting Wal-Mart Stock

wmt.ts <- ts(walmartStock$Close)

plot(wmt.ts, xlab = "Time", ylab = "Close Price ($)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=c(18,59.33,100.66,141.99,183.33,224.66),labels=c("Mar-01","May-01","Jul-01","Sep-01","Nov-01","Jan-02"))
axis(2,at=seq(40,60,5),labels=format(seq(40,60,5)),las=2)

  1. Create a time plot of the differenced series
plot(diff(wmt.ts,lag=1), ylim = c(-4,4), bty="l",xaxt="n",yaxt="n", ylab = "Lag-1")
axis(1,at=c(18,59.33,100.66,141.99,183.33,224.66),labels=c("Mar-01","May-01","Jul-01","Sep-01","Nov-01","Jan-02"))
axis(2,at=seq(-4,4,1),labels=format(seq(-4,4,1)),las=2)

  1. Which of the following is /are relevant for testing whether this stock is a random walk? Random walks are nonstationary. The mean of a random walk is constant, but it’s variance increases with t.

• The autocorrelations of the closing price series (N)

• The AR(1) slope coefficient for the closing price series (Y)

One way to test if a series is a random walk is to fit an AR(1) model to see if the slope coefficient is equal to 1. If the p-value is smaller than a specified significance level, then the hypothesis is rejected and the series is not a random walk. Forecasts of random walk series are basically naive forecasts.

• The AR(1) constant coefficient for the closing price series (N)

• The autocorrelations of the differenced series (Y)

Another way to determine if a series is a random walk is to compute the differenced series (lag-1 differenced series) and see if the ACF plot has autocorrelations at lags 1,2,3,… all approximately equal to 0.

• The AR(1) slope coefficient for the differenced series (N)

• The AR(1) constant coefficient for the difference series (N)

  1. Recreate the AR(1) model output for the Close price series. Does the AR model indicate that this is a random walk? Explain how you reached your conclusion.
# Fit an AR(1) model
wmt.fit <- Arima(wmt.ts, order = c(1,0,0))
wmt.fit
## Series: wmt.ts 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1  intercept
##       0.9558    52.9497
## s.e.  0.0187     1.3280
## 
## sigma^2 estimated as 0.9815:  log likelihood=-349.8
## AIC=705.59   AICc=705.69   BIC=716.13

The standard error of the AR(1) model is 0.0187. The slope coefficient is 0.9558, which is 5 standard errors away from 1, indicating this may not be a random walk.

# Use the slope coefficient and its std error to compute the test statictic, then feed it to a distribution

# t-distribution
wmt.tdist <- 2*pt(-abs((1 - wmt.fit$coef["ar1"]) / 0.0187), df = length(wmt.ts) - 1)
wmt.tdist
##        ar1 
## 0.01896261
# n-distribution
wmt.ndist <- 2*pnorm(-abs((1 - wmt.fit$coef["ar1"]) / 0.0187))
wmt.ndist
##        ar1 
## 0.01818593

The normal distribution gives a slightly lower p-value than the t-distribution, but both are smaller than an alpha = 0.05, suggesting the series is not a random walk. However, with a specified significance level of 0.01 the series would be a random walk.

Another way to test whether a time series is a random walk, or non-stationary is to use the adf.test function in R, which performs the Augmented Dickley-Fuller test.

adf.test(wmt.ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  wmt.ts
## Dickey-Fuller = -2.6351, Lag order = 6, p-value = 0.3085
## alternative hypothesis: stationary

The series contains a unit-root, indicating the trend is stationary and that the series may be a random walk. The p-value is greater than significance levels of 1%, 5% and 10%, also indicating the series may be a random walk.

  1. What are the implications of finding that a time series is a random walk? Choose the correct statements:

• It is impossible to obtain useful forecasts of the series.

Forecasts would reflect the lack of any other information and forecasting would most likely not improve over the naive forecast.

• The series is random.

Only if the autocorrelations at lags 1,2,3 are all approximately 0. As you can see from the ACF plot below this is not the case.

• The changes in the series from one period to the other are random.

This is the definition of a random walk, so assuming the series is nonstationary, this statement is correct.

wmtdiff <- diff(wmt.ts, lag = 1)
Acf(wmtdiff)

Problem 3: Souvenir Sales

# Create time series object and plot it
souvSales <- read.csv("SouvenirSales.csv")
souvSales.ts <- ts(souvSales$Sales,start=c(1995,1),frequency=12)

par(oma = c(0,1,0,0))
plot(c(1995,2001),ylim = c(0,120000), xlim = c(1995,2002),type="n",xlab="Year",ylab="Sales (Australian dollars)",bty="l",xaxt="n",yaxt="n")
lines(souvSales.ts,bty="l")
axis(1,at=seq(1995,2002,1),labels=format(seq(1995,2002,1)))
axis(2,at=seq(0,120000,20000),labels=format(seq(0,120000,20000)),las=2, cex.axis = 0.75)

  1. Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Use this model to forecast the sales in February 2002.
# Set up training and validation periods
validLength <- 12
trainLength <- length(souvSales.ts) - validLength
souvSales.Train <- window(souvSales.ts,end=c(1995,trainLength))
souvSales.Valid <- window(souvSales.ts, start=c(1995,trainLength+1))

logSales <- tslm(log(souvSales.Train) ~ trend + season)
summary(logSales)
## 
## Call:
## tslm(formula = log(souvSales.Train) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16
# To forecast February 2002 using t = 86 and season = 2
logSalesForecast <- forecast(logSales,h=validLength)
logSalesFebForecast <- logSales$coefficients["(Intercept)"] + logSales$coefficients["trend"]*86 + logSales$coefficients["season2"]
# Convert back to orginal scale
exp(logSalesFebForecast)
## (Intercept) 
##    17062.99

The forecast for February 2002 is 17062.99 AUD.

  1. Create an ACF plot until lag-15 for the forecast errors. Then fit an AR model with lag-2 [ARIMA(2,0,0)] to the forecast errors.
# Create an ACF plot with lag.max of 15
errACF <- Acf(logSales$residuals,lag.max=15)

errACF
## 
## Autocorrelations of series 'logSales$residuals', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.459  0.485  0.194  0.088  0.154  0.016  0.030  0.106  0.034 
##     10     11     12     13     14     15 
##  0.152 -0.055 -0.012 -0.047 -0.077 -0.023
# Fit an AR model with lag-2
AR2 <- Arima(logSales$residuals,order=c(2,0,0))
AR2
## Series: logSales$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2  intercept
##       0.3072  0.3687    -0.0025
## s.e.  0.1090  0.1102     0.0489
## 
## sigma^2 estimated as 0.0205:  log likelihood=39.03
## AIC=-70.05   AICc=-69.46   BIC=-60.95
# Calculate the t statistics for each coeff. / s.e.
AR2$coef["ar1"] / 0.1090
##      ar1 
## 2.818482
AR2$coef["ar2"] / 0.1102
##      ar2 
## 3.346186
# Calculate p-values for normal distribution
2*pnorm(-abs(AR2$coef["ar1"] / 0.1090))
##         ar1 
## 0.004825136
2*pnorm(-abs(AR2$coef["ar1"] / 0.1102))
##         ar1 
## 0.005306886
# Create an ACF plot of the residuals from the AR(2) model
Acf(AR2$residuals)

i. Examining the ACF plot and the estimated coefficients of the AR(2) model (and their statistical significance), what can we learn about the regression forecasts?

All lags are within the thresholds indicating there are no more significant autocorrelations. The autocorrelations are considered to be 0. Both the lag-1 and lag-2 term p-values are statistically significant, but the AR(2) model has a lower p-value, indicating the lag-2 correlation is better.

  1. Use the sutocorrelation information to compute a forecast for January 2001, the first month of the validation period, using the regression model and the AR(2) model above.
# Use the forecast method to generate forecasts for the validation period
regForecast <- forecast(logSales, h = validLength)
regForecast
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9.188097  8.917220  9.458974  8.769890  9.606304
## Feb 2001       9.491232  9.220354  9.762109  9.073024  9.909439
## Mar 2001       9.925335  9.654457 10.196212  9.507127 10.343542
## Apr 2001       9.625329  9.354452  9.896207  9.207122 10.043537
## May 2001       9.694286  9.423408  9.965163  9.276078 10.112493
## Jun 2001       9.740741  9.469864 10.011619  9.322534 10.158949
## Jul 2001       9.898195  9.627318 10.169072  9.479988 10.316402
## Aug 2001       9.882831  9.611954 10.153708  9.464624 10.301038
## Sep 2001       9.992619  9.721742 10.263496  9.574412 10.410826
## Oct 2001      10.107664  9.836787 10.378542  9.689457 10.525872
## Nov 2001      10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001      11.372615 11.101738 11.643493 10.954408 11.790823
# Generate a forecast for the error terms using the residual AR(2) model
errForecast <- forecast(AR2, h = validLength)
errForecast
##          Point Forecast       Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2001    0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001    0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001    0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001    0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001    0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001    0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001    0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001    0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001    0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001    0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001    0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001    0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
# Create adjusted forecast by adding the two forecasts together
adjForecast <- regForecast$mean + errForecast$mean
adjForecast
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2001  9.295979  9.589783  9.994580  9.682130  9.736457  9.773830  9.923098
##            Aug       Sep       Oct       Nov       Dec
## 2001  9.901870 10.006838 10.118240 10.607927 11.378062
# Convert back to orginal scale
exp(adjForecast[1])
## [1] 10894.13

The forecast for January 2001 is $10,894.13 AUD.

Problem 4: Shipments of Household Appliances

appShip.ts <- ts(appShipments$Shipments, start = c(1985,1), frequency = 4)
plot(appShip.ts, ylim = c(3000, 5000), bty = "l", ylab = "Shipments (millions $")

  1. If we compute the autocorrelation of the series, which lag (> 0) is most likely to have the largest coefficient (in absolute value)?

There appears to be a cyclical pattern, indicating there may be seasonality. The data is quarterly, so this might imply the lag-4 will have the largest coefficient in absolute value.

  1. Create an ACF plot and compare it with your answer.

The ACF plot shows that the lag-4 has the largest coefficient in absolute value. However, none of the lags > 0, so none of the autocorrelations are significant.

Acf(appShip.ts)