Read in the .csv files
CAworkhours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
walmartStock <- read.csv("WalMartStock.csv", stringsAsFactors = FALSE)
souvSales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
appShipments <- read.csv("ApplianceShipments.csv", stringsAsFactors = FALSE)
Problem 1: Analysis of Canadian Manufacturing Workers Work-Hours
caHours.ts <- ts(CAworkhours$HoursPerWeek, start = c(1966), frequency = 1)
plot(c(1966,2000), ylim = c(34.5,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Avg Hours Per Week", bty = "l", xaxt = "n", yaxt = "n")
lines(caHours.ts, bty = "l")
axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)), las = 3, cex.axis = 0.75)
axis(2, at = seq(34.5,38,0.5), labels = format(seq(34.5,38,0.5)), las = 0, cex.axis = 0.6)
# Use lag.max=1 to compute and plot the lag-1 autocorrelation
par(oma = c(0,0,1,0))
Acf(caHours.ts, lag.max=1)
Differencing can make the series more stationary.
# First difference of the time series
caHoursdiff1 <- diff(caHours.ts, differences = 1)
plot.ts(caHoursdiff1)
# Second difference of the time series
caHoursdiff2 <- diff(caHours.ts, differences = 2)
plot.ts(caHoursdiff2)
The trend component is removed by taking the first differences and it appears to be more stationary.
The lag-1 autocorrelation of the original series exhibits positive autocorrelation. The more stationary series, created by differencing, exhibits a negative lag-1 autocorrelation.
caHoursACF <- Acf(caHours.ts)
caHoursACF
##
## Autocorrelations of series 'caHours.ts', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.928 0.839 0.752 0.665 0.571 0.473 0.369 0.265 0.164
## 10 11 12 13 14 15
## 0.047 -0.082 -0.185 -0.261 -0.310 -0.346
Acf(caHoursdiff1)
Problem 2: Forecasting Wal-Mart Stock
wmt.ts <- ts(walmartStock$Close)
plot(wmt.ts, xlab = "Time", ylab = "Close Price ($)", bty = "l", xaxt = "n", yaxt = "n")
axis(1,at=c(18,59.33,100.66,141.99,183.33,224.66),labels=c("Mar-01","May-01","Jul-01","Sep-01","Nov-01","Jan-02"))
axis(2,at=seq(40,60,5),labels=format(seq(40,60,5)),las=2)
plot(diff(wmt.ts,lag=1), ylim = c(-4,4), bty="l",xaxt="n",yaxt="n", ylab = "Lag-1")
axis(1,at=c(18,59.33,100.66,141.99,183.33,224.66),labels=c("Mar-01","May-01","Jul-01","Sep-01","Nov-01","Jan-02"))
axis(2,at=seq(-4,4,1),labels=format(seq(-4,4,1)),las=2)
• The autocorrelations of the closing price series (N)
• The AR(1) slope coefficient for the closing price series (Y)
One way to test if a series is a random walk is to fit an AR(1) model to see if the slope coefficient is equal to 1. If the p-value is smaller than a specified significance level, then the hypothesis is rejected and the series is not a random walk. Forecasts of random walk series are basically naive forecasts.
• The AR(1) constant coefficient for the closing price series (N)
• The autocorrelations of the differenced series (Y)
Another way to determine if a series is a random walk is to compute the differenced series (lag-1 differenced series) and see if the ACF plot has autocorrelations at lags 1,2,3,… all approximately equal to 0.
• The AR(1) slope coefficient for the differenced series (N)
• The AR(1) constant coefficient for the difference series (N)
# Fit an AR(1) model
wmt.fit <- Arima(wmt.ts, order = c(1,0,0))
wmt.fit
## Series: wmt.ts
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 intercept
## 0.9558 52.9497
## s.e. 0.0187 1.3280
##
## sigma^2 estimated as 0.9815: log likelihood=-349.8
## AIC=705.59 AICc=705.69 BIC=716.13
The standard error of the AR(1) model is 0.0187. The slope coefficient is 0.9558, which is 5 standard errors away from 1, indicating this may not be a random walk.
# Use the slope coefficient and its std error to compute the test statictic, then feed it to a distribution
# t-distribution
wmt.tdist <- 2*pt(-abs((1 - wmt.fit$coef["ar1"]) / 0.0187), df = length(wmt.ts) - 1)
wmt.tdist
## ar1
## 0.01896261
# n-distribution
wmt.ndist <- 2*pnorm(-abs((1 - wmt.fit$coef["ar1"]) / 0.0187))
wmt.ndist
## ar1
## 0.01818593
The normal distribution gives a slightly lower p-value than the t-distribution, but both are smaller than an alpha = 0.05, suggesting the series is not a random walk. However, with a specified significance level of 0.01 the series would be a random walk.
Another way to test whether a time series is a random walk, or non-stationary is to use the adf.test function in R, which performs the Augmented Dickley-Fuller test.
adf.test(wmt.ts)
##
## Augmented Dickey-Fuller Test
##
## data: wmt.ts
## Dickey-Fuller = -2.6351, Lag order = 6, p-value = 0.3085
## alternative hypothesis: stationary
The series contains a unit-root, indicating the trend is stationary and that the series may be a random walk. The p-value is greater than significance levels of 1%, 5% and 10%, also indicating the series may be a random walk.
• It is impossible to obtain useful forecasts of the series.
Forecasts would reflect the lack of any other information and forecasting would most likely not improve over the naive forecast.
• The series is random.
Only if the autocorrelations at lags 1,2,3 are all approximately 0. As you can see from the ACF plot below this is not the case.
• The changes in the series from one period to the other are random.
This is the definition of a random walk, so assuming the series is nonstationary, this statement is correct.
wmtdiff <- diff(wmt.ts, lag = 1)
Acf(wmtdiff)
Problem 3: Souvenir Sales
# Create time series object and plot it
souvSales <- read.csv("SouvenirSales.csv")
souvSales.ts <- ts(souvSales$Sales,start=c(1995,1),frequency=12)
par(oma = c(0,1,0,0))
plot(c(1995,2001),ylim = c(0,120000), xlim = c(1995,2002),type="n",xlab="Year",ylab="Sales (Australian dollars)",bty="l",xaxt="n",yaxt="n")
lines(souvSales.ts,bty="l")
axis(1,at=seq(1995,2002,1),labels=format(seq(1995,2002,1)))
axis(2,at=seq(0,120000,20000),labels=format(seq(0,120000,20000)),las=2, cex.axis = 0.75)
# Set up training and validation periods
validLength <- 12
trainLength <- length(souvSales.ts) - validLength
souvSales.Train <- window(souvSales.ts,end=c(1995,trainLength))
souvSales.Valid <- window(souvSales.ts, start=c(1995,trainLength+1))
logSales <- tslm(log(souvSales.Train) ~ trend + season)
summary(logSales)
##
## Call:
## tslm(formula = log(souvSales.Train) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
# To forecast February 2002 using t = 86 and season = 2
logSalesForecast <- forecast(logSales,h=validLength)
logSalesFebForecast <- logSales$coefficients["(Intercept)"] + logSales$coefficients["trend"]*86 + logSales$coefficients["season2"]
# Convert back to orginal scale
exp(logSalesFebForecast)
## (Intercept)
## 17062.99
The forecast for February 2002 is 17062.99 AUD.
# Create an ACF plot with lag.max of 15
errACF <- Acf(logSales$residuals,lag.max=15)
errACF
##
## Autocorrelations of series 'logSales$residuals', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.459 0.485 0.194 0.088 0.154 0.016 0.030 0.106 0.034
## 10 11 12 13 14 15
## 0.152 -0.055 -0.012 -0.047 -0.077 -0.023
# Fit an AR model with lag-2
AR2 <- Arima(logSales$residuals,order=c(2,0,0))
AR2
## Series: logSales$residuals
## ARIMA(2,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 intercept
## 0.3072 0.3687 -0.0025
## s.e. 0.1090 0.1102 0.0489
##
## sigma^2 estimated as 0.0205: log likelihood=39.03
## AIC=-70.05 AICc=-69.46 BIC=-60.95
# Calculate the t statistics for each coeff. / s.e.
AR2$coef["ar1"] / 0.1090
## ar1
## 2.818482
AR2$coef["ar2"] / 0.1102
## ar2
## 3.346186
# Calculate p-values for normal distribution
2*pnorm(-abs(AR2$coef["ar1"] / 0.1090))
## ar1
## 0.004825136
2*pnorm(-abs(AR2$coef["ar1"] / 0.1102))
## ar1
## 0.005306886
# Create an ACF plot of the residuals from the AR(2) model
Acf(AR2$residuals)
i. Examining the ACF plot and the estimated coefficients of the AR(2) model (and their statistical significance), what can we learn about the regression forecasts?
All lags are within the thresholds indicating there are no more significant autocorrelations. The autocorrelations are considered to be 0. Both the lag-1 and lag-2 term p-values are statistically significant, but the AR(2) model has a lower p-value, indicating the lag-2 correlation is better.
# Use the forecast method to generate forecasts for the validation period
regForecast <- forecast(logSales, h = validLength)
regForecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9.188097 8.917220 9.458974 8.769890 9.606304
## Feb 2001 9.491232 9.220354 9.762109 9.073024 9.909439
## Mar 2001 9.925335 9.654457 10.196212 9.507127 10.343542
## Apr 2001 9.625329 9.354452 9.896207 9.207122 10.043537
## May 2001 9.694286 9.423408 9.965163 9.276078 10.112493
## Jun 2001 9.740741 9.469864 10.011619 9.322534 10.158949
## Jul 2001 9.898195 9.627318 10.169072 9.479988 10.316402
## Aug 2001 9.882831 9.611954 10.153708 9.464624 10.301038
## Sep 2001 9.992619 9.721742 10.263496 9.574412 10.410826
## Oct 2001 10.107664 9.836787 10.378542 9.689457 10.525872
## Nov 2001 10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001 11.372615 11.101738 11.643493 10.954408 11.790823
# Generate a forecast for the error terms using the residual AR(2) model
errForecast <- forecast(AR2, h = validLength)
errForecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001 0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001 0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001 0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001 0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001 0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001 0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001 0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001 0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001 0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001 0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001 0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
# Create adjusted forecast by adding the two forecasts together
adjForecast <- regForecast$mean + errForecast$mean
adjForecast
## Jan Feb Mar Apr May Jun Jul
## 2001 9.295979 9.589783 9.994580 9.682130 9.736457 9.773830 9.923098
## Aug Sep Oct Nov Dec
## 2001 9.901870 10.006838 10.118240 10.607927 11.378062
# Convert back to orginal scale
exp(adjForecast[1])
## [1] 10894.13
The forecast for January 2001 is $10,894.13 AUD.
Problem 4: Shipments of Household Appliances
appShip.ts <- ts(appShipments$Shipments, start = c(1985,1), frequency = 4)
plot(appShip.ts, ylim = c(3000, 5000), bty = "l", ylab = "Shipments (millions $")
There appears to be a cyclical pattern, indicating there may be seasonality. The data is quarterly, so this might imply the lag-4 will have the largest coefficient in absolute value.
The ACF plot shows that the lag-4 has the largest coefficient in absolute value. However, none of the lags > 0, so none of the autocorrelations are significant.
Acf(appShip.ts)