library(forecast)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
library(readxl)
Canada_Hours <- read_excel("~/rProjects/Assignment_6/Canada_Hours.xlsx")

Question 1

# Create a time series object out of it
hoursTS <- ts(Canada_Hours$HoursPerWeek, start=c(1966), frequency=1)

yrange = range(hoursTS)

# Set up the plot
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")

# Add the time series air
lines(hoursTS, bty="l")

# Add the x-axis
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))

# Add the y-axis
axis(2, at=seq(34.5,40, 0.5), labels=format(seq(34.5,40, 0.5)), las=2)


A)

This is a difficult question to answer without looking at the ACF plot - my first question would be is it possible to have both positive and negative auto-correlation in one plot? I know that negative auto-correlation means that values next to one another differ greatly, while positive auto-correlation values tend to be similar. If I had to guess, I would say that this series might exhibit positive auto-correlation because the trend appears more gradual, which implies that even though the values differ from one time period to the next, they are similar. Let’s check…

Acf(hoursTS, lag.max=1)

hoursACF <- Acf(hoursTS)

hoursACF
## 
## Autocorrelations of series 'hoursTS', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.928  0.839  0.752  0.665  0.571  0.473  0.369  0.265  0.164 
##     10     11     12     13     14     15 
##  0.047 -0.082 -0.185 -0.261 -0.310 -0.346

B)

The first plot above shows that there is a postive lag-1 auto-correlation, above .8, this is confirmed in the output of auto-correlation values; at 1 lag-1 = .928. We can also see that the correlations vary through out the series with the strongest correlations being the first 6 correlations. If we look at the second plot, the negative values imply that this series has a non-linear trend.

Question 2

# Read in the Walmart Stock data
WalMartStock <- read_excel("~/rProjects/Assignment_6/WalMartStock.xlsx")

# Create a time series
stockTS <- ts(WalMartStock$Close, start=c(2001,2), frequency=225)

plot(stockTS, xlab="Time",  ylab="Closing Price ($)", main = "Daily Closing Time Series", bty="l", xaxt='n')

# Add the x-axis
axis(1, at=c(2001.0, 2001.2, 2001.4, 2001.6, 2001.8, 2002.0), labels=c("Mar-01", "May-01", "Jul-01","Sep-01", "Nov-01", "Jan-02"))


A)

To create a time plot of the differenced series, I simply plotted the lag-1 of the series; this removed the trend of the stock closing prices.
plot(diff(stockTS, lag=1), ylab="Lag-1", xlab="Time", bty="l", main="Lag-1 Difference", xaxt = 'n')
axis(1, at=c(2001.0, 2001.2, 2001.4, 2001.6, 2001.8, 2002.0), labels=c("Mar-01", "May-01", "Jul-01","Sep-01", "Nov-01", "Jan-02"))


B)

The two items that are relevant for testing the stock series for random walk are:
1. The AR(1) slope coefficient for the closing price series. “A random walk is a special case of an AR(1) model, where the slope coefficient is equal to 1.”
2. The autocorrelations of the difference series. “Another approach for evaluating predictability, is to examine the series of differences between each pair of consecutive values (lag-1 difference series). If the ACF plot [of the differenced series] indicates that the autocorrelations at lags 1, 2, 3, etc are all approximately 0, then a random walk is inferred”
stockAR1 <- Arima(stockTS, order=c(1,0,0))
stockAR1
## Series: stockTS 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1  intercept
##       0.9558    52.9497
## s.e.  0.0187     1.3280
## 
## sigma^2 estimated as 0.9815:  log likelihood=-349.8
## AIC=705.59   AICc=705.69   BIC=716.13

C)

Using the AR table alone, I would say that this series has random walk. I came to my conclusion by looking at the slope coefficient, which is .9558. A series is said to have random walk when its slope coefficient is equal to 1, and .9558 is very close to 1. We can evaluate the series further by calculating the model’s p-value using a t-distribution and a normal distribution, then comparing the p-values to alpha. If the values are larger than alpha then random walk is implied.

# Calculate the two-tailed p-value from a t-distribution
2*pt(-abs((1 - stockAR1$coef["ar1"]) / 0.0187), df=length(stockTS)-1)
##        ar1 
## 0.01896261
# Now calculate it using the normal distribution
2*pnorm(-abs((1 - stockAR1$coef["ar1"]) / 0.0187))
##        ar1 
## 0.01818593

Assuming alpha = .01, we can see that the p-values are larger than the alpha, so random walk is implied.

D)

Each of the listed statements are true about random walks:
1. A random walk series is unpredictable therefore forcasting with any accuracy is next to impossible. Naive forecasts are best guess forecasts for random walks.
2. By definition a random walk implies that the series is random.
3. Again, random walks are defined as having their period to period changes be random.

Question 3

# Import the data
SouvenirSales <- read_excel("~/rProjects/Assignment_6/SouvenirSales.xlsx")

A)

Same as question C & Ciii from Assignment 5:
The forecast for February 2002 is $17,062.99.
# Create the time series
sales.ts <- ts(SouvenirSales$Sales, start=c(1995,1), frequency=12)

# Set up traning and validation period lengths
validLength <- 12
trainLength <- length(sales.ts) - validLength

# Create training period and validation period
salesTrain <- window(sales.ts, end=c(1995, trainLength))
salesValid <- window(sales.ts, start=c(1995,trainLength+1))

# Regression Model with log(sales)
salesRegMod <- tslm(log(salesTrain) ~ trend + season)
summary(salesRegMod)
## 
## Call:
## tslm(formula = log(salesTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16
# Use the named coefficients
febForecast <- salesRegMod$coefficients["(Intercept)"] + salesRegMod$coefficients["trend"]*86 + salesRegMod$coefficients["season2"]

# Convert them back and simply echo them
exp(febForecast)
## (Intercept) 
##    17062.99

Bi)

Looking at the ACF plot, it appears as though we could have used sine and cosine functions to better capture trend - I see this as a result of the wave like pattern of the lag terms. Also, the low co-efficients of .3072 and .3687 imply that this series is not a random walk, and the statisical significance of of the co-efficients shows that this model is a good fit.
# Plot only the first 15 lag-correlations
resACF <- Acf(salesRegMod$residuals[1:35])

# Output lag-correlations
resACF
## 
## Autocorrelations of series 'salesRegMod$residuals[1:35]', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.322  0.287 -0.069 -0.108  0.033 -0.136 -0.057  0.086  0.116 
##     10     11     12     13     14     15 
##  0.257 -0.042 -0.096 -0.090 -0.207 -0.005
# Run an AR(2) on the residuals from the fitted model
ARModel <- Arima(salesRegMod$residuals, order=c(2,0,0))
ARModel
## Series: salesRegMod$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2  intercept
##       0.3072  0.3687    -0.0025
## s.e.  0.1090  0.1102     0.0489
## 
## sigma^2 estimated as 0.0205:  log likelihood=39.03
## AIC=-70.05   AICc=-69.46   BIC=-60.95

AR1 t statistic = 2.818482 which is greater than 2, therefore implies significance.
# Calculate the t statistics for each: coefficient / s.e.
# Rough rule is anything > 2 (or < -2) is statistically significant
ARModel$coef["ar1"] / 0.1090
##      ar1 
## 2.818482

AR2 t statistic = 3.346186 which is greater than 2, therefore implies significance.
ARModel$coef["ar2"] / 0.1102
##      ar2 
## 3.346186

AR 1 p-value = 0.004825136 which is less than .01, therefore implies significance.
# Now estimate p-value based on the normal distribution
2*pnorm(-abs(ARModel$coef["ar1"] / 0.1090))
##         ar1 
## 0.004825136

AR 2 p-value = 0.0008193151 which is less than .01, therefore implies significance.
2*pnorm(-abs(ARModel$coef["ar2"] / 0.1102))
##          ar2 
## 0.0008193151

Bii)

# Create an ACF plot of the residuals from the AR(2) model
Acf(ARModel$residuals)

# Generate forecasts for the validation period
lrForecast <- forecast(salesRegMod, h=validLength)
lrForecast
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 2001       9.188097  8.917220  9.458974  8.769890  9.606304
## Feb 2001       9.491232  9.220354  9.762109  9.073024  9.909439
## Mar 2001       9.925335  9.654457 10.196212  9.507127 10.343542
## Apr 2001       9.625329  9.354452  9.896207  9.207122 10.043537
## May 2001       9.694286  9.423408  9.965163  9.276078 10.112493
## Jun 2001       9.740741  9.469864 10.011619  9.322534 10.158949
## Jul 2001       9.898195  9.627318 10.169072  9.479988 10.316402
## Aug 2001       9.882831  9.611954 10.153708  9.464624 10.301038
## Sep 2001       9.992619  9.721742 10.263496  9.574412 10.410826
## Oct 2001      10.107664  9.836787 10.378542  9.689457 10.525872
## Nov 2001      10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001      11.372615 11.101738 11.643493 10.954408 11.790823
# Generate a forecast for the error terms using residuals AR(2) model
errorForecast <- forecast(ARModel, h=validLength)
errorForecast
##          Point Forecast       Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2001    0.107882119 -0.07561892 0.2913832 -0.1727585 0.3885227
## Feb 2001    0.098551352 -0.09341395 0.2905167 -0.1950342 0.3921370
## Mar 2001    0.069245043 -0.14069093 0.2791810 -0.2518243 0.3903144
## Apr 2001    0.056801003 -0.15830920 0.2719112 -0.2721817 0.3857837
## May 2001    0.042171322 -0.17774923 0.2620919 -0.2941681 0.3785108
## Jun 2001    0.033088136 -0.18905521 0.2552315 -0.3066508 0.3728271
## Jul 2001    0.024902959 -0.19881529 0.2486212 -0.3172446 0.3670505
## Aug 2001    0.019038932 -0.20554500 0.2436229 -0.3244325 0.3625104
## Sep 2001    0.014219137 -0.21092154 0.2393598 -0.3301038 0.3585421
## Oct 2001    0.010576068 -0.21489090 0.2360430 -0.3342459 0.3553980
## Nov 2001    0.007679567 -0.21798999 0.2333491 -0.3374522 0.3528114
## Dec 2001    0.005446339 -0.22034478 0.2312375 -0.3398714 0.3507641
# Create the adjusted forecast by adding the two forecasts together
adjustedForecast <- lrForecast$mean + errorForecast$mean

# Single out January 2001 forecast
janForecast <- adjustedForecast[1]

# Convert January's forecast back and print the result
exp(janForecast)
## [1] 10894.13

The adjusted forecast for January 2001 is $10,894.13.

Question 4

A)

Looking at the plot of the series below, I would say that the largest lag co-efficient would exist at lag-4; meaning that same quarters would be the most highly correlated time periods (all Q1s are similar, all Q2s are similar…). However, after plotting the autocorrelations I was surprised to see that lag-7 had the highest, absolute correlation of |.471|. I also plotted the ACF for the differenced series and found that lag-4 is the 3rd highest at |.309| with lag-7 = |.318| and lag-1 = |.334|. Each of these correlations were negative, implying swings in the series.
# Read in shipment data
ApplianceShipments <- read_excel("~/rProjects/Assignment_6/ApplianceShipments_Mine.xlsx")

shipmentTS <- ts(ApplianceShipments$Shipments, frequency=4)

shipmentTS
##   Qtr1 Qtr2 Qtr3 Qtr4
## 1 4009 4123 4493 4595
## 2 4245 4321 4522 4806
## 3 4799 4900 4224 4657
## 4 4551 4417 4585 3944
## 5 4030 4485 4258 4533
library(ggplot2)
appliancePlot <- ggplot(data = ApplianceShipments, aes(x=Quarter, y=Shipments, group=1)) +
  geom_line() + 
  theme(axis.line = element_line(colour = "black")) +
  theme(axis.text.x=element_text(angle=-45, hjust = .001)) +
  theme(plot.margin=unit(c(1,1.75,1,1),"cm")) +
  labs(title = "Quarterly Appliance Shipments", x = "Quarter", y = "Shipments in Millions U.S. $")

appliancePlot

shipmentACF <- Acf(shipmentTS)

shipmentACF
## 
## Autocorrelations of series 'shipmentTS', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.279  0.044  0.021 -0.070  0.150 -0.088 -0.471 -0.367 -0.274 
##     10     11     12     13 
##  0.011  0.002 -0.022 -0.090
difACF <- Acf(diff(shipmentTS, lag=1), lag.max=12, main="ACF Plot for Differenced Series")

difACF
## 
## Autocorrelations of series 'diff(shipmentTS, lag = 1)', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000 -0.334 -0.094  0.109 -0.309  0.324  0.166 -0.318  0.000 -0.093 
##     10     11     12 
##  0.021  0.107  0.001