QUESTION ONE

This question refers to the data set that you are using for your project. For your data set determine the best simple forecasting method (average, naïve, seasonal naïve, linear trend). Use 80% of the data for training and 20% of the data for testing. Using the best model, forecast your data set for 6 periods into the future.

Load Library and Indigo sales data, indicate that it is a time series, and create a name variable
indigo_data <- read.csv("~/Dropbox (Personal)/ECON4210/Assignment Two/Question One/indigo_data.csv")
summary(indigo_data)
##       year         quarter          sales      
##  Min.   :2002   Min.   :1.000   Min.   :135.0  
##  1st Qu.:2006   1st Qu.:1.500   1st Qu.:181.5  
##  Median :2009   Median :2.000   Median :194.0  
##  Mean   :2009   Mean   :2.476   Mean   :222.8  
##  3rd Qu.:2013   3rd Qu.:3.000   3rd Qu.:251.0  
##  Max.   :2017   Max.   :4.000   Max.   :400.0
indigo_data = ts(indigo_data, start=2002, frequency=4)
data = indigo_data[,3]
Plot Indigo Sales Data
plot(indigo_data[,3],main="Indigo Quarterly Sales", xlab="Year",ylab="Millions")

Set the training and testing windows
train <- window(data,start=c(2002, 1),end=c(2014, 2))
test <- window(data, start=c(2014,2),end=c(2017, 3))
both <- window(data,start=c(2002, 1))
h=length(test)
Forecast for Average, Naïve, Seasonal Naïve Methods with benchmarks
Indigofit1 <- meanf(train, h=h)
plot(Indigofit1,main="Indigo Quarterly Sales Average Forecast", xlab="Year",ylab="Millions")

Indigofit2 <- naive(train, h=h)
plot(Indigofit2,main="Indigo Quarterly Sales Naive Forecast ", xlab="Year",ylab="Millions")

Indigofit3 <- snaive(train, h=h)
plot(Indigofit3,main="Indigo Quarterly Sales Seasonal Naive Forecast", xlab="Year",ylab="Millions")

Forecast a Linear Trend
Indigofit4 <- tslm(data ~ trend)
f <- forecast(Indigofit4, h=6,level=c(80,95))
plot(f,main="Indigo Quarterly Sales Linear Trend", ylab="Millions", xlab="Year")
lines(fitted(Indigofit4),col="blue")

Determine the accuracy of the benchmark models
a1Indigofit1 = accuracy(Indigofit1, test)
a2Indigofit2 = accuracy(Indigofit2, test)
a3Indigofit3 = accuracy(Indigofit3, test)
Combine the forecast summary statistics into a table with row names
a.table<-rbind(a1Indigofit1,a2Indigofit2, a3Indigofit3)

row.names(a.table)<-c('Mean training','Mean test', 'Naive training', 'Naive test', 
                      'S. Naive training', 'S. Naive test' )

#Order the table according to MASE
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
a.table
##                              ME      RMSE       MAE       MPE      MAPE
## S. Naive training  2.891304e+00  12.11449  9.934783  1.217238  4.837041
## S. Naive test      2.023077e+01  28.44833 21.000000  7.895729  8.164698
## Mean training     -5.680872e-15  59.58758 47.872000 -6.431668 21.411814
## Mean test          3.090769e+01  85.70588 60.261538  4.626308 20.342311
## Naive test         6.830769e+01 105.14825 68.307692 21.109562 21.109562
## Naive training     8.979592e-01  88.40722 68.326531 -5.869436 29.343977
##                       MASE       ACF1 Theil's U
## S. Naive training 1.000000  0.4453930        NA
## S. Naive test     2.113786  0.5470580 0.2896050
## Mean training     4.818626 -0.1011995        NA
## Mean test         6.065713 -0.1526504 0.8400265
## Naive test        6.875610 -0.1526504 1.0194622
## Naive training    6.877506 -0.3814527        NA
Make the table look nice
kable(a.table, caption="Forecast accuracy",digits = 4 )
Forecast accuracy
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
S. Naive training 2.8913 12.1145 9.9348 1.2172 4.8370 1.0000 0.4454 NA
S. Naive test 20.2308 28.4483 21.0000 7.8957 8.1647 2.1138 0.5471 0.2896
Mean training 0.0000 59.5876 47.8720 -6.4317 21.4118 4.8186 -0.1012 NA
Mean test 30.9077 85.7059 60.2615 4.6263 20.3423 6.0657 -0.1527 0.8400
Naive test 68.3077 105.1482 68.3077 21.1096 21.1096 6.8756 -0.1527 1.0195
Naive training 0.8980 88.4072 68.3265 -5.8694 29.3440 6.8775 -0.3815 NA

When determining how accurate the naïve, seasonal naïve and mean model is, I will be looking at the following: RSME, MAE, MAPE and MASE. It is evident that in all of those, the seasonal naïve model produces the least amount of error. This is evident because:

RSME: 105.14825 > 85.70588 > 28.44833
MAE: 68.30769 > 60.26154 > 21
MAPE: 21.10956 > 20.34231> 8.164698
MASE: 6.87561 > 6.065713 > 2.113786

In every single scenario, the season naive forecast has the lowest number which indicates the smallest error. Therefore, using these accuracies, the season naive forecast is the most accurate.

However, now we must take into consideration another model, the linear trend model.

Test Accuracy for Linear Trend Model
par(mfrow=c(1,2))
res1 <- ts(resid(Indigofit4), s = 2002, f = 4)
plot.ts(res1, main="Linear Trend Forecast Residuals", ylab = "res(Indigo Sales)")
abline(0, 0)
Acf(res1, main="Autocorrelation of Linear Trend Residuals")

summary(Indigofit4)
## 
## Call:
## tslm(formula = data ~ trend)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.84 -42.30 -28.77  39.25 146.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 190.6687    16.3084  11.691   <2e-16 ***
## trend         1.0034     0.4431   2.265   0.0271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.95 on 61 degrees of freedom
## Multiple R-squared:  0.07755,    Adjusted R-squared:  0.06243 
## F-statistic: 5.128 on 1 and 61 DF,  p-value: 0.02711

When we analyze the residuals and the ACFs of the residuals, it is apparent that there are large spikes on lag 4, 8, 12 and 16 going upwards, and large spikes going downward at 2, 6, 10, and 14. This shows that there is a high degree of both pattern (which appears to be seasonal) and autocorrelation of the residuals. Therefore, the estimated model violates the assumption of no autocorrelation in the errors and our forecasts may be inefficient.

If we analyze this further (to determine if seasonality is the better forecast) we can consider the linear trend forecasts summary. When assessing how well the model fit the data, we will first look at the residual statistics which indicate (like our plots) that the distribution of the residuals does not appear to be strongly symmetrical. Taking into consideration that the linear trend forecast shows a high degree of error, and the seasonal naïve forecast shows a small degree of error, I would say that the best forecast is the seasonal naïve approach. With that conclusion, the forecasts for the next 6 periods would be:

Time 2017,4: 220
Time 2018, 1: 193
Time 2018, 2: 217
Time 2018, 3: 400
Time 2018, 4: 220
Time 2019, 1: 193

QUESTION TWO

This question explores how Magna sales (millions of US dollars) are affected by US GDP (billions of US dollars). Use 2000:1 to 2014:4 for training and 2015:1 to 2017:1 for testing. The data are in as_2.csv.

Using the training data to estimate a linear regression with Magna sales as the dependent variable and GDP as the explanatory variable. How does this regression fit?

Load Library, Indigo sales data, and indicate that it is a time series
as_2 <- read.csv("~/Dropbox (Personal)/ECON4210/Assignment Two/as_2.csv")
summary(as_2)
##       date         MGA            GDP       
##  2000Q1 : 1   Min.   :2354   Min.   :10031  
##  2000Q2 : 1   1st Qu.:4623   1st Qu.:12181  
##  2000Q3 : 1   Median :6050   Median :14566  
##  2000Q4 : 1   Mean   :6014   Mean   :14406  
##  2001Q1 : 1   3rd Qu.:7727   3rd Qu.:16297  
##  2001Q2 : 1   Max.   :9443   Max.   :19058  
##  (Other):63
as_2 = ts(as_2, start=2000, frequency=4)
Investigate the relationship between MGA and GDP
plot(jitter(MGA) ~ jitter(GDP),main= "Scatter plot of Magna Sales versus GDP", xlab="GDP (millions)",ylab="MGA (millions)",data=as_2)
fit <- lm(MGA ~ GDP, data=as_2)
abline(fit)

Set the training and testing windows
trainData <- window(as_2,start=c(2001, 1),end=c(2014, 4))
testData <- window(as_2, start=c(2015,1),end=c(2017, 1))
both <- window(as_2,start=c(2002, 1))
h=length(testData)
Investigate the relationship between Magna Sales and GDP using training data
fit1 <- lm(MGA ~ GDP, data=trainData)
plot(jitter(MGA) ~ jitter(GDP), main= "Scatter plot of Magna Sales versus GDP", xlab="GDP (millions)",ylab="Magna Sales (millions)",data=trainData)
abline(fit)

Test to see if this regression fits well using residuals
res <- residuals(fit1)
plot(jitter(res)~jitter(GDP), main= "Scatter plot of Residuals for Magna Sales versus GDP", ylab="Residuals", xlab="GDP", data=trainData)
abline(0,0)

This linear regression model fits well. This is evident because the straight line that goes through the plot shows that majority of the points land on or around it, beside one or two outliers. To further verify that this model fits well we must analyze the scatterplot of the residuals against the predictor variable. The residual plot shows that there is a high degree of randomness and no clear pattern, as the residuals are both positive and negative, spread out, and are relatively symmetrically distributed. This indicates that a simple linear regression is appropriate for this data.

Forecast Magna sales over the test period. Use a random walk with drift to forecast GDP over the test period

Extract GDP, use random walk with drift to forecast GDP
x = as_2[,3]
fit2 = rwf(trainData[,3],h=9,drift=T,level=c(80,95),fan=FALSE,lambda=NULL) 
plot(fit2,main="GDP Forecast using Random Walk with Drift", xlab="Year",ylab="Millions")

h=length(testData)
head(fit2)
## $method
## [1] "Random walk with drift"
## 
## $level
## [1] 80 95
## 
## $x
##         Qtr1    Qtr2    Qtr3    Qtr4
## 2001 10508.1 10638.4 10639.5 10701.3
## 2002 10834.4 10934.8 11037.1 11103.8
## 2003 11230.1 11370.7 11625.1 11816.8
## 2004 11988.4 12181.4 12367.7 12562.2
## 2005 12813.7 12974.1 13205.4 13381.6
## 2006 13648.9 13799.8 13908.5 14066.4
## 2007 14233.2 14422.3 14569.7 14685.3
## 2008 14668.4 14813.0 14843.0 14549.9
## 2009 14383.9 14340.4 14384.1 14566.5
## 2010 14681.1 14888.6 15057.7 15230.2
## 2011 15238.4 15460.9 15587.1 15785.3
## 2012 15973.9 16121.9 16227.9 16297.3
## 2013 16475.4 16541.4 16749.3 16999.9
## 2014 17031.3 17320.9 17622.3 17735.9
## 
## $mean
##          Qtr1     Qtr2     Qtr3     Qtr4
## 2015 17867.31 17998.73 18130.14 18261.56
## 2016 18392.97 18524.39 18655.80 18787.22
## 2017 18918.63                           
## 
## $lower
##              80%      95%
## 2015 Q1 17733.42 17662.54
## 2015 Q2 17807.66 17706.52
## 2015 Q3 17894.06 17769.08
## 2015 Q4 17986.57 17840.99
## 2016 Q1 18082.88 17918.73
## 2016 Q2 18181.83 18000.50
## 2016 Q3 18282.73 18085.24
## 2016 Q4 18385.13 18172.28
## 2017 Q1 18488.73 18261.15
## 
## $upper
##              80%      95%
## 2015 Q1 18001.21 18072.09
## 2015 Q2 18189.80 18290.94
## 2015 Q3 18366.23 18491.21
## 2015 Q4 18536.55 18682.12
## 2016 Q1 18703.06 18867.21
## 2016 Q2 18866.94 19048.28
## 2016 Q3 19028.87 19226.36
## 2016 Q4 19189.30 19402.15
## 2017 Q1 19348.53 19576.11

Forecast Magna Sales, accounting for random walk and drift for the GDP variable

fcast <- forecast(fit1, newdata=data.frame(GDP=c(17867.31,17998.73,18130.14,18261.56,18392.97,18524.39, 18655.80,18787.22,18918.63)))
plot(fcast, ylab="% change in Magna Sales", xlab="% change in GDP")

prediction <- predict(fit1, newdata = data.frame(testData))
head(fcast)
## $model
## 
## Call:
## lm(formula = MGA ~ GDP, data = trainData)
## 
## Coefficients:
## (Intercept)          GDP  
##  -5762.9113       0.8251  
## 
## 
## $mean
##        1        2        3        4        5        6        7        8 
## 8979.725 9088.162 9196.591 9305.028 9413.457 9521.894 9630.323 9738.760 
##        9 
## 9847.188 
## 
## $lower
##       [,1]     [,2]
## 1 7964.209 7410.474
## 2 8070.606 7515.759
## 3 8176.931 7620.937
## 4 8283.199 7726.023
## 5 8389.396 7831.002
## 6 8495.537 7935.891
## 7 8601.607 8040.675
## 8 8707.622 8145.370
## 9 8813.567 8249.960
## 
## $upper
##        [,1]     [,2]
## 1  9995.242 10548.98
## 2 10105.719 10660.57
## 3 10216.251 10772.25
## 4 10326.857 10884.03
## 5 10437.518 10995.91
## 6 10548.251 11107.90
## 7 10659.038 11219.97
## 8 10769.897 11332.15
## 9 10880.810 11444.42
## 
## $level
## [1] 80 95
## 
## $x
##  [1] 2863 2817 2517 2829 3121 2896 2962 3443 3496 3660 3566 4623 5103 5113
## [15] 4784 5653 5718 5858 5381 5854 6019 6369 5424 6368 6423 6731 6077 6836
## [29] 6622 6713 5533 4836 3574 3705 4669 5419 5512 6050 5942 6598 7189 7338
## [43] 6970 7251 7666 7727 7411 8033 8361 8962 8338 9174 8455 8911 8247 8790

Therefore, the forecasts for Magna sales over the test period are as follows:

Time 2015,1: 8979.725
Time 2015,2: 9088.162
Time 2015,3: 9196.591
Time 2015,4: 9305.028
Time 2016,1: 9413.457
Time 2016,2: 9521.894
Time 2016,3: 9630.323
Time 2016,4: 9738.760
Time 2017,1: 9847.188

Use a linear trend to forecast Magna sales for the test period

Forcasting Magna sales using a Linear Trend

data = trainData[,2]
fit3 <- tslm(data ~ trend)
f <- forecast(fit3, h=9,level=c(80,95))
plot(f,main="Magna Sales using Linear trend Forecast", ylab="Millions", xlab="Year")
lines(fitted(fit3),col="blue")

head(f)
## $model
## 
## Call:
## tslm(formula = data ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##      2974.0        100.2  
## 
## 
## $mean
##          Qtr1     Qtr2     Qtr3     Qtr4
## 2015 8686.675 8786.896 8887.118 8987.340
## 2016 9087.561 9187.783 9288.005 9388.226
## 2017 9488.448                           
## 
## $lower
##             [,1]     [,2]
## 2015 Q1 7477.222 6817.739
## 2015 Q2 7575.213 6914.513
## 2015 Q3 7673.130 7011.174
## 2015 Q4 7770.975 7107.724
## 2016 Q1 7868.749 7204.162
## 2016 Q2 7966.451 7300.491
## 2016 Q3 8064.082 7396.710
## 2016 Q4 8161.643 7492.820
## 2017 Q1 8259.134 7588.821
## 
## $upper
##              [,1]     [,2]
## 2015 Q1  9896.127 10555.61
## 2015 Q2  9998.580 10659.28
## 2015 Q3 10101.106 10763.06
## 2015 Q4 10203.704 10866.96
## 2016 Q1 10306.374 10970.96
## 2016 Q2 10409.115 11075.08
## 2016 Q3 10511.927 11179.30
## 2016 Q4 10614.809 11283.63
## 2017 Q1 10717.762 11388.07
## 
## $level
## [1] 80 95
## 
## $x
##      Qtr1 Qtr2 Qtr3 Qtr4
## 2001 2863 2817 2517 2829
## 2002 3121 2896 2962 3443
## 2003 3496 3660 3566 4623
## 2004 5103 5113 4784 5653
## 2005 5718 5858 5381 5854
## 2006 6019 6369 5424 6368
## 2007 6423 6731 6077 6836
## 2008 6622 6713 5533 4836
## 2009 3574 3705 4669 5419
## 2010 5512 6050 5942 6598
## 2011 7189 7338 6970 7251
## 2012 7666 7727 7411 8033
## 2013 8361 8962 8338 9174
## 2014 8455 8911 8247 8790

Therefore, the forecasts for Magna sales over the test period are as follows: ######2015,1: 8686.675 ######2015,2: 8786.896 ######2015,3: 8887.118 ######2015,4: 8987.340 ######2016,1: 9087.561 ######2016,2: 9187.783 ######2016,3: 9288.005 ######2016,4: 9388.226 ######2017,1: 9488.448

Compare the accuracy of the two forecasting models. Which model forecasts the best?

Comparing the accuracy of the forecasting models

par(mfrow=c(2,2))

#linear regression
res2 <- ts(resid(fcast),s=2000,f=4)
plot.ts(res2,main="Linear Regression Forecast Residuals", ylab="res (Magna Sales)")
abline(0,0)
Acf(res2, main="Autocorrelation of Linear Regression Residuals")

#linear trend 
res1 <- ts(resid(fit3),s=2000,f=4)
plot.ts(res1, main="Linear Trend Forecast Residuals", ylab = "res (Magna Sales)")
abline(0,0)
Acf(res1, main="Autocorrelation of Linear Trend Residuals")

To compare the accuracy of the two forecasting models, we examine the residuals and the ACFs of the residuals. Although both ACF’s show significant spikes both up and down near the lower and high lags, the autocorrelation of the linear trend model residuals show that more lines extend beyond the blue lines, meaning there is more autocorrelation in the results for that model. Therefore, the linear trend estimated model violates the assumption of no autocorrelation in the errors more than the linear regression model, and its forecasts are more inefficient. The linear regression model utilizes more information and there has better forecasts. The linear trend forecast has a larger prediction interval than the linear regression forecast.

Furthermore, when examining the residuals graphs, the linear regression residual points (plot) lie closer to and on zero (abline) than the linear trend residuals, meaning the former forecast has less error.

Therefore, the best forecast is the linear regression forecast.