R Notebook: Lab I

1. Read in data and data wrangling

1.1 Data exploration and visualization

Exercise 1 Read in the data in RSFHFSN_Furniture.xlsx and store it as furniture_data. Convert the entire data into a time series object and name it total_ts. The frequency should be 12. Visualize total_ts. What is the clear pattern that you see?

rm(list=ls())
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(readxl)
furniture_data <- read_excel("./RSFHFSN_Furniture.xlsx")
total_ts <- ts(furniture_data$RSFHFSN, start = c(1992,1), freq=12)

show(total_ts)

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1992  3699  3759  3997  3981  4109  4188  4237  4273  4161  4324  4508  5086
## 1993  3954  3766  4209  4212  4370  4402  4513  4561  4455  4602  4976  5432
## 1994  3881  4000  4658  4538  4698  4784  4809  5172  5027  5143  5542  6096
## 1995  4501  4270  4922  4664  5032  5033  5028  5334  5249  5332  5893  6210
## 1996  4639  4684  5242  5139  5424  5305  5493  5704  5409  5775  6221  6523
## 1997  5148  4960  5536  5493  5825  5625  5838  6071  5903  6169  6573  7320
## 1998  5646  5411  5971  5851  6034  6115  6265  6301  6135  6487  6968  7850
## 1999  5752  5823  6593  6289  6493  6656  6764  7037  6938  7132  7788  8643
## 2000  6461  6805  7420  6944  7432  7287  7310  7761  7337  7461  8125  8411
## 2001  6866  6619  7372  6753  7351  7291  7259  7757  6888  7399  8320  9178
## 2002  6956  6900  7566  7167  7770  7293  7455  7974  7247  7637  8589  9261
## 2003  6984  6523  7400  7158  7843  7526  7838  8143  7786  8111  8761  9882
## 2004  7437  7400  8259  7765  7886  8209  8649  8639  8211  8479  9336 10625
## 2005  7696  7762  8550  8190  8415  8723  8718  9350  8974  8870  9965 11008
## 2006  8381  8248  9178  8346  8980  9212  9018  9695  9278  8953  9941 10782
## 2007  8588  8220  9183  8336  9030  8932  8952  9601  8643  8812  9731 10192
## 2008  7939  7887  8158  7815  8381  7904  8269  8238  7474  7506  7907  8630
## 2009  6611  6416  6713  6420  6777  6722  6985  6954  6764  6650  7161  8156
## 2010  6145  6285  7062  6535  6739  6703  7036  7124  6898  6600  7418  8249
## 2011  6073  6311  7319  6705  6886  6839  7043  7470  7242  6771  7715  8787
## 2012  6677  6976  7568  6679  7468  7175  7341  7840  7239  7064  8044  8987
## 2013  7117  6920  7695  7146  7658  7390  7805  8285  7686  7897  8668  9013
## 2014  7071  7100  7966  7654  8244  7749  8261  8591  8235  8301  8799 10053
## 2015  7840  7407  8495  8150  8785  8479  9012  9056  8886  9026  9398 10810
## 2016  8117  8323  9198  8545  8943  9056  9114  9503  9461  8978  9866 10756
## 2017  8232  8200  9605  8541  9369  9317  9195  9813  9467  9436 10468 11181
## 2018  8679  8508  9963  9152  9883  9692  9703 10169  9560  9746 10563 10912
## 2019  8641  8463  9825  9394 10105  9565  9986 10402  9914 10027 10789 10926
## 2020  9225  9193  7897  3780  7072  9581 10172 10692 10671 10645 10697 11493
## 2021 10215  9690 12133 11591 11580 11350 11660 11728 11616 11704 12443 12340
## 2022 10107 10422 12363 11967 11977 11654 11503 12294 11852 11770 12368 12227
## 2023 11121 10678 12098 10868 11541 11338 10970 11726 11092 10707 11958 11490
## 2024  9745 10140 10801 10388 11186 10597 11145 11679 11038 11194 12345 12455
## 2025 10414 10444 11549 11219 11766 11053 11840 12018

plot(total_ts)

Exercise 2 Use the window() function to create a subset of the total_ts that spans from 1992/01 to 2019/12 and name the new time series object furniture_ts. Visualize it as well. What is the pattern you see and why?

furniture_ts <- ts(furniture_data$RSFHFSN, start = c(1992, 1), end = c(2019, 12), freq=12)

show(furniture_ts)

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1992  3699  3759  3997  3981  4109  4188  4237  4273  4161  4324  4508  5086
## 1993  3954  3766  4209  4212  4370  4402  4513  4561  4455  4602  4976  5432
## 1994  3881  4000  4658  4538  4698  4784  4809  5172  5027  5143  5542  6096
## 1995  4501  4270  4922  4664  5032  5033  5028  5334  5249  5332  5893  6210
## 1996  4639  4684  5242  5139  5424  5305  5493  5704  5409  5775  6221  6523
## 1997  5148  4960  5536  5493  5825  5625  5838  6071  5903  6169  6573  7320
## 1998  5646  5411  5971  5851  6034  6115  6265  6301  6135  6487  6968  7850
## 1999  5752  5823  6593  6289  6493  6656  6764  7037  6938  7132  7788  8643
## 2000  6461  6805  7420  6944  7432  7287  7310  7761  7337  7461  8125  8411
## 2001  6866  6619  7372  6753  7351  7291  7259  7757  6888  7399  8320  9178
## 2002  6956  6900  7566  7167  7770  7293  7455  7974  7247  7637  8589  9261
## 2003  6984  6523  7400  7158  7843  7526  7838  8143  7786  8111  8761  9882
## 2004  7437  7400  8259  7765  7886  8209  8649  8639  8211  8479  9336 10625
## 2005  7696  7762  8550  8190  8415  8723  8718  9350  8974  8870  9965 11008
## 2006  8381  8248  9178  8346  8980  9212  9018  9695  9278  8953  9941 10782
## 2007  8588  8220  9183  8336  9030  8932  8952  9601  8643  8812  9731 10192
## 2008  7939  7887  8158  7815  8381  7904  8269  8238  7474  7506  7907  8630
## 2009  6611  6416  6713  6420  6777  6722  6985  6954  6764  6650  7161  8156
## 2010  6145  6285  7062  6535  6739  6703  7036  7124  6898  6600  7418  8249
## 2011  6073  6311  7319  6705  6886  6839  7043  7470  7242  6771  7715  8787
## 2012  6677  6976  7568  6679  7468  7175  7341  7840  7239  7064  8044  8987
## 2013  7117  6920  7695  7146  7658  7390  7805  8285  7686  7897  8668  9013
## 2014  7071  7100  7966  7654  8244  7749  8261  8591  8235  8301  8799 10053
## 2015  7840  7407  8495  8150  8785  8479  9012  9056  8886  9026  9398 10810
## 2016  8117  8323  9198  8545  8943  9056  9114  9503  9461  8978  9866 10756
## 2017  8232  8200  9605  8541  9369  9317  9195  9813  9467  9436 10468 11181
## 2018  8679  8508  9963  9152  9883  9692  9703 10169  9560  9746 10563 10912
## 2019  8641  8463  9825  9394 10105  9565  9986 10402  9914 10027 10789 10926

plot(furniture_ts)

Exercise 3 Based on furniture_ts, compute the annual total sales for each year from 1992 to 2019 using the aggregate() function and store it to annual_ts. Further plot annual_ts, which gives a clear visualization for trend.

annual_ts <- aggregate(furniture_ts, FUN=sum, nfrequency = 1)

show(annual_ts)

## Time Series:
## Start = 1992 
## End = 2019 
## Frequency = 1 
##  [1]  50322  53452  58348  61468  65558  70461  75034  81908  88754  89053
## [11]  91815  93955 100895 106221 110012 108220  96108  82329  82794  85161
## [21]  89058  93280  98024 105344 109860 112824 116530 118037

plot(annual_ts, main = "Annual sales")

Exercise 4 Compute the monthly sales averaged across 1992-2019 using the tapply() and cycle() function and store it to monthly_sales. Further plot monthly_sales, which gives a clear visualization for monthly seasonality.

monthly_sales <- tapply(furniture_ts, INDEX=cycle(furniture_ts), FUN=mean)

show(monthly_sales)

##        1        2        3        4        5        6        7        8 
## 6633.250 6569.500 7343.679 6912.929 7354.643 7256.143 7424.857 7743.500 
##        9       10       11       12 
## 7373.821 7453.143 8144.036 8891.393

plot(monthly_sales, main = "Monthly sales", type = "l")

2. Linear Regression based forecasting

2.1 Data partition

Note that for time series forecasting, data partition should respect the time order as we would like to evaluate the performance of statistical models on the most recent data. Partition furniture_ts into a validation set of length 3 years. Allocate the rest of the data to training data.

nTotal <- length(furniture_ts)
nValid <- 3 * 12
nTrain <- nTotal - nValid

train_ts <- window(furniture_ts, start=c(1992,1), end=c(1992, nTrain))
valid_ts <- window(furniture_ts, start = c(1992,nTrain+1))

show(nTotal)

## [1] 336

show(nValid)

## [1] 36

show(nTrain)

## [1] 300

show(start(train_ts))

## [1] 1992    1

show(end(train_ts))

## [1] 2016   12

show(start(valid_ts))

## [1] 2017    1

show(end(valid_ts))

## [1] 2019   12

2.2 An additive model

Exercise 5 Estimate an additive regression model with linear trend and dummy variable seasonality. Name it model1. Plot the fitted value of model1 for the training data, along with the training data.

model1 <- tslm(train_ts~trend+season)

summary(model1)

## 
## Call:
## tslm(formula = train_ts ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1672.0  -819.2  -156.9   824.8  2213.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4514.509    213.433  21.152  < 2e-16 ***
## trend         13.053      0.641  20.364  < 2e-16 ***
## season2      -69.213    271.720  -0.255 0.799121    
## season3      615.934    271.722   2.267 0.024148 *  
## season4      212.682    271.726   0.783 0.434445    
## season5      603.549    271.731   2.221 0.027123 *  
## season6      511.496    271.738   1.882 0.060805 .  
## season7      675.003    271.747   2.484 0.013564 *  
## season8      958.831    271.756   3.528 0.000487 ***
## season9      589.458    271.768   2.169 0.030905 *  
## season10     654.525    271.781   2.408 0.016658 *  
## season11    1310.832    271.795   4.823 2.30e-06 ***
## season12    2086.860    271.811   7.678 2.56e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 960.7 on 287 degrees of freedom
## Multiple R-squared:  0.6489, Adjusted R-squared:  0.6343 
## F-statistic: 44.21 on 12 and 287 DF,  p-value: < 2.2e-16

plot(train_ts)

lines(fitted(model1), col = "red")

legend("topleft", legend = c("Training", "Model 1"), col = c("black", "red"), lty = 1, bty = "n")

Exercise 6 Which month has the highest sales based on the estimated coefficients of model1? Does it match the visualization plot in Exercise 4?

# December. Yes, it matches the visualization plot in exercise 4

2.3 An additive model with an event effect

We further create an exogenous variable to capture the post-financial-crisis period using the function time() and we store it as xreg.

post_crisis_train <- as.numeric(time(train_ts)>=2008)
xreg <- cbind(post_crisis=post_crisis_train)

Exercise 7 Estimate an additive regression model with linear trend and dummy variable seasonality and the created exogeneous variable in xreg. Name it model2. What is the effect of the 2008 financial crisis?

model2 <- tslm(train_ts~trend+season+xreg)

summary(model2)

## 
## Call:
## tslm(formula = train_ts ~ trend + season + xreg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -752.95 -240.63  -76.32  128.40 2216.92 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3548.3401   103.8332  34.173  < 2e-16 ***
## trend          27.4732     0.5382  51.049  < 2e-16 ***
## season2       -83.6332   126.5522  -0.661 0.509234    
## season3       587.0936   126.5556   4.639 5.33e-06 ***
## season4       169.4204   126.5613   1.339 0.181750    
## season5       545.8672   126.5693   4.313 2.22e-05 ***
## season6       439.3940   126.5796   3.471 0.000598 ***
## season7       588.4808   126.5922   4.649 5.11e-06 ***
## season8       857.8876   126.6071   6.776 7.04e-11 ***
## season9       474.0944   126.6243   3.744 0.000219 ***
## season10      524.7413   126.6437   4.143 4.51e-05 ***
## season11     1166.6281   126.6654   9.210  < 2e-16 ***
## season12     1928.2349   126.6894  15.220  < 2e-16 ***
## xreg        -3124.4259    97.0198 -32.204  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 447.4 on 286 degrees of freedom
## Multiple R-squared:  0.9241, Adjusted R-squared:  0.9207 
## F-statistic: 267.9 on 13 and 286 DF,  p-value: < 2.2e-16

Exercise 8 Plot the fitted value of both model1 and model2 for the training data, along with the training data.

plot(train_ts)
lines(model1$fitted.values, col = 'red')
lines(model2$fitted.values, col = 'blue')

2.4 Forecast evaluation

Exercise 9 Let’s now evaluate the forecasting performance of the two models model1 and model2 on the validation data. First, let’s generate the prediction by each model and store them in model1_pred and model2_pred respectively. Set the level of confidence for the prediction interval as 90. We then use the accuracy() function to obtain the error metrics.

Note that to generate the forecast result for model2, we need to construct an accompanying data.matrix named newxreg for the post event indicator variable.

model1_pred <- forecast(model1, h=nValid, level=90)

model1_pred

##          Point Forecast    Lo 90    Hi 90
## Jan 2017       8443.392 6818.315 10068.47
## Feb 2017       8387.232 6762.155 10012.31
## Mar 2017       9085.432 7460.355 10710.51
## Apr 2017       8695.232 7070.155 10320.31
## May 2017       9099.152 7474.075 10724.23
## Jun 2017       9020.152 7395.075 10645.23
## Jul 2017       9196.712 7571.635 10821.79
## Aug 2017       9493.592 7868.515 11118.67
## Sep 2017       9137.272 7512.195 10762.35
## Oct 2017       9215.392 7590.315 10840.47
## Nov 2017       9884.752 8259.675 11509.83
## Dec 2017      10673.832 9048.755 12298.91
## Jan 2018       8600.025 6973.610 10226.44
## Feb 2018       8543.865 6917.450 10170.28
## Mar 2018       9242.065 7615.650 10868.48
## Apr 2018       8851.865 7225.450 10478.28
## May 2018       9255.785 7629.370 10882.20
## Jun 2018       9176.785 7550.370 10803.20
## Jul 2018       9353.345 7726.930 10979.76
## Aug 2018       9650.225 8023.810 11276.64
## Sep 2018       9293.905 7667.490 10920.32
## Oct 2018       9372.025 7745.610 10998.44
## Nov 2018      10041.385 8414.970 11667.80
## Dec 2018      10830.465 9204.050 12456.88
## Jan 2019       8756.658 7128.808 10384.51
## Feb 2019       8700.498 7072.648 10328.35
## Mar 2019       9398.698 7770.848 11026.55
## Apr 2019       9008.498 7380.648 10636.35
## May 2019       9412.418 7784.568 11040.27
## Jun 2019       9333.418 7705.568 10961.27
## Jul 2019       9509.978 7882.128 11137.83
## Aug 2019       9806.858 8179.008 11434.71
## Sep 2019       9450.538 7822.688 11078.39
## Oct 2019       9528.658 7900.808 11156.51
## Nov 2019      10198.018 8570.168 11825.87
## Dec 2019      10987.098 9359.248 12614.95

post_crisis_valid <- as.numeric(time(valid_ts)>=2008)

newxreg <- cbind(post_crisis=post_crisis_valid)

model2_pred <- forecast(model2, h=nValid, level=90, newdata=newxreg)

## Warning in forecast.lm(model2, h = nValid, level = 90, newdata = newxreg):
## newdata column names not specified, defaulting to first variable required.

model2_pred

##          Point Forecast     Lo 90     Hi 90
## Jan 2017       8693.346  7936.363  9450.329
## Feb 2017       8637.186  7880.203  9394.169
## Mar 2017       9335.386  8578.403 10092.369
## Apr 2017       8945.186  8188.203  9702.169
## May 2017       9349.106  8592.123 10106.089
## Jun 2017       9270.106  8513.123 10027.089
## Jul 2017       9446.666  8689.683 10203.649
## Aug 2017       9743.546  8986.563 10500.529
## Sep 2017       9387.226  8630.243 10144.209
## Oct 2017       9465.346  8708.363 10222.329
## Nov 2017      10134.706  9377.723 10891.689
## Dec 2017      10923.786 10166.803 11680.769
## Jan 2018       9023.024  8265.216  9780.832
## Feb 2018       8966.864  8209.056  9724.672
## Mar 2018       9665.064  8907.256 10422.872
## Apr 2018       9274.864  8517.056 10032.672
## May 2018       9678.784  8920.976 10436.592
## Jun 2018       9599.784  8841.976 10357.592
## Jul 2018       9776.344  9018.536 10534.152
## Aug 2018      10073.224  9315.416 10831.032
## Sep 2018       9716.904  8959.096 10474.712
## Oct 2018       9795.024  9037.216 10552.832
## Nov 2018      10464.384  9706.576 11222.192
## Dec 2018      11253.464 10495.656 12011.272
## Jan 2019       9352.702  8593.921 10111.484
## Feb 2019       9296.542  8537.761 10055.324
## Mar 2019       9994.742  9235.961 10753.524
## Apr 2019       9604.542  8845.761 10363.324
## May 2019      10008.462  9249.681 10767.244
## Jun 2019       9929.462  9170.681 10688.244
## Jul 2019      10106.022  9347.241 10864.804
## Aug 2019      10402.902  9644.121 11161.684
## Sep 2019      10046.582  9287.801 10805.364
## Oct 2019      10124.702  9365.921 10883.484
## Nov 2019      10794.062 10035.281 11552.844
## Dec 2019      11583.142 10824.361 12341.924

accuracy(model1_pred$mean, valid_ts)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 298.7718 404.0957 354.5877 2.952371 3.607991 0.1354622  0.490252

accuracy(model2_pred$mean, valid_ts)

##                 ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set -124.2274 308.9465 233.1961 -1.43048 2.508064 0.1508157 0.3612082

Exercise 10 Which model provides better forecasting performance on the validation data? Does the additional event effect help with prediction?

# Model 2 provides better forecasting performance on the validation data and the additional event effect helps significantly with the prediction.

Exercise 11 Does model1 over or under estimate the furniture sales in the validation data?

# Model 1 underestimated the furniture sales in the validation data.

Exercise 12 Visualize the prediction of model2, together with the actual time series and the predicted mean of model1.

plot(model2_pred)
lines(valid_ts)
lines(model1_pred$mean, col='red')
abline(v=2017)
legend("topleft", legend = c("model1", "model2", "actual"), col = c("red", "blue", "black"), lty = 1, bty = "n")

2.5 An additive model with fourier seasonality and event effect (Optional)

Exercise 13-15 are optional.

We further create a fourier series based seasonality variables with K=4 and name it fs_train. Furthermore, we combine post_crisis_train and fs_train together into the data.matrix named xreg that stores all the exogenous covariates.

# fs_train <- fourier(train_ts, K=4)
# xreg <- cbind(post_crisis=post_crisis_train, fs=fs_train)

Exercise 13 Estimate an additive regression model with linear trend and fourier series based seasonality and the post event indicator stored in xreg. Name it model3.

Exercise 14 Generate the prediction by model3 and store it in model3_pred. Note that to generate the forecast result for model3, we need to construct an accompanying data.matrix named newxreg that contains 1. fourier series and 2. the post event indicator variable for the validation period. Set the level of confidence for the prediction interval as 90.

Use the accuracy() function to obtain the error metrics for model3. Among model1,model2,model3, which one performs the best in the validation data?

# fs_valid <- fourier(train_ts, K=4, h=nValid)
# newxreg <- cbind(post_crisis=post_crisis_valid, fs=fs_valid)

Exercise 15 What is the upper bound of the 90% Prediction Interval by model3 for 2019-December?