Homework 2 Predictive analytics

Salma Elshahawy

2021-02-16

Chapter 3 - The forcaster’s toolbox

Problem 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

usnetelec usgdp mcopper enplanements

Solution

Let’s convert it into a function to be DRY

#> [1] "The lambda for the usnetelec is 0.5168"

#> [1] "The lambda for the usgdp is 0.3664"

#> [1] "The lambda for the mcopper is 0.1919"

#> [1] "The lambda for the enplanements is -0.2269"

Problem 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

Solution

#> [1] "The lambda for the cangas is 0.5768"

As illustrated from the plots, even with a lambda of 0.5, the transformation failed to make the explanations easier. Lambda failed to make the size of the seasonal variation the same across the whole series. This might be due to high seasonal instability between the 1965 and 1990.


Problem 3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

Solution

#> [1] "The lambda for the Retail data is 0.1276"

I would pick the Box-cox transformation with a lambda of 0.12. As illustrated from the plot, lambda succeeded to stabalize and smooth the variance of the retail time series.


Problem 3.8

For your retail time series (from Exercise 3 in Section 2.10): a. Split the data into two parts using

  1. Check that your data have been split appropriately by producing the following plot.

  1. Calculate forecasts using snaive applied to myts.train.
#>          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
#> Jan 2011          266.2 240.2540 292.1460 226.5190 305.8810
#> Feb 2011          240.0 214.0540 265.9460 200.3190 279.6810
#> Mar 2011          267.5 241.5540 293.4460 227.8190 307.1810
#> Apr 2011          260.7 234.7540 286.6460 221.0190 300.3810
#> May 2011          272.8 246.8540 298.7460 233.1190 312.4810
#> Jun 2011          260.5 234.5540 286.4460 220.8190 300.1810
#> Jul 2011          268.5 242.5540 294.4460 228.8190 308.1810
#> Aug 2011          277.0 251.0540 302.9460 237.3190 316.6810
#> Sep 2011          278.7 252.7540 304.6460 239.0190 318.3810
#> Oct 2011          279.0 253.0540 304.9460 239.3190 318.6810
#> Nov 2011          319.3 293.3540 345.2460 279.6190 358.9810
#> Dec 2011          400.2 374.2540 426.1460 360.5190 439.8810
#> Jan 2012          266.2 229.5068 302.8932 210.0826 322.3174
#> Feb 2012          240.0 203.3068 276.6932 183.8826 296.1174
#> Mar 2012          267.5 230.8068 304.1932 211.3826 323.6174
#> Apr 2012          260.7 224.0068 297.3932 204.5826 316.8174
#> May 2012          272.8 236.1068 309.4932 216.6826 328.9174
#> Jun 2012          260.5 223.8068 297.1932 204.3826 316.6174
#> Jul 2012          268.5 231.8068 305.1932 212.3826 324.6174
#> Aug 2012          277.0 240.3068 313.6932 220.8826 333.1174
#> Sep 2012          278.7 242.0068 315.3932 222.5826 334.8174
#> Oct 2012          279.0 242.3068 315.6932 222.8826 335.1174
#> Nov 2012          319.3 282.6068 355.9932 263.1826 375.4174
#> Dec 2012          400.2 363.5068 436.8932 344.0826 456.3174
  1. Compare the accuracy of your forecasts against the actual values stored in myts.test
#>                     ME     RMSE      MAE       MPE      MAPE     MASE      ACF1
#> Training set  7.772973 20.24576 15.95676  4.702754  8.109777 1.000000 0.7385090
#> Test set     55.300000 71.44309 55.78333 14.900996 15.082019 3.495907 0.5315239
#>              Theil's U
#> Training set        NA
#> Test set      1.297866
  1. Check the residuals.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Seasonal naive method
#> Q* = 624.45, df = 24, p-value < 2.2e-16
#> 
#> Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

Yes. However, it seems that the residuals are not centered around zero(potential bias). Additionally, the ACF plot shows several lags exceeding the 95% confidence interval, and the Ljung-Box test has a statistically significant p-value suggesting the residuals are not purely white noise. This suggests that there may be another model or additional variables that will better capture the remaining signal in the data.

How sensitive are the accuracy measures to the training/test split?

#>                     ME      RMSE      MAE       MPE      MAPE     MASE
#> Training set  7.772973  20.24576 15.95676  4.702754  8.109777 1.000000
#> Test set     81.744444 100.00869 82.06667 20.549055 20.669738 5.143067
#>                   ACF1 Theil's U
#> Training set 0.7385090        NA
#> Test set     0.6830879   1.67023
#> [1] "-----------"
#> [1] "Accuraccy using mean_fc"
#>                         ME      RMSE      MAE       MPE     MAPE     MASE
#> Training set -3.071617e-15  87.29571  73.3728 -26.93168 51.43421 4.598228
#> Test set      1.593357e+02 179.25923 159.3357  41.26616 41.26616 9.985472
#>                   ACF1 Theil's U
#> Training set 0.8503979        NA
#> Test set     0.6199172  2.985541

We see the snaive model produces lower scores across the majority of measures indicating better forecasting accuracy.


1 Github repo | portfolio | Blog


  1. GithHub repo↩︎