1. DATA INTRODUCTION

This time we will do a time series & forecast analysis for average motorcycle sales data at a dealership from 1967-2019.

1.1. Data Preparation

Read The data.

(motosale <- read.csv("DAUTONSA.csv"))
  • DATE: the date when the sale was recorded
  • DAUTONSA: average sales.

1.2. Data Preprocessing

Change the DATE data type into datetime data type.

motosale <- motosale %>% 
  mutate(DATE = ymd (DATE))
glimpse(motosale)
## Rows: 630
## Columns: 2
## $ DATE     <date> 1967-01-01, 1967-02-01, 1967-03-01, 1967-04-01, 1967-05-01, …
## $ DAUTONSA <dbl> 564.1, 509.1, 670.4, 710.2, 744.8, 780.2, 627.2, 517.2, 547.3…

Check the completeness of the time interval. Since this is mandatory for time series data, no time period should be missed.

motosale <- motosale %>% 
  pad()
## pad applied on the interval: month

Check missing value. This is mandatory for time series data, there should be no missing value in the data.

anyNA(motosale)
## [1] FALSE
colSums(is.na(motosale))
##     DATE DAUTONSA 
##        0        0

Based on the above checks, there are no missing values in our data. So, we don’t need to fill in the missing values.

2. CREATING TIME SERIES OBJECT

Firstly we need to know the start and the end of the time interval for time series. And then we can create the time series object.

range(motosale$DATE)
## [1] "1967-01-01" "2019-06-01"

And now We can create the time series object.

motosale_ts <- ts(data = motosale$DAUTONSA, start = c(1967, 1), frequency = 12)
glimpse(motosale_ts)
##  Time-Series [1:630] from 1967 to 2019: 564 509 670 710 745 ...

View general data visualization.

motosale_ts %>% 
  autoplot()

glimpse(motosale_ts)
##  Time-Series [1:630] from 1967 to 2019: 564 509 670 710 745 ...

3.SEASONALITY ANALYSIS

Before doing the forecasting model, we need to observe the time series object from the decompose result. The main idea of decomposition is to describe the three components of the object ts (trend, seasonal, residual).

motosale_ts %>% 
  decompose() %>% 
  autoplot()

Based on the results of the decomposition plot above, now We know that the type of time series we have is additive.

4. MODEL FITTING & EVALUATION

4.1.Cross Validation

Before doing the time series analysis, we will doing data cross validation by splitting motosale_ts into data train and data test.

# data train for 50 years
(motosale_ts_train <- head(motosale_ts, 50*12))
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 1967 564.100 509.100 670.400 710.200 744.800 780.200 627.200 517.200 547.300
## 1968 630.200 623.700 766.800 728.800 811.000 781.000 736.600 634.500 563.300
## 1969 645.400 662.100 722.300 753.800 794.700 798.100 662.300 555.000 709.400
## 1970 539.200 598.100 646.000 691.300 698.600 800.200 640.900 526.100 488.500
## 1971 584.600 635.000 753.900 734.800 745.700 795.300 666.200 564.800 755.900
## 1972 605.000 692.700 765.700 768.300 880.200 869.300 762.400 650.300 735.100
## 1973 729.000 767.400 954.700 854.400 962.500 900.700 800.900 679.200 748.800
## 1974 545.400 561.500 645.300 693.500 756.600 686.800 681.100 660.000 586.100
## 1975 457.600 530.900 517.000 508.900 592.000 608.400 626.700 525.300 581.800
## 1976 580.100 640.900 804.100 776.800 781.900 816.900 725.900 606.800 637.000
## 1977 592.800 654.700 880.400 809.800 821.300 906.600 720.000 715.200 646.500
## 1978 538.100 619.600 869.200 848.800 947.100 931.300 745.900 740.100 653.700
## 1979 635.600 666.300 852.600 752.800 786.900 690.800 679.600 697.500 592.800
## 1980 588.000 591.500 670.400 541.000 498.700 511.100 542.400 486.800 486.100
## 1981 470.000 543.800 719.200 533.800 524.300 517.900 497.400 602.200 518.900
## 1982 368.200 457.200 576.000 498.900 584.500 451.900 430.000 409.400 488.500
## 1983 414.100 442.500 600.300 577.700 630.000 668.100 576.900 531.000 538.300
## 1984 583.400 655.000 756.200 721.100 803.300 727.400 684.100 603.500 566.700
## 1985 628.000 645.300 768.900 788.300 807.600 676.700 633.500 744.800 839.400
## 1986 635.900 613.400 649.400 719.600 786.300 736.300 648.700 673.300 924.900
## 1987 427.800 558.100 683.400 693.800 622.300 656.700 611.000 654.000 613.200
## 1988 531.200 649.100 733.700 651.600 702.000 722.400 605.300 603.000 578.800
## 1989 512.100 553.600 641.700 667.400 710.500 640.700 603.400 685.400 609.700
## 1990 546.800 533.600 625.800 598.800 644.300 635.100 595.800 571.500 584.800
## 1991 421.100 479.000 545.000 510.100 581.100 592.700 585.300 503.100 498.200
## 1992 417.300 497.100 541.000 545.700 565.100 635.300 573.100 483.000 515.600
## 1993 435.800 466.200 581.500 607.100 639.500 671.300 598.900 537.300 554.100
## 1994 488.500 573.300 700.500 637.500 661.400 702.600 562.900 603.800 599.900
## 1995 471.700 531.500 654.200 561.800 673.500 701.700 593.200 658.200 591.100
## 1996 476.500 593.100 670.900 653.300 749.500 673.500 619.300 615.900 588.000
## 1997 509.900 547.100 638.300 589.400 649.900 621.500 612.000 599.900 548.800
## 1998 429.000 507.200 599.800 593.700 665.600 689.300 505.000 534.000 579.000
## 1999 440.000 551.300 640.000 598.000 667.500 672.500 603.400 632.700 568.200
## 2000 477.700 603.800 648.800 601.100 647.300 650.100 561.200 606.000 570.200
## 2001 444.000 544.800 597.100 514.500 617.600 614.100 485.800 519.100 476.200
## 2002 360.400 451.400 534.700 526.300 554.300 560.400 534.100 598.500 427.300
## 2003 381.800 414.800 497.200 471.400 529.900 509.600 488.600 529.200 430.200
## 2004 356.500 422.800 498.500 451.800 537.500 479.100 491.800 454.600 434.500
## 2005 350.000 414.200 517.200 509.700 494.900 504.500 508.100 499.100 462.200
## 2006 419.600 422.100 494.200 501.600 506.100 494.600 481.500 493.100 441.900
## 2007 345.000 389.500 488.300 425.900 536.200 495.500 415.100 471.600 426.900
## 2008 346.700 387.300 448.100 422.300 519.500 423.200 388.100 414.800 335.600
## 2009 200.000 217.500 283.100 279.400 315.800 295.900 354.400 469.300 263.200
## 2010 240.043 272.668 375.165 343.517 389.739 335.316 336.816 319.060 305.593
## 2011 258.576 338.615 445.315 405.418 362.214 348.950 329.517 340.459 327.466
## 2012 323.424 430.209 528.054 431.834 491.050 459.854 393.252 452.837 411.972
## 2013 383.879 446.362 544.516 466.742 519.361 499.660 446.915 516.827 383.073
## 2014 344.189 409.921 534.342 470.222 564.577 499.801 479.974 556.870 414.495
## 2015 386.010 419.495 525.027 477.314 563.176 491.397 481.665 496.815 446.569
## 2016 361.397 427.270 502.115 447.760 468.254 452.263 437.189 433.380 419.560
##          Oct     Nov     Dec
## 1967 664.800 617.800 614.700
## 1968 884.800 785.400 678.700
## 1969 816.700 705.900 638.600
## 1970 629.500 435.500 425.500
## 1971 932.000 846.200 647.400
## 1972 925.500 884.400 713.700
## 1973 850.500 772.200 568.300
## 1974 621.800 500.200 423.500
## 1975 764.600 646.500 591.200
## 1976 722.700 713.200 685.700
## 1977 859.100 727.400 637.400
## 1978 874.100 759.200 636.800
## 1979 721.300 599.600 554.300
## 1980 664.200 529.600 471.600
## 1981 491.600 432.100 357.600
## 1982 487.500 558.100 448.000
## 1983 664.200 590.500 559.400
## 1984 689.600 600.700 560.700
## 1985 598.400 515.800 558.000
## 1986 632.800 521.100 673.300
## 1987 523.900 486.000 550.700
## 1988 591.700 553.800 616.800
## 1989 524.200 473.600 455.800
## 1990 598.100 485.800 476.500
## 1991 526.200 457.600 437.500
## 1992 537.100 464.700 501.600
## 1993 581.500 538.400 522.400
## 1994 605.400 553.400 566.000
## 1995 590.300 540.900 546.000
## 1996 571.100 491.900 503.300
## 1997 550.900 463.300 531.300
## 1998 593.500 457.900 551.100
## 1999 527.200 491.900 526.100
## 2000 506.300 449.700 439.400
## 2001 620.600 431.100 389.500
## 2002 436.700 395.000 437.400
## 2003 413.300 394.700 411.900
## 2004 397.100 353.800 455.600
## 2005 395.200 384.800 433.700
## 2006 374.400 357.500 430.100
## 2007 383.600 387.700 431.800
## 2008 286.900 233.100 285.400
## 2009 281.400 248.600 349.800
## 2010 279.713 256.271 337.598
## 2011 321.021 304.949 363.464
## 2012 369.820 382.692 444.846
## 2013 398.003 406.615 421.205
## 2014 430.949 428.752 475.786
## 2015 447.743 387.789 472.123
## 2016 377.266 374.906 444.215
# data test for 2 years
(motosale_ts_test <- tail(motosale_ts, 24))
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 2017                                                 376.212 394.215 406.634
## 2018 280.479 328.988 416.105 326.924 394.895 372.417 321.267 338.095 329.564
## 2019 277.863 285.093 372.433 295.505 346.105 327.864                        
##          Oct     Nov     Dec
## 2017 345.309 340.959 362.059
## 2018 329.158 306.410 342.587
## 2019

4.2.Model Fitting & Evaluation

Since the data having the trend & seasonal. So, We will doing model fitting with Holtwinters automatically & manually.

(HWmod_A <- HoltWinters(x = motosale_ts_train,
                                          seasonal = "additive"))
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = motosale_ts_train, seasonal = "additive")
## 
## Smoothing parameters:
##  alpha: 0.4346224
##  beta : 0.0090972
##  gamma: 0.3273598
## 
## Coefficients:
##              [,1]
## a    458.25406181
## b     -0.15374837
## s1  -115.02082013
## s2   -57.96014962
## s3    34.15414076
## s4   -20.28723146
## s5    34.02672743
## s6     0.08182971
## s7   -17.24927712
## s8    18.41107282
## s9   -44.38528371
## s10  -59.31287755
## s11  -78.45111822
## s12  -23.16461175
(HWmod_M <- HoltWinters(x = motosale_ts_train,
                                         alpha = 0.01,
                                         beta = 0.001,
                                         gamma = 0.2,
                                         seasonal = "additive"))
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = motosale_ts_train, alpha = 0.01, beta = 0.001,     gamma = 0.2, seasonal = "additive")
## 
## Smoothing parameters:
##  alpha: 0.01
##  beta : 0.001
##  gamma: 0.2
## 
## Coefficients:
##            [,1]
## a    2898.08795
## b       5.49666
## s1  -2343.54992
## s2  -2288.43978
## s3  -2197.63878
## s4  -2253.18654
## s5  -2204.86448
## s6  -2245.77533
## s7  -2273.08019
## s8  -2244.71445
## s9  -2313.84341
## s10 -2331.97370
## s11 -2354.33420
## s12 -2297.99273

Then We will do prediction for the model We have create based on the data test.

HWmod_A_fc <- forecast(object = HWmod_A, h = 24)

HWmod_M_fc <- forecast(object = HWmod_M, h = 24)

Forecast interpretation.

motosale_ts_train %>% 
  autoplot() +
  autolayer(object = HWmod_A_fc) +
  autolayer(object = HWmod_M_fc)

motosale_ts_test %>% 
  autoplot() + 
  autolayer(object = HWmod_A_fc) +
  autolayer(object = HWmod_M_fc)

Model evaluation

forecast::accuracy(HWmod_A_fc$mean, motosale_ts_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -75.56059 80.70608 75.56059 -21.92006 21.92006 0.4272952  1.746043
forecast::accuracy(HWmod_M_fc$mean, motosale_ts_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -345.6412 349.9037 345.6412 -100.0839 100.0839 0.7711069  7.584226

4.3.Compare Model

Before we decide the model that We will use, We will build another model that compatible to the data and compare which one is better. So, We build another model. Since the data having a trend & seasonal. So, we can not use the Exponential Smoothing model. We will try to build a model using SARIMA.

Data staionary check.

motosale_ts_train %>% 
  adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -4.6672, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

from adf.test() above we can see that the data already stationeer. So, We don’t need to do differencing and We can directly build the model.

(SARIMAmod <- auto.arima(y = motosale_ts_train))
## Series: motosale_ts_train 
## ARIMA(1,0,2)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2     sma1     sma2    drift
##       0.9474  -0.3852  -0.1973  -0.6686  -0.1561  -0.4653
## s.e.  0.0176   0.0445   0.0414   0.0410   0.0400   0.2788
## 
## sigma^2 = 2819:  log likelihood = -3173.13
## AIC=6360.25   AICc=6360.44   BIC=6390.89

Then We will do prediction for the model We have create based on the data test.

SARIMAmod_fc <- forecast(object = SARIMAmod, h = 24)

Forecast interpretation.

motosale_ts_train %>% 
  autoplot() +
  autolayer(object = motosale_ts_test, series = "data validation") +
  autolayer(object = SARIMAmod_fc$mean, series = "forecast") +
  autolayer(object = SARIMAmod$fitted, series = "Fitted value")

Model evaluation

forecast::accuracy(SARIMAmod_fc$mean, motosale_ts_test)
##                 ME    RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -54.56376 61.8537 55.97338 -15.81701 16.16367 0.5279831  1.325008

5. CONCLUSION

From the evaluation of the 3 model that We build, SARIMA model is the best result.
After We know the best model We will use, now we will do Assumption Check for the residual.

5.1. No-Autocorrelation Residual

Box.test(SARIMAmod$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  SARIMAmod$residuals
## X-squared = 0.0064117, df = 1, p-value = 0.9362

\(H_0\): residual has no-autocorrelation \(H_1\): residual has autocorrelation

p-value > 0.05 (alpha), no-autocorrelation

Insight : residual has no-autocorrelation

5.2. Normality Residual

We will check the normality residual with visualization.

hist(SARIMAmod$residuals)

Based on the visualization, the residual is not normally distributed. Now, We will check using saphiro test.

shapiro.test(SARIMAmod$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  SARIMAmod$residuals
## W = 0.97023, p-value = 1.101e-09

H0 : Residual data is normally distributed
H1 : Residual data is not normally distributed

Reject H0, if p-value < alpha (0.05)

Insight : Residual data is not normally distributed

5.3. Summary

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.

In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment.