Forecasting model for “TURN” show


1) Initial data: 3 full seasons each of 10 episodes. Two episodes in the raw data is located in a single row - need to be cleaned

  • Patterns and irregularities of initial data were examined

  • Used tsclean function to clean up possible outliers or missing values (results)

  • As for the frequency parameter in ts() object, we are specifying periodicity of the data, i.e., number of observations per season = 10

Tsclean function cleaned only 1 observation so we can use initial data for forecasting as there are hardly any outliers that could bias the model by skewing statistical summaries.

Alternative to TsClean function can be smoothing initial data with Moving averages ma(), but for “TURN” data we can work without it as datasample is quite small (MA can be used for other shows if delivery variable will be to volatile for them)


2) I have devided 33 observation into 30 training (first 3 seasons of Turn) and 3 test (first 3 episodes of 4th season)

SEASON BROADCAST_DATE EPISODE_NAME DELIVERY
3 2016-05-23 305 HYPOCRISY FRAUD AND TYRANNY 216
3 2016-05-30 306 MANY MICKLES MAKE A MUCKLE 211
3 2016-06-06 307 JUDGMENT 244
3 2016-06-13 308 MENDED 216
3 2016-06-20 309 BLADE ON THE FEATHER 268
3 2016-06-27 310 TRIAL AND EXECUTION 323
4 2017-06-17 401 SPYHUNTER GENERAL 236
4 2017-06-17 402 THE BLACK HOLE OF CALCUTTA 167
4 2017-06-24 403 BLOOD FOR BLOOD 189

3) Data decomposition

  • Does the series appear to have trends or seasonality?
  • Use decompose() or stl() to examine and possibly remove components of the series
  • For further analysis data should be converted into ts using ts() function

Stuctural components of Turn Data using str()

Stuctural components of Turn Data using decompose()

The building blocks of a time series analysis are seasonality, trend, and cycle. These intuitive components capture the historical patterns in the series. We have defined these components using str() function with additive model (Can later try and evaluate use of multiplicative model)


4) Substracting seasonal component

First, we calculate seasonal component of the data using stl(). STL is a flexible function for decomposing and forecasting the series. It calculates the seasonal component of the series using smoothing, and adjusts the original series by subtracting seasonality in two simple lines:


5) Substracting trend component

Further step is to exclude trend and analyze remainders of the initial dataset:

Trend can be excluded with numerous instruments and functions. From the graph you can see two analyzed options: decomposition of data using str() function and 1st differences of the initial dataset

As you can see both methods work well: successfully exclude trend. But stl() method result in relatively lower residuals (remainders)


6) Performing ADF tests (Stationarity)

Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its mean, variance, and autocovariance are time invariant. This assumption makes intuitive sense: Since ARIMA uses previous lags of series to model its behavior, modeling stable series with consistent properties involves less uncertainty.

The augmented Dickey-Fuller (ADF) test is a formal statistical test for stationarity. The null hypothesis assumes that the series is non-stationary. ADF procedure tests whether the change in Y can be explained by lagged value and a linear trend. If contribution of the lagged value to the change in Y is non-significant and there is a presence of a trend component, the series is non-stationary and null hypothesis will not be rejected.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  TrainingTurn
## Dickey-Fuller = -3.2068, Lag order = 3, p-value = 0.1091
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Turn_d1
## Dickey-Fuller = -2.7484, Lag order = 3, p-value = 0.2857
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  detrendedTurn
## Dickey-Fuller = -5.1697, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

Our initial TURN data is non-stationary; the average number of delivery checkouts changes through time, levels change, etc.

Usually, non-stationary series can be corrected by a simple transformation such as differencing. Differencing the series can help in removing its trend or cycles.

As can been seen in the graphs removal of trend and seasonality lead to stationary dataset that can be later used for forecasting by ARIMA models

ACF plots display correlation between a series and its lags. In addition to suggesting the order of differencing, ACF plots can help in determining the order of the M A (q) model. Partial autocorrelation plots (PACF), as the name suggests, display correlation between a variable and its lags that is not explained by previous lags. PACF plots are useful when determining the order of the AR(p) model.

  • ACF plots can help in determining the order of the M A (q) model - [ 1 | 2 | 3 | 4]
  • PACF plots are useful when determining the order of the AR(p) model - [1]

7) Forecasting seasonality

seasonality can be forecasted as similiar indexes (delivery values) provided by str() function:


8) Forecasting Trend component

Trend component obtained by stl() function can be forecasted numerous ways, the most common is to fit:

  • Linear regression with a function of time and its polynoms and/or log of time
  • Use exponential smoothing
  • Use HoltWinters forecasting method

On our data linear regression and HoltWinter model looks more promising, but they forecast trends differently:

For forecast purposes i will be using Trendforecast made by HoltWinters method


9) Forecasting of remainders with ARIMA

##                Model      AIC      BIC
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895
## 2      ARIMA (1,1,1) 311.9321 316.0340
## 3      ARIMA (1,0,0) 318.3568 322.5604
## 4      ARIMA (1,0,1) 320.3128 325.9176
## 5      ARIMA (1,0,2) 320.1390 327.1449
## 6      ARIMA (1,0,3) 319.2352 327.6424
## 7      ARIMA (0,1,1) 311.1270 313.8616

There exist a number of such criteria for comparing quality of fit across multiple models. Two of the most widely used are Akaike information criteria (AIC) and Baysian information criteria (BIC). These criteria are closely related and can be interpreted as an estimate of how much information would be lost if a given model is chosen. When comparing models, we want to minimize AIC and BIC.

According to Training set results based on AIC and BIC we expect ARIMA(1,1,1) & ARIMA(0,1,1) to perform better on the test set, however there are only minor differences between AIC, BIC values for selected models, and residuals of all these models are not autocorelated, so the final decision about model quality will be based on the test error


10) Forecasting of remainders with ARIMA

##                Model      AIC      BIC     RMSE PercentageDifference
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895 75.79421             31.66631
## 2      ARIMA (1,1,1) 311.9321 316.0340 48.44699             22.57598
## 3      ARIMA (1,0,0) 318.3568 322.5604 48.03851             22.48914
## 4      ARIMA (1,0,1) 320.3128 325.9176 47.64507             22.34958
## 5      ARIMA (1,0,2) 320.1390 327.1449 39.44283             19.08011
## 6      ARIMA (1,0,3) 319.2352 327.6424 47.64113             23.77766
## 7      ARIMA (0,1,1) 311.1270 313.8616 46.26052             22.01650

According to test error rate ARIMA(1,0,2) performed better than the other models, however we will add other TS models and also evaluate them on the TEST set

All forecast estimates above are provided with confidence bounds: 95% CI. Longer term forecasts will usually have more uncertainty, as the model will regress future “DELIVERY” on previously predicted values (HOWEVER in our case ARIMA(1,0,2) predicted remainders over 0, so the more influence will be expected from trend and seasonal component) ____________________________________________________________________________________________________________________

11) Forecasting of remainders with LOESS, HW

Model AIC BIC RMSE PercentageDifference
ARIMA (1,0,2) 320.14 327.14 39.44 19.08
Loess with span = 0.6 NA NA 43.63 21.26
ARIMA (0,1,1) 311.13 313.86 46.26 22.02
ARIMA (1,0,3) 319.24 327.64 47.64 23.78
ARIMA (1,0,1) 320.31 325.92 47.65 22.35
ARIMA (1,0,0) 318.36 322.56 48.04 22.49
ARIMA (1,1,1) 311.93 316.03 48.45 22.58
Auto ARIMA (1,1,0) 321.99 326.09 75.79 31.67
HW beta = .4, gamma = .1 NA NA 83.14 34.72
HW beta = FALSE, gamma = FALSE NA NA 104.75 54.16
HW beta = .5, gamma = .5 NA NA 123.49 58.34

According to test error rate the most suitable model is ARIMA(1,0,2) - with a given forecasts of Trend and Seasonal components


11) Evaluating ARIMA(1,0,2) on the whole dataset

We will need to evaluate choosen model on the whole dataset to base forecasts on the most updated data

## 
##  Augmented Dickey-Fuller Test
## 
## data:  detrendedTurnFinal
## Dickey-Fuller = -4.9955, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(BestModel)
## W = 0.9838, p-value = 0.8909

12) Forecast using ARIMA (1,0,2) for future 7 episodes of 4th season

SEASON DELIVERY EPISODE status
1 714.0000 101 Actual
1 747.0000 102 Actual
1 540.0000 103 Actual
1 553.0000 104 Actual
1 446.0000 105 Actual
1 362.0000 106 Actual
1 404.0000 107 Actual
1 443.0000 108 Actual
1 395.0000 109 Actual
1 585.0000 110 Actual
2 268.0000 201 Actual
2 304.0000 202 Actual
2 309.0000 203 Actual
2 198.0000 204 Actual
2 312.0000 205 Actual
2 320.0000 206 Actual
2 333.0000 207 Actual
2 354.0000 208 Actual
2 337.0000 209 Actual
2 427.0000 210 Actual
3 206.0000 301 Actual
3 198.0000 302 Actual
3 222.0000 303 Actual
3 208.0000 304 Actual
3 216.0000 305 Actual
3 211.0000 306 Actual
3 244.0000 307 Actual
3 216.0000 308 Actual
3 268.0000 309 Actual
3 323.0000 310 Actual
4 236.0000 401 Actual
4 167.0000 402 Actual
4 189.0000 403 Actual
4 145.7247 404 Forecasted
4 162.6368 405 Forecasted
4 143.9503 406 Forecasted
4 184.3546 407 Forecasted
4 206.1743 408 Forecasted
4 213.1575 409 Forecasted
4 331.4394 410 Forecasted

13) Further steps:

  • Do the same for other shows
  • Use manual ARIMA(1,0,2) + add more variables
  • Try to use methods similiar to crossvalidation for time series