Patterns and irregularities of initial data were examined
Used tsclean function to clean up possible outliers or missing values (results)
As for the frequency parameter in ts() object, we are specifying periodicity of the data, i.e., number of observations per season = 10
Tsclean function cleaned only 1 observation so we can use initial data for forecasting as there are hardly any outliers that could bias the model by skewing statistical summaries.
Alternative to TsClean function can be smoothing initial data with Moving averages ma(), but for “TURN” data we can work without it as datasample is quite small (MA can be used for other shows if delivery variable will be to volatile for them)
| SEASON | BROADCAST_DATE | EPISODE_NAME | DELIVERY |
|---|---|---|---|
| 3 | 2016-05-23 | 305 HYPOCRISY FRAUD AND TYRANNY | 216 |
| 3 | 2016-05-30 | 306 MANY MICKLES MAKE A MUCKLE | 211 |
| 3 | 2016-06-06 | 307 JUDGMENT | 244 |
| 3 | 2016-06-13 | 308 MENDED | 216 |
| 3 | 2016-06-20 | 309 BLADE ON THE FEATHER | 268 |
| 3 | 2016-06-27 | 310 TRIAL AND EXECUTION | 323 |
| 4 | 2017-06-17 | 401 SPYHUNTER GENERAL | 236 |
| 4 | 2017-06-17 | 402 THE BLACK HOLE OF CALCUTTA | 167 |
| 4 | 2017-06-24 | 403 BLOOD FOR BLOOD | 189 |
Stuctural components of Turn Data using str()
Stuctural components of Turn Data using decompose()
The building blocks of a time series analysis are seasonality, trend, and cycle. These intuitive components capture the historical patterns in the series. We have defined these components using str() function with additive model (Can later try and evaluate use of multiplicative model)
First, we calculate seasonal component of the data using stl(). STL is a flexible function for decomposing and forecasting the series. It calculates the seasonal component of the series using smoothing, and adjusts the original series by subtracting seasonality in two simple lines:
Further step is to exclude trend and analyze remainders of the initial dataset:
Trend can be excluded with numerous instruments and functions. From the graph you can see two analyzed options: decomposition of data using str() function and 1st differences of the initial dataset
As you can see both methods work well: successfully exclude trend. But stl() method result in relatively lower residuals (remainders)
Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its mean, variance, and autocovariance are time invariant. This assumption makes intuitive sense: Since ARIMA uses previous lags of series to model its behavior, modeling stable series with consistent properties involves less uncertainty.
The augmented Dickey-Fuller (ADF) test is a formal statistical test for stationarity. The null hypothesis assumes that the series is non-stationary. ADF procedure tests whether the change in Y can be explained by lagged value and a linear trend. If contribution of the lagged value to the change in Y is non-significant and there is a presence of a trend component, the series is non-stationary and null hypothesis will not be rejected.
##
## Augmented Dickey-Fuller Test
##
## data: TrainingTurn
## Dickey-Fuller = -3.2068, Lag order = 3, p-value = 0.1091
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: Turn_d1
## Dickey-Fuller = -2.7484, Lag order = 3, p-value = 0.2857
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: detrendedTurn
## Dickey-Fuller = -5.1697, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
Our initial TURN data is non-stationary; the average number of delivery checkouts changes through time, levels change, etc.
Usually, non-stationary series can be corrected by a simple transformation such as differencing. Differencing the series can help in removing its trend or cycles.
As can been seen in the graphs removal of trend and seasonality lead to stationary dataset that can be later used for forecasting by ARIMA models
ACF plots display correlation between a series and its lags. In addition to suggesting the order of differencing, ACF plots can help in determining the order of the M A (q) model. Partial autocorrelation plots (PACF), as the name suggests, display correlation between a variable and its lags that is not explained by previous lags. PACF plots are useful when determining the order of the AR(p) model.
seasonality can be forecasted as similiar indexes (delivery values) provided by str() function:
Trend component obtained by stl() function can be forecasted numerous ways, the most common is to fit:
On our data linear regression and HoltWinter model looks more promising, but they forecast trends differently:
For forecast purposes i will be using Trendforecast made by HoltWinters method
## Model AIC BIC
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895
## 2 ARIMA (1,1,1) 311.9321 316.0340
## 3 ARIMA (1,0,0) 318.3568 322.5604
## 4 ARIMA (1,0,1) 320.3128 325.9176
## 5 ARIMA (1,0,2) 320.1390 327.1449
## 6 ARIMA (1,0,3) 319.2352 327.6424
## 7 ARIMA (0,1,1) 311.1270 313.8616
There exist a number of such criteria for comparing quality of fit across multiple models. Two of the most widely used are Akaike information criteria (AIC) and Baysian information criteria (BIC). These criteria are closely related and can be interpreted as an estimate of how much information would be lost if a given model is chosen. When comparing models, we want to minimize AIC and BIC.
According to Training set results based on AIC and BIC we expect ARIMA(1,1,1) & ARIMA(0,1,1) to perform better on the test set, however there are only minor differences between AIC, BIC values for selected models, and residuals of all these models are not autocorelated, so the final decision about model quality will be based on the test error
## Model AIC BIC RMSE PercentageDifference
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895 75.79421 31.66631
## 2 ARIMA (1,1,1) 311.9321 316.0340 48.44699 22.57598
## 3 ARIMA (1,0,0) 318.3568 322.5604 48.03851 22.48914
## 4 ARIMA (1,0,1) 320.3128 325.9176 47.64507 22.34958
## 5 ARIMA (1,0,2) 320.1390 327.1449 39.44283 19.08011
## 6 ARIMA (1,0,3) 319.2352 327.6424 47.64113 23.77766
## 7 ARIMA (0,1,1) 311.1270 313.8616 46.26052 22.01650
According to test error rate ARIMA(1,0,2) performed better than the other models, however we will add other TS models and also evaluate them on the TEST set
All forecast estimates above are provided with confidence bounds: 95% CI. Longer term forecasts will usually have more uncertainty, as the model will regress future “DELIVERY” on previously predicted values (HOWEVER in our case ARIMA(1,0,2) predicted remainders over 0, so the more influence will be expected from trend and seasonal component) ____________________________________________________________________________________________________________________
| Model | AIC | BIC | RMSE | PercentageDifference |
|---|---|---|---|---|
| ARIMA (1,0,2) | 320.14 | 327.14 | 39.44 | 19.08 |
| Loess with span = 0.6 | NA | NA | 43.63 | 21.26 |
| ARIMA (0,1,1) | 311.13 | 313.86 | 46.26 | 22.02 |
| ARIMA (1,0,3) | 319.24 | 327.64 | 47.64 | 23.78 |
| ARIMA (1,0,1) | 320.31 | 325.92 | 47.65 | 22.35 |
| ARIMA (1,0,0) | 318.36 | 322.56 | 48.04 | 22.49 |
| ARIMA (1,1,1) | 311.93 | 316.03 | 48.45 | 22.58 |
| Auto ARIMA (1,1,0) | 321.99 | 326.09 | 75.79 | 31.67 |
| HW beta = .4, gamma = .1 | NA | NA | 83.14 | 34.72 |
| HW beta = FALSE, gamma = FALSE | NA | NA | 104.75 | 54.16 |
| HW beta = .5, gamma = .5 | NA | NA | 123.49 | 58.34 |
According to test error rate the most suitable model is ARIMA(1,0,2) - with a given forecasts of Trend and Seasonal components
We will need to evaluate choosen model on the whole dataset to base forecasts on the most updated data
##
## Augmented Dickey-Fuller Test
##
## data: detrendedTurnFinal
## Dickey-Fuller = -4.9955, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
##
## Shapiro-Wilk normality test
##
## data: residuals(BestModel)
## W = 0.9838, p-value = 0.8909
| SEASON | DELIVERY | EPISODE | status |
|---|---|---|---|
| 1 | 714.0000 | 101 | Actual |
| 1 | 747.0000 | 102 | Actual |
| 1 | 540.0000 | 103 | Actual |
| 1 | 553.0000 | 104 | Actual |
| 1 | 446.0000 | 105 | Actual |
| 1 | 362.0000 | 106 | Actual |
| 1 | 404.0000 | 107 | Actual |
| 1 | 443.0000 | 108 | Actual |
| 1 | 395.0000 | 109 | Actual |
| 1 | 585.0000 | 110 | Actual |
| 2 | 268.0000 | 201 | Actual |
| 2 | 304.0000 | 202 | Actual |
| 2 | 309.0000 | 203 | Actual |
| 2 | 198.0000 | 204 | Actual |
| 2 | 312.0000 | 205 | Actual |
| 2 | 320.0000 | 206 | Actual |
| 2 | 333.0000 | 207 | Actual |
| 2 | 354.0000 | 208 | Actual |
| 2 | 337.0000 | 209 | Actual |
| 2 | 427.0000 | 210 | Actual |
| 3 | 206.0000 | 301 | Actual |
| 3 | 198.0000 | 302 | Actual |
| 3 | 222.0000 | 303 | Actual |
| 3 | 208.0000 | 304 | Actual |
| 3 | 216.0000 | 305 | Actual |
| 3 | 211.0000 | 306 | Actual |
| 3 | 244.0000 | 307 | Actual |
| 3 | 216.0000 | 308 | Actual |
| 3 | 268.0000 | 309 | Actual |
| 3 | 323.0000 | 310 | Actual |
| 4 | 236.0000 | 401 | Actual |
| 4 | 167.0000 | 402 | Actual |
| 4 | 189.0000 | 403 | Actual |
| 4 | 145.7247 | 404 | Forecasted |
| 4 | 162.6368 | 405 | Forecasted |
| 4 | 143.9503 | 406 | Forecasted |
| 4 | 184.3546 | 407 | Forecasted |
| 4 | 206.1743 | 408 | Forecasted |
| 4 | 213.1575 | 409 | Forecasted |
| 4 | 331.4394 | 410 | Forecasted |