1) Initial data: 3 full seasons each of 10 episodes. Two episodes in the raw data is located in a single row - need to be cleaned

Patterns and irregularities of initial data were examined
Used tsclean function to clean up possible outliers or missing values (results)
As for the frequency parameter in ts() object, we are specifying periodicity of the data, i.e., number of observations per season = 10

Tsclean function cleaned only 1 observation so we can use initial data for forecasting as there are hardly any outliers that could bias the model by skewing statistical summaries.

Alternative to TsClean function can be smoothing initial data with Moving averages ma(), but for “TURN” data we can work without it as datasample is quite small (MA can be used for other shows if delivery variable will be to volatile for them)

2) I have devided 33 observation into 30 training (first 3 seasons of Turn) and 3 test (first 3 episodes of 4th season)

SEASON	BROADCAST_DATE	EPISODE_NAME	DELIVERY
3	2016-05-23	305 HYPOCRISY FRAUD AND TYRANNY	216
3	2016-05-30	306 MANY MICKLES MAKE A MUCKLE	211
3	2016-06-06	307 JUDGMENT	244
3	2016-06-13	308 MENDED	216
3	2016-06-20	309 BLADE ON THE FEATHER	268
3	2016-06-27	310 TRIAL AND EXECUTION	323
4	2017-06-17	401 SPYHUNTER GENERAL	236
4	2017-06-17	402 THE BLACK HOLE OF CALCUTTA	167
4	2017-06-24	403 BLOOD FOR BLOOD	189

3) Data decomposition

Does the series appear to have trends or seasonality?
Use decompose() or stl() to examine and possibly remove components of the series
For further analysis data should be converted into ts using ts() function

Stuctural components of Turn Data using str()

Stuctural components of Turn Data using decompose()

The building blocks of a time series analysis are seasonality, trend, and cycle. These intuitive components capture the historical patterns in the series. We have defined these components using str() function with additive model (Can later try and evaluate use of multiplicative model)

4) Substracting seasonal component

First, we calculate seasonal component of the data using stl(). STL is a flexible function for decomposing and forecasting the series. It calculates the seasonal component of the series using smoothing, and adjusts the original series by subtracting seasonality in two simple lines:

5) Substracting trend component

Further step is to exclude trend and analyze remainders of the initial dataset:

Trend can be excluded with numerous instruments and functions. From the graph you can see two analyzed options: decomposition of data using str() function and 1st differences of the initial dataset

As you can see both methods work well: successfully exclude trend. But stl() method result in relatively lower residuals (remainders)

6) Performing ADF tests (Stationarity)

Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its mean, variance, and autocovariance are time invariant. This assumption makes intuitive sense: Since ARIMA uses previous lags of series to model its behavior, modeling stable series with consistent properties involves less uncertainty.

The augmented Dickey-Fuller (ADF) test is a formal statistical test for stationarity. The null hypothesis assumes that the series is non-stationary. ADF procedure tests whether the change in Y can be explained by lagged value and a linear trend. If contribution of the lagged value to the change in Y is non-significant and there is a presence of a trend component, the series is non-stationary and null hypothesis will not be rejected.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  TrainingTurn
## Dickey-Fuller = -3.2068, Lag order = 3, p-value = 0.1091
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Turn_d1
## Dickey-Fuller = -2.7484, Lag order = 3, p-value = 0.2857
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  detrendedTurn
## Dickey-Fuller = -5.1697, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

Our initial TURN data is non-stationary; the average number of delivery checkouts changes through time, levels change, etc.

Usually, non-stationary series can be corrected by a simple transformation such as differencing. Differencing the series can help in removing its trend or cycles.

As can been seen in the graphs removal of trend and seasonality lead to stationary dataset that can be later used for forecasting by ARIMA models

ACF plots display correlation between a series and its lags. In addition to suggesting the order of differencing, ACF plots can help in determining the order of the M A (q) model. Partial autocorrelation plots (PACF), as the name suggests, display correlation between a variable and its lags that is not explained by previous lags. PACF plots are useful when determining the order of the AR(p) model.

ACF plots can help in determining the order of the M A (q) model - [ 1 | 2 | 3 | 4]
PACF plots are useful when determining the order of the AR(p) model - [1]

7) Forecasting seasonality

seasonality can be forecasted as similiar indexes (delivery values) provided by str() function:

8) Forecasting Trend component

Trend component obtained by stl() function can be forecasted numerous ways, the most common is to fit:

Linear regression with a function of time and its polynoms and/or log of time
Use exponential smoothing
Use HoltWinters forecasting method

On our data linear regression and HoltWinter model looks more promising, but they forecast trends differently:

For forecast purposes i will be using Trendforecast made by HoltWinters method

9) Forecasting of remainders with ARIMA

##                Model      AIC      BIC
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895
## 2      ARIMA (1,1,1) 311.9321 316.0340
## 3      ARIMA (1,0,0) 318.3568 322.5604
## 4      ARIMA (1,0,1) 320.3128 325.9176
## 5      ARIMA (1,0,2) 320.1390 327.1449
## 6      ARIMA (1,0,3) 319.2352 327.6424
## 7      ARIMA (0,1,1) 311.1270 313.8616

There exist a number of such criteria for comparing quality of fit across multiple models. Two of the most widely used are Akaike information criteria (AIC) and Baysian information criteria (BIC). These criteria are closely related and can be interpreted as an estimate of how much information would be lost if a given model is chosen. When comparing models, we want to minimize AIC and BIC.

According to Training set results based on AIC and BIC we expect ARIMA(1,1,1) & ARIMA(0,1,1) to perform better on the test set, however there are only minor differences between AIC, BIC values for selected models, and residuals of all these models are not autocorelated, so the final decision about model quality will be based on the test error

10) Forecasting of remainders with ARIMA

##                Model      AIC      BIC     RMSE PercentageDifference
## 1 Auto ARIMA (1,1,0) 321.9876 326.0895 75.79421             31.66631
## 2      ARIMA (1,1,1) 311.9321 316.0340 48.44699             22.57598
## 3      ARIMA (1,0,0) 318.3568 322.5604 48.03851             22.48914
## 4      ARIMA (1,0,1) 320.3128 325.9176 47.64507             22.34958
## 5      ARIMA (1,0,2) 320.1390 327.1449 39.44283             19.08011
## 6      ARIMA (1,0,3) 319.2352 327.6424 47.64113             23.77766
## 7      ARIMA (0,1,1) 311.1270 313.8616 46.26052             22.01650

According to test error rate ARIMA(1,0,2) performed better than the other models, however we will add other TS models and also evaluate them on the TEST set

All forecast estimates above are provided with confidence bounds: 95% CI. Longer term forecasts will usually have more uncertainty, as the model will regress future “DELIVERY” on previously predicted values (HOWEVER in our case ARIMA(1,0,2) predicted remainders over 0, so the more influence will be expected from trend and seasonal component) ____________________________________________________________________________________________________________________

11) Forecasting of remainders with LOESS, HW

Model	AIC	BIC	RMSE	PercentageDifference
ARIMA (1,0,2)	320.14	327.14	39.44	19.08
Loess with span = 0.6	NA	NA	43.63	21.26
ARIMA (0,1,1)	311.13	313.86	46.26	22.02
ARIMA (1,0,3)	319.24	327.64	47.64	23.78
ARIMA (1,0,1)	320.31	325.92	47.65	22.35
ARIMA (1,0,0)	318.36	322.56	48.04	22.49
ARIMA (1,1,1)	311.93	316.03	48.45	22.58
Auto ARIMA (1,1,0)	321.99	326.09	75.79	31.67
HW beta = .4, gamma = .1	NA	NA	83.14	34.72
HW beta = FALSE, gamma = FALSE	NA	NA	104.75	54.16
HW beta = .5, gamma = .5	NA	NA	123.49	58.34

According to test error rate the most suitable model is ARIMA(1,0,2) - with a given forecasts of Trend and Seasonal components

11) Evaluating ARIMA(1,0,2) on the whole dataset

We will need to evaluate choosen model on the whole dataset to base forecasts on the most updated data

## 
##  Augmented Dickey-Fuller Test
## 
## data:  detrendedTurnFinal
## Dickey-Fuller = -4.9955, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(BestModel)
## W = 0.9838, p-value = 0.8909

12) Forecast using ARIMA (1,0,2) for future 7 episodes of 4th season

SEASON	DELIVERY	EPISODE	status
1	714.0000	101	Actual
1	747.0000	102	Actual
1	540.0000	103	Actual
1	553.0000	104	Actual
1	446.0000	105	Actual
1	362.0000	106	Actual
1	404.0000	107	Actual
1	443.0000	108	Actual
1	395.0000	109	Actual
1	585.0000	110	Actual
2	268.0000	201	Actual
2	304.0000	202	Actual
2	309.0000	203	Actual
2	198.0000	204	Actual
2	312.0000	205	Actual
2	320.0000	206	Actual
2	333.0000	207	Actual
2	354.0000	208	Actual
2	337.0000	209	Actual
2	427.0000	210	Actual
3	206.0000	301	Actual
3	198.0000	302	Actual
3	222.0000	303	Actual
3	208.0000	304	Actual
3	216.0000	305	Actual
3	211.0000	306	Actual
3	244.0000	307	Actual
3	216.0000	308	Actual
3	268.0000	309	Actual
3	323.0000	310	Actual
4	236.0000	401	Actual
4	167.0000	402	Actual
4	189.0000	403	Actual
4	145.7247	404	Forecasted
4	162.6368	405	Forecasted
4	143.9503	406	Forecasted
4	184.3546	407	Forecasted
4	206.1743	408	Forecasted
4	213.1575	409	Forecasted
4	331.4394	410	Forecasted

13) Further steps:

Do the same for other shows
Use manual ARIMA(1,0,2) + add more variables
Try to use methods similiar to crossvalidation for time series

Forecasting model for “TURN” show