The data_A.csv file contains data that needs to be analyzed. Use a tool of your choice (e.g. Excel, Python, R, etc.) to analyze the “production” time series data in the provided file and choose a forecasting model that provides reasonable forecasts at a 1-4 quarter horizon.
Exploring, visualizing, and qualitative analysis can help with the initial approach for which forecast models to test. The decision on which model to use should be based on the best performance of an accuracy metric. I selected MSE so that the scale of the errors are more sensitive the worse that they are as large errors are undesirable. Based on MSE, the ARIMA(1,1,2)(0,1,1) model was selected since it had the lowest MSE values indicating it was the most accurate.
Plotting out the time series can help identify if there are trends or seasonality patterns within the data which can be useful in determining which model to use. Auto-correlation plots can also indicate whether or not the time series data has stationarity properties. Augmented Dickey-Fuller and KPSS tests can formally check for statinoarity. Tests such as Ljung-Box can formally tests if the residuals from the fitted model results in white noise.
The ARIMA model that was fitted using the first 214 observations and evlauted on the last 4 observations. The result was an RMSE value of 12.71 (MSE 161.5441) which was the lowest of the 3 models tested. Additionaly, Cross Validation was used for more robust testing by using a variety of train/test data, which also resulted in the ARIMA model having the lowest MSE values for each of the 1-4 quarter horizon. [1] 336.7409 [2] 328.3081 [3] 325.1302 [4] 333.8706
library(data.table)
library(forecast)
library(lubridate)
library(zoo)
library(ggplot2)
library(tseries)
df <- fread('C:/Users/antho/Documents/projects/hellofresh/data_A.csv')
This is a univariate times series dataset with 218 quarterly observations starting with 1956Q1 and ending at 2010Q2.
str(df)
## Classes 'data.table' and 'data.frame': 218 obs. of 2 variables:
## $ time : chr "1956 Q1" "1956 Q2" "1956 Q3" "1956 Q4" ...
## $ production: int 284 213 227 308 262 228 236 320 272 233 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(df)
tail(df)
ts <- ts(df$production,
frequency = 4,
start = c(1956, 1))
The plot shows that there is a pretty clear trend in the data with a steady rise in the first 20 periods and then a slow decline. There also appears to be seasonality patterns as shown by the spikes and troughs within each year. Lastly, the variance appears consistent throughout the time period.
autoplot(ts)
ggseasonplot(ts, polar= TRUE)
ggsubseriesplot(ts)
A few different tests for stationarity confirms that this time series data is in fact non-stationary. This will be useful to keep in mind when determining the model.
ggAcf(ts)
Box.test(ts, lag=24, type="Ljung-Box")
##
## Box-Ljung test
##
## data: ts
## X-squared = 1967.6, df = 24, p-value < 2.2e-16
adf.test(ts)
##
## Augmented Dickey-Fuller Test
##
## data: ts
## Dickey-Fuller = -1.2827, Lag order = 6, p-value = 0.877
## alternative hypothesis: stationary
kpss.test(ts)
## Warning in kpss.test(ts): p-value smaller than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: ts
## KPSS Level = 2.2019, Truncation lag parameter = 4, p-value = 0.01
Test out 3 different models: Seasonal Naive, ARIMA, and Neurual Net. The training data will include the first 214 quarters and test data will include the last 4 quarters. From analyzing the residuals, the ARIMA(1,1,2)(0,1,1) resulted in white noise based on the ACF plot and the Ljung-Box results. This may indicate it is the better of the 3 models. The accuracy results RMSE for ARIMA is also the lowest at 12.71 which indicates it is the better of the 3 models. Based on these results, the ARIMA model is the best performing.
train <- window(ts, start = c(1956, 1), end = c(2009, 2))
fit1 <- snaive(train)
fit2 <- auto.arima(train, stepwise = FALSE)
fit3 <- nnetar(train)
summary(fit1)
##
## Forecast method: Seasonal naive method
##
## Model Information:
## Call: snaive(y = train)
##
## Residual sd: 19.2035
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 3.233333 19.42862 15.64286 0.8761965 3.724128 1
## ACF1
## Training set 0.009222949
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2009 Q3 410 385.1012 434.8988 371.9206 448.0794
## 2009 Q4 488 463.1012 512.8988 449.9206 526.0794
## 2010 Q1 415 390.1012 439.8988 376.9206 453.0794
## 2010 Q2 398 373.1012 422.8988 359.9206 436.0794
## 2010 Q3 410 374.7878 445.2122 356.1476 463.8524
## 2010 Q4 488 452.7878 523.2122 434.1476 541.8524
## 2011 Q1 415 379.7878 450.2122 361.1476 468.8524
## 2011 Q2 398 362.7878 433.2122 344.1476 451.8524
summary(fit2)
## Series: train
## ARIMA(1,1,2)(0,1,1)[4]
##
## Coefficients:
## ar1 ma1 ma2 sma1
## 0.0279 -0.9966 0.3681 -0.7411
## s.e. 0.2005 0.1871 0.1545 0.0505
##
## sigma^2 estimated as 241.2: log likelihood=-869.71
## AIC=1749.41 AICc=1749.71 BIC=1766.13
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02897946 15.19919 11.62055 0.02579476 2.759964 0.7428662
## ACF1
## Training set 0.005323207
summary(fit3)
## Length Class Mode
## x 214 ts numeric
## m 1 -none- numeric
## p 1 -none- numeric
## P 1 -none- numeric
## scalex 2 -none- list
## size 1 -none- numeric
## subset 214 -none- numeric
## model 20 nnetarmodels list
## nnetargs 0 -none- list
## fitted 214 ts numeric
## residuals 214 ts numeric
## lags 4 -none- numeric
## series 1 -none- character
## method 1 -none- character
## call 2 -none- call
checkresiduals(fit1)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 61.272, df = 8, p-value = 2.622e-10
##
## Model df: 0. Total lags used: 8
checkresiduals(fit2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(0,1,1)[4]
## Q* = 3.4231, df = 4, p-value = 0.4897
##
## Model df: 4. Total lags used: 8
checkresiduals(fit3)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
fc1 <- forecast(fit1, h=4)
fc2 <- forecast(fit2, h=4)
fc3 <- forecast(fit3, h=4)
accuracy(fc1, ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 3.233333 19.42862 15.64286 0.8761965 3.724128 1.000000
## Test set -4.000000 12.82576 8.50000 -1.1276717 2.201657 0.543379
## ACF1 Theil's U
## Training set 0.009222949 NA
## Test set 0.006734007 0.2379096
accuracy(fc2, ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02897946 15.19919 11.62055 0.02579476 2.759964 0.7428662
## Test set -3.55112916 12.71731 11.14764 -1.00265292 2.787240 0.7126344
## ACF1 Theil's U
## Training set 0.005323207 NA
## Test set 0.269216590 0.2057462
accuracy(fc3, ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04482403 16.19589 12.53834 -0.1622583 2.961866 0.8015378
## Test set -9.56746016 15.09339 11.60875 -2.4499074 2.937088 0.7421117
## ACF1 Theil's U
## Training set -0.01907935 NA
## Test set 0.11967922 0.2918909
fit1 %>% forecast(h=4) %>% autoplot()
fit2 %>% forecast(h=4) %>% autoplot()
fit3 %>% forecast(h=4) %>% autoplot()
Cross Validation can provide further training and testing results based on holding out various time periods for each 1 to 4 quarter horizon forecasts. Using MSE as the accuracy metric, the ARIMA model again has the best performance for each of the step ahead forecasts.
Seasonal Naive [1] 409.1843 [2] 422.4273 [3] 405.8949 [4] 400.3256
ARIMA [1] 336.7409 [2] 328.3081 [3] 325.1302 [4] 333.8706
NN [1] 619.2986 [2] 679.8911 [3] 701.8105 [4] 750.3337
fsnaive <- function(y, h) {
forecast(snaive(y), h = h)
}
farima <- function(x, h) {
forecast(auto.arima(x), h = h)
}
fnn <- function(y, h) {
forecast(nnetar(y), h = h)
}
sq <- function(u){u^2}
# snaive
for(h in 1:4){
ts %>% tsCV(fsnaive, h = h) %>% sq() %>% mean(na.rm=TRUE) %>% print()
}
# arima
for(h in 1:4){
ts %>% tsCV(farima, h = h) %>% sq() %>% mean(na.rm=TRUE) %>% print()
}
# nn
for(h in 1:4){
ts %>% tsCV(fnn, h = h) %>% sq() %>% mean(na.rm=TRUE) %>% print()
}