This is the final project of class BANA 4080 - Forecasting and Risk
Analysis at the University of Cincinnati. The focus is utilizing
knowledge of time series data and libraries
forecast, tseries.
This project was completed as part of the BANA 4080 - Forecasting and
Risk Analysis at the University of Cincinnati. It aims to model and
predict the total ticket sales for the next day, focusing on the use of
forecast and tseries libraries. Using R
Programming tool, State Space Model, and ARIMA model, the project
analyzes the given Titanic Ticket Sales dataset to uncover the patterns,
trends, and seasonal components for predicting purpose.
The dataset, Titanic Ticket Sales, is provided by according professor
Jiantong Wang of the course. It covers data from December 1997 to July
1998, containing daily sales records. The dataset consists of 182
observations and includes four variables: Day (the specific
date), Customer.per.Theater (total sales for the day),
Date (the specific date), and Day.of.Week (the
day of the week from Monday to Sunday).
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
library(dplyr)
titanic_df <- read.csv('Titanic_box_office.csv')
head(titanic_df,5)
## Day Customers.per.Theater Date Day.of.Week
## 1 4 2086 12/22/97 MON
## 2 5 2272 12/23/97 TUE
## 3 6 1214 12/24/97 WED
## 4 7 3166 12/25/97 THU
## 5 8 3932 12/26/97 FRI
We will drop the column Day which indicates the nth
record in the whole dataset. We also convert the Date
column into correct data type Date. This way, we can easily
handle the data as time series data. With that being done, we transform
the whole dataset into time series data with a new object
titanic.ts:
titanic_df <- titanic_df[,-which(names(titanic_df)== 'Day')]
titanic_df$Date <- as.Date(titanic_df$Date, format = "%m/%d/%y")
titanic.ts <- ts(titanic_df$Customers.per.Theater, start = c(1,1), frequency = 7)
1. Time Series Plot
Firstly, we make the time series plot to take an overview of data. We can see that there is a decreasing trend over the week with number of customers per theater. There can be also an weekly seasonal pattern, which can be further reviewed in following plot.
plot(titanic.ts, main = "Number of Customers per Theater", xlab = "Week period", ylab = "Number of customers")
2. Seasonal Plot
Next, we want to see the seasonal component in the dataset:
ggseasonplot(titanic.ts, main = 'Seasonal plot of number of customers weekly', xlab='Week period', ylab='Number of customers')
From the above season plot, we can see that for each week, the highest ticket sale is on Fridays and the least is on Mondays to Wednesdays.
Before developing model, we split the dataset to two training set and
testing set for data performance with window()
function:
titanic.training <- window(titanic.ts, start = c(1,1), end = c(25,7))
titanic.testing <- window(titanic.ts, start = c(26,1), end = c(26,7))
1. State Space Model
From above analysis, it is suggested to use Holt-Winters model to include the seasonal and trend component. However, to make sure we fit the model correctly, we will use est()function to fit the best state space model for it.
state_space_model <- ets(titanic.training, damped = F)
state_space_model
## ETS(M,A,M)
##
## Call:
## ets(y = titanic.training, damped = F)
##
## Smoothing parameters:
## alpha = 0.8593
## beta = 0.0374
## gamma = 1e-04
##
## Initial states:
## l = 4601.9689
## b = -19.1649
## s = 1.3519 2.3921 1.3591 0.4679 0.415 0.4983
## 0.5156
##
## sigma: 0.2956
##
## AIC AICc BIC
## 2729.447 2731.372 2767.424
The suggested state space model to fit is “MAM” structure. We have the parameters: alpha = 0.8593, beta = 0.0374, gamma = 1e-04. The gamma value is low (about 0.001), which indicates that the model gives small weights to the seasonal component.
Model Fitting
state_space_model <- ets(titanic.training, damped = F)
state_space_model
## ETS(M,A,M)
##
## Call:
## ets(y = titanic.training, damped = F)
##
## Smoothing parameters:
## alpha = 0.8593
## beta = 0.0374
## gamma = 1e-04
##
## Initial states:
## l = 4601.9689
## b = -19.1649
## s = 1.3519 2.3921 1.3591 0.4679 0.415 0.4983
## 0.5156
##
## sigma: 0.2956
##
## AIC AICc BIC
## 2729.447 2731.372 2767.424
From the day 5 (26.57143), the prediction tends to increase again. There is no negative forecast values so this output is valid.
2. ARIMA Model Stationary Test We use ADF to test if our time series data is stationary or not. Our null hypothesis is that our data is not stationary, and the alternative hypothesis is otherwise.
adf.test(titanic.training )
##
## Augmented Dickey-Fuller Test
##
## data: titanic.training
## Dickey-Fuller = -2.767, Lag order = 5, p-value = 0.2559
## alternative hypothesis: stationary
Since the p-value is 0.2559 which is higher than 0.05, we fail to reject our null hypothesis and that our data is not stationary. Therefore, we will consider ARIMA model instead of ARMA model.
Model Fitting
arima1 <- auto.arima(titanic.training)
arima1
## Series: titanic.training
## ARIMA(0,0,2)(2,1,2)[7] with drift
##
## Coefficients:
## ma1 ma2 sar1 sar2 sma1 sma2 drift
## 0.7189 0.3012 0.2040 -0.3332 -0.6028 0.5000 -18.4883
## s.e. 0.0785 0.0821 0.6754 0.3393 0.6935 0.2043 5.9606
##
## sigma^2 = 116844: log likelihood = -1216.2
## AIC=2448.41 AICc=2449.31 BIC=2473.4
Based on our result of ARIMA model, the seasonal components follows
an ARIMA(2,1,2) model and the regular components follows an
ARIMA(0,0,2) model. We run the predictions based on this
suggested model as following:
arima1_preds <- forecast(arima1, h=7)
arima1_preds
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 26.00000 120.62824 -317.4367 558.6932 -549.3342 790.5907
## 26.14286 33.88883 -505.6398 573.4174 -791.2489 859.0266
## 26.28571 -21.03814 -576.4671 534.3908 -870.4934 828.4172
## 26.42857 -3.68654 -559.1155 551.7424 -853.1418 845.7688
## 26.57143 159.37306 -396.0559 714.8020 -690.0822 1008.8283
## 26.71429 408.42858 -147.0004 963.8575 -441.0267 1257.8839
## 26.85714 207.29988 -348.1291 762.7288 -642.1554 1056.7552
There are predictions of negative values from the third day of week 26 (26.28571) followed by positive predictions. This can potentially be explained by the strong decreasing trend as seen in time series plot. We can change the negative values and their followings to 0 or we can do log transformation to existing data and fit the model again. In this research, we prefer doing the log transformation to time series data. This can help stabilize the variance of data and produce better predictions.
3. Non-constant variance detects (Heteroscedasticity)
From our research, we do the residuals and scale-location plot of ARIMA models as below.
# Extract residuals from ARIMA model
residual_arima <- residuals(arima1)
fitted_values_arima <- fitted(arima1)
# Residual plot
plot(fitted_values_arima, residual_arima, main='Residuals vs. Fitted values (ARIMA)', xlab='Fitted values', ylab='Residuals')
abline(h=0,col='red')
# Scale-Location Plot
plot(fitted_values_arima, sqrt(abs(residual_arima)), main = "Scale-Location Plot (ARIMA)",
xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
abline(h = 0, col = "red")
From the residuals plot, we observe that many of the residuals are close to zero, especially for lower fitted values.As the fitted values increase, the residuals tend to spread out more from the zero line, indicating potential non-constant variance.
In the scale-location plot, where we take the square root of the absolute residuals, we see a similar pattern: residuals spread out more as the fitted values increase. This spread suggests a non-constant variance in the residuals.
These observations indicate that our time series dataset exhibits non-constant variance, making a case for applying a log transformation to stabilize the variance and improve our predictions.
4. Log Transformation
We use function log() to transform the time series data
to log data. We add 1 within the log()function to avoid any potential
log(0).
ts_data_log <- log(titanic.training + 1)
After applying log transformation and fitting to the ARIMA model again, we get the predictions as below:
arima_log <- auto.arima(ts_data_log)
forecasts_log <- forecast(arima_log, h=7)
arima_log_preds <- arima1_preds
arima_log_preds$mean <- exp(forecasts_log$mean) - 1
arima_log_preds
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 26.00000 171.0125 -317.4367 558.6932 -549.3342 790.5907
## 26.14286 132.2312 -505.6398 573.4174 -791.2489 859.0266
## 26.28571 115.3961 -576.4671 534.3908 -870.4934 828.4172
## 26.42857 125.5328 -559.1155 551.7424 -853.1418 845.7688
## 26.57143 274.4740 -396.0559 714.8020 -690.0822 1008.8283
## 26.71429 491.7474 -147.0004 963.8575 -441.0267 1257.8839
## 26.85714 306.3188 -348.1291 762.7288 -642.1554 1056.7552
There are no longer negative values in our predictions. We also try to fit the log transformed data to State Space Model to see if it performs better.
ssmodel_log <- ets(ts_data_log, damped = F)
ssmodel_log
## ETS(A,N,A)
##
## Call:
## ets(y = ts_data_log, damped = F)
##
## Smoothing parameters:
## alpha = 0.6267
## gamma = 1e-04
##
## Initial states:
## l = 8.1539
## s = 0.6189 1.1242 0.5299 -0.5242 -0.671 -0.5437
## -0.5341
##
## sigma: 0.2773
##
## AIC AICc BIC
## 465.6778 467.0193 497.3257
The best state space model to fit is “ANA” structure. We have the parameters: alpha = 0.6267, gamma = 1e-04 (which is very low and nearly 0).
1. Out-of-sample RMSE
We can calculate the RMSE of each model to compare the performance and determine which model to use. In this research, we use the data of last week (period 26) as testing data. We use all 4 models above (before and after log transformation) to make predictions of this week.
For the state space model without log transformation, we have the following:
state_space_preds <- forecast(state_space_model, h=7)
sspace_rmse <- sqrt(mean((titanic.testing - state_space_preds$mean)^2))
sspace_rmse
## [1] 93.97602
For the state space model with log transformation, we have the following:
ssmodel_log_forecasts <- forecast(ssmodel_log, h = 7)
ssmodel_log_preds <- state_space_preds
ssmodel_log_preds$mean <- exp(ssmodel_log_forecasts$mean) - 1
ssmodel_log_rmse <- sqrt(mean((titanic.testing - ssmodel_log_preds$mean)^2))
ssmodel_log_rmse
## [1] 93.19824
There is no significant difference of RSME between the two models with log transformation. For the ARIMA model without log transformation, we have the following:
arima1_preds <- forecast(arima1, h=7)
arima1_rmse <- sqrt(mean((titanic.testing - arima1_preds$mean)^2))
arima1_rmse
## [1] 138.8161
For the ARIMA model with log transformation, we have the following:
forecasts_log <- forecast(arima_log, h=7)
arima_log_preds <- arima1_preds
arima_log_preds$mean <- exp(forecasts_log$mean)-1
arima_log_rmse <- sqrt(mean((titanic.testing - arima_log_preds$mean)^2))
arima_log_rmse
## [1] 58.20646
With the log transformation, the ARIMA model performs significantly better, exhibiting a lower RMSE value. By comparing the out-of-sample RMSE of all four models, we choose the ARIMA model with the log transformation since it has the lowest RMSE value among them. The performance comparison of the ARIMA models and State Space models is illustrated in the following plots:
plot_data_arima <- data.frame(
date = index(titanic.testing),
actual = coredata(titanic.testing),
base_arima_preds = coredata(arima1_preds$mean),
log_arima_preds = coredata(arima_log_preds$mean)
)
ggplot(plot_data_arima, aes(x=date)) +
geom_line(aes(y=actual, color='Actual values'), linewidth = 1) +
geom_line(aes(y=base_arima_preds, color='Before Log-transformed'), linewidth = 1, linetype = 'dotted') +
geom_line(aes(y=log_arima_preds, color = 'After Log-transformed'), linewidth = 1, linetype = 'dashed') +
geom_hline(yintercept = 0, color = "black", linetype = "twodash", size = 0.5) +
labs(title = "Comparison of ARIMA Models", x = "Date", y = "Sales") +
theme_minimal() +
scale_color_manual(name = "Legend", values = c('Actual values' = 'blue', 'Before Log-transformed' = 'red', 'After Log-transformed' = 'darkgreen'))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_data_sspace <- data.frame(
date = index(titanic.testing),
actual = coredata(titanic.testing),
base_sspace_preds = coredata(state_space_preds$mean),
log_sspace_preds = coredata(ssmodel_log_preds$mean)
)
ggplot(plot_data_sspace, aes(x=date)) +
geom_line(aes(y=actual, color='Actual values'), linewidth = 1) +
geom_line(aes(y=base_sspace_preds, color='Before Log-transformed'), linewidth = 1, linetype = 'dotted') +
geom_line(aes(y=log_sspace_preds, color = 'After Log-transformed'), linewidth = 1, linetype = 'dashed') +
geom_hline(yintercept = 0, color = "black", linetype = "twodash", size = 0.5) +
labs(title = "Comparison of State Space Models", x = "Date", y = "Sales") +
theme_minimal() +
scale_color_manual(name = "Legend", values = c('Actual values' = 'blue', 'Before Log-transformed' = 'red', 'After Log-transformed' = 'darkgreen'))
We can also use Information Criterion to select a model. A model with lower information criterion is better. For the sample size is relatively large, we choose to use BIC. The BIC of 4 models presented as follows:
BIC(state_space_model)
## [1] 2767.424
BIC(ssmodel_log)
## [1] 497.3257
BIC(arima1)
## [1] 2473.399
BIC(arima_log)
## [1] 78.5153
We can easily determine that the ARIMA model with log transformation has the lowest BIC value, which is the model that we should choose.
Our project aimed to predict the number of customers going to the theater the next day using data from December 1997 to July 1998. While exploring the daily sales, we identified a seasonal pattern with peaks on Fridays and a decreasing trend throughout the weeks.
We trained two models, ARIMA and the State Space Model (Holt-Winters), on data up to week 25 and tested them on data from week 26. During the training phase, we encountered invalid predictions from the ARIMA model, indicating non-constant variance. To address this, we analyzed the residuals plot and the scale-location plot, which confirmed the presence of non-constant variance in the training data.
To mitigate this issue, we applied a log transformation to the data and refitted both the ARIMA and State Space models. We then evaluated the performance of all four models (original ARIMA, original State Space, log-transformed ARIMA, and log-transformed State Space) using out-of-sample RMSE and BIC criteria.
Based on these evaluations, we determined that the ARIMA model with log transformation provided the best forecasts for predicting the next day’s theater customers. Future work could include further exploration of the historical dataset and validation for broader applications.
For any questions or feedback, please reach out to:
Feel free to open an issue or submit a pull request if you have suggestions or improvements!