Overview

This is the final project of class BANA 4080 - Forecasting and Risk Analysis at the University of Cincinnati. The focus is utilizing knowledge of time series data and libraries forecast, tseries.

This project was completed as part of the BANA 4080 - Forecasting and Risk Analysis at the University of Cincinnati. It aims to model and predict the total ticket sales for the next day, focusing on the use of forecast and tseries libraries. Using R Programming tool, State Space Model, and ARIMA model, the project analyzes the given Titanic Ticket Sales dataset to uncover the patterns, trends, and seasonal components for predicting purpose.

Data Source

The dataset, Titanic Ticket Sales, is provided by according professor Jiantong Wang of the course. It covers data from December 1997 to July 1998, containing daily sales records. The dataset consists of 182 observations and includes four variables: Day (the specific date), Customer.per.Theater (total sales for the day), Date (the specific date), and Day.of.Week (the day of the week from Monday to Sunday).

Project Workflow

Setup and import the dataset

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)
library(dplyr)

titanic_df <- read.csv('Titanic_box_office.csv')
head(titanic_df,5)
##   Day Customers.per.Theater     Date Day.of.Week
## 1   4                  2086 12/22/97         MON
## 2   5                  2272 12/23/97         TUE
## 3   6                  1214 12/24/97         WED
## 4   7                  3166 12/25/97         THU
## 5   8                  3932 12/26/97         FRI

We will drop the column Day which indicates the nth record in the whole dataset. We also convert the Date column into correct data type Date. This way, we can easily handle the data as time series data. With that being done, we transform the whole dataset into time series data with a new object titanic.ts:

titanic_df <- titanic_df[,-which(names(titanic_df)== 'Day')]
titanic_df$Date <- as.Date(titanic_df$Date, format = "%m/%d/%y")
titanic.ts <- ts(titanic_df$Customers.per.Theater, start = c(1,1), frequency = 7)

Data Visualization

1. Time Series Plot

Firstly, we make the time series plot to take an overview of data. We can see that there is a decreasing trend over the week with number of customers per theater. There can be also an weekly seasonal pattern, which can be further reviewed in following plot.

plot(titanic.ts, main = "Number of Customers per Theater", xlab = "Week period", ylab = "Number of customers")

2. Seasonal Plot

Next, we want to see the seasonal component in the dataset:

ggseasonplot(titanic.ts, main = 'Seasonal plot of number of customers weekly', xlab='Week period', ylab='Number of customers')

From the above season plot, we can see that for each week, the highest ticket sale is on Fridays and the least is on Mondays to Wednesdays.

Model Development

Before developing model, we split the dataset to two training set and testing set for data performance with window() function:

titanic.training <- window(titanic.ts, start = c(1,1), end = c(25,7))
titanic.testing <- window(titanic.ts, start = c(26,1), end = c(26,7))

1. State Space Model

From above analysis, it is suggested to use Holt-Winters model to include the seasonal and trend component. However, to make sure we fit the model correctly, we will use est()function to fit the best state space model for it.

state_space_model <- ets(titanic.training, damped = F)
state_space_model
## ETS(M,A,M) 
## 
## Call:
## ets(y = titanic.training, damped = F)
## 
##   Smoothing parameters:
##     alpha = 0.8593 
##     beta  = 0.0374 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 4601.9689 
##     b = -19.1649 
##     s = 1.3519 2.3921 1.3591 0.4679 0.415 0.4983
##            0.5156
## 
##   sigma:  0.2956
## 
##      AIC     AICc      BIC 
## 2729.447 2731.372 2767.424

The suggested state space model to fit is “MAM” structure. We have the parameters: alpha = 0.8593, beta = 0.0374, gamma = 1e-04. The gamma value is low (about 0.001), which indicates that the model gives small weights to the seasonal component.

Model Fitting

state_space_model <- ets(titanic.training, damped = F)
state_space_model
## ETS(M,A,M) 
## 
## Call:
## ets(y = titanic.training, damped = F)
## 
##   Smoothing parameters:
##     alpha = 0.8593 
##     beta  = 0.0374 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 4601.9689 
##     b = -19.1649 
##     s = 1.3519 2.3921 1.3591 0.4679 0.415 0.4983
##            0.5156
## 
##   sigma:  0.2956
## 
##      AIC     AICc      BIC 
## 2729.447 2731.372 2767.424

From the day 5 (26.57143), the prediction tends to increase again. There is no negative forecast values so this output is valid.

2. ARIMA Model Stationary Test We use ADF to test if our time series data is stationary or not. Our null hypothesis is that our data is not stationary, and the alternative hypothesis is otherwise.

adf.test(titanic.training )
## 
##  Augmented Dickey-Fuller Test
## 
## data:  titanic.training
## Dickey-Fuller = -2.767, Lag order = 5, p-value = 0.2559
## alternative hypothesis: stationary

Since the p-value is 0.2559 which is higher than 0.05, we fail to reject our null hypothesis and that our data is not stationary. Therefore, we will consider ARIMA model instead of ARMA model.

Model Fitting

arima1 <- auto.arima(titanic.training)
arima1
## Series: titanic.training 
## ARIMA(0,0,2)(2,1,2)[7] with drift 
## 
## Coefficients:
##          ma1     ma2    sar1     sar2     sma1    sma2     drift
##       0.7189  0.3012  0.2040  -0.3332  -0.6028  0.5000  -18.4883
## s.e.  0.0785  0.0821  0.6754   0.3393   0.6935  0.2043    5.9606
## 
## sigma^2 = 116844:  log likelihood = -1216.2
## AIC=2448.41   AICc=2449.31   BIC=2473.4

Based on our result of ARIMA model, the seasonal components follows an ARIMA(2,1,2) model and the regular components follows an ARIMA(0,0,2) model. We run the predictions based on this suggested model as following:

arima1_preds <- forecast(arima1, h=7)
arima1_preds
##          Point Forecast     Lo 80    Hi 80     Lo 95     Hi 95
## 26.00000      120.62824 -317.4367 558.6932 -549.3342  790.5907
## 26.14286       33.88883 -505.6398 573.4174 -791.2489  859.0266
## 26.28571      -21.03814 -576.4671 534.3908 -870.4934  828.4172
## 26.42857       -3.68654 -559.1155 551.7424 -853.1418  845.7688
## 26.57143      159.37306 -396.0559 714.8020 -690.0822 1008.8283
## 26.71429      408.42858 -147.0004 963.8575 -441.0267 1257.8839
## 26.85714      207.29988 -348.1291 762.7288 -642.1554 1056.7552

There are predictions of negative values from the third day of week 26 (26.28571) followed by positive predictions. This can potentially be explained by the strong decreasing trend as seen in time series plot. We can change the negative values and their followings to 0 or we can do log transformation to existing data and fit the model again. In this research, we prefer doing the log transformation to time series data. This can help stabilize the variance of data and produce better predictions.

3. Non-constant variance detects (Heteroscedasticity)

From our research, we do the residuals and scale-location plot of ARIMA models as below.

# Extract residuals from ARIMA model
residual_arima <- residuals(arima1)
fitted_values_arima <- fitted(arima1)
# Residual plot
plot(fitted_values_arima, residual_arima, main='Residuals vs. Fitted values (ARIMA)', xlab='Fitted values', ylab='Residuals')
abline(h=0,col='red')

# Scale-Location Plot
plot(fitted_values_arima, sqrt(abs(residual_arima)), main = "Scale-Location Plot (ARIMA)",
     xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
abline(h = 0, col = "red")

From the residuals plot, we observe that many of the residuals are close to zero, especially for lower fitted values.As the fitted values increase, the residuals tend to spread out more from the zero line, indicating potential non-constant variance.

In the scale-location plot, where we take the square root of the absolute residuals, we see a similar pattern: residuals spread out more as the fitted values increase. This spread suggests a non-constant variance in the residuals.

These observations indicate that our time series dataset exhibits non-constant variance, making a case for applying a log transformation to stabilize the variance and improve our predictions.

4. Log Transformation

We use function log() to transform the time series data to log data. We add 1 within the log()function to avoid any potential log(0).

ts_data_log <- log(titanic.training + 1)

After applying log transformation and fitting to the ARIMA model again, we get the predictions as below:

arima_log <- auto.arima(ts_data_log)
forecasts_log <- forecast(arima_log, h=7)
arima_log_preds <- arima1_preds
arima_log_preds$mean <- exp(forecasts_log$mean) - 1
arima_log_preds
##          Point Forecast     Lo 80    Hi 80     Lo 95     Hi 95
## 26.00000       171.0125 -317.4367 558.6932 -549.3342  790.5907
## 26.14286       132.2312 -505.6398 573.4174 -791.2489  859.0266
## 26.28571       115.3961 -576.4671 534.3908 -870.4934  828.4172
## 26.42857       125.5328 -559.1155 551.7424 -853.1418  845.7688
## 26.57143       274.4740 -396.0559 714.8020 -690.0822 1008.8283
## 26.71429       491.7474 -147.0004 963.8575 -441.0267 1257.8839
## 26.85714       306.3188 -348.1291 762.7288 -642.1554 1056.7552

There are no longer negative values in our predictions. We also try to fit the log transformed data to State Space Model to see if it performs better.

ssmodel_log <- ets(ts_data_log, damped = F)
ssmodel_log
## ETS(A,N,A) 
## 
## Call:
## ets(y = ts_data_log, damped = F)
## 
##   Smoothing parameters:
##     alpha = 0.6267 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 8.1539 
##     s = 0.6189 1.1242 0.5299 -0.5242 -0.671 -0.5437
##            -0.5341
## 
##   sigma:  0.2773
## 
##      AIC     AICc      BIC 
## 465.6778 467.0193 497.3257

The best state space model to fit is “ANA” structure. We have the parameters: alpha = 0.6267, gamma = 1e-04 (which is very low and nearly 0).

Model Comparison

1. Out-of-sample RMSE

We can calculate the RMSE of each model to compare the performance and determine which model to use. In this research, we use the data of last week (period 26) as testing data. We use all 4 models above (before and after log transformation) to make predictions of this week.

For the state space model without log transformation, we have the following:

state_space_preds <- forecast(state_space_model, h=7)
sspace_rmse <- sqrt(mean((titanic.testing - state_space_preds$mean)^2))
sspace_rmse
## [1] 93.97602

For the state space model with log transformation, we have the following:

ssmodel_log_forecasts <- forecast(ssmodel_log, h = 7)
ssmodel_log_preds <- state_space_preds
ssmodel_log_preds$mean <- exp(ssmodel_log_forecasts$mean) - 1
ssmodel_log_rmse <- sqrt(mean((titanic.testing - ssmodel_log_preds$mean)^2))
ssmodel_log_rmse
## [1] 93.19824

There is no significant difference of RSME between the two models with log transformation. For the ARIMA model without log transformation, we have the following:

arima1_preds <- forecast(arima1, h=7)
arima1_rmse <- sqrt(mean((titanic.testing - arima1_preds$mean)^2))
arima1_rmse
## [1] 138.8161

For the ARIMA model with log transformation, we have the following:

forecasts_log <- forecast(arima_log, h=7)
arima_log_preds <- arima1_preds
arima_log_preds$mean <- exp(forecasts_log$mean)-1
arima_log_rmse <- sqrt(mean((titanic.testing - arima_log_preds$mean)^2))
arima_log_rmse
## [1] 58.20646

With the log transformation, the ARIMA model performs significantly better, exhibiting a lower RMSE value. By comparing the out-of-sample RMSE of all four models, we choose the ARIMA model with the log transformation since it has the lowest RMSE value among them. The performance comparison of the ARIMA models and State Space models is illustrated in the following plots:

plot_data_arima <- data.frame(
  date = index(titanic.testing),
  actual = coredata(titanic.testing),
  base_arima_preds = coredata(arima1_preds$mean),
  log_arima_preds = coredata(arima_log_preds$mean)
)

ggplot(plot_data_arima, aes(x=date)) + 
  geom_line(aes(y=actual, color='Actual values'), linewidth = 1) +
  geom_line(aes(y=base_arima_preds, color='Before Log-transformed'), linewidth = 1, linetype = 'dotted') + 
  geom_line(aes(y=log_arima_preds, color = 'After Log-transformed'), linewidth = 1, linetype = 'dashed') +
  geom_hline(yintercept = 0, color = "black", linetype = "twodash", size = 0.5) +
  labs(title = "Comparison of ARIMA Models", x = "Date", y = "Sales") +
  theme_minimal() +
  scale_color_manual(name = "Legend", values = c('Actual values' = 'blue', 'Before Log-transformed' = 'red', 'After Log-transformed' = 'darkgreen'))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot_data_sspace <- data.frame(
  date = index(titanic.testing),
  actual = coredata(titanic.testing),
  base_sspace_preds = coredata(state_space_preds$mean),
  log_sspace_preds = coredata(ssmodel_log_preds$mean)
)

ggplot(plot_data_sspace, aes(x=date)) + 
  geom_line(aes(y=actual, color='Actual values'), linewidth = 1) +
  geom_line(aes(y=base_sspace_preds, color='Before Log-transformed'), linewidth = 1, linetype = 'dotted') + 
  geom_line(aes(y=log_sspace_preds, color = 'After Log-transformed'), linewidth = 1, linetype = 'dashed') +
  geom_hline(yintercept = 0, color = "black", linetype = "twodash", size = 0.5) +
  labs(title = "Comparison of State Space Models", x = "Date", y = "Sales") +
  theme_minimal() +
  scale_color_manual(name = "Legend", values = c('Actual values' = 'blue', 'Before Log-transformed' = 'red', 'After Log-transformed' = 'darkgreen'))

Information Criterion

We can also use Information Criterion to select a model. A model with lower information criterion is better. For the sample size is relatively large, we choose to use BIC. The BIC of 4 models presented as follows:

BIC(state_space_model)
## [1] 2767.424
BIC(ssmodel_log)
## [1] 497.3257
BIC(arima1)
## [1] 2473.399
BIC(arima_log)
## [1] 78.5153

We can easily determine that the ARIMA model with log transformation has the lowest BIC value, which is the model that we should choose.

Summary

Our project aimed to predict the number of customers going to the theater the next day using data from December 1997 to July 1998. While exploring the daily sales, we identified a seasonal pattern with peaks on Fridays and a decreasing trend throughout the weeks.

We trained two models, ARIMA and the State Space Model (Holt-Winters), on data up to week 25 and tested them on data from week 26. During the training phase, we encountered invalid predictions from the ARIMA model, indicating non-constant variance. To address this, we analyzed the residuals plot and the scale-location plot, which confirmed the presence of non-constant variance in the training data.

To mitigate this issue, we applied a log transformation to the data and refitted both the ARIMA and State Space models. We then evaluated the performance of all four models (original ARIMA, original State Space, log-transformed ARIMA, and log-transformed State Space) using out-of-sample RMSE and BIC criteria.

Based on these evaluations, we determined that the ARIMA model with log transformation provided the best forecasts for predicting the next day’s theater customers. Future work could include further exploration of the historical dataset and validation for broader applications.

Contact

For any questions or feedback, please reach out to:

Feel free to open an issue or submit a pull request if you have suggestions or improvements!