Ilustration by the author of the report.
Greetings all! It’s Gissella Nadya and today we are back with a project of forecasting the average ticket price for the musical Hamilton on Broadway! The data was obtained through Playbill.com . For a brief review, Hamilton is a musical on Broadway that tells the story of American’s Founding Fathers. Now that Broadway is not open yet due to the global pandemic, let’s just imagine that what would happen if we are using the previous history to forecast what would be the average ticket price to watch the musical. Okay let’s get started!
library(tidyverse)
library(padr)
library(forecast) # autplot()
library(tseries) # adf.test()
library(MLmetrics) # MAE()
library(plotly) # plotly
library(glue) # glue text
library(tictoc)
library(gganimate)
library(astsa)
library(lubridate)
<- read_csv("grosses.csv")
bway bway
<- bway %>%
hamilton filter(show == "Hamilton" ) %>%
select(week_ending, weekly_gross, avg_ticket_price, seats_sold)
%>%
bway filter(show == "Hamilton" ) %>%
select(week_ending, week_number, avg_ticket_price)
avg_ticket_price
<- hamilton %>%
clean_ham select(week_ending, avg_ticket_price) %>%
mutate(week_ending = as.Date(week_ending))
clean_ham
range(clean_ham$week_ending)
#> [1] "2015-07-19" "2020-03-01"
The characteristics of a Time Series object must consists of:
Therefore, we need to do time series padding.
<- clean_ham %>%
pad_ham pad(start_val = head(clean_ham$week_ending, 1),
end_val = tail(clean_ham$week_ending, 1))
is.na(pad_ham$avg_ticket_price),] pad_ham[
agg_fnb %>% fill(visitor, .direction = “up”)
# pad_ham <- pad_ham %>%
# mutate(avg_ticket_price = replace_na(avg_ticket_price, 0))
<- pad_ham %>%
pad_ham fill(avg_ticket_price, .direction = "down")
%>% is.na() %>% colSums() pad_ham
#> week_ending avg_ticket_price
#> 0 0
<- seq(from = as.Date("2015-07-19"), to = as.Date("2020-03-01"), by = "week")
index
<- xts::xts(pad_ham$avg_ticket_price, order.by = index, frequency = 52)
ham_xts %>% autoplot() + ggtitle("Hamilton's Average Ticket Price per Week") ham_xts
For the purpose of making a time-series dataset, we need to use the ts()
function and shall determine the frequency for the dataset. Here we used 52 as the frequency because we want to analyze the data based on weekly, and since the in one year there are approximately 52 weeks, it means that there are 52 weeks loop to be inspected.
head(week(ham_xts),1)
#> [1] 29
<- ts(data = ham_xts, frequency = 52, start = c(2015, head(week(ham_xts),1)))
ham_ts
%>%
ham_tsautoplot()
%>%
ham_ts decompose() %>%
autoplot()
ada trend ada seasonal
<- pad_ham$avg_ticket_price %>%
ham_msts msts(seasonal.periods = c(12, 52), start = c(2015, head(week(ham_xts),1))) # month , week
%>%
ham_msts mstl() %>%
autoplot()
<- 52 # 52
freq_num <- tail(ham_msts, n = freq_num)
test_ham_msts <- head(ham_msts, length(ham_msts) - length(test_ham_msts)) train_ham_msts
<- train_ham_msts %>%
res_adf adf.test()
#> [1] "p-value = 0.80081601590178"
#> [1] "p-value > alpha, which means data are non-stationary"
We must use diff()
in order to make our data stationary.
<- train_ham_msts %>%
diff_D1 diff(lag = 52) %>%
adf.test()
#> [1] "p-value = 0.549881924195139"
#> [1] "p-value > alpha, which means data are non-stationary"
<- train_ham_msts %>%
diff_D1_d1 diff(lag = 52) %>%
diff() %>%
adf.test()
#> [1] "p-value = 0.01"
#> [1] "p-value < alpha, which means data are non-stationary"
Using the Augmented Dickey-Fuller Test, after 1 diff, we confirm that the data are stationary
%>%
train_ham_msts diff(lag = 52) %>%
diff() %>%
autoplot() +
labs(title = "Autoplot for Train Dataset with Differencing") +
theme(plot.title = element_text(face = "bold", color = "#a16526"))
%>%
train_ham_msts diff(lag = 52) %>%
diff() %>%
tsdisplay(lag.max = 104)
To summed up,
\(p\) = 1 or 9, \(d\) = 1, \(q\) = 1 or 9, \(P\) = 0 or 1, \(D\) = 1, \(Q\) = 0 or 1
There are several models to choose for time series analysis:
We are using the Holt’s Winters Exponential or also known as Triple Exponential Smoothing because our data consists of trend and seasonal.
SARIMA or Seasonal Auto Regressive Integrated Moving Average, are the other alternatives that we are going to use. We are using the SARIMA method because there is trend and seasonal factors on our dataset. Here we use the function Arima()
, auto.arima()
, and stlm()
.
STLM or Seasonal Trend with Loess Model is a model that is perfect for Multiple Seasonal. Since our data is additive we are able to use stlm()
function, but if our data is multiplicative, we shall use the log()
function. We may use the function for both ARIMA and Exponential Smoothing or ETS.
The other alternative model that we may use is the TBATS or Trigonometric seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality Model. TBATS model may be used using the tbats()
function.
tic()
<- HoltWinters(train_ham_msts, seasonal = "additive") # 0.2 , 0.04 , 0.4
model_hw1 <- HoltWinters(train_ham_msts, alpha = 0.5, beta = 0.01, gamma = 1,
model_hw2 seasonal = "additive")
<- HoltWinters(train_ham_msts, alpha = 0.2, beta = 0.01, gamma = 0.08,
model_hw3 seasonal = "additive") ##
<- auto.arima(train_ham_msts)
model_arima_auto <- stlm(y = train_ham_msts, s.window = 52, method = "arima") ##
model_arima1 <- stlm(y = train_ham_msts, method = "arima")
model_arima2 <- Arima(y = train_ham_msts, order = c(1,1,1), seasonal = c(0,1,1))
model_arima3 <- stlm(y = train_ham_msts, s.window = 52, method = "ets") ##
model_ets1 <- stlm(y = train_ham_msts, s.window = 12, method = "ets") ##
model_ets2 <- stlm(y = train_ham_msts, s.window = 12*4, method = "ets",robust = T)
model_ets3 <- stlm(y = train_ham_msts, s.window = 53, method = "ets")
model_ets4 <- tbats(train_ham_msts, use.box.cox = F,seasonal.periods = 52)
model_tbats toc()
#> 44.614 sec elapsed
<- sarima.for(train_ham_msts, n.ahead = freq_num,
model_sarima1 p=9, d=1, q=1,
P=0, D=1, Q=0, S=53) #53
<- sarima.for(train_ham_msts, n.ahead = freq_num,
model_sarima2 p=0, d=1, q=1,
P=0, D=1, Q=0, S=52) # 52
<- sarima.for(train_ham_msts, n.ahead = freq_num,
model_sarima3 p=0, d=1, q=1, # q jgn 3
P=0, D=1, Q=0, S=53)
tic()
<- forecast(model_hw1, h = freq_num)
forecast_hw1 <- forecast(model_hw2, h = freq_num)
forecast_hw2 <- forecast(model_hw3, h = freq_num)
forecast_hw3 <- forecast(model_arima_auto, h = freq_num)
forecast_arima_auto <- forecast(model_arima1, h = freq_num)
forecast_arima1 <- forecast(model_arima2, h = freq_num)
forecast_arima2 <- forecast(model_arima3, h = freq_num)
forecast_arima3 <- forecast(model_ets1, h = freq_num)
forecast_ets1 <- forecast(model_ets2, h = freq_num)
forecast_ets2 <- forecast(model_ets3, h = freq_num)
forecast_ets3 <- forecast(model_ets4, h = freq_num)
forecast_ets4 <- forecast(model_tbats, h = freq_num)
forecast_tbats toc()
#> 1.418 sec elapsed
# forecast_arima4 <- forecast(model_arima4, h = freq_num)
# forecast_tbats <- forecast(model_tbats, h = freq_num)
To evaluate the model we may use the accuracy()
function from the forecast
library to get the MAE, MAPE, RMSE. Because we we would like to compare the values from the three metrics, we are using the MAE()
, MAPE()
, RMSE()
function by determining the y_pred
and y_true
parameter. We concluded it into a dataframe for a better visualization
To evaluate the error, we could see from the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) or the Root Mean Square Error (RMSE). After compiling the metrics, here we could see the
<- train_ham_msts %>%
model_all_1 tail(n = 52) %>%
autoplot() +
autolayer(test_ham_msts, series = "Data Test") +
autolayer(forecast_hw1$mean, series = "forecast HW1") +
autolayer(forecast_hw2$mean, series = "forecast HW2") +
autolayer(forecast_hw3$mean, series = "forecast HW3") +
autolayer(forecast_arima_auto$mean, series = "forecast Arima Auto") +
autolayer(forecast_arima1$mean, series = "forecast Arima 1") +
autolayer(forecast_arima2$mean, series = "forecast Arima 2") +
autolayer(forecast_arima3$mean, series = "forecast Arima 3") +
autolayer(forecast_ets1$mean, series = "forecast ETS 1") +
autolayer(forecast_ets2$mean, series = "forecast ETS 2") +
autolayer(forecast_ets3$mean, series = "forecast ETS 3") +
autolayer(forecast_ets4$mean, series = "forecast ETS 4") +
autolayer(model_sarima1$pred, series = "forecast SARIMA 1") +
autolayer(model_sarima2$pred, series = "forecast SARIMA 2") +
autolayer(model_sarima3$pred, series = "forecast SARIMA 3") +
autolayer(forecast_tbats$mean, series = "forecast TBATS") +
labs(title = "Visualization of Data Forecast - Data Test",
subtitle = "For Hamilton's Average Ticket Price") +
theme(plot.title = element_text(face = "bold"))
%>%
model_all_1 ggplotly() %>%
layout(title = list(text = paste0("<b>",
"Visualization of Data Forecast - Data Test",
"</b>",
"<br>", #line break
"<sup>", #superscript
"For Hamilton's Average Ticket Price",
"</sup>")))
%>%
comp_MAPE_1 arrange(MAPE)
Now, we are going to visualize the top models that has MAPE below 10%.
As we can see here, the data could predict almost with the same pattern. If we take a look on January 2020, the data almost predict perfectly.
Because all of the ETS model have almost the same MAPE (around 5.67%), therefore we are going to skip to the other models.
Every model must be checked through the Assumption test, and precisely for time series model, we have to check the Autocorrelation assumption, and Normality of the Residuals.
We may use the Ljung-box Test by using the Box.test() function and use “Ljung-Box” as the type parameter.
<- Box.test(forecast_hw1$residuals, type = "Ljung-Box")
a <- Box.test(forecast_hw2$residuals, type = "Ljung-Box")
b <- Box.test(forecast_hw3$residuals, type = "Ljung-Box")
c <- Box.test(forecast_arima_auto$residuals, type = "Ljung-Box")
d <- Box.test(forecast_arima1$residuals, type = "Ljung-Box")
e <- Box.test(forecast_arima2$residuals, type = "Ljung-Box")
f <- Box.test(forecast_arima3$residuals, type = "Ljung-Box")
g <- Box.test(forecast_ets1$residuals, type = "Ljung-Box")
i <- Box.test(forecast_ets2$residuals, type = "Ljung-Box")
j <- Box.test(forecast_ets3$residuals, type = "Ljung-Box")
k <- Box.test(forecast_ets4$residuals, type = "Ljung-Box")
l <- Box.test(model_sarima1$pred, type = "Ljung-Box")
m <- Box.test(model_sarima2$pred, type = "Ljung-Box")
n <- Box.test(model_sarima3$pred, type = "Ljung-Box")
o <- Box.test(forecast_tbats$residuals, type = "Ljung-Box") q
%>%
model_assump1 mutate(autocorrelation = case_when(p_value >= 0.05 ~ "failed to reject H0",
< 0.05 ~ "reject H0")) p_value
From the results we can conclude that all the models have no-autocorrelation except for the three sarima models that has p-values below the alpha.
To test the Normality of Residuals, we may use the Shapiro-Wilk Normality test by using the shapiro.test()
function.
<- shapiro.test(forecast_hw1$residuals)
aa <- shapiro.test(forecast_hw2$residuals)
ab <- shapiro.test(forecast_hw3$residuals)
ac <- shapiro.test(forecast_arima_auto$residuals)
ad <- shapiro.test(forecast_arima1$residuals)
ae <- shapiro.test(forecast_arima2$residuals)
af <- shapiro.test(forecast_arima3$residuals)
ag <- shapiro.test(forecast_ets1$residuals)
ai <- shapiro.test(forecast_ets2$residuals)
aj <- shapiro.test(forecast_ets3$residuals)
ak <- shapiro.test(forecast_ets4$residuals)
al <- shapiro.test(model_sarima1$pred)
am <- shapiro.test(model_sarima2$pred)
an <- shapiro.test(model_sarima3$pred)
ao <- shapiro.test(forecast_tbats$residuals) aq
%>%
model_normal mutate(normality = case_when(p_value >= 0.05 ~ "failed to reject H0",
< 0.05 ~ "reject H0")) p_value
We can conclude that all the residuals are NOT distributed normally.
We all know that Broadway, as of April 2021, is still closed, but it could not hurt to take a look on what could have been if there were no global pandemic, right?
ham_msts
#> Multi-Seasonal Time Series:
#> Start: 2015 29
#> Seasonal Periods: 12 52
#> Data:
#> [1] 138.94 140.21 140.39 117.99 136.31 136.04 144.62 158.52 145.91 144.59
#> [11] 146.33 138.36 156.58 137.94 139.08 148.71 147.07 148.66 135.14 170.77
#> [21] 150.36 152.85 154.60 154.60 182.39 159.23 164.55 161.84 161.21 164.81
#> [31] 166.69 162.79 163.17 164.28 163.50 161.82 159.89 169.51 168.65 156.41
#> [41] 168.61 169.12 170.49 157.06 164.09 178.38 172.46 188.60 188.51 187.03
#> [51] 188.36 190.95 189.35 190.32 189.85 191.79 190.14 192.07 191.02 194.55
#> [61] 199.97 200.75 199.95 202.23 205.87 201.21 185.70 206.03 207.56 228.04
#> [71] 228.30 303.21 208.09 227.23 208.65 307.16 310.13 228.43 227.90 227.98
#> [81] 229.17 298.92 270.54 267.10 291.37 264.10 292.48 266.89 287.86 262.17
#> [91] 276.99 289.94 291.07 259.59 292.14 263.01 291.36 257.05 284.07 255.42
#> [101] 282.61 282.96 284.00 284.61 280.79 282.01 280.78 283.28 282.65 276.16
#> [111] 275.52 277.78 274.70 272.72 272.59 273.54 272.24 272.55 271.20 248.26
#> [121] 272.39 292.14 263.31 321.13 264.33 271.04 263.37 353.06 358.46 291.24
#> [131] 288.61 287.93 286.99 259.81 284.82 287.10 287.98 285.46 292.88 292.93
#> [141] 286.67 268.47 292.06 291.87 286.51 262.97 288.31 262.30 290.20 278.35
#> [151] 290.46 265.29 291.16 291.27 289.99 295.08 289.13 292.34 293.13 292.95
#> [161] 292.08 289.72 302.18 302.16 298.36 297.70 301.05 301.89 299.98 301.77
#> [171] 301.99 274.34 263.15 274.24 273.29 354.10 274.54 274.45 302.60 335.92
#> [181] 375.39 310.89 293.66 299.15 292.75 290.87 288.96 274.39 292.88 286.94
#> [191] 296.48 297.12 274.67 301.00 273.92 296.77 306.95 301.56 296.45 291.92
#> [201] 299.82 272.29 293.20 267.69 295.98 295.58 298.52 287.56 284.12 288.37
#> [211] 285.26 285.07 282.72 275.89 273.80 280.16 270.71 270.11 276.56 278.07
#> [221] 278.35 287.48 259.75 275.81 224.35 287.06 257.72 247.22 314.22 284.53
#> [231] 258.05 297.87 339.03 296.96 256.99 253.26 248.70 248.89 250.34 237.78
#> [241] 257.03 250.67
tic()
<- stlm(y = ham_msts, s.window = 52, method = "ets")
final_model_ets1 <- tbats(ham_msts, use.box.cox = F,seasonal.periods = 52)
final_model_tbats <- HoltWinters(ham_msts, seasonal = "additive")
final_model_hw1 <- forecast(final_model_ets1, h = freq_num)
final_forecast_model_ets1 <- forecast(final_model_tbats, h = freq_num)
final_forecast_model_tbats <- forecast(final_model_hw1, h = freq_num)
final_forecast_model_hw1 toc()
#> 12.858 sec elapsed
<- ham_msts %>%
final_model_plot autoplot() +
autolayer(final_forecast_model_ets1$mean, series = "forecast ETS1") +
autolayer(final_forecast_model_tbats$mean, series = "forecast TBATS") +
autolayer(final_forecast_model_hw1$mean, series = "forecast HW1") +
labs(title = "Visualization of Data Forecast - Data Test",
subtitle = "For Hamilton's Average Ticket Price") +
theme(plot.title = element_text(face = "bold"))
%>%
final_model_plot ggplotly() %>%
layout(title = list(text = paste0("<b>",
"Visualization of Data Forecast",
"</b>",
"<br>", #line break
"<sup>", #superscript
"For Hamilton's Average Ticket Price",
"</sup>")))
Here we could see that the models predicted that, if the musical is with the current state (until the last data were taken which is the First of March of 2020), the average ticket price could decreasing. While this is good news for us theatregoers, the productions could think on something to “spice things up” like bringing famous actors to play the lead, or have marketing strategy so that the demand are increasing to make the average ticket price heightening.
Now that we know the best model, we are using the model to interpret the Seasonality Analysis.
<- ham_ts %>%
ham_ts_decompose decompose()
<- ham_msts %>%
ham_msts_mstl mstl() %>% # decompose for msts
as.data.frame()
<- matrix(ham_ts_decompose$seasonal,
weekly_tickets ncol = 52,
byrow = T)
<- data.frame(seasonal = weekly_tickets[1, ],
week_plot week = 1:52
%>%
) distinct(week, seasonal) %>%
mutate(rounded_seasonal = round(seasonal)) %>%
ggplot(week_plot, mapping = aes(x = week, y = seasonal,
text = glue("Week: {week}
Seasonal: {rounded_seasonal}"))) +
geom_col(aes(fill = seasonal)) +
# theme_minimal() +
scale_fill_gradient2(low = "#404040", mid = "#f6c049", high = "#a16526") + # #f6c049
scale_x_continuous(breaks = seq(1,52,3)) +
labs(title = "The Seasonality Analysis",
subtitle = "Based on the Weekly data") +
theme(legend.position = "bottom")
%>%
week_plot ggplotly(tooltip = "text") %>%
layout(title = list(text = paste0("<b>",
"The Seasonality Analysis of Hamilton's Average Ticket Price",
"</b>",
"<br>", #line break
"<sup>", #superscript
"Based on the Weekly data",
"</sup>")))
As we can see here our time series data could not grasp the week correctly as seen on ham_ts
, January supposed to be high season and around the fall or week 31 until around 40 should be decreasing.
Let’s try to see from the multiple seasonal plot.
Here is the ham_msts
for comparisson.
<- ham_msts_mstl %>%
monthly_tickets mutate(week_ending = pad_ham$week_ending,
Week = week(pad_ham$week_ending),
Month = month(pad_ham$week_ending, label = T, abbr = T)) %>%
group_by(Week, Month) %>%
summarise(seasonal = mean(Seasonal12 + Seasonal52))
<- monthly_tickets %>%
monthly_plot ggplot() +
geom_bar(aes(x = Week,
y = seasonal,
fill = Month),
stat ="identity",
position = "dodge",
width = 0.7) +
scale_x_continuous(breaks = seq(1,53,4)) +
geom_hline(yintercept = 0, col = "navy", lty = 1, lwd = 0.3) +
theme_minimal() +
# scale_fill_gradient(low = "#000000", high = "#a16526") +
labs(title = "The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
subtitle = "Based on the Weekly and Monthly Data") +
theme(legend.position = "right")
%>%
monthly_plot ggplotly() %>%
layout(title = list(text = paste0("<b>",
"The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
"</b>",
"<br>", #line break
"<sup>", #superscript
"Based on the Weekly and Monthly Data",
"</sup>")))
As we can see our msts can plot the seasonal better. Here we can conclude that the last weeks of the year until the first week of the next year have really high seasonality. While, like the msts plot, we can see that after July (week 30), presumably the summer holiday, the ticket prices are starting to decrease with the lowest on week 44 (first week of November), then it bounce back up again breaking the negative seasonal.
There we go, thank you very much for taking the time to read my report. I hope it could be useful for you! -xn