Ilustration by the author of the report.
Greetings all! It’s Gissella Nadya and today we are back with a project of forecasting the average ticket price for the musical Hamilton on Broadway! The data was obtained through Playbill.com . For a brief review, Hamilton is a musical on Broadway that tells the story of American’s Founding Fathers. Now that Broadway is not open yet due to the global pandemic, let’s just imagine that what would happen if we are using the previous history to forecast what would be the average ticket price to watch the musical. Okay let’s get started!
library(tidyverse)
library(padr)
library(forecast) # autplot()
library(tseries) # adf.test()
library(MLmetrics) # MAE()
library(plotly) # plotly
library(glue) # glue text
library(tictoc)
library(gganimate)
library(astsa)
library(lubridate)bway <- read_csv("grosses.csv")
bwayhamilton <- bway %>%
filter(show == "Hamilton" ) %>%
select(week_ending, weekly_gross, avg_ticket_price, seats_sold)bway %>%
filter(show == "Hamilton" ) %>%
select(week_ending, week_number, avg_ticket_price)avg_ticket_price
clean_ham <- hamilton %>%
select(week_ending, avg_ticket_price) %>%
mutate(week_ending = as.Date(week_ending))
clean_ham range(clean_ham$week_ending)#> [1] "2015-07-19" "2020-03-01"
The characteristics of a Time Series object must consists of:
Therefore, we need to do time series padding.
pad_ham <- clean_ham %>%
pad(start_val = head(clean_ham$week_ending, 1),
end_val = tail(clean_ham$week_ending, 1))pad_ham[is.na(pad_ham$avg_ticket_price),]agg_fnb %>% fill(visitor, .direction = “up”)
# pad_ham <- pad_ham %>%
# mutate(avg_ticket_price = replace_na(avg_ticket_price, 0))
pad_ham <- pad_ham %>%
fill(avg_ticket_price, .direction = "down")
pad_ham %>% is.na() %>% colSums()#> week_ending avg_ticket_price
#> 0 0
index <- seq(from = as.Date("2015-07-19"), to = as.Date("2020-03-01"), by = "week")
ham_xts <- xts::xts(pad_ham$avg_ticket_price, order.by = index, frequency = 52)
ham_xts %>% autoplot() + ggtitle("Hamilton's Average Ticket Price per Week")For the purpose of making a time-series dataset, we need to use the ts() function and shall determine the frequency for the dataset. Here we used 52 as the frequency because we want to analyze the data based on weekly, and since the in one year there are approximately 52 weeks, it means that there are 52 weeks loop to be inspected.
head(week(ham_xts),1)#> [1] 29
ham_ts <- ts(data = ham_xts, frequency = 52, start = c(2015, head(week(ham_xts),1)))
ham_ts%>%
autoplot()ham_ts %>%
decompose() %>%
autoplot()ada trend ada seasonal
ham_msts <- pad_ham$avg_ticket_price %>%
msts(seasonal.periods = c(12, 52), start = c(2015, head(week(ham_xts),1))) # month , week
ham_msts %>%
mstl() %>%
autoplot()freq_num <- 52 # 52
test_ham_msts <- tail(ham_msts, n = freq_num)
train_ham_msts <- head(ham_msts, length(ham_msts) - length(test_ham_msts))res_adf <- train_ham_msts %>%
adf.test()#> [1] "p-value = 0.80081601590178"
#> [1] "p-value > alpha, which means data are non-stationary"
We must use diff() in order to make our data stationary.
diff_D1 <- train_ham_msts %>%
diff(lag = 52) %>%
adf.test()#> [1] "p-value = 0.549881924195139"
#> [1] "p-value > alpha, which means data are non-stationary"
diff_D1_d1 <- train_ham_msts %>%
diff(lag = 52) %>%
diff() %>%
adf.test()#> [1] "p-value = 0.01"
#> [1] "p-value < alpha, which means data are non-stationary"
Using the Augmented Dickey-Fuller Test, after 1 diff, we confirm that the data are stationary
train_ham_msts %>%
diff(lag = 52) %>%
diff() %>%
autoplot() +
labs(title = "Autoplot for Train Dataset with Differencing") +
theme(plot.title = element_text(face = "bold", color = "#a16526"))train_ham_msts %>%
diff(lag = 52) %>%
diff() %>%
tsdisplay(lag.max = 104)To summed up,
\(p\) = 1 or 9, \(d\) = 1, \(q\) = 1 or 9, \(P\) = 0 or 1, \(D\) = 1, \(Q\) = 0 or 1
There are several models to choose for time series analysis:
We are using the Holt’s Winters Exponential or also known as Triple Exponential Smoothing because our data consists of trend and seasonal.
SARIMA or Seasonal Auto Regressive Integrated Moving Average, are the other alternatives that we are going to use. We are using the SARIMA method because there is trend and seasonal factors on our dataset. Here we use the function Arima(), auto.arima(), and stlm().
STLM or Seasonal Trend with Loess Model is a model that is perfect for Multiple Seasonal. Since our data is additive we are able to use stlm() function, but if our data is multiplicative, we shall use the log() function. We may use the function for both ARIMA and Exponential Smoothing or ETS.
The other alternative model that we may use is the TBATS or Trigonometric seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality Model. TBATS model may be used using the tbats() function.
tic()
model_hw1 <- HoltWinters(train_ham_msts, seasonal = "additive") # 0.2 , 0.04 , 0.4
model_hw2 <- HoltWinters(train_ham_msts, alpha = 0.5, beta = 0.01, gamma = 1,
seasonal = "additive")
model_hw3 <- HoltWinters(train_ham_msts, alpha = 0.2, beta = 0.01, gamma = 0.08,
seasonal = "additive") ##
model_arima_auto <- auto.arima(train_ham_msts)
model_arima1 <- stlm(y = train_ham_msts, s.window = 52, method = "arima") ##
model_arima2 <- stlm(y = train_ham_msts, method = "arima")
model_arima3 <- Arima(y = train_ham_msts, order = c(1,1,1), seasonal = c(0,1,1))
model_ets1 <- stlm(y = train_ham_msts, s.window = 52, method = "ets") ##
model_ets2 <- stlm(y = train_ham_msts, s.window = 12, method = "ets") ##
model_ets3 <- stlm(y = train_ham_msts, s.window = 12*4, method = "ets",robust = T)
model_ets4 <- stlm(y = train_ham_msts, s.window = 53, method = "ets")
model_tbats <- tbats(train_ham_msts, use.box.cox = F,seasonal.periods = 52)
toc()#> 44.614 sec elapsed
model_sarima1 <- sarima.for(train_ham_msts, n.ahead = freq_num,
p=9, d=1, q=1,
P=0, D=1, Q=0, S=53) #53model_sarima2 <- sarima.for(train_ham_msts, n.ahead = freq_num,
p=0, d=1, q=1,
P=0, D=1, Q=0, S=52) # 52model_sarima3 <- sarima.for(train_ham_msts, n.ahead = freq_num,
p=0, d=1, q=1, # q jgn 3
P=0, D=1, Q=0, S=53)tic()
forecast_hw1 <- forecast(model_hw1, h = freq_num)
forecast_hw2 <- forecast(model_hw2, h = freq_num)
forecast_hw3 <- forecast(model_hw3, h = freq_num)
forecast_arima_auto <- forecast(model_arima_auto, h = freq_num)
forecast_arima1 <- forecast(model_arima1, h = freq_num)
forecast_arima2 <- forecast(model_arima2, h = freq_num)
forecast_arima3 <- forecast(model_arima3, h = freq_num)
forecast_ets1 <- forecast(model_ets1, h = freq_num)
forecast_ets2 <- forecast(model_ets2, h = freq_num)
forecast_ets3 <- forecast(model_ets3, h = freq_num)
forecast_ets4 <- forecast(model_ets4, h = freq_num)
forecast_tbats <- forecast(model_tbats, h = freq_num)
toc()#> 1.418 sec elapsed
# forecast_arima4 <- forecast(model_arima4, h = freq_num)
# forecast_tbats <- forecast(model_tbats, h = freq_num)To evaluate the model we may use the accuracy() function from the forecast library to get the MAE, MAPE, RMSE. Because we we would like to compare the values from the three metrics, we are using the MAE(), MAPE(), RMSE() function by determining the y_pred and y_true parameter. We concluded it into a dataframe for a better visualization
To evaluate the error, we could see from the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) or the Root Mean Square Error (RMSE). After compiling the metrics, here we could see the
model_all_1 <- train_ham_msts %>%
tail(n = 52) %>%
autoplot() +
autolayer(test_ham_msts, series = "Data Test") +
autolayer(forecast_hw1$mean, series = "forecast HW1") +
autolayer(forecast_hw2$mean, series = "forecast HW2") +
autolayer(forecast_hw3$mean, series = "forecast HW3") +
autolayer(forecast_arima_auto$mean, series = "forecast Arima Auto") +
autolayer(forecast_arima1$mean, series = "forecast Arima 1") +
autolayer(forecast_arima2$mean, series = "forecast Arima 2") +
autolayer(forecast_arima3$mean, series = "forecast Arima 3") +
autolayer(forecast_ets1$mean, series = "forecast ETS 1") +
autolayer(forecast_ets2$mean, series = "forecast ETS 2") +
autolayer(forecast_ets3$mean, series = "forecast ETS 3") +
autolayer(forecast_ets4$mean, series = "forecast ETS 4") +
autolayer(model_sarima1$pred, series = "forecast SARIMA 1") +
autolayer(model_sarima2$pred, series = "forecast SARIMA 2") +
autolayer(model_sarima3$pred, series = "forecast SARIMA 3") +
autolayer(forecast_tbats$mean, series = "forecast TBATS") +
labs(title = "Visualization of Data Forecast - Data Test",
subtitle = "For Hamilton's Average Ticket Price") +
theme(plot.title = element_text(face = "bold"))
model_all_1 %>%
ggplotly() %>%
layout(title = list(text = paste0("<b>",
"Visualization of Data Forecast - Data Test",
"</b>",
"<br>", #line break
"<sup>", #superscript
"For Hamilton's Average Ticket Price",
"</sup>")))comp_MAPE_1 %>%
arrange(MAPE)Now, we are going to visualize the top models that has MAPE below 10%.
As we can see here, the data could predict almost with the same pattern. If we take a look on January 2020, the data almost predict perfectly.
Because all of the ETS model have almost the same MAPE (around 5.67%), therefore we are going to skip to the other models.
Every model must be checked through the Assumption test, and precisely for time series model, we have to check the Autocorrelation assumption, and Normality of the Residuals.
We may use the Ljung-box Test by using the Box.test() function and use “Ljung-Box” as the type parameter.
a <- Box.test(forecast_hw1$residuals, type = "Ljung-Box")
b <- Box.test(forecast_hw2$residuals, type = "Ljung-Box")
c <- Box.test(forecast_hw3$residuals, type = "Ljung-Box")
d <- Box.test(forecast_arima_auto$residuals, type = "Ljung-Box")
e <- Box.test(forecast_arima1$residuals, type = "Ljung-Box")
f <- Box.test(forecast_arima2$residuals, type = "Ljung-Box")
g <- Box.test(forecast_arima3$residuals, type = "Ljung-Box")
i <- Box.test(forecast_ets1$residuals, type = "Ljung-Box")
j <- Box.test(forecast_ets2$residuals, type = "Ljung-Box")
k <- Box.test(forecast_ets3$residuals, type = "Ljung-Box")
l <- Box.test(forecast_ets4$residuals, type = "Ljung-Box")
m <- Box.test(model_sarima1$pred, type = "Ljung-Box")
n <- Box.test(model_sarima2$pred, type = "Ljung-Box")
o <- Box.test(model_sarima3$pred, type = "Ljung-Box")
q <- Box.test(forecast_tbats$residuals, type = "Ljung-Box")model_assump1 %>%
mutate(autocorrelation = case_when(p_value >= 0.05 ~ "failed to reject H0",
p_value < 0.05 ~ "reject H0"))From the results we can conclude that all the models have no-autocorrelation except for the three sarima models that has p-values below the alpha.
To test the Normality of Residuals, we may use the Shapiro-Wilk Normality test by using the shapiro.test() function.
aa <- shapiro.test(forecast_hw1$residuals)
ab <- shapiro.test(forecast_hw2$residuals)
ac <- shapiro.test(forecast_hw3$residuals)
ad <- shapiro.test(forecast_arima_auto$residuals)
ae <- shapiro.test(forecast_arima1$residuals)
af <- shapiro.test(forecast_arima2$residuals)
ag <- shapiro.test(forecast_arima3$residuals)
ai <- shapiro.test(forecast_ets1$residuals)
aj <- shapiro.test(forecast_ets2$residuals)
ak <- shapiro.test(forecast_ets3$residuals)
al <- shapiro.test(forecast_ets4$residuals)
am <- shapiro.test(model_sarima1$pred)
an <- shapiro.test(model_sarima2$pred)
ao <- shapiro.test(model_sarima3$pred)
aq <- shapiro.test(forecast_tbats$residuals)model_normal %>%
mutate(normality = case_when(p_value >= 0.05 ~ "failed to reject H0",
p_value < 0.05 ~ "reject H0"))We can conclude that all the residuals are NOT distributed normally.
We all know that Broadway, as of April 2021, is still closed, but it could not hurt to take a look on what could have been if there were no global pandemic, right?
ham_msts#> Multi-Seasonal Time Series:
#> Start: 2015 29
#> Seasonal Periods: 12 52
#> Data:
#> [1] 138.94 140.21 140.39 117.99 136.31 136.04 144.62 158.52 145.91 144.59
#> [11] 146.33 138.36 156.58 137.94 139.08 148.71 147.07 148.66 135.14 170.77
#> [21] 150.36 152.85 154.60 154.60 182.39 159.23 164.55 161.84 161.21 164.81
#> [31] 166.69 162.79 163.17 164.28 163.50 161.82 159.89 169.51 168.65 156.41
#> [41] 168.61 169.12 170.49 157.06 164.09 178.38 172.46 188.60 188.51 187.03
#> [51] 188.36 190.95 189.35 190.32 189.85 191.79 190.14 192.07 191.02 194.55
#> [61] 199.97 200.75 199.95 202.23 205.87 201.21 185.70 206.03 207.56 228.04
#> [71] 228.30 303.21 208.09 227.23 208.65 307.16 310.13 228.43 227.90 227.98
#> [81] 229.17 298.92 270.54 267.10 291.37 264.10 292.48 266.89 287.86 262.17
#> [91] 276.99 289.94 291.07 259.59 292.14 263.01 291.36 257.05 284.07 255.42
#> [101] 282.61 282.96 284.00 284.61 280.79 282.01 280.78 283.28 282.65 276.16
#> [111] 275.52 277.78 274.70 272.72 272.59 273.54 272.24 272.55 271.20 248.26
#> [121] 272.39 292.14 263.31 321.13 264.33 271.04 263.37 353.06 358.46 291.24
#> [131] 288.61 287.93 286.99 259.81 284.82 287.10 287.98 285.46 292.88 292.93
#> [141] 286.67 268.47 292.06 291.87 286.51 262.97 288.31 262.30 290.20 278.35
#> [151] 290.46 265.29 291.16 291.27 289.99 295.08 289.13 292.34 293.13 292.95
#> [161] 292.08 289.72 302.18 302.16 298.36 297.70 301.05 301.89 299.98 301.77
#> [171] 301.99 274.34 263.15 274.24 273.29 354.10 274.54 274.45 302.60 335.92
#> [181] 375.39 310.89 293.66 299.15 292.75 290.87 288.96 274.39 292.88 286.94
#> [191] 296.48 297.12 274.67 301.00 273.92 296.77 306.95 301.56 296.45 291.92
#> [201] 299.82 272.29 293.20 267.69 295.98 295.58 298.52 287.56 284.12 288.37
#> [211] 285.26 285.07 282.72 275.89 273.80 280.16 270.71 270.11 276.56 278.07
#> [221] 278.35 287.48 259.75 275.81 224.35 287.06 257.72 247.22 314.22 284.53
#> [231] 258.05 297.87 339.03 296.96 256.99 253.26 248.70 248.89 250.34 237.78
#> [241] 257.03 250.67
tic()
final_model_ets1 <- stlm(y = ham_msts, s.window = 52, method = "ets")
final_model_tbats <- tbats(ham_msts, use.box.cox = F,seasonal.periods = 52)
final_model_hw1 <- HoltWinters(ham_msts, seasonal = "additive")
final_forecast_model_ets1 <- forecast(final_model_ets1, h = freq_num)
final_forecast_model_tbats <- forecast(final_model_tbats, h = freq_num)
final_forecast_model_hw1 <- forecast(final_model_hw1, h = freq_num)
toc()#> 12.858 sec elapsed
final_model_plot <- ham_msts %>%
autoplot() +
autolayer(final_forecast_model_ets1$mean, series = "forecast ETS1") +
autolayer(final_forecast_model_tbats$mean, series = "forecast TBATS") +
autolayer(final_forecast_model_hw1$mean, series = "forecast HW1") +
labs(title = "Visualization of Data Forecast - Data Test",
subtitle = "For Hamilton's Average Ticket Price") +
theme(plot.title = element_text(face = "bold"))
final_model_plot %>%
ggplotly() %>%
layout(title = list(text = paste0("<b>",
"Visualization of Data Forecast",
"</b>",
"<br>", #line break
"<sup>", #superscript
"For Hamilton's Average Ticket Price",
"</sup>")))Here we could see that the models predicted that, if the musical is with the current state (until the last data were taken which is the First of March of 2020), the average ticket price could decreasing. While this is good news for us theatregoers, the productions could think on something to “spice things up” like bringing famous actors to play the lead, or have marketing strategy so that the demand are increasing to make the average ticket price heightening.
Now that we know the best model, we are using the model to interpret the Seasonality Analysis.
ham_ts_decompose <- ham_ts %>%
decompose()
ham_msts_mstl <- ham_msts %>%
mstl() %>% # decompose for msts
as.data.frame()weekly_tickets <- matrix(ham_ts_decompose$seasonal,
ncol = 52,
byrow = T)
week_plot <- data.frame(seasonal = weekly_tickets[1, ],
week = 1:52
) %>%
distinct(week, seasonal) %>%
mutate(rounded_seasonal = round(seasonal)) %>%
ggplot(week_plot, mapping = aes(x = week, y = seasonal,
text = glue("Week: {week}
Seasonal: {rounded_seasonal}"))) +
geom_col(aes(fill = seasonal)) +
# theme_minimal() +
scale_fill_gradient2(low = "#404040", mid = "#f6c049", high = "#a16526") + # #f6c049
scale_x_continuous(breaks = seq(1,52,3)) +
labs(title = "The Seasonality Analysis",
subtitle = "Based on the Weekly data") +
theme(legend.position = "bottom")
week_plot %>%
ggplotly(tooltip = "text") %>%
layout(title = list(text = paste0("<b>",
"The Seasonality Analysis of Hamilton's Average Ticket Price",
"</b>",
"<br>", #line break
"<sup>", #superscript
"Based on the Weekly data",
"</sup>")))As we can see here our time series data could not grasp the week correctly as seen on ham_ts, January supposed to be high season and around the fall or week 31 until around 40 should be decreasing.
Let’s try to see from the multiple seasonal plot.
Here is the ham_msts for comparisson.
monthly_tickets <- ham_msts_mstl %>%
mutate(week_ending = pad_ham$week_ending,
Week = week(pad_ham$week_ending),
Month = month(pad_ham$week_ending, label = T, abbr = T)) %>%
group_by(Week, Month) %>%
summarise(seasonal = mean(Seasonal12 + Seasonal52))
monthly_plot <- monthly_tickets %>%
ggplot() +
geom_bar(aes(x = Week,
y = seasonal,
fill = Month),
stat ="identity",
position = "dodge",
width = 0.7) +
scale_x_continuous(breaks = seq(1,53,4)) +
geom_hline(yintercept = 0, col = "navy", lty = 1, lwd = 0.3) +
theme_minimal() +
# scale_fill_gradient(low = "#000000", high = "#a16526") +
labs(title = "The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
subtitle = "Based on the Weekly and Monthly Data") +
theme(legend.position = "right")
monthly_plot %>%
ggplotly() %>%
layout(title = list(text = paste0("<b>",
"The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
"</b>",
"<br>", #line break
"<sup>", #superscript
"Based on the Weekly and Monthly Data",
"</sup>")))As we can see our msts can plot the seasonal better. Here we can conclude that the last weeks of the year until the first week of the next year have really high seasonality. While, like the msts plot, we can see that after July (week 30), presumably the summer holiday, the ticket prices are starting to decrease with the lowest on week 44 (first week of November), then it bounce back up again breaking the negative seasonal.
There we go, thank you very much for taking the time to read my report. I hope it could be useful for you! -xn