Ilustration by the author of the report.

1 Business Case

Greetings all! It’s Gissella Nadya and today we are back with a project of forecasting the average ticket price for the musical Hamilton on Broadway! The data was obtained through Playbill.com . For a brief review, Hamilton is a musical on Broadway that tells the story of American’s Founding Fathers. Now that Broadway is not open yet due to the global pandemic, let’s just imagine that what would happen if we are using the previous history to forecast what would be the average ticket price to watch the musical. Okay let’s get started!

2 Getting to know the Data

2.1 Installing Libraries

library(tidyverse)
library(padr)
library(forecast) # autplot()
library(tseries) # adf.test()
library(MLmetrics) # MAE()
library(plotly) # plotly
library(glue) # glue text
library(tictoc)
library(gganimate)
library(astsa)
library(lubridate)

2.2 Reading the Data

bway <- read_csv("grosses.csv")
bway

3 Data Pre-processing

hamilton <- bway %>% 
  filter(show == "Hamilton" ) %>% 
  select(week_ending, weekly_gross, avg_ticket_price, seats_sold)
bway %>% 
  filter(show == "Hamilton" ) %>% 
  select(week_ending, week_number, avg_ticket_price)

3.1 Changing Data Types

avg_ticket_price

clean_ham <- hamilton %>% 
  select(week_ending, avg_ticket_price)  %>% 
  mutate(week_ending = as.Date(week_ending))

clean_ham 

3.2 Data Aggregation

range(clean_ham$week_ending)
#> [1] "2015-07-19" "2020-03-01"

3.3 Time Series Padding

The characteristics of a Time Series object must consists of:

  • an ordered structure for the period
  • no missing period, and
  • no missing value

Therefore, we need to do time series padding.

pad_ham <- clean_ham %>% 
  pad(start_val = head(clean_ham$week_ending, 1),
      end_val = tail(clean_ham$week_ending, 1))
pad_ham[is.na(pad_ham$avg_ticket_price),]

agg_fnb %>% fill(visitor, .direction = “up”)

# pad_ham <- pad_ham %>% 
#    mutate(avg_ticket_price = replace_na(avg_ticket_price, 0))

pad_ham <- pad_ham %>% 
   fill(avg_ticket_price, .direction = "down")

pad_ham %>% is.na() %>% colSums()
#>      week_ending avg_ticket_price 
#>                0                0
index <- seq(from = as.Date("2015-07-19"), to = as.Date("2020-03-01"), by = "week")

ham_xts <- xts::xts(pad_ham$avg_ticket_price, order.by = index, frequency = 52)
ham_xts %>%  autoplot() + ggtitle("Hamilton's Average Ticket Price per Week")

4 Seasonality Analysis

4.1 Decomposition Approach

For the purpose of making a time-series dataset, we need to use the ts() function and shall determine the frequency for the dataset. Here we used 52 as the frequency because we want to analyze the data based on weekly, and since the in one year there are approximately 52 weeks, it means that there are 52 weeks loop to be inspected.

head(week(ham_xts),1)
#> [1] 29
ham_ts <- ts(data = ham_xts, frequency = 52, start = c(2015, head(week(ham_xts),1)))

ham_ts%>% 
  autoplot()

ham_ts %>%
  decompose() %>% 
  autoplot()

ada trend ada seasonal

ham_msts <- pad_ham$avg_ticket_price %>% 
  msts(seasonal.periods = c(12, 52), start = c(2015, head(week(ham_xts),1))) # month , week

ham_msts %>% 
  mstl() %>%  
  autoplot()

5 Model Fitting and Evaluation

5.1 Cross Validation

freq_num <- 52 # 52
test_ham_msts <- tail(ham_msts, n = freq_num) 
train_ham_msts <- head(ham_msts, length(ham_msts) - length(test_ham_msts))

5.2 Model Fitting

res_adf <- train_ham_msts %>% 
  adf.test()
  • \(H_0\) : data are non-stationary
  • \(H_1\) : data are stationary
#> [1] "p-value = 0.80081601590178"
#> [1] "p-value > alpha, which means data are non-stationary"

We must use diff() in order to make our data stationary.

diff_D1 <- train_ham_msts %>% 
  diff(lag = 52) %>%
  adf.test()
#> [1] "p-value = 0.549881924195139"
#> [1] "p-value > alpha, which means data are non-stationary"
diff_D1_d1 <- train_ham_msts %>% 
  diff(lag = 52) %>%
  diff() %>% 
  adf.test()
#> [1] "p-value = 0.01"
#> [1] "p-value < alpha, which means data are non-stationary"

Using the Augmented Dickey-Fuller Test, after 1 diff, we confirm that the data are stationary

train_ham_msts %>%
  diff(lag = 52) %>%
  diff() %>% 
  autoplot() +
  labs(title = "Autoplot for Train Dataset with Differencing") + 
  theme(plot.title = element_text(face = "bold", color = "#a16526"))

train_ham_msts %>% 
  diff(lag = 52) %>%
  diff() %>% 
  tsdisplay(lag.max = 104)

  • PACF : 1,2,3,4,9, 23
  • ACF : cuts off after lag 1. 3,5,9,24
  1. Since we have done differencing before, then \(d\) = 1, and \(D\) = 1
  2. On the PACF and ACF plot, we see lag 1, 9, are significant, then \(p\) = 1 or 9, \(q\) = 1 or 9
  3. On the PACF plot, lag 52 is significant but 104 is not so \(P\) = 0 or 1.
  4. On the ACF plot, lag 52 is significant but 104 is not so \(Q\) = 0 or 1

To summed up,

\(p\) = 1 or 9, \(d\) = 1, \(q\) = 1 or 9, \(P\) = 0 or 1, \(D\) = 1, \(Q\) = 0 or 1

5.3 Building the Models

There are several models to choose for time series analysis:

  1. Simple Moving Average (SMA) → no trend and seasonal
  2. Exponential Smoothing
    • Simple Exponential Smoothing (SES) → no trend and seasonal
    • Double Exponential Smoothing (Holt Exponential Smoothing) → with trend no seasonal
    • Triple Exponential Smoothing (Holt-Winters) → with trend and seasonal
  3. Autoregressive Integrated Moving Average (ARIMA) → with trend no seasonal
  4. Seasonal ARIMA (SARIMA) → with trend and seasonal

5.3.1 Holt’s Winters Exponential

We are using the Holt’s Winters Exponential or also known as Triple Exponential Smoothing because our data consists of trend and seasonal.

5.3.2 SARIMA

SARIMA or Seasonal Auto Regressive Integrated Moving Average, are the other alternatives that we are going to use. We are using the SARIMA method because there is trend and seasonal factors on our dataset. Here we use the function Arima(), auto.arima(), and stlm().

5.3.3 STLM

STLM or Seasonal Trend with Loess Model is a model that is perfect for Multiple Seasonal. Since our data is additive we are able to use stlm() function, but if our data is multiplicative, we shall use the log() function. We may use the function for both ARIMA and Exponential Smoothing or ETS.

5.3.4 TBATS

The other alternative model that we may use is the TBATS or Trigonometric seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality Model. TBATS model may be used using the tbats() function.

tic()
model_hw1 <- HoltWinters(train_ham_msts, seasonal = "additive") # 0.2 , 0.04 , 0.4
model_hw2 <- HoltWinters(train_ham_msts, alpha = 0.5, beta = 0.01, gamma = 1, 
                         seasonal = "additive")
model_hw3 <- HoltWinters(train_ham_msts, alpha = 0.2, beta = 0.01, gamma = 0.08, 
                         seasonal = "additive") ##
model_arima_auto <- auto.arima(train_ham_msts)
model_arima1 <- stlm(y = train_ham_msts, s.window = 52, method = "arima") ##
model_arima2 <- stlm(y = train_ham_msts, method = "arima")
model_arima3 <- Arima(y = train_ham_msts, order = c(1,1,1), seasonal = c(0,1,1))
model_ets1 <- stlm(y = train_ham_msts, s.window = 52, method = "ets") ##
model_ets2 <- stlm(y = train_ham_msts, s.window = 12, method = "ets") ##
model_ets3 <- stlm(y = train_ham_msts, s.window = 12*4, method = "ets",robust = T)
model_ets4 <- stlm(y = train_ham_msts, s.window = 53, method = "ets")
model_tbats <- tbats(train_ham_msts, use.box.cox = F,seasonal.periods = 52)
toc()
#> 44.614 sec elapsed
model_sarima1 <- sarima.for(train_ham_msts, n.ahead = freq_num, 
                            p=9, d=1, q=1, 
                            P=0, D=1, Q=0,  S=53) #53

model_sarima2 <- sarima.for(train_ham_msts, n.ahead = freq_num, 
                            p=0, d=1, q=1, 
                            P=0, D=1, Q=0,  S=52) # 52

model_sarima3 <- sarima.for(train_ham_msts, n.ahead = freq_num, 
                            p=0, d=1, q=1, # q jgn 3
                            P=0, D=1, Q=0,  S=53)

5.4 Forecasting

tic()
forecast_hw1 <- forecast(model_hw1, h = freq_num)
forecast_hw2 <- forecast(model_hw2, h = freq_num)
forecast_hw3 <- forecast(model_hw3, h = freq_num)
forecast_arima_auto <- forecast(model_arima_auto, h = freq_num)
forecast_arima1 <- forecast(model_arima1, h = freq_num)
forecast_arima2 <- forecast(model_arima2, h = freq_num)
forecast_arima3 <- forecast(model_arima3, h = freq_num)
forecast_ets1 <- forecast(model_ets1, h = freq_num)
forecast_ets2 <- forecast(model_ets2, h = freq_num)
forecast_ets3 <- forecast(model_ets3, h = freq_num)
forecast_ets4 <- forecast(model_ets4, h = freq_num)
forecast_tbats <- forecast(model_tbats, h = freq_num)
toc()
#> 1.418 sec elapsed
# forecast_arima4 <- forecast(model_arima4, h = freq_num)
# forecast_tbats <- forecast(model_tbats, h = freq_num)

5.5 Accuracy

To evaluate the model we may use the accuracy() function from the forecast library to get the MAE, MAPE, RMSE. Because we we would like to compare the values from the three metrics, we are using the MAE(), MAPE(), RMSE() function by determining the y_pred and y_true parameter. We concluded it into a dataframe for a better visualization

To evaluate the error, we could see from the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) or the Root Mean Square Error (RMSE). After compiling the metrics, here we could see the

5.6 Visualization

5.6.1 All Models

model_all_1 <- train_ham_msts %>% 
  tail(n = 52) %>%
  autoplot() +
  autolayer(test_ham_msts, series = "Data Test") +
  autolayer(forecast_hw1$mean, series = "forecast HW1") +
  autolayer(forecast_hw2$mean, series = "forecast HW2") +
  autolayer(forecast_hw3$mean, series = "forecast HW3") +
  autolayer(forecast_arima_auto$mean, series = "forecast Arima Auto") +
  autolayer(forecast_arima1$mean, series = "forecast Arima 1") +
  autolayer(forecast_arima2$mean, series = "forecast Arima 2") +
  autolayer(forecast_arima3$mean, series = "forecast Arima 3") +
  autolayer(forecast_ets1$mean, series = "forecast ETS 1") +
  autolayer(forecast_ets2$mean, series = "forecast ETS 2") +
  autolayer(forecast_ets3$mean, series = "forecast ETS 3") +
  autolayer(forecast_ets4$mean, series = "forecast ETS 4") +
  autolayer(model_sarima1$pred, series = "forecast SARIMA 1") +
  autolayer(model_sarima2$pred, series = "forecast SARIMA 2") +
  autolayer(model_sarima3$pred, series = "forecast SARIMA 3") +
  autolayer(forecast_tbats$mean, series = "forecast TBATS") +
  labs(title = "Visualization of Data Forecast - Data Test",
       subtitle = "For Hamilton's Average Ticket Price") +
  theme(plot.title = element_text(face = "bold"))

model_all_1 %>%  
  ggplotly() %>% 
  layout(title = list(text = paste0("<b>",
                                    "Visualization of Data Forecast - Data Test",
                                    "</b>",
                                    "<br>", #line break
                                    "<sup>", #superscript
                                    "For Hamilton's Average Ticket Price",
                                    "</sup>")))
comp_MAPE_1 %>% 
  arrange(MAPE)

5.6.2 Lowest MAPE Model

Now, we are going to visualize the top models that has MAPE below 10%.

As we can see here, the data could predict almost with the same pattern. If we take a look on January 2020, the data almost predict perfectly.

Because all of the ETS model have almost the same MAPE (around 5.67%), therefore we are going to skip to the other models.

6 Assumption Checking

Every model must be checked through the Assumption test, and precisely for time series model, we have to check the Autocorrelation assumption, and Normality of the Residuals.

6.1 Autocorrelation assumption

We may use the Ljung-box Test by using the Box.test() function and use “Ljung-Box” as the type parameter.

a <- Box.test(forecast_hw1$residuals, type = "Ljung-Box")
b <- Box.test(forecast_hw2$residuals, type = "Ljung-Box")
c <- Box.test(forecast_hw3$residuals, type = "Ljung-Box")
d <- Box.test(forecast_arima_auto$residuals, type = "Ljung-Box")
e <- Box.test(forecast_arima1$residuals, type = "Ljung-Box")
f <- Box.test(forecast_arima2$residuals, type = "Ljung-Box")
g <- Box.test(forecast_arima3$residuals, type = "Ljung-Box")
i <- Box.test(forecast_ets1$residuals, type = "Ljung-Box")
j <- Box.test(forecast_ets2$residuals, type = "Ljung-Box")
k <- Box.test(forecast_ets3$residuals, type = "Ljung-Box")
l <- Box.test(forecast_ets4$residuals, type = "Ljung-Box")
m <- Box.test(model_sarima1$pred, type = "Ljung-Box")
n <- Box.test(model_sarima2$pred, type = "Ljung-Box")
o <- Box.test(model_sarima3$pred, type = "Ljung-Box")
q <- Box.test(forecast_tbats$residuals, type = "Ljung-Box")
model_assump1 %>% 
  mutate(autocorrelation = case_when(p_value >= 0.05 ~ "failed to reject H0",
                                     p_value < 0.05 ~ "reject H0"))
  • \(H_0\) : residuals have no-autocorrelation
  • \(H_1\) : residuals have autocorrelation

From the results we can conclude that all the models have no-autocorrelation except for the three sarima models that has p-values below the alpha.

6.2 Normality of Residuals

To test the Normality of Residuals, we may use the Shapiro-Wilk Normality test by using the shapiro.test() function.

aa <- shapiro.test(forecast_hw1$residuals)
ab <- shapiro.test(forecast_hw2$residuals)
ac <- shapiro.test(forecast_hw3$residuals)
ad <- shapiro.test(forecast_arima_auto$residuals)
ae <- shapiro.test(forecast_arima1$residuals)
af <- shapiro.test(forecast_arima2$residuals)
ag <- shapiro.test(forecast_arima3$residuals)
ai <- shapiro.test(forecast_ets1$residuals)
aj <- shapiro.test(forecast_ets2$residuals)
ak <- shapiro.test(forecast_ets3$residuals)
al <- shapiro.test(forecast_ets4$residuals)
am <- shapiro.test(model_sarima1$pred)
an <- shapiro.test(model_sarima2$pred)
ao <- shapiro.test(model_sarima3$pred)
aq <- shapiro.test(forecast_tbats$residuals)
model_normal %>% 
  mutate(normality = case_when(p_value >= 0.05 ~ "failed to reject H0",
                                     p_value < 0.05 ~ "reject H0"))
  • \(H_0\): residual distributed normally
  • \(H_1\): residual NOT distributed normally

We can conclude that all the residuals are NOT distributed normally.

7 Predicting the “next year”

We all know that Broadway, as of April 2021, is still closed, but it could not hurt to take a look on what could have been if there were no global pandemic, right?

ham_msts
#> Multi-Seasonal Time Series:
#> Start: 2015 29
#> Seasonal Periods: 12 52
#> Data:
#>   [1] 138.94 140.21 140.39 117.99 136.31 136.04 144.62 158.52 145.91 144.59
#>  [11] 146.33 138.36 156.58 137.94 139.08 148.71 147.07 148.66 135.14 170.77
#>  [21] 150.36 152.85 154.60 154.60 182.39 159.23 164.55 161.84 161.21 164.81
#>  [31] 166.69 162.79 163.17 164.28 163.50 161.82 159.89 169.51 168.65 156.41
#>  [41] 168.61 169.12 170.49 157.06 164.09 178.38 172.46 188.60 188.51 187.03
#>  [51] 188.36 190.95 189.35 190.32 189.85 191.79 190.14 192.07 191.02 194.55
#>  [61] 199.97 200.75 199.95 202.23 205.87 201.21 185.70 206.03 207.56 228.04
#>  [71] 228.30 303.21 208.09 227.23 208.65 307.16 310.13 228.43 227.90 227.98
#>  [81] 229.17 298.92 270.54 267.10 291.37 264.10 292.48 266.89 287.86 262.17
#>  [91] 276.99 289.94 291.07 259.59 292.14 263.01 291.36 257.05 284.07 255.42
#> [101] 282.61 282.96 284.00 284.61 280.79 282.01 280.78 283.28 282.65 276.16
#> [111] 275.52 277.78 274.70 272.72 272.59 273.54 272.24 272.55 271.20 248.26
#> [121] 272.39 292.14 263.31 321.13 264.33 271.04 263.37 353.06 358.46 291.24
#> [131] 288.61 287.93 286.99 259.81 284.82 287.10 287.98 285.46 292.88 292.93
#> [141] 286.67 268.47 292.06 291.87 286.51 262.97 288.31 262.30 290.20 278.35
#> [151] 290.46 265.29 291.16 291.27 289.99 295.08 289.13 292.34 293.13 292.95
#> [161] 292.08 289.72 302.18 302.16 298.36 297.70 301.05 301.89 299.98 301.77
#> [171] 301.99 274.34 263.15 274.24 273.29 354.10 274.54 274.45 302.60 335.92
#> [181] 375.39 310.89 293.66 299.15 292.75 290.87 288.96 274.39 292.88 286.94
#> [191] 296.48 297.12 274.67 301.00 273.92 296.77 306.95 301.56 296.45 291.92
#> [201] 299.82 272.29 293.20 267.69 295.98 295.58 298.52 287.56 284.12 288.37
#> [211] 285.26 285.07 282.72 275.89 273.80 280.16 270.71 270.11 276.56 278.07
#> [221] 278.35 287.48 259.75 275.81 224.35 287.06 257.72 247.22 314.22 284.53
#> [231] 258.05 297.87 339.03 296.96 256.99 253.26 248.70 248.89 250.34 237.78
#> [241] 257.03 250.67
tic()
final_model_ets1 <- stlm(y = ham_msts, s.window = 52, method = "ets") 
final_model_tbats <- tbats(ham_msts, use.box.cox = F,seasonal.periods = 52)
final_model_hw1 <- HoltWinters(ham_msts, seasonal = "additive")
final_forecast_model_ets1 <- forecast(final_model_ets1, h = freq_num)
final_forecast_model_tbats <- forecast(final_model_tbats, h = freq_num)
final_forecast_model_hw1 <- forecast(final_model_hw1, h = freq_num)
toc()
#> 12.858 sec elapsed
final_model_plot <- ham_msts %>%
  autoplot() +
  autolayer(final_forecast_model_ets1$mean, series = "forecast ETS1") +
  autolayer(final_forecast_model_tbats$mean, series = "forecast TBATS") +
  autolayer(final_forecast_model_hw1$mean, series = "forecast HW1") +
  labs(title = "Visualization of Data Forecast - Data Test",
       subtitle = "For Hamilton's Average Ticket Price") +
  theme(plot.title = element_text(face = "bold"))

final_model_plot %>%  
  ggplotly() %>% 
  layout(title = list(text = paste0("<b>",
                                    "Visualization of Data Forecast",
                                    "</b>",
                                    "<br>", #line break
                                    "<sup>", #superscript
                                    "For Hamilton's Average Ticket Price",
                                    "</sup>")))

Here we could see that the models predicted that, if the musical is with the current state (until the last data were taken which is the First of March of 2020), the average ticket price could decreasing. While this is good news for us theatregoers, the productions could think on something to “spice things up” like bringing famous actors to play the lead, or have marketing strategy so that the demand are increasing to make the average ticket price heightening.

8 Interpreting the Seasonability Analysis

Now that we know the best model, we are using the model to interpret the Seasonality Analysis.

ham_ts_decompose <-  ham_ts %>% 
  decompose()
ham_msts_mstl <-  ham_msts %>% 
  mstl() %>%  # decompose for msts
  as.data.frame()

8.1 Weekly

weekly_tickets <-  matrix(ham_ts_decompose$seasonal, 
                          ncol = 52, 
                          byrow = T)

week_plot <- data.frame(seasonal = weekly_tickets[1, ],
                        week = 1:52
                        ) %>% 
  distinct(week, seasonal) %>% 
  mutate(rounded_seasonal = round(seasonal)) %>% 
  ggplot(week_plot, mapping = aes(x = week, y = seasonal,
                                             text = glue("Week: {week}
                                                         Seasonal: {rounded_seasonal}"))) + 
  geom_col(aes(fill = seasonal)) + 
  # theme_minimal() +
  scale_fill_gradient2(low = "#404040", mid = "#f6c049", high = "#a16526") + # #f6c049 
  scale_x_continuous(breaks = seq(1,52,3)) +
  labs(title = "The Seasonality Analysis",
       subtitle = "Based on the Weekly data") +
  theme(legend.position = "bottom")

week_plot %>% 
  ggplotly(tooltip = "text") %>% 
  layout(title = list(text = paste0("<b>",
                                    "The Seasonality Analysis of Hamilton's Average Ticket Price",
                                    "</b>",
                                    "<br>", #line break
                                    "<sup>", #superscript
                                    "Based on the Weekly data",
                                    "</sup>")))

As we can see here our time series data could not grasp the week correctly as seen on ham_ts, January supposed to be high season and around the fall or week 31 until around 40 should be decreasing.

Let’s try to see from the multiple seasonal plot.

8.2 Monthly

Here is the ham_msts for comparisson.

monthly_tickets <- ham_msts_mstl %>% 
  mutate(week_ending = pad_ham$week_ending,
         Week = week(pad_ham$week_ending),
         Month = month(pad_ham$week_ending, label = T, abbr = T)) %>% 
  group_by(Week, Month) %>% 
  summarise(seasonal = mean(Seasonal12 + Seasonal52))

monthly_plot <- monthly_tickets %>% 
ggplot() +
  geom_bar(aes(x = Week, 
               y = seasonal, 
               fill = Month),
           stat ="identity", 
           position = "dodge", 
           width = 0.7) +
  scale_x_continuous(breaks = seq(1,53,4)) +
  geom_hline(yintercept = 0, col = "navy", lty = 1, lwd = 0.3) +
  theme_minimal() +
  # scale_fill_gradient(low = "#000000", high = "#a16526") +
  labs(title = "The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
       subtitle = "Based on the Weekly and Monthly Data") +
  theme(legend.position = "right") 

monthly_plot %>% 
  ggplotly() %>% 
  layout(title = list(text = paste0("<b>",
                                    "The Multi-Seasonality Analysis on Hamilton's Average Ticket Price",
                                    "</b>",
                                    "<br>", #line break
                                    "<sup>", #superscript
                                    "Based on the Weekly and Monthly Data",
                                    "</sup>")))

As we can see our msts can plot the seasonal better. Here we can conclude that the last weeks of the year until the first week of the next year have really high seasonality. While, like the msts plot, we can see that after July (week 30), presumably the summer holiday, the ticket prices are starting to decrease with the lowest on week 44 (first week of November), then it bounce back up again breaking the negative seasonal.

There we go, thank you very much for taking the time to read my report. I hope it could be useful for you! -xn