Project 1 broadway

Author

E Choi

Introduction

In this project I have decided to explore attendance trends for popular Broadways Musicals using data from The Broadway League. The dataset includes information about shows, theatres, dates, attendance, and gross revenue. I plan to focus on four shows that were some of the most attended, and how attendance changed over the years. Data source: The Broadway League (https://www.broadwayleague.com)

##load libraries and dataset
#| message: false
#| warning: false
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(RColorBrewer)
setwd("C:/Users/enomc/OneDrive - montgomerycollege.edu/Documents/Data Science")
Broadway <- read_csv("broadway.csv")
Rows: 31296 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Date.Full, Show.Name, Show.Theatre, Show.Type
dbl (8): Date.Day, Date.Month, Date.Year, Statistics.Attendance, Statistics....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#new data frame that gives me a descending order of the most attend shows from most to least
new_df <- Broadway |>
  group_by(Show.Name) |>
  summarize(attend = sum(Statistics.Attendance)) |>
  arrange(desc(attend))
head(new_df)              
# A tibble: 6 × 2
  Show.Name                  attend
  <chr>                       <dbl>
1 The Lion King            13207871
2 The Phantom Of The Opera 11582362
3 Wicked                    9524462
4 Chicago                   8123328
5 Beauty And The Beast      7609397
6 Mamma Mia!                7566124
#Make broadway 5 dataset with only four of the most popular broadway shows
broadway5 <- Broadway|>
  filter(Show.Name %in% c("The Lion King", "The Phantom Of The Opera","Wicked", "Chicago"))
broadway5
# A tibble: 3,734 × 12
   Date.Day Date.Full Date.Month Date.Year Show.Name      Show.Theatre Show.Type
      <dbl> <chr>          <dbl>     <dbl> <chr>          <chr>        <chr>    
 1        2 6/2/1996           6      1996 The Phantom O… Majestic     Musical  
 2        9 6/9/1996           6      1996 The Phantom O… Majestic     Musical  
 3       16 6/16/1996          6      1996 The Phantom O… Majestic     Musical  
 4       23 6/23/1996          6      1996 The Phantom O… Majestic     Musical  
 5       30 6/30/1996          6      1996 The Phantom O… Majestic     Musical  
 6        7 7/7/1996           7      1996 The Phantom O… Majestic     Musical  
 7       14 7/14/1996          7      1996 The Phantom O… Majestic     Musical  
 8       21 7/21/1996          7      1996 The Phantom O… Majestic     Musical  
 9       28 7/28/1996          7      1996 The Phantom O… Majestic     Musical  
10        4 8/4/1996           8      1996 The Phantom O… Majestic     Musical  
# ℹ 3,724 more rows
# ℹ 5 more variables: Statistics.Attendance <dbl>, Statistics.Capacity <dbl>,
#   Statistics.Gross <dbl>, `Statistics.Gross Potential` <dbl>,
#   Statistics.Performances <dbl>
#Group by show and year, summarize total attendance for show that year
broadway_summary <- broadway5 |>
  group_by(Show.Name, Date.Year) |>
  #ignore missing values
  summarize(
    yearly_attendance = sum(Statistics.Attendance, na.rm = TRUE))
`summarise()` has grouped output by 'Show.Name'. You can override using the
`.groups` argument.
#Check the summarized data
head(broadway_summary)
# A tibble: 6 × 3
# Groups:   Show.Name [1]
  Show.Name Date.Year yearly_attendance
  <chr>         <dbl>             <dbl>
1 Chicago        1996             95959
2 Chicago        1997            603797
3 Chicago        1998            593108
4 Chicago        1999            499617
5 Chicago        2000            463149
6 Chicago        2001            421767
#create line plot displaying yearly attendance trends for each show
ggplot(broadway_summary, aes(x = Date.Year, y = yearly_attendance, color = Show.Name)) +
  geom_line(linewidth = 1) + #line connecting each year's data points
  geom_point(size = 2.5) + #add visible points for each year
  scale_color_brewer(palette = "Set3") + #custom color palette for each show
  scale_y_continuous(labels = scales::comma) + #format y-axis to have numbers instead of e or scientific notation got this code very end from: https://stackoverflow.com/questions/37713351/formatting-ggplot2-axis-labels-with-commas-and-k-mm-if-i-already-have-a-y-sc
  labs(
    title = "Annual Attendance Trends for Top Broadway Musicals",
    x = "Year",
    y = "Total Attendance",
    color = "Show",
    caption = "Source: The Broadway League"
  ) +
  theme_minimal(base_size = 10)#theme with adjusted text size so model is easier to follow

Build a multiple linear regression model predicting yearly attendance

options(scipen = 999) # Turn off scientific notation for readability got this code from: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r
rat1 <- lm(yearly_attendance ~ Date.Year + Show.Name, data = broadway_summary)
summary(rat1)

Call:
lm(formula = yearly_attendance ~ Date.Year + Show.Name, data = broadway_summary)

Residuals:
    Min      1Q  Median      3Q     Max 
-533484   -9639   25008   62831  201057 

Coefficients:
                                  Estimate Std. Error t value       Pr(>|t|)
(Intercept)                        3934046    5017797   0.784          0.436
Date.Year                            -1768       2501  -0.707          0.482
Show.NameThe Lion King              274453      38565   7.117 0.000000000718
Show.NameThe Phantom Of The Opera   164716      38072   4.326 0.000048676501
Show.NameWicked                     299683      43457   6.896 0.000000001820
                                     
(Intercept)                          
Date.Year                            
Show.NameThe Lion King            ***
Show.NameThe Phantom Of The Opera ***
Show.NameWicked                   ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 123400 on 71 degrees of freedom
Multiple R-squared:  0.4911,    Adjusted R-squared:  0.4625 
F-statistic: 17.13 on 4 and 71 DF,  p-value: 0.000000000708
plot(rat1) #I found out from this code how to use diagnostic plots as mentioned in the assignment rubric from here :https://www.statology.org/diagnostic-plots-in-r/ 

Interpretation of linear regression stats and model

You may notice the show Chicago is missing. This is because RStudio sets one of the categories as a baseline in this case it picked Chicago alphabetically. The coefficients for the other shows show how their attendance compares to Chicago. The p-values for The Lion King, Wicked, and The Phantom of the Opera are all very small, meaning their attendance is significantly higher than Chicago’s (the baseline). This could mean you reject the null that there is no difference between shows since the p values are less than 0.05. The adjusted R-squared value (~0.46) shows that about 46% of the variation in attendance can be explained by the model, which is like a moderate fit.

Essay

To clean the dataset, I first removed any missing values in attendance by using na.rm = TRUE when summarizing the data. This ensures that any missing values don’t mess everything up. I also grouped the data by both show name and year to then calculate total yearly attendance for each show. Then I filtered the dataset to only include the four most popular shows so the results would be easier to compare. I also turned off scientific notation to make the numbers easier to read and formatted the y-axis labels for clarity.

The visualization shows yearly attendance trends for the top Broadway musicals. From the plot, I noticed that “The Lion King” and “Wicked” had consistently higher attendance compared to “Chicago.” The lines make it easy to see how each show performed over time, and it was interesting that some shows maintained steady attendance while others declined slightly.

One thing I wanted to do but could not fully complete was show the exact regression equation directly on the plot. I also considered including more shows, but I decided to limit it to four so the graph was clear. Overall, the model and visualization helped show how Broadway attendance patterns changed across different popular shows. Something that may not be as good is the diagnostic plots. I saw in the rubric, but was not very sure how to use them correctly and how they represent data from the visualization to be honest.