In this project I have decided to explore attendance trends for popular Broadways Musicals using data from The Broadway League. The dataset includes information about shows, theatres, dates, attendance, and gross revenue. I plan to focus on four shows that were some of the most attended, and how attendance changed over the years. Data source: The Broadway League (https://www.broadwayleague.com)
##load libraries and dataset#| message: false#| warning: falselibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 31296 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Date.Full, Show.Name, Show.Theatre, Show.Type
dbl (8): Date.Day, Date.Month, Date.Year, Statistics.Attendance, Statistics....
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#new data frame that gives me a descending order of the most attend shows from most to leastnew_df <- Broadway |>group_by(Show.Name) |>summarize(attend =sum(Statistics.Attendance)) |>arrange(desc(attend))head(new_df)
# A tibble: 6 × 2
Show.Name attend
<chr> <dbl>
1 The Lion King 13207871
2 The Phantom Of The Opera 11582362
3 Wicked 9524462
4 Chicago 8123328
5 Beauty And The Beast 7609397
6 Mamma Mia! 7566124
#Make broadway 5 dataset with only four of the most popular broadway showsbroadway5 <- Broadway|>filter(Show.Name %in%c("The Lion King", "The Phantom Of The Opera","Wicked", "Chicago"))broadway5
#Group by show and year, summarize total attendance for show that yearbroadway_summary <- broadway5 |>group_by(Show.Name, Date.Year) |>#ignore missing valuessummarize(yearly_attendance =sum(Statistics.Attendance, na.rm =TRUE))
`summarise()` has grouped output by 'Show.Name'. You can override using the
`.groups` argument.
#Check the summarized datahead(broadway_summary)
# A tibble: 6 × 3
# Groups: Show.Name [1]
Show.Name Date.Year yearly_attendance
<chr> <dbl> <dbl>
1 Chicago 1996 95959
2 Chicago 1997 603797
3 Chicago 1998 593108
4 Chicago 1999 499617
5 Chicago 2000 463149
6 Chicago 2001 421767
#create line plot displaying yearly attendance trends for each showggplot(broadway_summary, aes(x = Date.Year, y = yearly_attendance, color = Show.Name)) +geom_line(linewidth =1) +#line connecting each year's data pointsgeom_point(size =2.5) +#add visible points for each yearscale_color_brewer(palette ="Set3") +#custom color palette for each showscale_y_continuous(labels = scales::comma) +#format y-axis to have numbers instead of e or scientific notation got this code very end from: https://stackoverflow.com/questions/37713351/formatting-ggplot2-axis-labels-with-commas-and-k-mm-if-i-already-have-a-y-sclabs(title ="Annual Attendance Trends for Top Broadway Musicals",x ="Year",y ="Total Attendance",color ="Show",caption ="Source: The Broadway League" ) +theme_minimal(base_size =10)#theme with adjusted text size so model is easier to follow
Build a multiple linear regression model predicting yearly attendance
options(scipen =999) # Turn off scientific notation for readability got this code from: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-rrat1 <-lm(yearly_attendance ~ Date.Year + Show.Name, data = broadway_summary)summary(rat1)
Call:
lm(formula = yearly_attendance ~ Date.Year + Show.Name, data = broadway_summary)
Residuals:
Min 1Q Median 3Q Max
-533484 -9639 25008 62831 201057
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3934046 5017797 0.784 0.436
Date.Year -1768 2501 -0.707 0.482
Show.NameThe Lion King 274453 38565 7.117 0.000000000718
Show.NameThe Phantom Of The Opera 164716 38072 4.326 0.000048676501
Show.NameWicked 299683 43457 6.896 0.000000001820
(Intercept)
Date.Year
Show.NameThe Lion King ***
Show.NameThe Phantom Of The Opera ***
Show.NameWicked ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 123400 on 71 degrees of freedom
Multiple R-squared: 0.4911, Adjusted R-squared: 0.4625
F-statistic: 17.13 on 4 and 71 DF, p-value: 0.000000000708
plot(rat1) #I found out from this code how to use diagnostic plots as mentioned in the assignment rubric from here :https://www.statology.org/diagnostic-plots-in-r/
Interpretation of linear regression stats and model
You may notice the show Chicago is missing. This is because RStudio sets one of the categories as a baseline in this case it picked Chicago alphabetically. The coefficients for the other shows show how their attendance compares to Chicago. The p-values for The Lion King, Wicked, and The Phantom of the Opera are all very small, meaning their attendance is significantly higher than Chicago’s (the baseline). This could mean you reject the null that there is no difference between shows since the p values are less than 0.05. The adjusted R-squared value (~0.46) shows that about 46% of the variation in attendance can be explained by the model, which is like a moderate fit.
Essay
To clean the dataset, I first removed any missing values in attendance by using na.rm = TRUE when summarizing the data. This ensures that any missing values don’t mess everything up. I also grouped the data by both show name and year to then calculate total yearly attendance for each show. Then I filtered the dataset to only include the four most popular shows so the results would be easier to compare. I also turned off scientific notation to make the numbers easier to read and formatted the y-axis labels for clarity.
The visualization shows yearly attendance trends for the top Broadway musicals. From the plot, I noticed that “The Lion King” and “Wicked” had consistently higher attendance compared to “Chicago.” The lines make it easy to see how each show performed over time, and it was interesting that some shows maintained steady attendance while others declined slightly.
One thing I wanted to do but could not fully complete was show the exact regression equation directly on the plot. I also considered including more shows, but I decided to limit it to four so the graph was clear. Overall, the model and visualization helped show how Broadway attendance patterns changed across different popular shows. Something that may not be as good is the diagnostic plots. I saw in the rubric, but was not very sure how to use them correctly and how they represent data from the visualization to be honest.