The topic of the data, any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate).
Data Description
For this project I used the CORGIS Broadway Dataset. This library holds data about Broadway shows, grouped over weeklong periods. Only shows that reported capacity were included, so the dataset includes the years 1999 through 2016. The dataset is made available by the Broadway League (the national trade association for the Broadway industry), and you can view the data online at http://www.broadwayleague.com/.
The datasets contains the following information:
Date.Day - (Integer) The day of the month that this performance’s week ended on Date.Full - (String) The full date representation that this performance’s week ended in “Month/Day/Year” format Date.Month = (Integer) The numeric month that this performance’s week on in (1 = January, 2 = February, etc) Date.Year - (Integer) The year that this week of performances occurred in Show.Name - (String) The name of the production Show.Theatre - (String) The name of the theatre Show.Type - (String) Whether it is a “Musical”, “Play”, or “Special”. Statistics.Attendance - (Integer) The total number of people who attended performance over the week Statistics.Capacity - (Integer) The percentage of the theatre that was filled during that week Statistics.Gross - (Integer) The “Gross Gross” of this performance or how much of it made in total across the entire week. Measured in dollars Statistics.Gross Potential - (Integer) The Gross Potential is the maximum amount an engagement can possibly earn based on calculations involving ticket prices, seating capacity, and thenumber of performances. The number is expressed as a percentage of what could have been achieved (Gross Gross/Gross Potential).
Data Shape number of observations (rows): 31,296 number of variables (columns): 12
Problem Statement: Examining the number of plays, musicals, and special production on Broadway and which is the largest and most profitable.
Cleaning the data: While the dataset was prepared by CORGIS and did not have many issues, it was necessary to clean the data to create column names that were simplier and in lower case. In addition, the data is organized in weekly format which required me to aggregrate gross revenues.
Why I chose this topic and dataset – what meaning does it have for you?
I enjoy Broadway shows and for that reason I chose this dataset.I am also interested in the length of time shows remain on broadway and how much money is being made through revenue(ticket sales). I was also interested in learning about the number of broadway revivals.
Incorporate brief background research about this topic. This background information will include information you find in an article, website, or book. Please source this background information within the essay or if you have multiple sources, include a bibliography.
The Broadway League developed a website that includes research reports and other data that captures the economic impact of broadway, research on demographics of ticket purchasers and attendees of shows, programs to support the performing arts, and touring shows that originate on broadway. Below is a link the Broadway League’s website. https://www.broadwayleague.com/research/research-reports/
#loading the libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
#get the working directory
getwd()
## [1] "/Users/grayce/Desktop"
#set working directory
setwd("/Users/grayce/Desktop")
#read working directory
Broadway <- read_csv("Broadway.csv")
## Rows: 31296 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date.Full, Show.Name, Show.Theatre, Show.Type
## dbl (8): Date.Day, Date.Month, Date.Year, Statistics.Attendance, Statistics....
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#clean thedata to remove NA's and spaces
Broadway <- Broadway
Broadway [Broadway == ""] <- NA
colSums(is.na(Broadway))
## Date.Day Date.Full
## 0 0
## Date.Month Date.Year
## 0 0
## Show.Name Show.Theatre
## 0 0
## Show.Type Statistics.Attendance
## 0 0
## Statistics.Capacity Statistics.Gross
## 0 0
## Statistics.Gross Potential Statistics.Performances
## 0 0
#summary of the dataset
summary(Broadway)
## Date.Day Date.Full Date.Month Date.Year
## Min. : 1.00 Length:31296 Min. : 1.000 Min. :1990
## 1st Qu.: 8.00 Class :character 1st Qu.: 4.000 1st Qu.:2001
## Median :16.00 Mode :character Median : 6.000 Median :2006
## Mean :15.72 Mean : 6.509 Mean :2006
## 3rd Qu.:23.00 3rd Qu.:10.000 3rd Qu.:2011
## Max. :31.00 Max. :12.000 Max. :2016
## Show.Name Show.Theatre Show.Type Statistics.Attendance
## Length:31296 Length:31296 Length:31296 Min. : 24
## Class :character Class :character Class :character 1st Qu.: 5508
## Mode :character Mode :character Mode :character Median : 7802
## Mean : 8011
## 3rd Qu.:10348
## Max. :21631
## Statistics.Capacity Statistics.Gross Statistics.Gross Potential
## Min. : 10.00 Min. : 1037 Min. : 0.0
## 1st Qu.: 70.00 1st Qu.: 320108 1st Qu.: 45.0
## Median : 83.00 Median : 514520 Median : 63.0
## Mean : 80.74 Mean : 596069 Mean : 62.4
## 3rd Qu.: 95.00 3rd Qu.: 785269 3rd Qu.: 82.0
## Max. :892.00 Max. :3201333 Max. :226.0
## Statistics.Performances
## Min. : 0.000
## 1st Qu.: 8.000
## Median : 8.000
## Mean : 7.225
## 3rd Qu.: 8.000
## Max. :16.000
#clean data - making all headers lowercase and remove spaces
names(Broadway) <- tolower(names(Broadway))
names(Broadway) <- gsub(" ","",names(Broadway))
str(Broadway)
## spc_tbl_ [31,296 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date.day : num [1:31296] 26 24 31 7 14 21 28 5 12 19 ...
## $ date.full : chr [1:31296] "8/26/1990" "3/24/1991" "3/31/1991" "4/7/1991" ...
## $ date.month : num [1:31296] 8 3 3 4 4 4 4 5 5 5 ...
## $ date.year : num [1:31296] 1990 1991 1991 1991 1991 ...
## $ show.name : chr [1:31296] "Tru" "Miss Saigon" "Miss Saigon" "Miss Saigon" ...
## $ show.theatre : chr [1:31296] "Booth" "Broadway" "Broadway" "Broadway" ...
## $ show.type : chr [1:31296] "Play" "Musical" "Musical" "Musical" ...
## $ statistics.attendance : num [1:31296] 5500 1737 12160 13921 10973 ...
## $ statistics.capacity : num [1:31296] 88 100 100 100 90 101 101 101 100 99 ...
## $ statistics.gross : num [1:31296] 134456 100647 634424 713353 573981 ...
## $ statistics.grosspotential: num [1:31296] 0 0 0 0 0 0 0 0 0 0 ...
## $ statistics.performances : num [1:31296] 8 0 0 0 4 8 8 8 8 8 ...
## - attr(*, "spec")=
## .. cols(
## .. Date.Day = col_double(),
## .. Date.Full = col_character(),
## .. Date.Month = col_double(),
## .. Date.Year = col_double(),
## .. Show.Name = col_character(),
## .. Show.Theatre = col_character(),
## .. Show.Type = col_character(),
## .. Statistics.Attendance = col_double(),
## .. Statistics.Capacity = col_double(),
## .. Statistics.Gross = col_double(),
## .. `Statistics.Gross Potential` = col_double(),
## .. Statistics.Performances = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# get column names
colnames(Broadway)
## [1] "date.day" "date.full"
## [3] "date.month" "date.year"
## [5] "show.name" "show.theatre"
## [7] "show.type" "statistics.attendance"
## [9] "statistics.capacity" "statistics.gross"
## [11] "statistics.grosspotential" "statistics.performances"
# Rename column for more clarity
names(Broadway)[names(Broadway) == "date.year"] <- "year"
names(Broadway)[names(Broadway) == "show.name"] <- "show"
names(Broadway)[names(Broadway) == "show.theatre"] <- "theatre"
names(Broadway)[names(Broadway) == "show.type"] <- "type"
names(Broadway)[names(Broadway) == "statistics.attendance"] <- "attendance"
names(Broadway)[names(Broadway) == "statistics.gross"] <- "gross"
names(Broadway)[names(Broadway) == "statistics.performances"] <-"performances"
names(Broadway)[names(Broadway) == "statistics.capacity"] <- "capacity"
names(Broadway)[names(Broadway) == "date.full"] <- "week"
#names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
Broadway
## # A tibble: 31,296 × 12
## date.day week date.month year show theatre type attendance capacity
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 26 8/26/1990 8 1990 Tru Booth Play 5500 88
## 2 24 3/24/1991 3 1991 Miss S… Broadw… Musi… 1737 100
## 3 31 3/31/1991 3 1991 Miss S… Broadw… Musi… 12160 100
## 4 7 4/7/1991 4 1991 Miss S… Broadw… Musi… 13921 100
## 5 14 4/14/1991 4 1991 Miss S… Broadw… Musi… 10973 90
## 6 21 4/21/1991 4 1991 Miss S… Broadw… Musi… 14076 101
## 7 28 4/28/1991 4 1991 Miss S… Broadw… Musi… 14065 101
## 8 5 5/5/1991 5 1991 Miss S… Broadw… Musi… 14064 101
## 9 12 5/12/1991 5 1991 Miss S… Broadw… Musi… 13896 100
## 10 19 5/19/1991 5 1991 Miss S… Broadw… Musi… 13738 99
## # ℹ 31,286 more rows
## # ℹ 3 more variables: gross <dbl>, statistics.grosspotential <dbl>,
## # performances <dbl>
#check the dimensions of the new dataframe
dim(Broadway)
## [1] 31296 12
#table showing data by type of show
table(Broadway$type)
##
## Musical Play Special
## 22551 8406 339
#load library
library(ggplot2)
#facet wrap showing revenue by type of show from 1990 through 2016
plot10 <- Broadway %>%
group_by(year, type) %>%
summarize(gross = sum(gross)) %>%
ggplot(aes(x = year, y = gross, group = type, fill = type)) +
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~ type, nrow = 10) +
labs(title = "Broadway Revenue by Type & Year (1990 - 2016)", x = "Gross Revenue", y = "Gross Revenue") +
scale_y_continuous("Gross Revenue",
breaks = scales::breaks_extended(8),
labels = scales::label_dollar()) +
scale_x_continuous("Year",
breaks = scales::breaks_extended(6),
labels = scales::label_number_auto())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
plot10
#interactivity
ggplotly(plot10)
#I ran the same data set but using geom_jitter and a smoother to show the linear regression. There were no real suprises as the lines through each type of show were linear with very few
plot2 <- Broadway %>%
group_by(year, type) %>%
summarize(gross = sum(gross)) %>%
ggplot(aes(x = year, y = gross, group = type, fill = type)) +
geom_jitter(aes(colour = type)) +
geom_smooth(method='lm', lwd = 0.5) +
xlab("Year") +
ylab("Gross Revenue") +
ggtitle("Broadway Shows - 1990 - 2016")
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
plot2
## `geom_smooth()` using formula = 'y ~ x'
#I then filtered for the highest gross across the data set by week. The answer is below through filtering - Wicked - the last week of 2013. The gross was $3,201,333
highestgross <- Broadway %>%
filter(gross == max(gross))
highestgross
## # A tibble: 1 × 12
## date.day week date.month year show theatre type attendance capacity gross
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 29 12/2… 12 2013 Wick… Gershw… Musi… 17352 100 3.20e6
## # ℹ 2 more variables: statistics.grosspotential <dbl>, performances <dbl>
#then I looked at the highest grossing year by type of show. The winners were: Musicials in 2014 and Plays and Specials in 2015.
aggregategross <- Broadway %>%
group_by(type, year) %>% summarise(gross = sum(gross)) %>%
filter(gross == max(gross))
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
aggregategross
## # A tibble: 3 × 3
## # Groups: type [3]
## type year gross
## <chr> <dbl> <dbl>
## 1 Musical 2014 1140640907
## 2 Play 2015 230404143
## 3 Special 2015 18588487
#After looking at the above facet chart, I noticed that 2014 musicals had the highest gross. I created a new datafram (newdf - 1630 obs. and 5 variables). I filtered by year (2014) and then removed columns that I did not need. I want to focus on 2014 shows.
newdf <- Broadway[ -c(1,3,6,8,9,11,12) ] %>%
filter(year == 2014)
tibble(newdf)
## # A tibble: 1,630 × 5
## week year show type gross
## <chr> <dbl> <chr> <chr> <dbl>
## 1 1/5/2014 2014 A Gentleman'S Guide To Love And Murder Musical 544066
## 2 1/5/2014 2014 A Night With Janis Joplin Musical 369944
## 3 1/5/2014 2014 After Midnight Musical 579804
## 4 1/5/2014 2014 Annie '12 Musical 1387296
## 5 1/5/2014 2014 Beautiful Musical 740451
## 6 1/5/2014 2014 Betrayal '13 Play 1367059
## 7 1/5/2014 2014 Chicago Musical 923074
## 8 1/5/2014 2014 Cinderella Musical 1123628
## 9 1/5/2014 2014 First Date Musical 563371
## 10 1/5/2014 2014 Jersey Boys Musical 928795
## # ℹ 1,620 more rows
#I created a visualization that focused on all shows in 2014 that is interactive. I summed the gross to avoid weekly gross and to look at aggregrate gross over the year.
plot21 <- newdf %>%
group_by(type) %>% summarise(gross = sum(gross)) %>%
ggplot(aes(x = gross, y = type, fill = type)) +
geom_col(stat="identity") +
theme(axis.text.x = element_text(angle=0.5, hjust=1)) +
xlab("Gross Revenune") +
ylab("Type of Show") +
ggtitle("2014 Broadway Musicals") +
coord_flip() +
theme(plot.title = element_text(family = "Georgia", face = "bold"))
## Warning in geom_col(stat = "identity"): Ignoring unknown parameters: `stat`
plot21
ggplotly(plot21)
str(newdf)
## tibble [1,630 × 5] (S3: tbl_df/tbl/data.frame)
## $ week : chr [1:1630] "1/5/2014" "1/5/2014" "1/5/2014" "1/5/2014" ...
## $ year : num [1:1630] 2014 2014 2014 2014 2014 ...
## $ show : chr [1:1630] "A Gentleman'S Guide To Love And Murder" "A Night With Janis Joplin" "After Midnight" "Annie '12" ...
## $ type : chr [1:1630] "Musical" "Musical" "Musical" "Musical" ...
## $ gross: num [1:1630] 544066 369944 579804 1387296 740451 ...
#table showing 2014 broadway productions
table(newdf$type)
##
## Musical Play Special
## 1252 373 5
#filter to isolate musicals
musical <- newdf %>%
filter(type == 'Musical')
#filter to isolate lowest grossing play in 2014 by week. Mothers and Sons, $29,638.
moneymusical <- newdf %>%
filter(gross == min(gross))
tibble(moneymusical)
## # A tibble: 1 × 5
## week year show type gross
## <chr> <dbl> <chr> <chr> <dbl>
## 1 2/23/2014 2014 Mothers And Sons Play 29638
moneymusical
## # A tibble: 1 × 5
## week year show type gross
## <chr> <dbl> <chr> <chr> <dbl>
## 1 2/23/2014 2014 Mothers And Sons Play 29638
#Final Visualization
#using this new data set, I added a visualization to show the 2014 musicals. The highest grossing show, for that year, was The Lion King, followed by Wicked, and Book of Mormon.
musical <- newdf %>%
filter(type == 'Musical') %>%
arrange(desc(show)) %>%
ggplot(aes(fill = gross, x= gross, y= show)) +
geom_bar(stat='identity')+
theme(axis.text.x = element_text(angle=50, hjust=1)) +
theme(axis.ticks = element_line(linewidth = 2)) +
xlab("Gross Revenue") +
ylab("Shows") +
ggtitle("2014 Broadway Musicals")
musical
library(dslabs)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'highcharter'
## The following object is masked from 'package:dslabs':
##
## stars
library(devtools)
## Loading required package: usethis
What the visualization represents.
I experimented with several different visualizations to show the larger data set whhich included over 31,000 observations and 12 variables. I wanted to first look at the data by show type and gross revenue. Why not surprising, it was clear that musicals represent the largest revenue of broadway productions. Even the highest grossing play, was far less financially lucrative than most musicals. Plays are less likely to run beyond more than six months. Musicals can run for several decades. I focused on the highest grossing year, 2014, and looked at the shows that were the highest grossing. The top three were: The Lion King, Wicked, and the Book of Mormon. My final visualization shows the massive revenue of these top three (3) shows in 2014.
Any interesting patterns or surprises that arise within the visualization
It was interesting to see that 2013, 2014, and 2016 were the highest grossing years for broadway. I expected to see a more proportionate gross across the years from 2010 - 2016. After the recession of 2007-2008, fewer people had disposable income and were not attending broadway shows. However, the data shows that the real increase was from 2013-2016.
Anything that could have been shown that you could not get to work or that you wished you could have included.
There were alot of coding that did not work. However, the data set had a number of categories that were not interesting (capacity, theatre name, and potential capacity). I used the categories that offered the best available analysis (type of show, gross revenue, and show name). It would have been interesting to use data on ticket prices, but that was not a part of this data set.