The topic of the data, any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate).

Data Description

For this project I used the CORGIS Broadway Dataset. This library holds data about Broadway shows, grouped over weeklong periods. Only shows that reported capacity were included, so the dataset includes the years 1999 through 2016. The dataset is made available by the Broadway League (the national trade association for the Broadway industry), and you can view the data online at http://www.broadwayleague.com/.

The datasets contains the following information:

Date.Day - (Integer) The day of the month that this performance’s week ended on Date.Full - (String) The full date representation that this performance’s week ended in “Month/Day/Year” format Date.Month = (Integer) The numeric month that this performance’s week on in (1 = January, 2 = February, etc) Date.Year - (Integer) The year that this week of performances occurred in Show.Name - (String) The name of the production Show.Theatre - (String) The name of the theatre Show.Type - (String) Whether it is a “Musical”, “Play”, or “Special”. Statistics.Attendance - (Integer) The total number of people who attended performance over the week Statistics.Capacity - (Integer) The percentage of the theatre that was filled during that week Statistics.Gross - (Integer) The “Gross Gross” of this performance or how much of it made in total across the entire week. Measured in dollars Statistics.Gross Potential - (Integer) The Gross Potential is the maximum amount an engagement can possibly earn based on calculations involving ticket prices, seating capacity, and thenumber of performances. The number is expressed as a percentage of what could have been achieved (Gross Gross/Gross Potential).

Data Shape number of observations (rows): 31,296 number of variables (columns): 12

Problem Statement: Examining the number of plays, musicals, and special production on Broadway and which is the largest and most profitable.

Cleaning the data: While the dataset was prepared by CORGIS and did not have many issues, it was necessary to clean the data to create column names that were simplier and in lower case. In addition, the data is organized in weekly format which required me to aggregrate gross revenues.

Why I chose this topic and dataset – what meaning does it have for you?

I enjoy Broadway shows and for that reason I chose this dataset.I am also interested in the length of time shows remain on broadway and how much money is being made through revenue(ticket sales). I was also interested in learning about the number of broadway revivals.

Incorporate brief background research about this topic. This background information will include information you find in an article, website, or book. Please source this background information within the essay or if you have multiple sources, include a bibliography.

The Broadway League developed a website that includes research reports and other data that captures the economic impact of broadway, research on demographics of ticket purchasers and attendees of shows, programs to support the performing arts, and touring shows that originate on broadway. Below is a link the Broadway League’s website. https://www.broadwayleague.com/research/research-reports/

#loading the libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
#get the working directory
getwd()
## [1] "/Users/grayce/Desktop"
#set working directory 
setwd("/Users/grayce/Desktop")
#read working directory
Broadway <- read_csv("Broadway.csv")
## Rows: 31296 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Date.Full, Show.Name, Show.Theatre, Show.Type
## dbl (8): Date.Day, Date.Month, Date.Year, Statistics.Attendance, Statistics....
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#clean thedata to remove NA's and spaces
Broadway <- Broadway
Broadway [Broadway == ""] <- NA 
colSums(is.na(Broadway))
##                   Date.Day                  Date.Full 
##                          0                          0 
##                 Date.Month                  Date.Year 
##                          0                          0 
##                  Show.Name               Show.Theatre 
##                          0                          0 
##                  Show.Type      Statistics.Attendance 
##                          0                          0 
##        Statistics.Capacity           Statistics.Gross 
##                          0                          0 
## Statistics.Gross Potential    Statistics.Performances 
##                          0                          0
#summary of the dataset
summary(Broadway)
##     Date.Day      Date.Full           Date.Month       Date.Year   
##  Min.   : 1.00   Length:31296       Min.   : 1.000   Min.   :1990  
##  1st Qu.: 8.00   Class :character   1st Qu.: 4.000   1st Qu.:2001  
##  Median :16.00   Mode  :character   Median : 6.000   Median :2006  
##  Mean   :15.72                      Mean   : 6.509   Mean   :2006  
##  3rd Qu.:23.00                      3rd Qu.:10.000   3rd Qu.:2011  
##  Max.   :31.00                      Max.   :12.000   Max.   :2016  
##   Show.Name         Show.Theatre        Show.Type         Statistics.Attendance
##  Length:31296       Length:31296       Length:31296       Min.   :   24        
##  Class :character   Class :character   Class :character   1st Qu.: 5508        
##  Mode  :character   Mode  :character   Mode  :character   Median : 7802        
##                                                           Mean   : 8011        
##                                                           3rd Qu.:10348        
##                                                           Max.   :21631        
##  Statistics.Capacity Statistics.Gross  Statistics.Gross Potential
##  Min.   : 10.00      Min.   :   1037   Min.   :  0.0             
##  1st Qu.: 70.00      1st Qu.: 320108   1st Qu.: 45.0             
##  Median : 83.00      Median : 514520   Median : 63.0             
##  Mean   : 80.74      Mean   : 596069   Mean   : 62.4             
##  3rd Qu.: 95.00      3rd Qu.: 785269   3rd Qu.: 82.0             
##  Max.   :892.00      Max.   :3201333   Max.   :226.0             
##  Statistics.Performances
##  Min.   : 0.000         
##  1st Qu.: 8.000         
##  Median : 8.000         
##  Mean   : 7.225         
##  3rd Qu.: 8.000         
##  Max.   :16.000
#clean data - making all headers lowercase and remove spaces
names(Broadway) <- tolower(names(Broadway))
names(Broadway) <- gsub(" ","",names(Broadway))
str(Broadway)
## spc_tbl_ [31,296 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date.day                 : num [1:31296] 26 24 31 7 14 21 28 5 12 19 ...
##  $ date.full                : chr [1:31296] "8/26/1990" "3/24/1991" "3/31/1991" "4/7/1991" ...
##  $ date.month               : num [1:31296] 8 3 3 4 4 4 4 5 5 5 ...
##  $ date.year                : num [1:31296] 1990 1991 1991 1991 1991 ...
##  $ show.name                : chr [1:31296] "Tru" "Miss Saigon" "Miss Saigon" "Miss Saigon" ...
##  $ show.theatre             : chr [1:31296] "Booth" "Broadway" "Broadway" "Broadway" ...
##  $ show.type                : chr [1:31296] "Play" "Musical" "Musical" "Musical" ...
##  $ statistics.attendance    : num [1:31296] 5500 1737 12160 13921 10973 ...
##  $ statistics.capacity      : num [1:31296] 88 100 100 100 90 101 101 101 100 99 ...
##  $ statistics.gross         : num [1:31296] 134456 100647 634424 713353 573981 ...
##  $ statistics.grosspotential: num [1:31296] 0 0 0 0 0 0 0 0 0 0 ...
##  $ statistics.performances  : num [1:31296] 8 0 0 0 4 8 8 8 8 8 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date.Day = col_double(),
##   ..   Date.Full = col_character(),
##   ..   Date.Month = col_double(),
##   ..   Date.Year = col_double(),
##   ..   Show.Name = col_character(),
##   ..   Show.Theatre = col_character(),
##   ..   Show.Type = col_character(),
##   ..   Statistics.Attendance = col_double(),
##   ..   Statistics.Capacity = col_double(),
##   ..   Statistics.Gross = col_double(),
##   ..   `Statistics.Gross Potential` = col_double(),
##   ..   Statistics.Performances = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# get column names
colnames(Broadway)
##  [1] "date.day"                  "date.full"                
##  [3] "date.month"                "date.year"                
##  [5] "show.name"                 "show.theatre"             
##  [7] "show.type"                 "statistics.attendance"    
##  [9] "statistics.capacity"       "statistics.gross"         
## [11] "statistics.grosspotential" "statistics.performances"
# Rename column for more clarity 
names(Broadway)[names(Broadway) == "date.year"] <- "year"
names(Broadway)[names(Broadway) == "show.name"] <- "show"
names(Broadway)[names(Broadway) == "show.theatre"] <- "theatre"
names(Broadway)[names(Broadway) == "show.type"] <- "type"
names(Broadway)[names(Broadway) == "statistics.attendance"] <- "attendance"
names(Broadway)[names(Broadway) == "statistics.gross"] <- "gross"
names(Broadway)[names(Broadway) == "statistics.performances"] <-"performances"
names(Broadway)[names(Broadway) == "statistics.capacity"] <- "capacity"
names(Broadway)[names(Broadway) == "date.full"] <- "week"
#names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
Broadway
## # A tibble: 31,296 × 12
##    date.day week      date.month  year show    theatre type  attendance capacity
##       <dbl> <chr>          <dbl> <dbl> <chr>   <chr>   <chr>      <dbl>    <dbl>
##  1       26 8/26/1990          8  1990 Tru     Booth   Play        5500       88
##  2       24 3/24/1991          3  1991 Miss S… Broadw… Musi…       1737      100
##  3       31 3/31/1991          3  1991 Miss S… Broadw… Musi…      12160      100
##  4        7 4/7/1991           4  1991 Miss S… Broadw… Musi…      13921      100
##  5       14 4/14/1991          4  1991 Miss S… Broadw… Musi…      10973       90
##  6       21 4/21/1991          4  1991 Miss S… Broadw… Musi…      14076      101
##  7       28 4/28/1991          4  1991 Miss S… Broadw… Musi…      14065      101
##  8        5 5/5/1991           5  1991 Miss S… Broadw… Musi…      14064      101
##  9       12 5/12/1991          5  1991 Miss S… Broadw… Musi…      13896      100
## 10       19 5/19/1991          5  1991 Miss S… Broadw… Musi…      13738       99
## # ℹ 31,286 more rows
## # ℹ 3 more variables: gross <dbl>, statistics.grosspotential <dbl>,
## #   performances <dbl>
#check the dimensions of the new dataframe
dim(Broadway)
## [1] 31296    12
#table showing data by type of show
table(Broadway$type)
## 
## Musical    Play Special 
##   22551    8406     339
#load library
library(ggplot2)
#facet wrap showing revenue by type of show from 1990 through 2016
plot10 <- Broadway %>%
  group_by(year, type) %>% 
  summarize(gross = sum(gross)) %>% 
  ggplot(aes(x = year, y = gross, group = type, fill = type)) +
  geom_bar(position = "dodge", stat = "identity") +
   facet_wrap(~ type, nrow = 10) +
   labs(title = "Broadway Revenue by Type & Year (1990 - 2016)", x = "Gross Revenue", y = "Gross Revenue") +
  scale_y_continuous("Gross Revenue", 
                       breaks = scales::breaks_extended(8),
                       labels = scales::label_dollar()) + 
  scale_x_continuous("Year",
                     breaks = scales::breaks_extended(6),
                     labels = scales::label_number_auto()) 
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
plot10

#interactivity
ggplotly(plot10)
#I ran the same data set but using geom_jitter and a smoother to show the linear regression. There were no real suprises as the lines through each type of show were linear with very few
plot2 <- Broadway %>%
  group_by(year, type) %>% 
  summarize(gross = sum(gross)) %>% 
  ggplot(aes(x = year, y = gross, group = type, fill = type)) +
  geom_jitter(aes(colour = type)) +
  geom_smooth(method='lm', lwd = 0.5) +
  xlab("Year") +
  ylab("Gross Revenue") +
  ggtitle("Broadway Shows - 1990 - 2016") 
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
plot2
## `geom_smooth()` using formula = 'y ~ x'

#I then filtered for the highest gross across the data set by week. The answer is below through filtering - Wicked - the last week of 2013.  The gross was $3,201,333   
highestgross <- Broadway %>%
  filter(gross == max(gross))
highestgross
## # A tibble: 1 × 12
##   date.day week  date.month  year show  theatre type  attendance capacity  gross
##      <dbl> <chr>      <dbl> <dbl> <chr> <chr>   <chr>      <dbl>    <dbl>  <dbl>
## 1       29 12/2…         12  2013 Wick… Gershw… Musi…      17352      100 3.20e6
## # ℹ 2 more variables: statistics.grosspotential <dbl>, performances <dbl>
#then I looked at the highest grossing year by type of show.  The winners were: Musicials in 2014 and Plays and Specials in 2015.
aggregategross <- Broadway %>%
  group_by(type, year) %>% summarise(gross = sum(gross))  %>%  
  filter(gross == max(gross))
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
aggregategross
## # A tibble: 3 × 3
## # Groups:   type [3]
##   type     year      gross
##   <chr>   <dbl>      <dbl>
## 1 Musical  2014 1140640907
## 2 Play     2015  230404143
## 3 Special  2015   18588487
#After looking at the above facet chart, I noticed that 2014 musicals had the highest gross. I created a new datafram (newdf - 1630 obs. and 5 variables). I filtered by year (2014) and then removed columns that I did not need. I want to focus on 2014 shows.
newdf <- Broadway[ -c(1,3,6,8,9,11,12) ] %>%
filter(year == 2014)
tibble(newdf)
## # A tibble: 1,630 × 5
##    week      year show                                   type      gross
##    <chr>    <dbl> <chr>                                  <chr>     <dbl>
##  1 1/5/2014  2014 A Gentleman'S Guide To Love And Murder Musical  544066
##  2 1/5/2014  2014 A Night With Janis Joplin              Musical  369944
##  3 1/5/2014  2014 After Midnight                         Musical  579804
##  4 1/5/2014  2014 Annie '12                              Musical 1387296
##  5 1/5/2014  2014 Beautiful                              Musical  740451
##  6 1/5/2014  2014 Betrayal '13                           Play    1367059
##  7 1/5/2014  2014 Chicago                                Musical  923074
##  8 1/5/2014  2014 Cinderella                             Musical 1123628
##  9 1/5/2014  2014 First Date                             Musical  563371
## 10 1/5/2014  2014 Jersey Boys                            Musical  928795
## # ℹ 1,620 more rows
#I created a visualization that focused on all shows in 2014 that is interactive. I summed the gross to avoid weekly gross and to look at aggregrate gross over the year.
plot21 <- newdf %>% 
  group_by(type) %>% summarise(gross = sum(gross))  %>%       
  ggplot(aes(x = gross, y = type, fill = type)) + 
  geom_col(stat="identity") +
theme(axis.text.x = element_text(angle=0.5, hjust=1)) +
xlab("Gross Revenune") +  
ylab("Type of Show") +
ggtitle("2014 Broadway Musicals") +
coord_flip() +
theme(plot.title = element_text(family = "Georgia", face = "bold")) 
## Warning in geom_col(stat = "identity"): Ignoring unknown parameters: `stat`
plot21

ggplotly(plot21)
str(newdf)
## tibble [1,630 × 5] (S3: tbl_df/tbl/data.frame)
##  $ week : chr [1:1630] "1/5/2014" "1/5/2014" "1/5/2014" "1/5/2014" ...
##  $ year : num [1:1630] 2014 2014 2014 2014 2014 ...
##  $ show : chr [1:1630] "A Gentleman'S Guide To Love And Murder" "A Night With Janis Joplin" "After Midnight" "Annie '12" ...
##  $ type : chr [1:1630] "Musical" "Musical" "Musical" "Musical" ...
##  $ gross: num [1:1630] 544066 369944 579804 1387296 740451 ...
#table showing 2014 broadway productions
table(newdf$type)
## 
## Musical    Play Special 
##    1252     373       5
#filter to isolate musicals
musical <- newdf %>%
  filter(type == 'Musical')
#filter to isolate lowest grossing play in 2014 by week. Mothers and Sons, $29,638.  
moneymusical <- newdf %>%
  filter(gross == min(gross))
tibble(moneymusical)
## # A tibble: 1 × 5
##   week       year show             type  gross
##   <chr>     <dbl> <chr>            <chr> <dbl>
## 1 2/23/2014  2014 Mothers And Sons Play  29638
moneymusical
## # A tibble: 1 × 5
##   week       year show             type  gross
##   <chr>     <dbl> <chr>            <chr> <dbl>
## 1 2/23/2014  2014 Mothers And Sons Play  29638
#Final Visualization 
#using this new data set, I added a visualization to show the 2014 musicals.  The highest grossing show, for that year, was The Lion King, followed by Wicked, and Book of Mormon. 
musical <- newdf %>%
  filter(type == 'Musical') %>%
 arrange(desc(show)) %>%
ggplot(aes(fill = gross, x= gross, y= show)) +
geom_bar(stat='identity')+
theme(axis.text.x = element_text(angle=50, hjust=1)) +
theme(axis.ticks = element_line(linewidth = 2)) +
xlab("Gross Revenue") +  
ylab("Shows") +
ggtitle("2014 Broadway Musicals")

musical

library(dslabs)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'highcharter'
## The following object is masked from 'package:dslabs':
## 
##     stars
library(devtools)
## Loading required package: usethis

What the visualization represents.

I experimented with several different visualizations to show the larger data set whhich included over 31,000 observations and 12 variables. I wanted to first look at the data by show type and gross revenue. Why not surprising, it was clear that musicals represent the largest revenue of broadway productions. Even the highest grossing play, was far less financially lucrative than most musicals. Plays are less likely to run beyond more than six months. Musicals can run for several decades. I focused on the highest grossing year, 2014, and looked at the shows that were the highest grossing. The top three were: The Lion King, Wicked, and the Book of Mormon. My final visualization shows the massive revenue of these top three (3) shows in 2014.

Any interesting patterns or surprises that arise within the visualization

It was interesting to see that 2013, 2014, and 2016 were the highest grossing years for broadway. I expected to see a more proportionate gross across the years from 2010 - 2016. After the recession of 2007-2008, fewer people had disposable income and were not attending broadway shows. However, the data shows that the real increase was from 2013-2016.

Anything that could have been shown that you could not get to work or that you wished you could have included.

There were alot of coding that did not work. However, the data set had a number of categories that were not interesting (capacity, theatre name, and potential capacity). I used the categories that offered the best available analysis (type of show, gross revenue, and show name). It would have been interesting to use data on ticket prices, but that was not a part of this data set.