Section 1: Introduction

Broadway shows are legendary for their high production value, popularity, and ability to showcase the talents of many artists over long periods of time. The more popular a Broadway show, the more people are likely to come and the more money it is likely to make. The data set I will be investigating contains points about Broadways shows grouped over week long periods, going back to the 1990s. Only shows that reported their capacity were included in this dataset. The data comes from the Broadway League, which is a national trade association for the Broadway industry. As it is provided by an official and national organization, I trust it as a source. Therefore, I will perform an analysis of this Broadway data with the assumption that it is reliable and accurate. The dataset contains 31, 296 rows and 12 columns. Each row of the data represents the data from a particular Broadway show over an isolated week-long period. Each column is a different variable that provides some description of the Broadway show or the performance. Together, these variables provide a detailed description of what a successful Broadway show may look like.

Section 2: The Goal

This analysis explores the relationship between the type of show and the percentage of the theater that was filled during the week, as well as the type of show and the percentage of the gross potential the show met during the week. Type of show is a categorical variable, separated into “musical,” “play,” and “special.” The show’s capacity that was met is a numeric variable, measured by the percent of the theater that was full, and has a practical minimum of 0 and maximum of 100. The gross potential of the show that was met is a numeric variable as well, measured by dividing the gross income of the show for a week by the potential gross income the show could have made. This variable is presented as a percentage, with a practical minimum of 0 and maximum of 100. These relational questions are of particular interest because they give key insight into how successful a Broadway show may be, what factors demonstrate success, and how that relates to what kind of show it is. A high percentage of gross potential met and a high percentage of the theater being filled are indicators that a show is successful, and those variables may be related to whether the show is a “musical,” “play,” or “special.” A theater company may want to know this during their selection of what shows they want to book, as they would want to know what capacity demands may be made and how much potential a show has to bring them money. An artist may want to know this when deciding what they want to write or perform in so that they can maximize their exposure and their potential to make money. The executives of a Broadway show may want to look at these statistics for several reasons. They may use them to decide whether better advertising efforts are needed if there is low potential gross being met, or whether they need to look at different theaters depending on how often full capacity is met.

Section 3: Theater Capacity

We start by graphing the extent to which a theather’s capacity was met, or what percentage of the theater that was filled. This is a numeric variable, in units of percent, and provides insight into how much of the audience was full for a particular week of the show. When a higher capacity is met, it indicates that there are more seats taken than available, which may give insight into how successful the show is advertised or how it is held in the public eye. Figure 3.1 and Figure 3.2 show that there are several outliers within this data that go over a capacity of 100%. This does not mean that the data is unreliable, but rather there was likely some input error on the part of the researcher. This might have been that instead of the percent that the theater was full, they put the actual number of people present, or something of that nature. Therefore, I included two additional graphs, Figure 3.3 and Figure 3.4, where the x-axis is limited to a maximum theater capacity of 100%. Figure 3.1 and 3.3 are frequency polgygons, which demonstrate that the majority of the Broadway show percent capacity data are towards a higher capacity than 50%. Figures 3.2 and 3.4 are histograms that similarly demonstrate this trend.

Section 4: Research Question 1

The research question of interest is: What is the relationship between the percent of the theater’s capacity met and the type of show? I have selected a histogram and a boxplot to represent the relationship between these two variables. I chose these graphs as they are well-suited to displaying the relationship between a numeric and categorical variable. The percentage of the capacity of the theater is a numeric variable, while show type is a categorical variable. A numeric variable means that the value is a number, in this case the percent to which a theater is at capacity. A categorical variable means that the values aren’t numeric, but rather grouped into categories, in this case being show type of “musical,” “play,” or “special.” As Figure 4.1 and Figure 4.2 demonstrate, there are outliers present in this data, within the theater capacity variable. As stated in the prior section, reports of theater capacity over 100% may be due to researcher input error, but are worth noting. Of 31, 296 entries, there are 2,128 entries that have a reported percent capacity of over 100%, ranging from 101% to 892%. The first two graphs, Figure 4.1 and 4.2, are representations of the data with the full range. Figure 4.2 indicates that each show type has high outliers. Musicals have the highest outlier, being over 875%, followed by plays, being over 750% and below 875%. Specials has the lowest of the high outliers, being right around 750%. The second two graphs, Figure 4.3 and 4.4, are representations of the data where the maximum limit is restricted to 100%. If I were presenting to a client, I would make note of the values outside of 100% and focus my interpretation on the representations limited to 100% capacity. Figure 4.3 demonstrates that musicals have the highest frequency in the data set, followed by plays, and finally specials. Figure 4.4 demonstrates that musicals have the highest average percent capacity, followed by specials, and plays have the lowest average percent capacity filled. The average percent capacity of musicals is above 75%, closer to 80%, while the average percent capacity of specials is right around 75%. The average percent capacity of plays is a bit below 75% and above 60%. Figure 4.4 also demonstrates that there are low outliers in each show type, the lowest for musicals and shows being around 13%. The lowest outlier for specials appears to be below 12%. Each show category has their first and third quartiles above 50%, meaning that the majority of the show capacities is above 50% regardless of their type. Therefore, based on the graphs, my answer to the original research question is that show type “musical” tends to have the highest % capacity, followed by “special,” and “plays” tend to have the lowest % capacity.

Section 5: Research Question 2

The research question of interest is: What is the relationship between the type of show and the percentage gross potential met? I have selected a histogram and box plot to represent the relationship between these two variables, as type of show is a categorical variable and the percentage gross potential met is a numeric variable. A numeric variable means that the value is a number, in this case being the percent to which the gross potential of a show was met. This value is calculated by involving ticket prices, seating capacity, and the number of performances within the week. It is ultimately a value of what could have been achieved, being gross potential, and how much was actually achieved, gross. The percentage is therefore the true gross divided by the gross potential. The variable description provided in this data set left the true meaning of the gross potential as slightly vague. This interpretation is taking the description that gross potential means truly the “maximum” possible value of gross income, and thus the maximum percentage that could be reported is 100%. Due to the vagueness of this variable’s description, and the way in which it is being interpreted in this analysis, there are multiple data points in the data set that are above the restricted range, made clear in Figure 5.2. These data points are likely due to researcher input mistakes or different interpretations of the variable which led to different maximum values being understood. Of the 31, 296 entries, there are 2,745 entries that are over 100%, ranging from 101% to 226%. Therefore, to give a comprehensive representation of the data, I have included two sets of graphs. The first set of graphs, Figure 5.1 and Figure 5.2, includes the data points that go over 100% gross potential met. Figure 5.2 demonstrates that there are high outliers in each show type, the highest outliers being in the play category, being over 225%. The highest outlier in the musical category is over 150%, and the highest outlier in the special category is just over 125%. The second set of graphs, Figure 5.3 and Figure 5.4, limit their representation to data points that go up to 100%. If I were presenting this to a client, I would make note of the outliers, and base my interpretations off of the second set of graphs. Figure 5.4 demonstrates that specials and plays have around the same average percentage gross potential met, while musicals meet a higher percentage gross potential. Musicals, therefore, on average meet a higher percentage of their gross potential than plays and specials, or are closer to getting the maximum amount of engagement possible. Plays and specials are very similar in their average percentage gross potential met. Figure 5.4 demonstrates that musicals have one low outlier, being 0. However, the researchers input 0 as a default for gross potential when it could not be calculated. Therefore, this low outlier of 0 may not be a true value, but a default value put in place by the researchers. Figure 5.4 also indicates that all three categories have their first and third quartiles above 25%, meaning that Broadway shows on average meet over 25% of their gross potential regardless of show type. Musicals on average met over 60% gross potential, plays on average met just under 50%, and specials also met just under 50% on average. Therefore, to answer the second research question based upon these graphs, “musicals” have the highest average percent of their gross potential met, followed by “specials” and “plays,” which are close to equal in their average value.

Section 6:

This graph demonstrates the relationship between whether a person smokes and their household income. Whether a person smokes “every day” or “some days” is a categorical variable, meaning that it isn’t number-based and is rather a category. Participants were asked if they smoke, yes or no, but only data from those who replied yes are present on the graph in the form of a percentage of participants. A household income is measured in thousands of dollars, and is grouped into four income brackets. These groups are <$35k, $35k-$75k, $75k-$100k, and >$100k. Household income is measured by the total amount of money earned, including wages, investment income, and retirement and welfare payments, by all members of a household unit. Income is a numeric variable, meaning it is based on number values, but the groups of income brackets in the bar graph turn it into a categorical variable. The story being told by this graph is that smoking and income have an inverse relationship, which means that as one increases, one tends to decrease. This is not a cause and effect relationship, but rather a correlational one. Therefore, the data suggests that as household income decreases, the likelihood that a person smokes “every day” or “some days” tends to increase, but they do not necessarily cause or affect each other. This provides insight as to how income may be related to lifestyle factors. The largest difference between categories in the percent of participants that smoke is between those with a household income of <$35k and $35k-$75k, with a difference of 7%. This difference is emphasized by the small scale of the graph. I have a couple concerns regarding how this graph could be perceived by the public. The way that smoking is quantified, as “every day” or “some days” is relatively vague, and also depends upon self-reporting which can be biased. “Some days” does not have a number associated with it, and it is therefore not as precise as it could be if the amount of days was quantified. There is also a decent lifestyle difference between smoking every day and smoking some days, so I would suggest that those groups are separated in data analysis. Also, the variable of smoking is based on an individual’s smoking habits, while household income is potentially based on a group, being “all members of a household unit.” Therefore, one person could smoke and have a lower income, but that is not demonstrated in the graph if they live in a household with other people who do not smoke and have a higher income. The graph also does not demonstrate the relationship between income and those who do not smoke. Including data from individuals that said that they did not smoke would provide a more comprehensive image of how smoking habits, whether they are present or not, are correlated with income and what general income distributions look like. The x-axis is the percentage of participants who said that they smoked nearly every day or some days, but the maximum percentage of 21% is at the final reaches of the graph. This may influence the public to assume larger differences between income brackets due to the small scale, when a large scale would demonstrate that the differences are present but not as big as the original graph suggests. For marketing however, utilizing a smaller scale is beneficial to emphasize the differences between groups. Overall, this graph was easy to read and provides useful information, but has some potential faults in terms of how it is representing the data.

Section 7: References

Broadway.2015. Broadway League. Data Set [broadway.csv], accessed 05 02 2022 from https://corgis-edu.github.io/corgis/csv/broadway/. Center for Disease Control and Prevention, The New York Times, digital image, accessed 06 02 2022, https://www.nytimes.com/2021/02/25/learning/whats-going-on-in-this-graph-smoking-income.html.