Section 1

Broadway shows are legendary for their high production value, popularity, and ability to showcase the talents of many artists over long periods of time. The more popular a Broadway show, the more people are likely to come and the more money it is likely to make. The data set I will be investigating contains points about Broadways shows grouped over week long periods, going back to the 1990s. Only shows that reported their capacity were included in this dataset. The data comes from the Broadway League, which is a national trade association for the Broadway industry. As it is provided by an official and national organization, I trust it as a source. Therefore, I will perform an analysis of this Broadway data with the assumption that it is reliable and accurate. The dataset contains 31,296 rows and 12 columns. Each row of the data represents the data from a Broadway show over an isolated week-long period. Each column is a different variable that provides some description of the Broadway show or the performance.

Section 2

This analysis explores the relationship between the type of show and the percentage of the theater that was filled during the week, as well as the type of show and the percentage of the gross potential the show met during the week. These questions are of particular interest because they give key insight into how successful a Broadway show is and how that relates to a particular aspect of the show. A high percentage of gross potential met and a high percentage of the theater being filled are indicators that a show is successful, and those variables may be related to whether being a “musical,” “play,” or “special” is successful. A theater company may want to know this during their selection of what shows they want to book, as they would want to know what kind of money they should be expecting to make and how often their theater will be full. An artist may want to know this when deciding what they want to write or perform in so that they are more likely to on average meet more of their potential or make the most money they can.

Section 3

We start by graphing the percentage of the theater that was filled, or what percent of the theater’s capacity was met. This is a numeric variable, and provides insight into the percentage of the theater that was filled in that particular week. A higher capacity indicates that more of the seats were filled than not.

Section 4

The research question of interest is: What is the relationship between the % capacity of the theater and the show type? I have selected a histogram and a boxplot to represent the relationship between these two variables. The percentage of the capacity of the theater is a numeric variable, while show type is a categorical variable. A numeric variable means that the value is a number, in this case the percent to which a theater is at capacity. A categorical variable means that the values aren’t numeric, but rather grouped into categories, in this case being show type of “musical,” “play,” or “special.” Most apparently in the original boxplot, it is apparent that there are multiple outliers in the data. This does not mean that the data is unreliable, but a mistake likely occurred during the input, as it is impossible to have over 100% capacity. However, if the scope of the data is limited to a maximum of 100% capacity, as is practical, then the results are more clear. Musicals have the higher average percent capacity, followed by specials, and plays have the lowest average percent capacity filled. The average percent capacity of musicals is above 75%, closer to 80%, while the average percent capacity of specials is right around 75%. The average percent capacity of plays is a bit below 75% and above 60%. Each show category has their first and third quartiles above 50%, meaning that the majority of the show capacities is above 50% regardless of their type. Therefore, based off the graphs, my answer to the original research question is that show type “Musical” tends to have the highest % capacity, followed by “Special,” and “Plays” tend to have the lowest % capacity.

Section 5

The research question of interest is: What is the relationship between the type of show and the percentage gross potential met? I have selected a histogram and box plot to represent the relationship between these two variables, as type of show is a categorical variable and the percentage gross potential met is a numeric variable. A numeric variable means that the value is a number, in this case being the percent to which the gross potential of a show was met. This value is calculated by involving ticket prices, seating capacity, and the number of performances within the week. It is ultimately a value of what could have been achieved, being gross potential, and how much was actually achieved, gross. The percentage is therefore the true gross divided by the gross potential. The initial graph demonstrates that there are outliers due to cases when the gross potential met could not be calculated, so a 0 was put in place. There are also outliers due to input error where gross potential is represented as a percentage over 100, which is not possible. In creating a new plot with a maximum gross potential as 100%, a clearer pattern emerges. These graphs demonstrate that specials and plays have around the same average percentage gross potential met, while musicals meet a higher percentage gross potential. This data therefore suggests that musicals on average meet a higher percentage of their gross potential than plays and specials, or are closer to getting the maximum amount of engagement possible. Plays and specials are very similar in their average percentage gross potential met. All three categories had their first and third quartiles above 25%, meaning that Broadway shows on average meet over 25% of their gross potential regardless of show type. Musicals on average met over 60% gross potential, plays on average met just under 50%, and specials also met just under 50% on average. Therefore, to answer the second research question, “Musicals” have the highest average percent of their gross potential met, followed by “Specials” and “Plays,” which are close to equal in their average value.

Section 6

This graph demonstrates the relationship between whether a person smokes and their household income. Whether a person smokes “every day” or “some days” is a categorical variable, meaning that it isn’t number-based and is rather a category. Participants were asked if they smoke, yes or no, but only data from those who replied yes are present on the graph in the form of a percentage of participants. A household income is measured in thousands of dollars, and is grouped into four income brackets. These groups are <$35k, $35k-$75k, $75k-$100k, and >$100k. Household income is measured by the total amount of money earned, including wages, investment income, and retirement and welfare payments, by all members of a household unit. Income is a numeric variable, meaning it is based on number values, but the groups of income brackets in the bar graph turn it into a categorical variable. The story being told by this graph is that smoking and income have an inverse relationship, which means that as one increases, one tends to decrease. This is not a cause and effect relationship, but rather a correlational one. Therefore, the data suggests that as household income decreases, the likelihood that a person smokes “every day” or “some days” tends to increase, but they do not necessarily cause or affect each other. This provides insight as to how income may be related to lifestyle factors. The largest difference between categories in the percent of participants that smoke is between those with a household income of <$35k and $35k-$75k, with a difference of 7%. This difference is emphasized by the small scale of the graph. I have a couple concerns regarding how this graph could be perceived by the public. The way that smoking is quantified, as “every day” or “some days” is relatively vague, and also depends upon self-reporting which can be biased. “Some days” does not have a number associated with it, and it is therefore not as precise as it could be if the amount of days was quantified. There is also a decent lifestyle difference between smoking every day and smoking some days, so I would suggest that those groups are separated in data analysis. Also, the variable of smoking is based on an individual’s smoking habits, while household income is potentially based on a group, being “all members of a household unit.” Therefore, one person could smoke and have a lower income, but that is not demonstrated in the graph if they live in a household with other people who do not smoke and have a higher income. The graph also does not demonstrate the relationship between income and those who do not smoke. Including data from individuals that said that they did not smoke would provide a more comprehensive image of how smoking habits, whether they are present or not, are correlated with income. The x-axis is the percentage of participants who said that they smoked nearly every day or some days, but the maximum percentage of 21% is at the final reaches of the graph. This may influence the public to assume larger differences between income brackets due to the small scale, when a large scale would demonstrate that the differences are present but not as big as the original graph suggests. For marketing however, utilizing a smaller scale is beneficial to emphasize the differences between groups.

References

Broadway.2015. Broadway League. Data Set [broadway.csv], accessed 05 02 2022 from https://corgis-edu.github.io/corgis/csv/broadway/. Center for Disease Control and Prevention, The New York Times, digital image, accessed 06 02 2022, https://www.nytimes.com/2021/02/25/learning/whats-going-on-in-this-graph-smoking-income.html.