This report will analyze statistics concerning the television show The Office to answer questions about the show’s viewership, ratings, and how those two variables relate to each other.
The data set containing this information is called “office_ratings” and it derives its contents from two websites: https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-03-17/office_ratings.csv and https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes. These websites provide the data set with specific details on each episode of The Office, including information on when the episodes came out (both in relation to other episodes and the actual air date), their titles, number of viewers for each episode, and statistics on ratings taken from IMDB.com. In total, there are 186 observations 7 variables.
The variables presented by the data set are called season, episode, title, viewers, imdb_rating, total_votes, and air_date. Here are specific descriptions for each variable:
season = season during which the episode aired
episode = episode number within the season
title = title of episode
viewers = number of viewers in millions on original air date
imdb_rating = average fan rating on IMDB.com from 1 to 10
total_votes = number of ratings on IMDB.com
air_date = date episode originally aired
The whole data set is displayed below:
The tidyverse package will be used throughout this report. This makes it possible to create visualizations of the data.
library(tidyverse)
Before analyzing the relationship between viewers and ratings, we will analyze the distributions of each variable involved. We will look at how the the number of viewers each episode had on its original air date are distributed first.
ggplot(data=office_ratings)+
geom_histogram(mapping=aes(x=viewers))+
labs(x="number of viewers (in millions)", y="amount", title="Viewership Distribution")
The most common number of viewers for an episode to get on its original air date was about 7.5 million viewers. The most frequently occurring values are on the lower side of viewership (in comparison to the highest number of viewers), causing the tallest bins to appear on the left side of the visualization. While viewership generally does not exceed 10 million (or slightly above), there is one episode that received almost 25 million viewers. The next distribution we will look at is for the average fan rating received by each episode.
ggplot(data=office_ratings)+
geom_histogram(mapping=aes(x=imdb_rating))+
labs(x="average rating (1-10)", y="amount", title="Ratings Distribution")
The most commonly received fan rating is just above 8. The ratings are generally clustered between 7 and 9, with only a few just over or under this range. The height of the bins tend to decrease as they get farther away from the peak. All of the average ratings are on the higher half of the 1 to 10 scale. Finally, we will look at how the number of ratings each episode received are distributed.
ggplot(data=office_ratings)+
geom_histogram(mapping=aes(x=total_votes))+
labs(x="total number of ratings", y="amount", title="Total Votes Distribution")
Episodes most often received just over 1500 votes. This peak is on the far left side of the distribution, meaning if an episode did not get around 1500 votes, it received a greater number. The height of the bins decreases as the total number of ratings increases. Although episodes generally do not get more than just over 4000 votes, there are exceptions at just under 6000 and just under 8000.
Now we can move on to analyzing how these variables influence each other. First, we will figure out if there is a connection between viewership and average ratings.
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=viewers, y=imdb_rating))+
labs(x="number of viewers (in millions)", y="average rating (1-10)", title="Average Rating vs. Viewership")
From this visualization, we can conclude that viewership plays a noticeable role in the average fan rating received by an episode. Although it is not a perfect trend, in general, the more viewers an episode had, the higher its rating is. There are a couple of exceptions to this trend: the point just above a 7.5 rating that has a lot of viewers and the point that has the highest rating but not a lot of viewers. A visualization showing the relationship between total number of votes and average ratings can explain these outliers.
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=total_votes, y=imdb_rating))+
labs(x="total number of ratings", y="average rating (1-10)", title="Average Rating vs. Total Votes")
We can see that the point just above a 7.5 rating got a lot more votes than other points at that rating. Because it has a lot of viewers and votes compared to other points with the same rating, it would make sense to assume it was a popular episode among the audience. There was likely some sort of discourse surrounding it or something else that caused an increase in fan engagement. As for the point with the highest rating, it also received a very large number of votes. In order to get both such a high rating and such an abnormally high number of votes while having so few viewers, the episode must have been very good. It is likely that the people who watched it enjoyed it so much that they went out of their way to give it a high rating when they would normally not vote at all.
Next, we will examine the relationship between viewership and total number of votes.
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=viewers, y=total_votes))+
labs(x="number of viewers (in millions)", y="total number of ratings", title="Total Votes vs. Viewership")
This graph displays a positive relationship between the number of viewers and the total number of votes received by an episode. This means that higher viewership for an episode resulted in a greater number of votes. However, there are a few outliers. The most noticeable one is the point that almost reaches 8000 votes. It has the highest number of votes but a small number of viewers. Once again, this can be explained by a visualization showing the relationship between total number of votes and average ratings (this time the axes flipped to make the connection more obvious).
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=imdb_rating, y=total_votes))+
labs(x="average rating (1-10)", y="total number of ratings", title="Total Votes vs. Average Rating")
This graph shows that the episode with almost 8000 votes has a very high average rating. This means that although it was not watched by many people, it was very well liked by those who did watch it.
Now we will analyze different changes within the audience over time. The first of these changes will be in the show’s popularity.
ggplot(data=office_ratings)+
geom_boxplot(mapping=aes(x=season, y=viewers))+
labs(x="season number", y="number of viewers (in millions)", title="Change in Popularity over Time")
After the first season, the show had a spike in popularity. The number of viewers fluctuated slightly in seasons 2, 3, and 4, then decreased each season for the rest of the show. There is a huge outlier in season 5. One episode got almost 25 million views, and it is by far the most watched episode in the entire show. This episode is titled “Stress Relief” and it aired immediately after a Super Bowl. Super Bowls are known to be insanely high in views, so this episode would have gained many of its viewers by succeeding one.
Next, we will look at how the show’s appeal changed over time by looking at the average ratings episodes received in a given season.
ggplot(data=office_ratings)+
geom_boxplot(mapping=aes(x=season, y=imdb_rating))+
labs(x="season number", y="average rating (1-10)", title="Change in Appeal over Time")
The appeal did not change over the seasons as consistently as the popularity did. After season 1, the show began to receive higher ratings. These ratings rose slightly and then fell slightly between seasons 2 and 7. In season 8, ratings reached an all-time low. Although they rose slightly in season 9, they remained lower than they were in prior seasons. There are two outliers - one in season 6 and one in season 8 - that have abnormally low ratings. Both are described by fans and/or critics as being boring and disappointing. There are a few outliers in season 9. One of these, specifically the highest rated one, was the show’s final episode. It makes sense for this to be a highly rated episode, as fans are bound to feel sentimental, resulting in fondness and high engagement.
While the change in the show’s popularity and appeal follow similar trends, there is one key difference: season 9 has the lowest popularity by far, but it does not have the lowest appeal. In fact, it has some outliers that have very high ratings. This is likely because a lot of people had stopped watching the show, and only its bigger fans were still keeping up with it. Not only would this explain why viewership was so low, but also why the ratings were so high for the number of viewers it got. The show’s true fans would obviously like it more than a random viewer, meaning they would rate it higher. This can be seen most clearly in the three high-rated outliers in season 9. These are the last three episodes of the show and earned high ratings for their role of bringing it to a conclusion.
The final statistic we will examine is how viewership changed within each season.
ggplot(data=office_ratings)+
geom_smooth(mapping=aes(x=episode, y=viewers, color=season), se=FALSE)+
labs(x="episode number", y="number of viewers (in millions)", color="season number", title="Change in Viewership within Seasons")
In general (excluding seasons 1, 8, and 9), the episodes released in the middle of a season had the most viewers. Viewership started high, fell slightly, then rose as the season progressed until it peaked near the middle then steadily declined until the season ended. While this trend is not followed perfectly in every season, it is a pretty good description of what tended to happen. As for seasons 1, 8, and 9, they had a general decline in viewership throughout the entire season (with a few exceptions). However, season nine experienced an increase in viewers near the end of the season.
Both season 5 and season 9 have noticeable increases in viewership. For season 5, the increase is in the middle of the season, and it can be explained by the episode that came out after the Super Bowl. For season 9, the increase is at the end of the season, and it can be explained by the fact that the entire show was coming to an end.
In all, there are many factors that can influence an episode’s popularity, rating, and level of fan engagement. Our visualizations showed us how the number of viewers, average ratings, and the number of ratings an episode received all correlate with each other. We also saw how the behavior of the audience changed throughout the seasons. Viewership and fan reactions of an episode are closely related, but do not always mirror each other.