Introduction

This report seeks to answer the following questions:

What are the trends of popularity and appeal over time of the TV show The Office? Is there a relationship between the two?

We will be using a data set called office_ratings. This data set compiles the total number of U.S. viewers for all 186 episodes of The Office from Wikipedia, as well as the IMDb rating for each episode. The variables we are most interested in are viewers, which represents the total number of viewers in the United States (in millions), and imdb_rating, a rating on a scale of 1-10 calculated by the average rating of users on imdb.com, where anybody can create a free account. total_votes represents the total number of votes submitted that factor into the rating for each episode. The full data set can be viewed below:

Throughout, we will use the functionality of the tidyverse package, mainly to create visualizations.

library(tidyverse)

Distribution of Variables

First, we will look at each of our three variables of interest independently: the number of viewers, the average rating, and total voters. First, we will look at the distribution of total viewers.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = viewers), binwidth = 0.5) +
  labs(x = "Viewers (in millions)",
       y = "Count",
       title = "Distribution of Viewers")

We get an interesting and perhaps unexpected result. There is a single outlier to the right, Stress Relief, but the data actually appears to be skewed left, indicating that a lot of episodes had high viewership, but there are a few less popular episodes that people didn’t watch. Note, however, that there is a second, smaller peak left of the main peak. One possible explanation for this is that some seasons as a whole gathered more views than others. Lets test this hypothesis:

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = viewers, fill = season), binwidth = 0.5) +
  labs(x = "Viewers (in millions)",
       y = "Count",
       color = "Season",
       title = "Distribution of Viewers")

By segmenting each bar into colors based on which season the episode was. We can see that on the low end of the spectrum, we mostly have episodes from seasons 1, 8, and 9. The main peak has clusters of seasons 2 through 7.

Now let’s take a look at the distribution of IMDb Ratings.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = imdb_rating), binwidth = 0.1) +
  labs(x = "IMDb Rating",
       y = "Count",
       title = "Distribution of IMDb Ratings")

The ratings follow a normal distribution with no outliers. The peak is in the center at \(8.2\). There are some higher bars mixed in in some unexpected spots, telling a story about how people gravitate toward choosing certain “random” numbers over others, but that is a different study altogether. (Video here if interested.)

Let’s take a look at our third variable, number of IMDb voters.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = total_votes), binwidth = 250) +
  labs(x = "Total Votes",
       y = "Count",
       title = "Distribution of Total Votes")

This distribution is strongly skewed to the right, and we can see that there are three outliers, one of those being Stress Relief. More on the other two outliers in a little bit. This tells us that most episodes, people don’t bother taking their time to go to imdb.com, but they will if they felt particularly emotionally attached to an episode.

Relationships between Variables

Now that we have seen all three variables, we will look for trends between them. Let’s start with a natural question that arises from these sets of data: Is it the case that the higher rated an episode is, the more people watched said episode? We can create a scatter plot with a regression line to find out:

ggplot(data = office_ratings, mapping = aes(x = viewers, y = imdb_rating)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(x = "Viewers (in millions)",
       y = "IMDb Rating",
       title = "IMDb Rating vs. Viewers")

It would appear that yes, there is a moderately positive relationship between popularity and appeal. The data is fairly scattered, i.e. doesn’t follow the line very closely. As a result, there are some episodes with high ratings but low viewers, and some with low ratings and high viewers, all due to random variation. We can confirm this by checking the correlation coefficient:

cor(office_ratings$viewers, office_ratings$imdb_rating)
## [1] 0.4918702

A correlation coefficient of \(0.49\) indicates a moderate, positive correlation, which we were able to observe in the scatter plot.

One also might wonder if there is a correlation between the number of people who viewed an episode and the number of IMDb votes. It seems reasonable to assume that this would be the case. People can only form an opinion on an episode if they have already watched it, so it seems reasonable to assume that there would be a positive relationship. Let’s check this hypothesis with another scatter plot:

ggplot(data = office_ratings, mapping = aes(x = viewers, y = total_votes)) +
  geom_point() +
  geom_smooth(method=lm, se = FALSE) +
  labs(x = "Viewers (in millions)",
       y = "Total Votes",
       title = "Total Votes vs. Viewers")

cor(office_ratings$viewers, office_ratings$total_votes)
## [1] 0.4749562

A correlation coefficient of \(0.47\) tells us again that the the relationship is a moderate, positive correlation. We can see that there are outliers all occuring far above the regression line. First note that Stress Relief, the episode with over double the views of the second-most-watched episode, does not actually have the highest vote total, like we might expect. Instead, Goodbye, Michael received the most votes. This episode marked the departure of Michael, the office’s boss and central character for the past seven seasons. Naturally, fans who watched this episode had an emotional connection to it, and were more inclined to vote, even though that particular episode was on the low end of the viewership spectrum. The other episode with an abnormally high vote count is Finale, the final episode of the show. Although popularity had already died down by season 9, the loyal fans still watching were more inclined to vote as they were sad to see the show come to an end.

Analysis of Variables over Time

Now that we have seen that there are relationships between these three variables, we will analyze how these variables changed over time. First, we will take a look at how popularity shifted throughout the TV show. We will use another scatter plot for this, and I will color code the points by season for ease of viewing.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = air_date, y = viewers, color = season)) +
  labs(x = "Air Date",
       y = "Viewers (in millions)",
       title = "Popularity over Time")

We can see that the show gained popularity in season 2, and then gradually declined until it was finished. One might assume that since popularity and appeal were related, the ratings over time might look similar. We can check this by creating a plot like the one above, but with IMDb rating on the y-axis instead.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = air_date, y = imdb_rating, color = season)) +
  ylim(4, 10) +
  labs(x = "Air Date",
       y = "IMDb Rating",
       title = "Appeal over Time")

We can see that it is indeed the case that the show picks up in popularity in season 2 and then slowly declines until the end. However, it does appear that the decline in popularity was far more drastic. For example, the most viewed episode in season 9, Finale, still had a lower viewer count than every single episode in seasons 2-7. We can infer that the ratings remained relatively constant because the people still loyal to the show liked it better, thus the higher ratings are the ones we see reported. For this reason, we see that season 9 actually had better ratings than season 8. Since people knew the show was coming to an end, the people who liked the show best stayed and gave high ratings.

Now that we have seen the overall trend in popularity over the course of the entire show, the last thing we would like to look at is how the popularity changed over the course of individual seasons. We can do so by creating a facet grid of scatter plots for each season:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = episode, y = viewers)) +
  facet_grid(cols = vars(season)) +
  labs(x = "Episode Number",
       y = "Viewers (in millions)",
       title = "Popularity by Season")

By looking at each season individually, we can see that the general trend (with a few exceptions) is that the popularity decreased from start to finish of the season. This is likely because there was a lot of excitement about a new season coming out when it was releasing, but that excitement died down before the season was complete. Season 9 is one exception, that season actually gained in popularity from start to finish because people knew the show was coming to a close. Seasons 2 and 3 oddly peaked in the middle of their season. This took place during the height of the show’s popularity, so this likely led to viewership being more uniform, as people who started the season were finishing it as well.

Conclusion

In summary, we can conclude that there is a direct relationship between viewership of an episode and its rating. The number of voters for an episode was also related, but had some important high outliers of episodes that people were more emotionally attached to. Both viewership and ratings peaked in seasons 2 and 3, but then declined over time. We also saw viewership decline from start to finish of each season.