Introduction

This report seeks to answer the following question:

What is the relationship between the popularity of The Office and its ratings over time.

We will be using a data set called office_ratings obtained from [https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17/readme.md] and [https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes].These include per-episode viewership throughout the entirety of The Office as well as the overall ratings for each of the episodes. There are 7 total variables for each episode; the relevant ones for this report are viewers (the number of viewers in millions on original air date), imdb_rating (the average fan rating on IMDb.com from 1 to 10), total_votes (the number of ratings on IMDb.com), and air_date (the date the episode originally aired). The full data table can be seen below:

library(DT)
datatable(office_ratings, options = list(scrollx = TRUE))

Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations.

library(tidyverse)

Relationship Between The Number of Viewers and its Popularity of an Episode

The problem we are dealing with involves a variety of continuous variables including viewers, IMDb_rating, and total_votes. It is imperative to see how these variables are distributed in order to better understand the data that is being shared so we are able to answer the overall question. It is reasonable to assume that when there are more viewers watching an episode there will be more people voting towards the overall rating. Also, the more votes that an episode receives the higher the overall rating will be of that episode. We can test these hypotheses with scatter plots:

ggplot(data = office_ratings, mapping = aes(x = total_votes, y = viewers)) +
  geom_point() +
  labs(x = "Total Votes",
       y = "Viewers",
       title = "Viewers vs. Total Votes")

ggplot(data = office_ratings, mapping = aes(x = imdb_rating, y = total_votes)) +
  geom_point() +
  geom_smooth() +
  labs(x = "IMDb Rating",
       y = "Total Votes",
       title = "Number of Votes vs. Rating")

These scatter plots indicate that our hypotheses weren’t entirely correct. It seems that even though there may be more viewers it doesn’t necessarily mean that there will be more votes. In fact, one of the episodes with the most votes didn’t have nearly as many viewers as some of the other episodes. It seems as though there was a similar number of viewers for each episode and some just had more people vote than others. There is one episode that had a lot of viewers but that didn’t necessarily mean that there were a ton of people who voted. We can hypothesize that many of the same people are voting which is why there is such a similarity between the number of votes for a majority of the episodes.

On the other hand, our hypothesis on the more votes the higher the rating seems to be more correct. It seems as though a majority of the episode ratings range between 7.5-8.5 but some of the more popular episodes that led people to vote had rating of 9.5 or higher. After all, if an episode has a greater impact more people will want to vote generally giving it a higher rating. We can test to see if this hypothesis is true by comparing the overall rating and the number of viewers. This will allow us to see if it is, in fact, true that the more people who watch an episode the better it is liked by the viewers.

ggplot(data = office_ratings, mapping = aes(x = imdb_rating, y = viewers)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "IMDb Rating",
       y = "Viewers",
       title = "Rating of an Episode vs. Viewership")

The general trend tells us that there is a slight increase in the rating as the number of viewers increase. We can assume that the people who rate the episodes are the avid lovers of The Office so there is generally a similar number of people. There are a few exceptions when we look at the trend line because we can see it slightly decrease or level off. This could be because people were getting bored with the show so less people were watching it. We can test this hypothesis by adding some color to the graph to show us which rating goes with what season.

ggplot(data = office_ratings, mapping = aes(x = imdb_rating, y = viewers, color = season)) +
  geom_point() +
  labs(x = "IMDb Rating",
       y = "Viewers",
       color = "Season",
       title = "IMDb Rating vs Viewership")

We can see that our hypothesis could be correct. This is because many of the episodes with the lower viewership are those in the final two seasons. This could be because people were getting bored with the show or because the beloved character Michael left the show towards the end of season 7. But does this mean that there is an increase in votes when more people watch an episode? We can answer this question by adding some color to one of the previous graphs that was used.

ggplot(data = office_ratings, mapping = aes(x = total_votes, y = viewers)) +
  geom_point(mapping = aes(color = season)) +
  geom_smooth(se = FALSE) +
  labs(x = "Total Votes",
       y = "Viewers",
       color = "Season",
       title = "Viewers vs. Total Votes")

Here we can see that there are generally around 1000 - 3000 people voting for a majority of the episodes. This allows us to conclude that many of the votes come from the same people. These are generally the avid fans of the show or those who feel strongly on some of the episodes. We can see that in most cases when more people watch there tend to be more votes. An example of this would be the episode in season 5 where there were over 22 million viewers because they were one of the episodes with the most votes. On the other hand, it is not always true because the episode with the most votes happened to be one of the episodes with the least amount of viewers. This could be because it happened to be the final episode of the show so its viewers wanted to make sure their vote was heard and counted.

Relationship Between Popularity and Appeal Over Time

Now that we have gathered some data and knowledge on how the rating of the show is impacted by the number of people watching it we can compare the information to how it was affected over time. Generally, when we are watching a show the first few seasons tend to be the most popular because all the writers ideas are fresh and new. Can this be said of The Office? We can answer this question with a line graph:

ggplot(data = office_ratings) +
  geom_line(mapping = aes(x = air_date, y = viewers, color = season)) +
  labs(x = "Air Date",
       y = "Viewers",
       color = "Season",
       title = "Popularity Over Time")

This theory is mainly true as we can see a slight declining trend in viewers of the show over time. The number of viewers stays fairly consistent throughout the first 6 seasons before they start to really decline. This could be because fans were upset that Michael left the show in season 5 or that the writer’s ideas weren’t as fresh as it was in the beginning of the show. We can see that there is an episode that had a ton of viewers compared to the rest of the show. This is the episode that aired right after the Super Bowl which brought a lot of viewership to the episode. While we can see that the popularity of the show slightly decreased over time it is possible that the appeal of the show increased over time. We can get a closer look at this conjecture through a line graph:

ggplot(data = office_ratings) +
  geom_line(mapping = aes(x = air_date, y = imdb_rating, color = season)) +
  labs(x = "Air Date",
       y = "IMDb Rating",
       color = "Season",
       title = "Appeal Over Time")

When looking at the line graph we can conclude that our conjecture was incorrect. It seems as though the appeal of the show doesn’t depend on where you are in the show but can vary from season to season. We can see that towards the end of the show the appeal tends to decrease which could be related to people getting bored with the show or being disappointed that Michael left.

When comparing the two previous graphs we can see that the popularity and the appeal of the show don’t change the same way throughout the entirety of the seasons. This could be because of character developments or losses like we saw with Michael in season 5. It could also be because of viewer fatigue where viewers tend to get bored with a show the longer it runs like it can be seen in season 8. Just because less people are watching an episode it doesn’t mean that the quality and rating of an episode is bad and vice versa.

While we have looked at this at a series level we can also see how these trends change within an individual season through a box plot.

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = viewers, y = season)) +
  labs(x = "Viewers",
       y = "Season",
       title = "Trend in Total Viewership")

Here we can see that in the first season there were a reasonable amount of viewers considering the show just got off the ground. By season 2 the show began to find its footing and attracted more viewership. We can see this momentum continue to build until the 4th season where there is a bit of a decrease. This could be due to the writers’ strike that took place in 2007-2008. After this event the show maintained good viewership until season 7 where the show began to have a decline in viewership. This is mainly due to the fact that Michael left the show. After this the show couldn’t really get going again and continued to see decreases of viewership in each of the following seasons.

Conclusion

In summary, we can conclude that popularity and ratings of The Office are generally related but not always in a direct manner, meaning that just because one increases doesn’t mean an increase will be seen in the other. However, our data shows this through the total votes that went toward the rating. This shows that only a fraction of the population that is watching the show will take the time to vote and give it a rating. This is seen through fan engagement which can depend on how strongly an individual relates to a particular episode. This allows us to see exactly why popularity and rating results are so diverse.