Introduction

This report seeks to answer the following question:

Are there any relationships between the amount of viewers, the amount of votes, and the IMDb ratings on various episodes of The Office?

I will be using a data set called office_ratings obtained from https://github.com/. It contains information on episodes, including the season number, episode number, title of the episode, amount of viewers, IMDb rating, total amount of votes, and air date. Of these variables, the relevant ones in this report are viewers (the number of people that watched each episode, in millions, on original air date), total_votes (the number of ratings on IMDb.com), and imdb_rating (the average fan rating on IMDb.com from 1 to 10). The full data set can be viewed below:

Throughout, I will need the functionality of the tidyverse package, mainly to create visualizations.

library(tidyverse)

The Distribution of Viewers, Total Votes, and IMDb Ratings

Before I find the possible relationships between the desired variables, it is important to look at visual representations of each variable individually. I do so using these frequency polygons:

ggplot(data = office_ratings) +
  geom_freqpoly(mapping = aes(x = viewers)) +
  labs(x = "Viewers",
       y = "Count",
       title = "Distribution of Viewers")

The distribution of viewers is positively skewed to the right, meaning that the view count of the show was consistently average with a few instances of a greater number in views. The peak of the graph is around 7.5 million viewers. This peak tells me that most of the episodes received this number of viewers on their original air dates.

ggplot(data = office_ratings) +
  geom_freqpoly(mapping = aes(x = imdb_rating)) +
  labs(x = "IMDb Rating",
       y = "Count",
       title = "Distribution of IMDb Ratings")

This graph shows that the distribution of IMDb ratings is normal. There are peaks around 8.2, 8.7, and 9.4. These peaks indicate that a larger amount of the episodes received these ratings and fewer episodes received ratings such as 6.9 or 9.8.

ggplot(data = office_ratings) +
  geom_freqpoly(mapping = aes(x = total_votes)) +
  labs(x = "Total Votes",
       y = "Count",
       title = "Distribution of Total Votes")

The distribution of total votes is positively skewed to the right, similar to the distribution of viewers. The peak of the graph is around 1,800. This indicates that most of the episodes had around 1,800 people vote on ratings for them. There are far fewer episodes that had more than 4,000 people vote on ratings.

Relationship Between Viewers and IMDb Rating

Now that I have looked at each variable separately, I can start comparing them and discover if there are any significant relationships between them. First, I will look at the relationship between viewers and the IMDb rating in order to determine if it meant that an episode was better liked when more people watched it. I will use the following scatter plot and trend curve to do so:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = imdb_rating)) +
   geom_smooth(mapping = aes(x = viewers, y = imdb_rating), se=FALSE) +
  labs(x = "Viewers",
       y = "IMDb Rating",
       title = "IMDb Rating vs Viewers")

Based on the look of this graph, I have reason to believe that in most cases, having more viewers did not determine the IMDb rating. There is a slight positive trend, as shown with the trend curve, that indicates there is a slight pattern with more viewers and better ratings; however, more viewers is not the direct cause of a higher IMDb rating. When I add color representing the different seasons each episode was in, I can see episodes that might contradict my finding.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = imdb_rating, color = season)) +
   geom_smooth(mapping = aes(x = viewers, y = imdb_rating), se=FALSE) +
  labs(x = "Viewers",
       y = "IMDb Rating",
       title = "IMDb Rating vs Viewers",
         color = "Season")

This graph shows me that the plot points on the trend curve are the episodes that would lead me to believe there is a direct correlation between more viewers and a higher rating. A specific example that stands out on the graph is the episode from season 5 with over 20 million viewers and an IMDb rating over 9.

Relationship Between Viewers and Total Votes

Now I will look at the relationship between viewers and total votes on each episode. By doing so, I can determine if it is probable that when more people watched an episode, more people left an IMDb rating. I will illustrate this using the scatter plot and trend curve below:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes)) +
  geom_smooth(mapping = aes(x = viewers, y = total_votes), se=FALSE) +
  labs(x = "Viewers",
       y = "Total Votes",
       title = "Total Votes vs Viewers")

Based on the look of this visualization, I have reason to believe that the amount of viewers did have some effect on the amount of people who voted but not by a significant margin. The plot points do show more of a relationship between viewers and total votes rather than the relationship shown between viewers and IMDb rating from the previous visualization. There are a few points that are exeptions to my conclusion though, as shown in the visualization below:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes, color = season)) +
   geom_smooth(mapping = aes(x = viewers, y = total_votes), se=FALSE) +
  labs(x = "Viewers",
       y = "Total Votes",
       title = "Total Votes vs Viewers",
         color = "Season")

The color added to differentiate the season each episode was a part of allows me to see that some episodes in seasons 7, 8, and 9 contradict the thought that more viewers leads to more votes. Those points stay fairly consistent with the amount of votes they receive despite the view count rising.

Overall Popularity and Appeal of The Office

Since I have investigated the relationships between the viewers, IMDb rating, and total votes, I now would like to look at the overall popularity and appeal of the show. First, I will discuss the popularity of the show using the box plot below:

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = viewers)) +
  labs(x = "Season",
       y = "Viewers",
       title = "Viewers vs Season")

This visualization allows me to see trends in view count by showing me the data from each season of The Office individually. From this graph, I can interpret that the show was significantly popular when seasons 2-5 aired. There is a small decline in viewers in seasons 6 and 7. Then, much more significant decreases in seasons 8 and 9. Over time, The Office grew in popularity until its peak in seasons 3 and 4. From there, its popularity dwindled until its final season.

Similarly, I can examine the overall appeal of the show based on each individual season by using the bow plot below:

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
  labs(x = "Season",
       y = "IMDb Rating",
       title = "IMDb Rating vs Season")

This visualization shows the IMDb rating of the episodes in each season over time. Similar to the popularity, the show’s appeal increases until its peak in seasons 3 and 4. Then, the appeal slowly decreases; however, there is an increase in appeal with season 7. The decline then continues with season 8 but slightly improves with season 9.

When I compare my observations from the overall popularity of The Office over time and the overall appeal of it over time, I find that the two have slightly different patterns. The popularity increases, peaks, then decreases, but the appeal increases, peaks, decreases, and then it has a few more increases before the end of the show. One of the possible reasons for the increase in appeal, in seasons 7 and 9 specifically, while the popularity was decreasing could be that only the loyal fans were watching these seasons and giving ratings. It is possible that only the “die hard” fans were watching and giving ratings, while the casual watchers either didn’t watch or watched but didn’t vote. This could be the reason why the amount of viewers is low, yet the ratings slightly increase. I personally watched The Office, and one of the main characters leaves the show at the end of season 7. This could also be a reason why there were less viewers, but the people who did continue to watch still enjoyed the episodes.

Conclusion

In summary, I can conclude that there are relationships between the viewers, IMDb rating, and total votes for each episode of The Office, but some of the relationships are not very significant. I discovered that the amount of viewers does not necessarily determine the appeal of an episode, but there is some evidence that allows me to believe that the episodes receive more ratings when more people watch them. Overall, The Office is a very popular show and received good ratings up until the end of it.