This report will answer the following question:
How does the The Office’s popularity changes over all nine seasons on the basis of episode?
We will be using a data set titled office_ratings, which
was obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv.
This set contains the total viewers on the original air date of each
episode over the nine seasons. There are a total of 186 observations
over the 7 variables measured for each episode; the relevant variables
to this report are viewers(the number of viewers in
millions on the original air date), imdb_rating(average fan
rating on IMDb.com from 1 to 10), total_votes(number of
ratings on IMDb.com), season(season during which the
episode aired), and episode(episode number within a
season). The full data set can be viewed below:
Throughout, the tidyverse package will need to be used to create these visualizations.
library(tidyverse)
The problem being investigated deals with three main continuous
variables, viewers, imdb_rating and
total_votes. In order to find the trend in popularity over
each episode, each of the three variables must be first be investigated
individually over each episode. It seems correct to assume that the more
viewers and episode had on it’s air date would indicate popularity, from
that we could also assume that the more viewers an episode had the more
ratings were put on IMDb.com; however, it cannot be assumed that the
more viewers an episode had, the higher the IMDb.com rating for that
episode will be as not everyone who watched that episode may have
enjoyed it, or that as the season progresses, the higher the rating will
go. It may also be reasonable to assume that the viewers of each
episode, as the show progresses, would increase as it grew in
popularity. We can prove these hypotheses by graphing the following
scatter plots:
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=episode, y= viewers, color = season))+
labs(y="Viewers on Original Air Date (Millions)",
x= "Episode within a Season",
title = "Viewers on Original Air Date vs. Episode",
caption = "Data Obtained from office_ratings.csv")
This scatter plot proved the hypothesis regarding viewership increase as the show progressed incorrect. Within each season, the original viewership of the episodes peaked in the beginning of the season, seemingly with the first episode, and then decreased, in some seasons, such as seasons 3-5, viewership peaked again in the middle of the season. There is one noticeable outlier that occurs in the middle of Season 5, where the viewership peaks to over 25 million views on the original air date. The original peak can be inferred to be from the fact the the people who watch the show were excited that a new season was starting and wanted to see how it played out, but it seems that interest was lost as the seasons continued but had the occasional peak in views perhaps when the climax of the individual season occurred.
ggplot(data=office_ratings)+
geom_point(mapping=aes(y=imdb_rating, x=episode, color=season))+
labs(y="Average IMDb.com Rating",
x= "Episode within a Season",
title = "Average IMDb.com Rating vs. Episode",
caption = "Data Obtained from office_ratings.csv")
The scatter plot that was created above, supports the original hypothesis that the average IMDb.com rating will have no strict pattern or trend per season or individual episode. The majority of the points displayed in the plot were between 7.5 and 9, there are four points that stand out as having a much lower or higher rating outside of that range. The two lower points occurred in seasons 6 and 8, these episodes were in the middle to end of those seasons, an explanation for their low rating could be that the writing of the episodes were less enjoyable than the previous as the writers themselves were running out of ideas. The two high ratings come from Seasons 7 and 9, both occurred towards the end of each season. While there isn’t an obvious reason that the rating on the episode in Season 7, the rating on the Season 9 episode makes more sense considering it was the final episode of the season and the overall show.
ggplot(data=office_ratings)+
geom_point(mapping=aes(y=total_votes, x=episode, color=season))+
labs(y="Number of Ratings on IMDb.com",
x="Episodes within a Season",
title = "Total IMDb.com Votes vs. Episode",
caption = "Data Obtained from office_ratings.csv")
This scatter plot supports the hypothesis that the number of ratings per episode is generally directly correlated with the number of rating on IMDb.com, by also being viewed with the Viewers vs Episodes graph from earlier. One noticeable correlation is how across season 9 both the viewers and the number of ratings were low. However, in this plot, there is one outlier that stands out within Season 9, as it spiked to almost 8000 reviews on IMDb.com compared to the rest of the season which barely went past 2000. It can be assumed that the reason for this spike is that it was the season’s final that would conclude the show, and people who had watched the show continuously wanted to see how it would be finished off.
From the graphs created above, it can be assumed that the more viewers an episode has, the higher the IMDb.com rating will be. To prove this, another scatter plot will be made to show the relationship between the viewers and the average IMDb.com rating
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=imdb_rating, y=viewers))+
labs(y= "Viewers on Original Air Date (Millions)",
x= "Average IMDb.com Rating",
title = "Viewers vs. Average IMDb.com Rating",
caption = "Data Obtained from office_ratings.csv")
According to the graph above, the hypothesis that as the viewership increased so would the Average IMDb.com rating, to put it simply, the graph demonstrates that the more people who watch the episode, the more the episode is liked. There is one distinct outlier shown in this graph that has shown up in a previous graph, which is an episode that had over 20 million views on its air date and a rating of over 9.5, it still cannot be concluded from these graphs why that specific episode was so well liked compared to the others.
Considering the outlier in the previous graph, we can plot a graph that will categorize the episode points by season to gain some insight on why it was rated highly.
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=imdb_rating, y=viewers, color=season))+
labs(y= "Viewers on Original Air Date (Millions)",
x= "Average IMDb.com Rating",
title = "Viewers vs. IMDb.com Rating categorized by Season",
caption = "Data Obtained from office_ratings.csv")
Color coding the graph by season allows us to connect it to the previous graphs created and notice that for some reason this specific episode in season 5 appears to be a consistent outlier. We can make inferences as to why it was so well liked, such as the fact that according to the Viewers vs. Episode graph, that specific episode was placed in the middle of the season. With the knowledge of how a story line typically progresses, the climax of the story is often placed in the middle, meaning that this episode was most likely the climax of a plot line that had been building throughout the season and people were likely anticipating the height of the season.
In order to gain more insight into how well the IMDb.com ratings represent the total viewership of each episode, we will create another plot to demonstrate this distribution.
ggplot(data=office_ratings)+
geom_point(mapping=aes(x=viewers, y=total_votes, color = season))+
labs(x= "Viewers on Origianl Air Date (Millions)",
y= "Number of Ratings on IMDb.com",
title = "Viewers vs. Total IMDb.com Votes",
caption = "Data Obtained from office_ratings.csv")
From the graph above, it can be concluded that the ratio of ratings to viewers is roughly a direct relationship. There are some points where the ratio is higher and where the ratio is lower. The two points that stand out the most in this is the point from Season 9 that has almost 8000 ratings but only around 6 million views and the point from Season 5 which has 6000 ratings for over 20 million views. The point from Season 5 has been discussed in previous graphs and the hypotheses on why it is consistently an outlier, but through this graph, a new layer to it is discovered, and that is that overall, the ratio of ratings to viewers is lower than the majority. From this, we can assume that the average rating of the episode may not be completely accurate to what the audience felt about it. In other words, the data may be skewed from the lack of ratings. Compared to the Season 9 data point, where the ratio is extremely high, and the most people who watched that episode rated it on IMDb.com. With that knowledge, we can infer that the average rating on IMDb.com is accurate to the majority of the viewers for that episode.
In order to find the popularity and appeal of The Office in each season, two box plots need to be made to demonstrate how many people are rating the episodes each season and then what the average rating is per season.
ggplot(data=office_ratings)+
geom_boxplot(mapping=aes(x=season, y=total_votes))+
labs(x= "Season",
y= "Number of Ratings on IMDb.com",
title= "Total Ratings on IMDb.com vs Season",
caption = "Data Obtained from office_ratings.csv")
Measuring the popularity by the median line of each box, the graph above conveys that the popularity of The Office was at its peak in Season 1 and then as the show continued, the popularity decreased exponentially. This is most likely attributed to the fact that when a show is new, that is when it is its most popular, and as it continues on, if it does not gain a new audience, then the popularity drops.
ggplot(data=office_ratings)+
geom_boxplot(mapping=aes(x=season, y=imdb_rating))+
labs(x= "Season",
y= "Average IMDb.com Rating",
title= "IMDb.com Rating vs Season",
caption = "Data Obtained from office_ratings.csv")
Once again by using the median line to measure the appeal of the show, the graph demonstrates that the appeal of The Office does not have any distinct pattern or flow between seasons. Overall, it seems that Seasons 3 and 4 had the highest appeal as they had the highest median ratings and the appeal of Season 9 was less than Season 1 which shows us that people enjoyed the earlier seasons more than the later ones. That may have come down to personal preference of the content or writing style of the seasons or it may have been that the writers themselves were running out of ideas on how to continue the show, which resulted in the appeal dropping.
Popularity and appeal are different and the trends of the individual graphs show that they are not parallel throughout the series. The popularity of the series is determined, in this case, by the number of ratings on IMDb.com, which steadily decreases along the course of the series. What we can infer from this, is that there were a certain amount of people who watched the first season and enjoyed it so they continued to watch the series, and a fewer amount of people who didn’t and decided to stop watching after the season finale. This would repeat for each season, and without enough people finding the show appealing enough to watch the new seasons, the popularity drops. The appeal of a series isn’t based on the how many people are watching but rather how well each season does content wise. From that we can infer that popularity and appeal do not correlate to each other. We can see that the popularity of the show never recovered after Season 1 ended, but the appeal of the series increased after Season 1 and didn’t begin to drop until after Season 4. So if the appeal of the show increased during Season 3, why didn’t the popularity spike for Season 4? I am under the impression that Seasons 3 and 4 appealed more to the older audiences who had been watching the show from the beginning and they were not meant to bring in a new audience. So, even though the appeal increased, the popularity still continued to decrease.
ggplot(data=office_ratings)+
geom_line(mapping=aes(x=episode, y=viewers))+
geom_point(mapping=aes(x=episode, y=viewers, color=season))+
facet_grid(row=vars(season))+
labs(x= "Episode",
y= "Viewers on Original Air Date (Millions)",
title= "Viewers vs. Episode",
caption = "Data Obtained from office_ratings.csv")
Viewing the graph above, we can see that all of the seasons started off with a higher number of viewers and then the viewership gradually decreased throughout the season. We can also see that the original views became lower and lower with the beginning of each new season, which is demonstrating the decrease in popularity. Within Season 5, we come across this spike in viewers for episode 13 once again, if we look at episode 12 and 14, their views were roughly equivalent to each other and there was no gradual increase no decrease around episode 13. From the spike in viewers, we can make the assumption that this episode was highly anticipated and potentially promoted towards the original audience to try to bring back some of the viewership from earlier in the season. If that was the case, they succeeded but it didn’t last as the views dropped back down to where they were before. At the end of Season 9, there is a small gradual increase in viewership leading up the season’s, and entire series’, finale. We can infer that the increase in viewers came from early audience members who had perhaps been less involved in the show since a certain point, but came back at the end to see how a show they had enjoyed earlier was going to be finished off.
From the data displayed and the relationships of variables discovered, we can conclude that the popularity of The Office over its run time is directly related to the number of viewers on the original air date of each episode and the total number of ratings of IMDb.com per episode, but has no correlation to the average rating on IMDb.com. The data is able to support this claim, and we are able to create multiple visualizations to demonstrate that as the series progressed, the viewership and number of ratings, and therefore popularity, decreased overall.