In this report, I will seek to clarify and analyze this data by looking at several key relationships. This report will analyze the distribution of the continuous variables, it will look at the relationships between many of the key variables, it will isolate exceptions to generalizations made about the relationships, and it will analyze trends in individual seasons.
This data is obtained from the office_ratings data set. It describes
6 variables for each episode. The ones that will be important for this
report are viewers, the number of viewers in millions on
the original air date, imdb_rating, the average fan rating
on IMDb.com from 1-10, and total_votes, the number of
ratings left by viewers on IMDb.com
In this report I will be making use of tidyverse to create visualizations
library(tidyverse)
First, I am going to look at the distributions of each of the continuous variables.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = viewers)) +
labs(x = "Viewers (Millions)",
title = "Distribution of the Viewers Variable") +
coord_flip()
In this box plot, we can see that for the viewers
variable, the majority of the data falls right around 6-8 millions
viewers per episode. One single episode doubled the next highest
episode’s viewership numbers causing an incredibly large skew on the
data set. This happened because the episode aired on NBC directly after
the broadcast of Super Bowl XLIII which highly inflated the number of
viewers.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = imdb_rating)) +
labs(x = "Rating on IMDb.com",
title = "Distribution of the IMDb_Rating Variable") +
coord_flip()
In this distribution, we can observe most points lie between about 7.9 and 8.6. There are exceptions below 7 and above 9.5. There aren’t any super extreme values in this set.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = total_votes)) +
labs(x = "Total Votes on IMDb.com",
title = "Distribution of the Total Votes Variable") +
coord_flip()
In this distribution, the vast majority of data points are close to 2000, with several episodes around 4000, 6000, and one that got 8000. I would guess that these episodes with higher votes than normal were season finales or fan-favorite episodes. The one that got 8000 was the episode that aired after the Super Bowl.
Next, this report will take a look at the various relationships between continuous variables. I would like to see whether it is true that the more people that view an episode, the more it is liked.
ggplot(data = office_ratings, mapping = aes(x = viewers, y = imdb_rating)) +
geom_smooth(se = FALSE) +
geom_point() +
labs(title = "Rating as a Function of Viewership",
x = "Viewers (Millions)",
y = "IMDb Rating")
It would appear that as a general trend, the more people that view an episode, the higher it is rated.
There are a good amount of variation in this data, however, and it causes a good amount of exceptions.
ggplot(data = office_ratings, mapping = aes(x = viewers, y = imdb_rating)) +
geom_smooth(se = FALSE) +
geom_point(mapping = aes(color = season)) +
labs(title = "Rating as a Function of Viewership With Respect to Season",
x = "Viewers (Millions)",
y = "IMDb Rating",
color = "Season")
In this visualization, I added season as a color to more easily distinguish outlier episodes. Several notable points above the curve, well-liked with lower viewers, are more than likely fan-favorite episodes and season finales. The points below the curve, the episodes that were highly viewed but rated low, are more than likely highly anticipated episodes that weren’t as good as people were expecting. Also, a lot of the points below the curve are from the later seasons after Steve Carell left.
Next, I want to look at whether more people tend to leave a rating when more people watch an episode.
ggplot(data = office_ratings, mapping = aes(x = viewers, y = total_votes)) +
geom_smooth(se = FALSE) +
geom_point() +
labs(title = "Total IMDb Ratings as a Function of Viewership",
x = "Viewers (Millions)",
y = "Total IMDb Ratings")
It would seem that when more people watch an episode, more people leave a review. Again, this data is skewed because of the season 5 episode.
In this section, the report seeks to answer questions regarding the popularity and appeal of The Office to determine whether the show holds up season to season.
Firstly, I would like to examine the relationship between popularity and time to see if The Office remained popular for its duration, or if there were major fluctuations.
ggplot(data = office_ratings) +
geom_line(mapping = aes(x = air_date, y = viewers)) +
labs(x = "Air Date",
y = "Viewers",
title = "Popularity of The Office Over Time")
The Office experienced a severe drop off in viewers after the pilot episode followed by a steady increase in viewership up to the third season. After that, The Office declined steadily in popularity until the beginning of the eighth season. At the beginning of the eighth season, the viewers dropped off significantly until the ninth season finale episode.
The next important relationship in determining whether The Office holds up across the seasons is to examine the appeal of the show over time.
ggplot(data = office_ratings) +
geom_line(mapping = aes(x = air_date, y = imdb_rating)) +
labs(x = "Air Date",
y = "Average Rating",
title = "Appeal of The Office Over Time")
This line graph shows an increase in rating up to late 2006. Then, a generally steady trend to about 2011 with a sharp drop off afterwards. There does seem to be some stronger episodes around 2007-2008 and again in early 2011. This shows that the middle of the show was received much better than the early episodes and the last few years of the show with some stronger episodes in certain seasons.
Finally, I am going to examine trends in viewership between individual seasons
ggplot(data = office_ratings) +
geom_line(mapping = aes(x = air_date, y = viewers, color = season)) +
labs(x = "Air Date",
y = "Viewers (millions)",
color = "Season",
title = "Viewers of The Office Season by Season")
In general, seasons 2-5 seem to be the most highly viewed seasons with a steady decline in the next two. Following that, seasons 8 and 9 are much less viewed than the previous seasons, excluding season 1. Season 1 sees a steep drop after the pilot. I think based off of the popularity and appeal graphs, it can be concluded that this is due to season 1 not being that well received. Seasons 2 and 3 both have lower initial viewership with a peak in the middle and a drop off around the end. This is also shown in the popularity and appeal graphs. Season 4 has pretty steady viewership with a drop off around the end of the season. Season 5’s viewers trend down with a crazy outlier because of the Super Bowl. Season 6 and 7 both have higher viewers toward the beginning and end of the seasons and a small peak in the middle. This is probably due to pre season hype and finale excitement. Seasons 8 and 9 are both a downward trend as in a lot of seasons. However, those seasons have significantly lower viewers as I would imagine a large amount of people stopped watching after Steve Carell left the show. The exception is that the finale of the ninth season got a lot of viewers. i think this is because it was the last episode of the show and because Steve Carell made an appearance.
To summaraize, there are several general conclusions that can be drawn about this data. Firstly, the most obvious conclusion is that the last two season of The Office were incredibly poorly received compared to the others. Additionally, we can conclude that with a few exceptions, the more an episode is viewed, the higher it is rated, and the more people leave a rating. Furthermore, we can conclude that as time went on, popularity and appeal stayed relatively the same after season 1 and before season 8. There was a very slight downward trend but it wasn’t very large. Finally, each season had its ups and downs but towards the end, more people seemed to stop watching.