This report will display visuals and analyze viewership trends for the TV show The Office.
Throughout the report, it will answer questions about the relationship between the number of viewers per episode and episode ratings. It will also explore trends regarding the show’s overall popularity and appeal over time.
We will be using a data set called ‘office_ratings’ obtained and compiled from https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-03-17/readme.md. It contains all 186 episodes of the show that aired on TV. Each episode has 7 variables including: season (season during which the episode aired), episode (episode number within the season), title (title of the episode), viewers (number of viewers in millions on original air date), imdb_rating (average fan rating on IMDb.com from 1 to 10), total_votes (number of ratings on IMDb.com), and air_date (date episode originally aired). The full data set can be viewed below:
We will be using the tidyverse package for computing to create and display visuals.
library(tidyverse)
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = viewers), bins = 30) +
labs(
title = "Distribution of Viewership",
x = "Viewers (in millions)",
y = "Count",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = imdb_rating), bins = 30)+
labs(
title = "Distribution of IMDb Ratings",
x = "IMDb Rating",
y = "Count",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = total_votes), bins = 25)+
labs(
title = "Distribution of IMDb Total Votes",
x = "Total Votes",
y = "Count",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
For the distribution of viewership, the histogram has a normal distribution shape that peaks in the middle; however, there is an outlier that is much higher than the rest of the data points. This shows that there is an average viewership of about 7.5 million views. The distribution of IMDb ratings also has a normal shape that peaks around 8-8.5, meaning that The Office is fairly enjoyed by the show’s viewers. Looking at the range of the ratings, there are very few low rated episodes. Finally, the distribution of IMDb total votes is strongly skewed right with the peak at around 1600 total votes. The outlines show that those episodes are likely more popular and favored by fans of The Office than the rest.
ggplot(data = office_ratings, mapping = aes(x = imdb_rating, y = viewers)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Viewership vs. Rating",
x = "IMDb Rating",
y = "Viewership (in millions)",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
This plot represents the viewership and IMDb ratings each episode got, and we want to see if there is a relationship between the two variables. Looking at the plot, there appears to be a weak positive relationship between viewership and IMDb ratings. We can see that the general trend of this plot is better ratings the more views an episode gets, but this is not true for all episodes.
ggplot(data = office_ratings, mapping = aes(x = imdb_rating, y = viewers, color = air_date)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Viewership vs. Rating",
x = "IMDb Rating",
y = "Viewership (in millions)",
color = "Air Date",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
Yes, there are some exceptions. As we can see from this plot, there are points in which have high IMDb ratings despite having low viewership and those with low IMDb ratings despite having high viewership. We can see that a few of the more recent episodes (lighter blue color) have low viewership but still have high ratings for the episode. We can also see that there are a few earlier aired episodes has high viewership but lower ratings.
The problem we are investigating questions whether there is a relationship between the number of views gets and the number of IMDb ratings each episode gets. It is reasonable to assume that an episode will likely get more rating votes the more viewers it gets, but we won’t know for sure until we test it. The following scatter plot will show if there is a relationship between these two variables.
ggplot(data = office_ratings, mapping = aes(x = total_votes, y = viewers, color = air_date)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Viewership vs. IMDb Total Votes",
x = "IMDb Total Votes",
y = "Viewership (in millions)",
color = "Air Date",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
This plot reveals that there is a moderately strong positive relationship between viewership and total votes for the episodes of The Office. Episodes that have higher viewership typically have more IMDb rating votes, however, there are still some exceptions that have either very high rating votes with lower viewership or some that have lower rating votes with high viewership.
This problem investigates the show’s popularity over time and what kind of change we can find out through visuals. Since this show went on for 8 consecutive years, we can either see an increase or a decrease of viewership over time and this could be due to many factors that we may not know. An increase of viewership could be due to a new audience being introduced to the show and a decrease could be due to the current audience losing interest. We can test this hypothesis with a line plot:
ggplot(data = office_ratings, mapping = aes(x = air_date, y = viewers)) +
geom_line() +
labs(
title = "The Office's Popularity (Viewership) Over Time",
x = "Episode Air Date",
y = "Viewership (in millions)",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
This plot shows the show’s popularity slow decline throughout the seasons where viewership grew in the first and second season, but started declining towards the last few. We can see that the show’s viewership spikes every now and then throughout the seasons, but the overall trend is a slow decrease.
Going off the previous question, this problem will focus on analyzing the show’s appeal and how it changed over time. Though the viewership of the show had a declining trend, this may not be the same for the appeal for it. To analyze this, we will look at the IMDb Ratings and how they changed throughout the show’s duration. We will be using a line plot as shown below:
ggplot(data = office_ratings, mapping = aes(x = air_date, y = imdb_rating)) +
geom_line() +
labs(
title = "The Office's Appeal (IMDb Rating) Over Time",
x = "Episode Air Date",
y = "IMDb Rating",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
Looking at the plot, we can see that the ratings for episodes throughout the duration of the show stayed consistently high, with a few exceptions being just below 7. However, that is still fairly good considering that the viewership has dropped since the beginning seasons. This shows that people who continued to watch or those who started watching the show enjoyed it.
Comparing the previous two plots, the show’s popularity showed a declining trend, whereas the show’s appeal stayed consistent over time. This can be explained by loyal fans who have stayed with the show consistently and watching it live while still giving the episodes good ratings. The show’s decline in popularity is likely due to the more casual audiences not watching the episodes live as much as when the first show started airing.
This problem is investigating the trend of viewership of episodes within each season of the show. With the decline of total viewership throughout the show, how might this be different for each individual season? We will be exploring this question and analyzing the trends with a line plot as shown:
ggplot(data = office_ratings, mapping = aes(x = episode, y = viewers, group = season)) +
geom_line() +
geom_point() +
facet_grid(cols = vars(season)) +
labs(
title = "Viewership Trends Within Each Season",
x = "Episode Number",
y = "Viewership (in millions)",
caption = "data obtained from raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
Observing the individual seasons, we can see that viewership is highest at most times within the first 3 episodes while it sometimes rises again mid-season. The majority of the seasons show a sharp drop of viewership after the first 3 episodes or towards the mid section of the season. An exception to this is the last season where we can see the highest viewership was the last 3 episodes. These rises and drops of viewership can be explained by various factors including major plot developments, cast changes, or the ending of the show.
In summary, we can conclude that IMDb ratings and IMDb votes are related to the viewership of The Office. Though we saw a declining trend of the show’s popularity over time, this did not affect the show’s appeal. These trends can be explained by the show’s overall audience watching less of the episodes live, but having those who consistently watch it enjoy the episodes being aired out and still giving good reviews. …