This report seeks to explore the show ‘The Office’ and draw meaningful conclusions from it. The dataset we will use is ‘office_ratings’ retrieved from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv. The dataset has Viewers, which is the number of people who viewed the episode in millions. The IMDB_rating is the average rating from 1-10 from the website IMDB which allows users to rate shows and movies. Lastly, the total_votes is the number of people who have given an IMDB rating on the website. The rest of the variables are fairly self-explanatory and will be used as well to analyze the dataset as a whole. The code below reads in and shows the data set that can be viewed below:
library(DT)
office_ratings <- readr::read_csv('https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv')
office_ratings$season <- as.character(office_ratings$season)
office_ratings$air_date <- as.Date(office_ratings$air_date, "%m/%d/%Y")
datatable(office_ratings, options = list(scrollx = TRUE))
The tidyverse library will be used for visualizations throughout the report.
library(tidyverse)
The first relationship we will investigate is the one between Viewers, IMDB rating, and Total votes. The hypothesis for this set of variables is that they will have a positive correlation between them all. This is because the more people watch, the more likely that the episode is better liked, and the more people will go and vote on the episode.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = imdb_rating, color = total_votes)) +
labs(x = "Viewers (in millions)",
y = "IMDB Rating",
color = "Total Votes",
title = "IMDB Rating vs. Viewers vs. Total
Votes")
We see that our hypothesis is mostly true for the graph, but there are a few outliers that we should look at. The main outlier seems to have nearly twice the viewers of the average episode. The reason for this is that the episode ‘Stress Relief’ was split into two parts when it was released, therefore leading to many people watching both weeks when it was released. The other two outliers we see are the finale for Season 9 called ‘Finale’ with a high IMDB Rating. THis is likely because it is the last episode of the office in its last season. The other outlier we see is ‘Goodbye Michael’ which is likely rated so highly because it dealt with one of the main characters Michael having a major moment.
The next hypothesis is that the more viewers there are the higher the IMDB rating will be. This is because the more people watch an episode the more often it is recommended by others or algorithms.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = imdb_rating,)) +
labs(x = "Viewers (in millions)",
y = "IMDB Rating",
title = "IMDB Rating vs. Viewers")
The Viewers and IMDB rating have a positive correlation between them. We still see the same outliers as addressed above as well. Overall this supports our hypothesis. However to see the reason why some episodes are more highly rated we will add the episode to the plot as well.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = imdb_rating, color = episode)) +
labs(x = "Viewers (in millions)",
y = "IMDB Rating",
color = "Episode",
title = "IMDB Rating vs. Viewers",
subtitle = "Differentiated by episode number")
This shows us that two of the three main outliers are due to it being a finale episode of a season. This makes sense as finales for season tend to be the most popular episodes, followed by the first episode.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = total_votes,)) +
labs(x = "Viewers (in millions)",
y = "Total Votes",
title = "Total Votes vs. Viewers")
We are shown by this plot that there is a positive correlation between Viewers and Total Votes. We see the same outliers of ‘Finale’, ‘Stress Relief’, and ‘Goodbye Michael’. This shows us that they have a high rating as well as many votes for them. even though the viewers are average.
The next two variables we will examine are the Viewers and Season. There will likely be a decrease in viewers as the season increases. This is because shows tend to decrease in popularity over time.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = viewers,
color = episode)) +
labs(x = "Season",
y = "Viewers (in millions)",
title = "Viewers vs. Season")
We can see that the hypothesis we had is false. The viewers started lower and then increased, peaking around season four. This is likely due to the fact that the show started slowly and then began to gain a following over time, peaking in its mid seasons. The reason it falls off afterwards is because the novelty of the show likely wore off for some people and therefore many got bored or started watching something else. We have the first episode of season 1, ‘Pilot’ as an outlier. This is due to it being the first episode and therefore being something people would check out. In the second season we have an outlier of episode 12 ‘The Injury’. The overall consensus is that this episode was high quality and enjoyable, nothing else too special about it.
Now we want to try to see if the shows appeal increased or decreased over time. The way we will do this is by taking IMDB Rating vs Season. The hypothesis is that the IMDB Ratings will be lower as the show goes on, as people often like the first season of a show more that later ones. This can be because they often like a show in its first season, but then the show changes to something they dislike later on.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y =
imdb_rating,)) +
labs(x = "Season",
y = "IMDB Rating",
title = "IMDB Rating vs. Season")
We can see that the hypothesis is also false here. Similarly to popularity the most appealing season is season four. It builds up to it and then slowly comes back down. However one difference we can see is that the appeal of seasons seven and nine is higher in relation to their popularity. This is likely due to the committed fanbase enjoying the seasons more than the general public who were starting to watch less overall anyway. Those who kept up with the show likely voted more and gave higher votes to these seasons.
The last analysis we will make is between Viewers and the episodes in each season. We will look at this by comparing the viewers per episode. The hypothesis is that the first and last episodes of each season will be the highest viewers for each season. The season finale of the first season will also be the most popular of these.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = viewers,
)) +
labs(x = "Season",
y = "Viewers (in millions)",
title = "Viewers vs. Season")
We can see that the first hypothesis is correct, but the second one is not. The same outlier we addressed in the first question is still the highest viewed, but not for a reason that is relevant to this analysis. Otherwise we see the episode ‘Pilot’ of season one is indeed the most viewed episode for that season. The other seasons mostly follow suit
In summary, we can conclude that the Viewers, IMDB Rating, and Total Votes all have a positive correlation to each other, meaning that as one increases the others do as well. The main outlier that we found is easily explained with a bit of research, and otherwise there are only a few other outliers. We can also see that as the show went on, the viewers decreased for each successive season. We also saw that the appeal went down over the course of the season, but not in the exact same manner as the viewers. One of the sources of error would be the inability to see how the variables have changed over time from the shows release to now. Since we are only working with a snapshot of time for the show, it limits our ability to see the changes that might have occurred over time.