library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)
We will be using the libraries “tidyverse” and “DT” to visualize the data of the viewership over time of the show “The Office.” Some would say it’s one of the most popular shows ever, but has the show always shown the increase in viewership over time? To answer this question and further visualize it, we will be using a data set compiled from the following links: https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-03-17/readme.md and https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes.
datatable(office_ratings)
The variables that will be used are as follows as given. Season (categorical) = season during which the episode aired Episode (categorical) = episode number within the season Title (categorical) = title of episode viewers (continuous) = number of viewers in millions on air date imdb_rating (continuous) = average fan rating on IMDb.com from 1 to 10 total_votes (continuous) = number of ratings on IMDb.com air_date (date) = date episode originally aired
First, it is critical to know how our continuous variables are distributed in the dataset “office_ratings.”
ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = viewers)) + labs(title = "Distribution of Viewership", x = "Total Number of Viewers (millions)", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The viewers distribution peaks at around 7.5 million viewers being the most common viewership, and it appears to be a right skewed distribution. There is an extreme value at close to 22.5 million viewers that will be investigated further.
ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = imdb_rating)) + labs(title = "Distribution of IMDb Ratings", x = "Average IMDb Rating", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of imdb_rating peaks at around 8.2 and is approximately normal, with no extreme values, telling us that this was always looked at the same way as far as ratings, according to critics on imdb.com.
ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = total_votes)) + labs(title = "Distribution of Total Ratings on IMDb.com", x = "Total Votes", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The total ratings histogram is skewed right and peaks at around 1500 total ratings, with outliers at 8000 votes and close to 6000 votes.
Now that we have the distributions of viewers, imdb_rating, and total_votes, we can establish trends that are associated with The Office over time with its viewership.
First, we discover that the viewership of The Office increased from its inception to around 2008 (its peak), but began a steady decrease in viewership after 2008 until its finale in 2013, as shown by the code & visualization below.
ggplot(data = office_ratings) + geom_point(mapping = aes(x = air_date, y = viewers)) + labs(title = "The Office's Viewership over Time", x = "Episode Air Date", y = "Viewers (Millions)")
ggplot(data = office_ratings) + geom_point(mapping = aes(x = air_date, y = imdb_rating)) + labs(title = "The Office's Ratings Over Time", x = "Episode Air Date", y = "Ratings on IMDb.com")
There is a difference between the show’s popularity (viewership) and its appeal (IMDb rating). As we notice with the two scatter plots from above, there are similarities. The ratings begin a slight decrease after 2007, while the show’s viewership begins its steady decrease around 2011 and never recovered, which can be attributed to Michael Scott’s (Steve Carell) departure from The Office.
As we notice similarities, such as the spike in ratings and viewership in 2009, in part because the episode “Stress Relief” aired right after the Super Bowl, garnering a large amount of viewers and therefore a lot of ratings, mostly great. However, while it would be expected that the more people that watch the episode, the better ratings it gets, this is simply not true, as shown by the visualization below.
ggplot(data=office_ratings,mapping=aes(x=viewers,y=imdb_rating))+
geom_point() +
geom_smooth(mapping = aes(x = viewers, y = imdb_rating), se=FALSE)+
labs(title="Relationship between Viewers and IMDb Ratings", x="Viewers", y="Average IMDb Rating")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
While our regression line shows an increase, the scatterplot shows that there is an outlier at around 25 million viewers, which is an exception. If we did not include that Super Bowl episode, there would not be much of an association between an increase in viewership and an increase in ratings. As we notice in the scatter plot, an episode of around 6 million viewers got one of the highest ratings, and there is a large cluster around 7.5 million viewers and all different ratings in that cluster.
However, maybe there is not a lot of ratings being left on specific episodes.
ggplot(data = office_ratings, mapping = aes(x = viewers, y = total_votes)) + geom_point() + geom_smooth(mapping = aes(x = viewers, y = total_votes)) + labs(Title = "Total Number of Ratings compared to Viewership", x = "Total Number of Viewers (Millions)", y = "Total Number of Votes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As we see in the visualization, there is a slight positive correlation between viewership and the total amount of ratings on each episode, and we see exceptions at around 22.5 million viewers receiving 6000 ratings, while an episode of 6 million viewers received 8000 ratings. These are major exceptions.
As we saw above, the popularity and appeal both seem to change slightly over time, but if we visualize using seasons we can narrow down some reasons for dips and peaks.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
labs(title = "Relationship Between IMDb Rating and Season",
x = "Season",
y = "IMDb Rating")
In this visualization, we see that the ratings fluctuated through time. There were multiple peaks, in seasons 3 and 4, and its largest peak in ratings in season 7 before dropping dramatically. Upon researching, we can find that Michael Scott (Steve Carell) left The Office after season 7, leading to the dramatic loss in ratings.
Now let’s look at the popularity per season.
ggplot(data = office_ratings,mapping = aes(x = episode,y=viewers)) +
geom_smooth(aes(color=season), se = FALSE) + labs(title = "Relationship between Viewership and Season Progression", x = "Episode Number", y = "Number of Viewers in Millions", color="Season")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As we can see by the visualization, the popularity of The Office peaked in the middle seasons, between season 2-5 and consistently took dips. After season 7, as the research showed, viewership dipped greatly following the departure of Michael Scott.
We can conclude that there is not a true correlation between an increase in viewership and an increase in the quality of ratings, but we can conclude that more viewers leads to more ratings in quantity. There are relationships between the continuous variables (IMDb rating, original air date viewership, and total ratings), some more strong than others, but we cannot conclude that there is a correlation, as many of the plots are impacted by the extreme value of the episode that aired after the Super Bowl. What we can say for certain is that viewership decreased from season 6 onward as the show began to decline, losing Steve Carell as Michael Scott after season 7 and hitting its lowest viewership in Seasons 8 and 9. While The Office lost a good chunk of its viewership after Season 7, it remained a steady show in terms of viewership and rating and remained mostly steady throughout its air time. It remains a popular and comedic show today that many people miss dearly.
```