library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DT)

We will be using the libraries “tidyverse” and “DT” to visualize the data of the viewership over time of the show “The Office.” Some would say it’s one of the most popular shows ever, but has the show always shown the increase in viewership over time? To answer this question and further visualize it, we will be using a data set compiled from the following links: https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-03-17/readme.md and https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes.

datatable(office_ratings)

The variables that will be used are as follows as given. Season (categorical) = season during which the episode aired Episode (categorical) = episode number within the season Title (categorical) = title of episode viewers (continuous) = number of viewers in millions on air date imdb_rating (continuous) = average fan rating on IMDb.com from 1 to 10 total_votes (continuous) = number of ratings on IMDb.com air_date (date) = date episode originally aired

Continuous Variables

First, it is critical to know how our continuous variables are distributed in the dataset “office_ratings.”

Viewers:

ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = viewers)) + labs(title = "Distribution of Viewership", x = "Total Number of Viewers (millions)", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The viewers distribution peaks at around 7.5 million viewers being the most common viewership, and it appears to be a right skewed distribution. There is an extreme value at close to 22.5 million viewers that will be investigated further.

IMDb_rating:

ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = imdb_rating)) + labs(title = "Distribution of IMDb Ratings", x = "Average IMDb Rating", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of imdb_rating peaks at around 8.2 and is approximately normal, with no extreme values, telling us that this was always looked at the same way as far as ratings, according to critics on imdb.com.

Total_votes:

ggplot(data = office_ratings) + geom_histogram(mapping = aes(x = total_votes)) + labs(title = "Distribution of Total Ratings on IMDb.com", x = "Total Votes", y = "Number of Episodes")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The total ratings histogram is skewed right and peaks at around 1500 total ratings, with outliers at 8000 votes and close to 6000 votes.

Now that we have the distributions of viewers, imdb_rating, and total_votes, we can establish trends that are associated with The Office over time with its viewership.

Ratings over Time

ggplot(data = office_ratings) + geom_point(mapping = aes(x = air_date, y = imdb_rating)) + labs(title = "The Office's Ratings Over Time", x = "Episode Air Date", y = "Ratings on IMDb.com")

Popularity and Appeal Season to Season

As we saw above, the popularity and appeal both seem to change slightly over time, but if we visualize using seasons we can narrow down some reasons for dips and peaks.

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
  labs(title = "Relationship Between IMDb Rating and Season",
       x = "Season",
       y = "IMDb Rating")

In this visualization, we see that the ratings fluctuated through time. There were multiple peaks, in seasons 3 and 4, and its largest peak in ratings in season 7 before dropping dramatically. Upon researching, we can find that Michael Scott (Steve Carell) left The Office after season 7, leading to the dramatic loss in ratings.

Now let’s look at the popularity per season.

ggplot(data = office_ratings,mapping = aes(x = episode,y=viewers)) +
  geom_smooth(aes(color=season), se = FALSE) + labs(title = "Relationship between Viewership and Season Progression", x = "Episode Number", y = "Number of Viewers in Millions", color="Season")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As we can see by the visualization, the popularity of The Office peaked in the middle seasons, between season 2-5 and consistently took dips. After season 7, as the research showed, viewership dipped greatly following the departure of Michael Scott.

Conclusion

We can conclude that there is not a true correlation between an increase in viewership and an increase in the quality of ratings, but we can conclude that more viewers leads to more ratings in quantity. There are relationships between the continuous variables (IMDb rating, original air date viewership, and total ratings), some more strong than others, but we cannot conclude that there is a correlation, as many of the plots are impacted by the extreme value of the episode that aired after the Super Bowl. What we can say for certain is that viewership decreased from season 6 onward as the show began to decline, losing Steve Carell as Michael Scott after season 7 and hitting its lowest viewership in Seasons 8 and 9. While The Office lost a good chunk of its viewership after Season 7, it remained a steady show in terms of viewership and rating and remained mostly steady throughout its air time. It remains a popular and comedic show today that many people miss dearly.

```