office_ratings <- readr::read_csv('https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv')
office_ratings$season <- as.character(office_ratings$season)
office_ratings$air_date <- as.Date(office_ratings$air_date, "%m/%d/%Y")
library(tidyverse)
library(DT)

Introduction

In this report, the intention will be to answer the question that follows:

Is there a relationship between the progression of a show and the changes in the audience?

We will be using the data set office_ratings to examine this question further. This data set contains 7 variables that examine both the popularity and appeal over the progression of the show The Office. We are give the season, episode, title, and air_date. Along with this, we are given the numerical values of viewers, imdb_rating, and total_votes.

Overall distribution of viewers, imdb_rating, and total votes

First, we must get familiar with our variables. Below are three histograms showing the distribution of the variables. This will show our highest and lowest values of each variables and the freuqency of which they were recieved throughout the show.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = viewers)) +
  labs(x = "Number of Viwers",
       title = "Number of Viewers Per Each Episode Of The Office",
       caption = "data obtained from data set `office_ratings`")

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = imdb_rating)) +
 labs(x = "Imdb Rating ",
       title = "Imdb Rating Per Each Episode Of The Office",
       caption = "data obtained from data set `office_ratings`")

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = total_votes)) +
   labs(x = "Number of Votes",
       title = "Number of Votes Per Each Episode Of The Office",
       caption = "data obtained from data set `office_ratings`")

These three histograms show the general distribution of values and do not directly show any correlation with one another. But within these graphs, we can see Viewers peaks right around 7.5 with the bulk of the numbers staying at or less than 10, with regards to the two values outside the initial chunk of data in the graph. Looking at the values of the Imdb ratings, we can see that there is quite a wide variety throughout the show, the overall peak being roughly around 8.1. The number of votes has a drastically lower variety than that of the imdb rating, however the peak happens right around 1700 and the majority of the votes remain between 1000 and 3000 with regard to the few votes further out on the graph.

Correlation Between Viewership and Total Votes

Now that we are familiar with the distribution of the three variables above, we can look more intently on the relationship with one another. The graph below will show a scatter plot of viewers and total votes.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes)) +
  labs(x = "Viewers", 
       y = "Total Votes", 
       title = "Relationship Between Total Viewership and Votes Per Episode",
       caption = "data obtained from data set `office_ratings`")

While the total votes may make the graph a little hard to read, we can see an overall upward to the right trend. This means that generally, as the viewers increase so do the total number of votes for each episode. There are three points way above the graph. This shows that from what we just concluded, there is an un-proportionate number of votes compared to views at those three points.

Correlation Between Total Viewers and Imdb Rating

Knowing that there is a general positive correlation between viewership and votes, we should see the same for overall imdb rating and viewership. The following graph will show a scatter plot between total views and imdb rating.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = imdb_rating)) + 
  labs(x = "Viewers",
       y = "Imdb Rating",
       title = "Relationship Between Total Viewers and Imdb Ratings Per Episode", 
       caption = "data obtained from data set `office_rating`")

This graph shows generally the same the same trend as above, upward to the right. This graph shows a little more variation which means total viewership may not have as much as an effect of imdb rating as it does total votes. There are a few outliers in this graph which just means the viewership was not proportionate to the imdb rating.

Correlation Between Progression and Audience Retention Including Ratings and Votes

Understanding the direct correlation of these variables with one another, we can now look at the direct impact of time on each one of these variables. To measure time in this show we will use the season. The graphs below will consist of three different box plot graphs, each containing one of the previous mentioned variables measured by season.

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = viewers)) + 
  labs(x = "Season", 
       y = "Viewers",
       title = "Relationship Between Progression and Audience",
       caption = "data obtained from data set `office_ratings`")

This individual data set shows that throughout the seasons, viewership initially increases but after the 2nd season seems to plateau and gradually decreases after season 5. This shows that that popularity of the show generally remained the same from season 1 through five but declined after the 5th season.

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = total_votes)) +
  labs(x = "Season", 
       y = "Votes",
       title = "Relationship Between Progression and Votes",
       caption = "data obtained from data set `office_ratings`")

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
  labs(x = "Season", 
       y = "Imdb Rating",
       title = "Relationship Between Progression and Imdb Rating",
       caption = "data obtained from data set `office_ratings`")

These two data sets while inherently different in shape, how roughly the same overall downward trends. These two variables can attributed to appeal. Votes have a very clear downward trend, with regard to the many outliers in the graph. Imdb ratings vary a little bit more, however, there is a general downward trend after the 5th season. Imdb ratings and the viewer graphs are more alike than the total votes graph, however they all show a general decline after, the 5th season.

Outliers

From the graphs we’ve looked at, there are some considerable outliers. Why is that? Looking back at the graphs above, there is an episode in season 5 that is an outlier on a few of the graphs. This is Season 5, episode 13, called Stress Relief with 22.91 million viewers. Under further investigation, many people loved this episode because it encased the humor of the main character, Michael Scott in a very perceivable way for new viewers. Many claimed that this episode kept people hooked and overall raised the viewership.

Another notable change within vieweship would be: why are season 8 and 9 drastically lower in overall viewers than the other seasons? Michael Scott left the show in season 7, there was a drastic decline in viewers after his dissapearance on the show. Many audience members decided not to tune in to the episodes following Michael Scott’s exit.

We must also think in regards to the general population and common viewership trends. There may be more viewers during the first and last seasons because many tune in to shows they are not familiar with because of special episodes or features.

Conclusion

The leading question throughout this report has been: Is there a relationship between the progression of a show and the changes in the audience?

Ultimately, for this case, the answer is no. Viewership in The Office did not drop toward the end of the show because the audience got bored, however, it dropped because the main character left the show. Other factors more prominently effect the viewership of a show than that of progression. For the most case, we did see a direct coorelation between viewership, imdb ratings, and total votes.