Introduction

This report seeks to answer the following question:

Is there a relationship between all of the various variables in office_ratings that can be seen when using visualizations to decipher any meaningful data from all 9 seasons of The Office?

We will be using a data set called office_ratings obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv. It contains a total of 186 per episode data points from all 9 seasons of The Office TV show. There are 7 variables in the database; the relevant ones in this report are season (season during which an episode is aired), episode (episode number within a given season), title (the tile of the episode), viewers (number of viewers, in the millions, on the original air date), imdb_rating (the average fan rating, on IMDB.com, on a scale from 1 to 10), total_votes (number of ratings on IMDB.com), and air_date (when the episode originally aired). They all play a role in creating the visualizations that will be used to answer the questions. The full data set can be viewed below:

Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations.

library(tidyverse)

Extra Information I used outside the database (The links to them):

https://www.imdb.com/title/tt0664521/reviews?ref_=tt_urv.

https://www.reddit.com/r/DunderMifflin/comments/vuc9wn/imdb_ratings_of_the_office/.

https://www.imdb.com/title/tt1564719/reviews.

https://www.imdb.com/title/tt1248736/reviews.

https://www.imdb.com/title/tt2669746/reviews.

https://www.imdb.com/title/tt1833197/reviews?ref_=tt_urv.

https://www.reddit.com/r/DunderMifflin/comments/49q924/i_was_looking_at_the_viewership_for_each_episode/.

The Distribution of Each of the Three Continuous Variables

The problem that we are dealing with here is seeing how the distribution for the three continuous variables looks like? Those three continuous variables are viewers, imdb_rating, and total_votes. We might suspect that the first graph, for viewers, would be concentrated more on the left side and quickly get smaller when it goes more to the right. The reason why might be that there would be more total viewership ratings in the million to ten million range, then larger ones, like more than 11 million. For the second graph, imdb_rating, we could suspect that the graph would be more condensed in the middle and then get smaller as it went more left and right of the graph. The reason for this could be that most episodes would be reasonably rated a similar score, making that the middle of the graph, and any great or bad episode would be more away from the middle and be away from the average rating. Finally, the last graph, ‘total_votes’, would be clustered at the left side of the graph, where most of the total votes on IMDB.com would be and it would rapidly start to go lower as it went more to the right. The reason why this might be possible is because most of the episodes would have a similar amount of IMDB.com ratings and only the great and bad episodes would have more because more people would want to voice their opinions on those particular episodes. We can test these hypotheses with three histograms:

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = viewers)) +
  labs(x = "total number of viewers on the original air date (in millions)",
       y = "total amount of episodes",
       title = "The Distribution of Viewers Based on Episodes",
       caption = "data obtained from githubusercontent.com")

The peak for the viewers variable is at about 8-9 million total viewers on original air date, and there is a smaller peak at the 4-5 million mark. The peak tells me that, on average, most episodes get a viewership rating of 8-9 million views when they originally air, or 4-5 million views, but not as often. The shape of the distribution is of a right-skewed graph. This tells me that the outliers will be at the far-right extreme end of the graph. There are also two spikes on the graph; one is at the 4-5 million mark and the other bigger spike is at the 8-9 million mark. My hypothesis on why there are two different spikes is because the first smaller spike could be how many average viewers each episode got when they originally aired before the big episode that got the most views, and the other bigger spike could be the average amount of viewers each episode got when they originally aired after the big episode. There are two extreme values at the 11.5 million mark and the 22 million mark. The reason why the 11 million episode is an outlier is because that episode was the “Pilot” episode, meaning that was the first ever episode that was released so a lot of people tuned in to that episode to see if they would like the show and would want to continue it. The reason why the 22 million episode is such a big outlier is because that episode originally aired right after NBC’s broadcast of Super Bowl XLIII. It is also the episode that many fans of the show say is the best episode in the series and is a great opening for new viewers to get hooked on the show.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = imdb_rating)) +
  labs(x = "average fan rating on IMDB.com (scale from 1 to 10)",
       y = "total amount of episodes",
       title = "The Distribution of IMDB Ratings Based on Episodes",
       caption = "data obtained from githubusercontent.com")

The peak is at the 8.15 IMDB rating mark for the average fan rating. There are other peaks at the 8.6 IMDB rating mark and 9.3 IMDB rating mark for the average fan rating. This peak tells me that, on average, most episodes get a IMDB rating around the 8.15 to 9.3 range. The shape of the distribution is of a bell curve. Most of the data points are in the middle of the graph with a few outliers on both ends of the graph. This show had an even distribution of liking to its episodes. There are low IMDB fan rating outliers at around the 6.75 mark, and there are some high IMDB fan rating outliers at around the 9.65 mark. The first spike in the middle tells me that this is the most average rating that most episodes got. The second spike tells me that it had some episodes that had a somewhat higher liking to its episodes. Finally, the last spike tells me that there were very few episodes that fans enjoyed the most from the show. There are a couple extreme values on the graph. The first one is the lowest rated mark at about the 6.75 mark. The reason why this spike is so low is because the episode, “Get the Girl,” had a lot of fans not liking the story that was going on during this time of the show’s running. Many people started to feel like the show was dipping in quality and that the character Nellie, during this time of the show, was ruining the quality of the episodes. The other episode that had an outlier IMDB rating was “The Banker.” This episode was not liked as much because it was basically just a filler episode and didn’t move the story in any meaningful way. The second one is the highest rated mark at about the 9.65 mark. The reason why this one is so high is because the episode, “Stress Relief”, was so beloved by the fans for how great of an episode it is and for having so many memorable and funny moments. It was also a smart move by the directors to have this episode be played after the Super Bowl, when it originally aired, so they could get as many new viewers as they could with one of the best episodes they made. The episode, “Finale,” was loved because it was a perfect ending to a perfect show. They finished the episode off with a beautiful ending that encapsulated and finished telling the story that they were building up to all these years. The final good episode, “Goodbye, Michael”, was so good because the way that they showed Michael leaving was very heartwarming and heartfelt, and made a lot of the fans cry and get emotional at the end of the episode.

ggplot(data = office_ratings) +
  geom_histogram(mapping = aes(x = total_votes)) +
  labs(x = "total amount of ratings on IMDB.com",
       y = "total amount of episodes",
       title = "The Distribution of Total Amount of Ratings Based on Episodes",
       caption = "data obtained from githubusercontent.com")

The peak for this graph is at about the 1650 total votes on IMDB mark. This tells me that, on average, most episodes have an average total votes, from fans, on IMDB of about 1650. The shape of this distribution is a right-skew graph. This tells me that most of the data points will be on the left side of the graph and the extreme outliers will be on the right most end of the graph. There are four extreme outliers on the graph. The first one, at about the 4000 mark, called “Dinner Party” has a lot of votes on IMDB because it’s an episode that a lot of people found funny, relatable, and chaotic but humorous. The next one at about 5700, called “Goodbye, Michael,” has a lot of votes on IMDB because it was a very emotional episode that had a lot of people crying, so the fans wanted to express their emotions and talk about how great of a character Michael was in the show. The second to last one at the 5900 mark called, “Stress Relief,” has a lot of votes on IMDB because a lot of people found this to be one of the best episodes of the show and was really funny. It was also the most viewed episode that was originally aired, so a lot of newcomers and devoted fans probably came to vote and express their opinions on the episode.Then finally, the “Finale” episode having a voting total of 7900 has the most votes on IMDB because of it being the last episode of the show and a lot of fans wanted to express their opinion on what they thought of the show over all and how they ended the show on a beautiful ending.

Relationship Between the More People Watch an Episode and the Better it’s Liked

The question that needs investigating here is if there is a relationship between the number of viewers, on an episode, and the average fan rating, of an episode, on IMDB.com? It seems reasonable that if an episode has more views, the more it is liked. Also the more a show is liked, the more an episode has viewers on it. We can possibly expect a direct relationship to happen between the number of viewers, on an episode, and the average fan rating, of an episode, on IMDB.com. As the number of viewers on an episode increases, so does how much people like an episode on IMDB.com as well. We can check this theory with the use of a scatter plot:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = imdb_rating)) +
  labs(x = "total number of viewers on the original air date (in millions)",
       y = "average fan rating on IMDB.com (scale from 1 to 10)",
       title = "Viewership Total as a Function of Average Fan Rating",
       caption = "data obtained from githubusercontent.com")

Yes, there is some type of correlation between the amount of people who watch an episode and how liked the episode is on IMDB. The graph’s trajectory is going somewhat up and to the right as the IMDB rating is going up, but it’s more of a steep slope. But, there are also quite a few data points that contradict this correlation, that are on the top left side of the scatter plot. There is also one outlier at the bottom right of the scatter plot that has quite a few views but not rated as highly.

Exceptions From the Previous Graph

The problem that needs examining is if there are any exceptions to the scatter plot from the previous graph? We can’t infer that much from the graph with just the state that it is in now. It would be better if we can map another variable to an extra aesthetic to help visualize the exceptions from the graph better. The way that I think would help us see the exceptions from the graph better is if we possibly mapped the season variable to the color aesthetic. The reason for adding the extra variable, season, can possibly allow us to better understand which data points are the exceptions and why. We can test this assumption by using a scatter plot:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = imdb_rating, color = season)) +
  labs(x = "total number of viewers on the original air date (in millions)",
       y = "average fan rating on IMDB.com (scale from 1 to 10)",
       title = "Viewership Total as a Function of Average Fan Rating",
       color = "season number",
       caption = "data obtained from githubusercontent.com")

From the graph above there are some exceptions that can be seen. For example, there are a couple of highly liked episodes, around and above the 9 IMDB rating mark, that have less than 10 million views when it first aired. My hypothesis on why a lot of the high liked episodes, that also have low viewer amounts, is because these episodes are in seasons that are closer to the end of the show. The longer a show runs on, the more people start to lose interest and stop watching the show altogether. Only the most passionate and die-hard fans stay till the very end of a show they like. There is one data point that is at around the 11 million viewer mark, but only has a IMDB rating of about 7.6. The reason why this outlier is so far from the rest of the scatter plot is because it’s the “Pilot”. Pilots are usually not the best episodes because they are trying to lay the groundwork for the rest of the show, don’t have a lot of things going beside trying to set up the plot and introduce the characters, and are have a lot of views because people want to see what the show is about and if they want to continue watching it.

Relationship Between the More People Watch an Episode, the More People Leave an IMDB Rating, and the Exceptions to the Graph

The question that we are exploring here is, could there be a correlation between the number of viewers, on an episode, and how many total ratings are left on IMDB.com? And are there any Exceptions to the trend? For me, it would be sensible that if more people watch a given episode, the more people would leave a IMDB rating for it. We could theoretically presume that a direct relationship will happen between the number of viewers, on an episode, and how many total ratings are left on IMDB.com. As the viewership of an episode rises, so too does the total amount of ratings that are left about an episode on IMDB.com. In order to find any exceptions to the first question here, we need to add an extra variable to another aesthetic to the graph. Since we’re trying to see if there is a correlation between viewership and total amount of IMDB ratings, possibly mapping the season variable to the color aesthetic can prove to be useful. Doing this could help us find any exceptions in this correlation and why they might be exceptions. We can verify these conjectures with two scatter plots (one is for the first question and the second is for the other):

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes)) +
  labs(x = "total number of viewers on the original air date (in millions)",
       y = "total amount of ratings on IMDB.com",
       title = "Viewership Total as a Function of Total Amount of Ratings",
       caption = "data obtained from githubusercontent.com")

Yes, there is some type of correlation between the amount of people who watch an episode and how many people leave an IMDB rating. The graph’s trajectory is going up and to the right in a very fast way. The graph looks like an exponential graph. But, there are quite a few exceptions to the graph where the episode has a low amount of views but high total votes on IMDB.

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes, color = season)) +
  labs(x = "total number of viewers on the original air date (in millions)",
       y = "total amount of ratings on IMDB.com",
       title = "Viewership Total as a Function of Total Amount of Ratings",
       color = "season number",
       caption = "data obtained from githubusercontent.com")

A lot of the exceptions in the graph are from episodes that had the most impact on the viewers. For example, the data point at the very top, with the most votes but not as many views, is the “Finale”. This episode is the ending of the show, so a lot of people would want to express their opinion of how the show ended and what they thought of the show as a whole by leaving a IMDB rating. There is an interesting thing happening in the graph where you can see, from left to right, the older seasons, like 7-9, had the least amount of viewers and to the right were the early seasons, like 1-3, of the show where they had the most engagement being that it was a newer show at the time. My hypothesis is that as the show got older, only the most loyal of fans stuck around and stayed till the end watching. Whilst the reason there were a lot of viewers for the earlier seasons was because of the correlation of it being a new show when it first aired, and people were interested in seeing what it was all about.

The Show’s Popularity Over Time

The problem that we are trying to analyze is how has the show’s popularity changed over time? To do this, we need to see the trend between the season that an episode aired in and how many total viewers an episode got. My hypothesis is, that as the show goes on, the show’s popularity will increase at the beginning and then decrease the longer that it goes on. I think this because usually shows get a lot of viewership in the beginning couple of seasons because people are intrigued by the new show and want to see what the hype is all about, and then when people start to get bored of the show, in the last couple of seasons, they stop watching it. We can potentially see a direct relationship occur between the season that an episode aired in and how many total viewers an episode got, in the first couple of seasons, and then an inverse relationship for the later seasons. As the number of seasons increases, the viewership on episodes increases in the beginning, and then decreases in the end. We can see if my hypothesis is true by using a box plot:

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = viewers)) +
  labs(x = "season which the episodes aired in",
       y = "total number of viewers on the original air date (in millions)",
       title = "Season Number as a Function of Viewership Total",
       caption = "data obtained from githubusercontent.com")

The shows popularity started off low in the first season with only around an average of 5 million views per episode, and have one exception being the “Pilot” where lots of people tuned in, but then gradually got higher till the fourth season where it peaked, and then it started to gradually go back down after that to where the ninth season has an average of views less than the first season. There were some outliers from the trend, but most of the data points stayed in the same trajectory. Overall, the trend for the box plot is that it is an arc; going up then back down at the end.

The Show’s Appeal Over Time

The question that we would like to solve here is how has the show’s appeal changed over time? In order to solve this, we need to create a trend between the season that an episode aired in and the average fan rating, of an episode, on IMDB.com. The hypothesis that I came up with is, that as The Office goes on, the show’s appeal to viewers will grow in the earlier seasons, but then diminish in the later seasons. The reason why I believe this is because, usually, the viewers love the beginning of the show when there are good character arcs, good story telling, good character development, and etc. But, as the show goes on, the writers and producers have a hard time making the show as good as or even better then it was previously; resulting in fans losing appeal to the show, and the ratings for the show slowly dwindle because of bad writing, bad characters, bad plot points, and etc. We could presumably witness a direct relationship transpire between the season that an episode aired in and the average fan rating of an episode, on IMDB.com, in the first couple of seasons, and then an inverse relationship for the later seasons. We might see as the number of seasons go up, the IMDB rating go up as well at the start, and then go down at the end. A box plot would be best to see if my speculation is correct:

ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
  labs(x = "season which the episodes aired in",
       y = "average fan rating on IMDB.com (scale from 1 to 10)",
       title = "Season Number as a Function of Average Fan Rating",
       caption = "data obtained from githubusercontent.com")

The show’s appeal started off pretty well for the first season having an average IMDB rating of 8, then gradually got higher and higher till the fourth season where it peaked, then the following season had a dip in ratings, and then it went up and down after that till the last season. The trend does have some outliers, but most of the data points stayed in the same course. Overall, the trend for this graph is that of an arc; going up then back down at the end.

Relationship Between Total Viewership and Individual Episodes in the Seasons

The question that is being answered here is if there is a connection between the episode within the season and the number of viewers, on an episode? Because this question is looking at the trend within the individual seasons, it would be wise to also add the season variable as an extra aesthetic; perhaps the color aesthetic would be best. These three variables help to analyze the trend of viewership in a given season over the course of the individual episodes in that season. It could be reasonably assumed that as a season of a show goes on, it will lose more viewers as the episodes go on. We could assume that the trend would be somewhat of an inverse relationship between the episode within the season and the number of viewers, on an episode. There will possibly be some outlier episodes that do better than the previous one, but generally the trend is downward because some people get bored and stop watching the show. As the number of episodes in a season rises, the viewership total in that season drops. In order to test my theory, creating a scatter plot visualization would be best suited to test it:

ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = episode, y = viewers, color = season)) +
  labs(x = "episode number within a given season",
       y = "total number of viewers on the original air date (in millions)",
       title = "Season Number as a Function of Viewership Total",
       color = "season which the episodes aired in",
       caption = "data obtained from githubusercontent.com")

Yes, there is a trend in total viewership within the individual seasons. As the seasons progress the total viewership gradually increases, where the earlier seasons had a much higher viewership average compared to the later seasons. With the start of a new show, you will always have a boost in viewership because of new interest in the new show, because of this the viewership on the original air date for the early seasons would constantly be at a high. One of the big changes in viewership was on the day of the Super Bowl after it was completed, the episode “Stress Relief” was aired and garnered a large viewing which The Office used to bring in new viewers and old viewers back to the show. For a time, this marketing technique did work, with them never seeing viewership numbers as high as that day before and after the episode, but it brought back a good amount of viewership. Although it did not last that long, as the show approached season seven, number decreased at the point of episode 21 “Goodbye, Michael” we never saw that a viewership total ever again that high after it. Where what you were left with until the end of the season, was your dedicated fan base who stayed to see the end of the show.

Conclusion

In summary, we can infer that as the seasons of a show goes on, in this case The Office, the viewership for that show starts out high with a lot of people wanting to get in on the hype around the new show that just aired, but then decreases as the show goes on for longer than its lifespan and the writers and producers run out of ideas so the fans stop watching. Also, as the viewership for an episode increases, so does the IMDB rating for that episode increase, and vice versa because they work in tandem. Lastly, the substance surrounding an episode can either make or break the show’s success as a whole moving forward for later seasons. All in all, every variable, in office_ratings, plays a part in the data set, and the visualizations that are created to decipher any meaningful information from them.