This report seeks to answer the following questions about “The Office”:
-How are each of the three continuous variables distributed?
-Is it the case that the more people watch an episode, the better it’s liked?
-Are there any exceptions to the trend you noticed in the previous problem?
-Is it the case that the more people watch an episode, the more people leave an IMDb rating?
-How did the show’s popularity change over time?
-How did the show’s appeal change over time?
-In the previous two problems, you should notice that the show’s popularity and appeal don’t change in exactly the same way throughout the series. Use the differences you notice in the visualizations to explain why this might be.
-Is there a trend in total viewership within the individual seasons? Are there any notable changes in viewership within any season?
We will be using a data set called office_ratings
obtained from [https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv].
It contains information on each of the 186 individual episodes of The
Office, over it’s 9 season history. There are 7 variables for each
episode; the relevant ones in this report are viewers
(number of viewers in millions on original air date),
imdb_rating (average fan rating on IMDb.com from 1 to 10),
air_date(date episode originally aired),
episode(episode number within the season),
season(season during which the episode aired) and
total_votes (number of ratings on IMDb.com). The full data
set can be viewed below:
Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations. You will also need DT in order to view part of your data set, this was loaded in the previous code chunk.
library(tidyverse)
I will be solving all types of problems related to viewership, and ratings throughout this narrative. Before I go too far in depth though, it is important for me to get a good grasp of the trend of each of the continuous variables. I expect the distributions to look fairly similar, because the viewership and ratings typically go hand in hand. However, this is certainly not always the case. In order to view the trend of each of the variables, I created a histogram plot for each:
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = viewers), bins= 30) +
labs(x = "viewers (millions)",
y = "# of episodes",
title = "Viewership Distribution",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = imdb_rating), bins= 30) +
labs(x = "imdb rating",
y = "# of episodes",
title = "Ratings Distribution",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = total_votes), bins= 30) +
labs(x = "total votes",
y = "# of episodes",
title = "Vote Distibution",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
After plotting each of the histograms, it appears that the distributions are not blatantly similar after all. The ratings curve appears to be a normal distribution, with the viewership, and total votes each being skewed (viewership: skewed right, total votes: skewed left). Looking at each histogram individually now, the total votes histogram is skewed left, likely because there are very few episodes that get lots of votes. The majority get around 1250 votes, but there are a few episodes that are very popular, which skews the distribution to the left. The imdb rating is normally distributed, which makes sense, because there are some really highly rated shows, some really lowly rated shows, but the average around 8.25, is what the majority of the episodes were rated. Lastly the viewership is skewed left, this is because most of the episodes received around 8-9 million viewers, but that appeared to be the cap (other than a few outliers), which led the rest of the episodes to be scattered between 2 and 8 million, causing the skewness.
Moving forward, I want to discover if increased viewership leads to better ratings. In order to do this, I will produce a scatter plot of the two variables, to see its trend (I am also adding a trend line for better visualization). My hypothesis is that increased viewership will lead to better ratings, because people will be more likely to view a higher rated episode.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x=viewers, y=imdb_rating)) +
geom_smooth(mapping = aes(x=viewers, y=imdb_rating), se = FALSE) +
labs(x = "viewers",
y = "imdb rating",
title = "Viewership vs. Ratings",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
After viewing the plot, it appears that my hypothesis was correct. Other than a few outliers, the ratings did appear to increase as viewership increased. There was a small dip around 4 million, I am going to create a visualization in order to explain why this dip is present. My hypothesis is that its because there are newer episodes that have good ratings, but haven’t been around as long as other episodes, and that’s the reason for the lesser viewership. To test this, I am going to create the same plot, but apply a color to each point, based on season.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x=viewers, y=imdb_rating, color = season)) +
geom_smooth(mapping = aes(x=viewers, y=imdb_rating), se = FALSE) +
labs(x = "viewers",
y = "imdb rating",
title = "Viewership vs. Ratings",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
My hypothesis seems to be correct again. All of the points that received the minimal viewership, which created the dip in the trend line, are from season’s 8 and 9, which are the most recent seasons. Over time, these seasons will continue to get more viewers, and naturally the higher rated episodes will get more views. As a result, the increasing trend that we expect will be present throughout the entirety of the visualization.
Looking at the next question now, I need to determine if more viewers leads to more votes by those giving ratings. I think that, once again, we will see an increasing trend between these two variables. When an episode has more views, it naturally should receive more ratings. I will use a scatter plot, just like the previous two questions, and I will make each point a color based on its season, just in case we run into the same exceptions as the problem before.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x=viewers, y=total_votes, color = season)) +
geom_smooth(mapping = aes(x=viewers, y=total_votes), se = FALSE) +
labs(x = "viewers",
y = "total votes",
title = "Viewership vs. Total Votes",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
After looking at the plot, it appears that the result is much of the same. There is indeed an increase in votes as a result of increased viewership. And the dip that I expected to see is still present as a result of the newer seasons skewing the trend a small amount. The newer episodes that have gotten a similar amount of views to the older episodes, have simply not been rated as much yet due to the smaller time frame for those votes to happen.
The next trend that I will try to discover, is how the show’s popularity has changed over time. While it may seem like I would want to map air date with the ratings, I believe that the popularity would be better represented using the viewers variable. I expect the viewership to decrease over time, because the older episodes are not only more well-known, but they have also been around longer which would allow for increased viewership.
ggplot(data = office_ratings) +
geom_smooth(mapping = aes(x=air_date, y= viewers)) +
labs(x = "air date",
y = "viewers",
title = "Popularity Over Time",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
This time, my hypothesis was incorrect. While the viewership did decrease every year after 2008, leading up to that year the viewership actually increased every year. I would guess that the reason for this, is it took a little while for the show to gain its popularity. The first couple of seasons the viewership increased every year, until it peaked around 2008, which are the most “popular” episodes.
Next, I want to figure out the shows appeal over time. This is where I am going to map the ratings with the air date. My guess is that this graph will look a lot like the prior graph, because as we found out earlier, higher viewership often lead to higher ratings.
ggplot(data = office_ratings) +
geom_smooth(mapping = aes(x=air_date, y= imdb_rating)) +
labs(x = "air date",
y = "imdb rating",
title = "Appeal Over Time",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
Sure enough, the visualization looks very similar to the previous one. With ratings increasing every year until 2008, then declining nearly every season after. One difference in the visualizations is there is a small spike in ratings in the past season. This spike isn’t seen in the viewership trend graph due to the phenomenon we talked about earlier (new episodes can be rated highly, but haven’t been around long enough for high viewership). This would be a good sign for the office, if they were still making episodes (unfortunately they are not), because in their last season they finally started to head in a better direction after 4 years of decline in ratings.
Next, I need to determine why the ratings and popularity graph don’t change in the same way. While I already gave a few suggestions as to why this may be, I am going to graph the charts side by side, and potentially come up with some other suggestions as to why this variation in trends exists.
ggplot(data = office_ratings) +
geom_smooth(mapping = aes(x=air_date, y= viewers)) +
labs(x = "air date",
y = "viewers",
title = "Popularity Over Time",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
ggplot(data = office_ratings) +
geom_smooth(mapping = aes(x=air_date, y= imdb_rating)) +
labs(x = "air date",
y = "imdb rating",
title = "Appeal Over Time",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
After plotting the two side by side, a thought for why the ratings increased at the end came to my mind. I believe that the ratings increased because of the fact that it was the last season. The way that they ended the show must have hit the hearts of the people rating just right. But just because the people rating liked the episodes more than before, does not mean that the viewership will increase any more than it was in previous season’s.
The final problem that I will attempt to solve involves the viewership trend for each episode, within the individual seasons. To visualize the trend in each season, I am going to use a fact grid, which will give me a trend line for each of the 9 seasons.
ggplot(data = office_ratings) +
geom_line(mapping = aes(x = episode, y = viewers, group = season)) +
geom_point(mapping = aes(x = episode, y = viewers)) +
facet_grid(rows = vars(season)) +
labs(x = "episode",
y = "viewers",
title = "Seasonal Viewership Trend",
caption = "data obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv")
The trend that appears to be present for most of the season (particularly 1,4,and 8), is the viewership tends to decrease as the season goes on. This makes a lot of sense, because there is an initial excitement for the season, which drives high viewership for the first couple fo episodes. But that excitement dies down as the season goes on, which results in decreased viewership in later episodes. One thing that I wanted to mention to, is the fact grid is not completely ideal here, due to an outlier that is making the trend difficult to see. This outlier is an episode that took place in season 5, that was a big hit, as it received more views than any other episode by a wide margin.
In Conclusion, it is clear that increased viewership of episodes, typically leads to increased ratings in episodes. Unless, those episodes are newer episodes that are rated highly. Also, the popularity and appeal changed very similarly over time, with an initial increase in ratings and viewership, until 2008. Followed by the viewership and ratings dropping every year until 2012, when there was a small spike in ratings. All in all, The Office was a massive hit that we still hear about today. And even the poorly rated episodes are still top-tier television.