This report seeks to answer the following question:
Is there a relationship between all of the various variables in
office_ratings that can be seen when using visualizations
to decipher any meaningful data from all 9 seasons of The
Office?
We will be using a data set called office_ratings
obtained from https://raw.githubusercontent.com/jafox11/MS282/main/office_ratings.csv.
It contains a total of 186 per episode data points from all
9 seasons of The Office TV show. There are 7 variables in the
database; the relevant ones in this report are season
(season during which an episode is aired), episode (episode
number within a given season), title (the tile of the
episode), viewers (number of viewers, in the millions, on
the original air date), imdb_rating (the average fan
rating, on IMDB.com, on a scale from 1 to 10), total_votes
(number of ratings on IMDB.com), and air_date (when the
episode originally aired). They all play a role in creating the
visualizations that will be used to answer the questions. The full data
set can be viewed below:
Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations.
library(tidyverse)
Extra Information I used outside the database (The links to them):
https://www.imdb.com/title/tt0664521/reviews?ref_=tt_urv.
https://www.reddit.com/r/DunderMifflin/comments/vuc9wn/imdb_ratings_of_the_office/.
https://www.imdb.com/title/tt1564719/reviews.
https://www.imdb.com/title/tt1248736/reviews.
https://www.imdb.com/title/tt2669746/reviews.
https://www.imdb.com/title/tt1833197/reviews?ref_=tt_urv.
The problem that we are dealing with here is seeing how the
distribution for the three continuous variables looks like? Those three
continuous variables are viewers, imdb_rating,
and total_votes. We might suspect that the first graph, for
viewers, would be concentrated more on the left side and
quickly get smaller when it goes more to the right. The reason why might
be that there would be more total viewership ratings in the million to
ten million range, then larger ones, like more than 11 million. For the
second graph, imdb_rating, we could suspect that the graph
would be more condensed in the middle and then get smaller as it went
more left and right of the graph. The reason for this could be that most
episodes would be reasonably rated a similar score, making that the
middle of the graph, and any great or bad episode would be more away
from the middle and be away from the average rating. Finally, the last
graph, ‘total_votes’, would be clustered at the left side of the graph,
where most of the total votes on IMDB.com would be and it would rapidly
start to go lower as it went more to the right. The reason why this
might be possible is because most of the episodes would have a similar
amount of IMDB.com ratings and only the great and bad episodes would
have more because more people would want to voice their opinions on
those particular episodes. We can test these hypotheses with three
histograms:
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = viewers)) +
labs(x = "total number of viewers on the original air date (in millions)",
y = "total amount of episodes",
title = "The Distribution of Viewers Based on Episodes",
caption = "data obtained from githubusercontent.com")
The peak for the viewers variable is at about 8-9
million total viewers on original air date, and there is a smaller peak
at the 4-5 million mark. The peak tells me that, on average, most
episodes get a viewership rating of 8-9 million views when they
originally air, or 4-5 million views, but not as often. The shape of the
distribution is of a right-skewed graph. This tells me that the outliers
will be at the far-right extreme end of the graph. There are also two
spikes on the graph; one is at the 4-5 million mark and the other bigger
spike is at the 8-9 million mark. My hypothesis on why there are two
different spikes is because the first smaller spike could be how many
average viewers each episode got when they originally aired before the
big episode that got the most views, and the other bigger spike could be
the average amount of viewers each episode got when they originally
aired after the big episode. There are two extreme values at the 11.5
million mark and the 22 million mark. The reason why the 11 million
episode is an outlier is because that episode was the “Pilot” episode,
meaning that was the first ever episode that was released so a lot of
people tuned in to that episode to see if they would like the show and
would want to continue it. The reason why the 22 million episode is such
a big outlier is because that episode originally aired right after NBC’s
broadcast of Super Bowl XLIII. It is also the episode that many fans of
the show say is the best episode in the series and is a great opening
for new viewers to get hooked on the show.
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = imdb_rating)) +
labs(x = "average fan rating on IMDB.com (scale from 1 to 10)",
y = "total amount of episodes",
title = "The Distribution of IMDB Ratings Based on Episodes",
caption = "data obtained from githubusercontent.com")
The peak is at the 8.15 IMDB rating mark for the average fan rating. There are other peaks at the 8.6 IMDB rating mark and 9.3 IMDB rating mark for the average fan rating. This peak tells me that, on average, most episodes get a IMDB rating around the 8.15 to 9.3 range. The shape of the distribution is of a bell curve. Most of the data points are in the middle of the graph with a few outliers on both ends of the graph. This show had an even distribution of liking to its episodes. There are low IMDB fan rating outliers at around the 6.75 mark, and there are some high IMDB fan rating outliers at around the 9.65 mark. The first spike in the middle tells me that this is the most average rating that most episodes got. The second spike tells me that it had some episodes that had a somewhat higher liking to its episodes. Finally, the last spike tells me that there were very few episodes that fans enjoyed the most from the show. There are a couple extreme values on the graph. The first one is the lowest rated mark at about the 6.75 mark. The reason why this spike is so low is because the episode, “Get the Girl,” had a lot of fans not liking the story that was going on during this time of the show’s running. Many people started to feel like the show was dipping in quality and that the character Nellie, during this time of the show, was ruining the quality of the episodes. The other episode that had an outlier IMDB rating was “The Banker.” This episode was not liked as much because it was basically just a filler episode and didn’t move the story in any meaningful way. The second one is the highest rated mark at about the 9.65 mark. The reason why this one is so high is because the episode, “Stress Relief”, was so beloved by the fans for how great of an episode it is and for having so many memorable and funny moments. It was also a smart move by the directors to have this episode be played after the Super Bowl, when it originally aired, so they could get as many new viewers as they could with one of the best episodes they made. The episode, “Finale,” was loved because it was a perfect ending to a perfect show. They finished the episode off with a beautiful ending that encapsulated and finished telling the story that they were building up to all these years. The final good episode, “Goodbye, Michael”, was so good because the way that they showed Michael leaving was very heartwarming and heartfelt, and made a lot of the fans cry and get emotional at the end of the episode.
ggplot(data = office_ratings) +
geom_histogram(mapping = aes(x = total_votes)) +
labs(x = "total amount of ratings on IMDB.com",
y = "total amount of episodes",
title = "The Distribution of Total Amount of Ratings Based on Episodes",
caption = "data obtained from githubusercontent.com")
The peak for this graph is at about the 1650 total votes on IMDB mark. This tells me that, on average, most episodes have an average total votes, from fans, on IMDB of about 1650. The shape of this distribution is a right-skew graph. This tells me that most of the data points will be on the left side of the graph and the extreme outliers will be on the right most end of the graph. There are four extreme outliers on the graph. The first one, at about the 4000 mark, called “Dinner Party” has a lot of votes on IMDB because it’s an episode that a lot of people found funny, relatable, and chaotic but humorous. The next one at about 5700, called “Goodbye, Michael,” has a lot of votes on IMDB because it was a very emotional episode that had a lot of people crying, so the fans wanted to express their emotions and talk about how great of a character Michael was in the show. The second to last one at the 5900 mark called, “Stress Relief,” has a lot of votes on IMDB because a lot of people found this to be one of the best episodes of the show and was really funny. It was also the most viewed episode that was originally aired, so a lot of newcomers and devoted fans probably came to vote and express their opinions on the episode.Then finally, the “Finale” episode having a voting total of 7900 has the most votes on IMDB because of it being the last episode of the show and a lot of fans wanted to express their opinion on what they thought of the show over all and how they ended the show on a beautiful ending.
The question that needs investigating here is if there is a relationship between the number of viewers, on an episode, and the average fan rating, of an episode, on IMDB.com? It seems reasonable that if an episode has more views, the more it is liked. Also the more a show is liked, the more an episode has viewers on it. We can possibly expect a direct relationship to happen between the number of viewers, on an episode, and the average fan rating, of an episode, on IMDB.com. As the number of viewers on an episode increases, so does how much people like an episode on IMDB.com as well. We can check this theory with the use of a scatter plot:
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = imdb_rating)) +
labs(x = "total number of viewers on the original air date (in millions)",
y = "average fan rating on IMDB.com (scale from 1 to 10)",
title = "Viewership Total as a Function of Average Fan Rating",
caption = "data obtained from githubusercontent.com")
Yes, there is some type of correlation between the amount of people who watch an episode and how liked the episode is on IMDB. The graph’s trajectory is going somewhat up and to the right as the IMDB rating is going up, but it’s more of a steep slope. But, there are also quite a few data points that contradict this correlation, that are on the top left side of the scatter plot. There is also one outlier at the bottom right of the scatter plot that has quite a few views but not rated as highly.
The problem that needs examining is if there are any exceptions to
the scatter plot from the previous graph? We can’t infer that much from
the graph with just the state that it is in now. It would be better if
we can map another variable to an extra aesthetic to help
visualize the exceptions from the graph better. The way that I think
would help us see the exceptions from the graph better is if we possibly
mapped the season variable to the
color aesthetic. The reason for adding the extra variable,
season, can possibly allow us to better understand which
data points are the exceptions and why. We can test this assumption by
using a scatter plot:
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = imdb_rating, color = season)) +
labs(x = "total number of viewers on the original air date (in millions)",
y = "average fan rating on IMDB.com (scale from 1 to 10)",
title = "Viewership Total as a Function of Average Fan Rating",
color = "season number",
caption = "data obtained from githubusercontent.com")
From the graph above there are some exceptions that can be seen. For example, there are a couple of highly liked episodes, around and above the 9 IMDB rating mark, that have less than 10 million views when it first aired. My hypothesis on why a lot of the high liked episodes, that also have low viewer amounts, is because these episodes are in seasons that are closer to the end of the show. The longer a show runs on, the more people start to lose interest and stop watching the show altogether. Only the most passionate and die-hard fans stay till the very end of a show they like. There is one data point that is at around the 11 million viewer mark, but only has a IMDB rating of about 7.6. The reason why this outlier is so far from the rest of the scatter plot is because it’s the “Pilot”. Pilots are usually not the best episodes because they are trying to lay the groundwork for the rest of the show, don’t have a lot of things going beside trying to set up the plot and introduce the characters, and are have a lot of views because people want to see what the show is about and if they want to continue watching it.
The question that we are exploring here is, could there be a
correlation between the number of viewers, on an episode, and how many
total ratings are left on IMDB.com? And are there any Exceptions to the
trend? For me, it would be sensible that if more people watch a given
episode, the more people would leave a IMDB rating for it. We could
theoretically presume that a direct relationship will happen between the
number of viewers, on an episode, and how many total ratings are left on
IMDB.com. As the viewership of an episode rises, so too does the total
amount of ratings that are left about an episode on IMDB.com. In order
to find any exceptions to the first question here, we need to add an
extra variable to another aesthetic to the graph. Since
we’re trying to see if there is a correlation between viewership and
total amount of IMDB ratings, possibly mapping the season
variable to the color aesthetic can prove to be useful.
Doing this could help us find any exceptions in this correlation and why
they might be exceptions. We can verify these conjectures with two
scatter plots (one is for the first question and the second is for the
other):
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = total_votes)) +
labs(x = "total number of viewers on the original air date (in millions)",
y = "total amount of ratings on IMDB.com",
title = "Viewership Total as a Function of Total Amount of Ratings",
caption = "data obtained from githubusercontent.com")
Yes, there is some type of correlation between the amount of people who watch an episode and how many people leave an IMDB rating. The graph’s trajectory is going up and to the right in a very fast way. The graph looks like an exponential graph. But, there are quite a few exceptions to the graph where the episode has a low amount of views but high total votes on IMDB.
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = total_votes, color = season)) +
labs(x = "total number of viewers on the original air date (in millions)",
y = "total amount of ratings on IMDB.com",
title = "Viewership Total as a Function of Total Amount of Ratings",
color = "season number",
caption = "data obtained from githubusercontent.com")
A lot of the exceptions in the graph are from episodes that had the most impact on the viewers. For example, the data point at the very top, with the most votes but not as many views, is the “Finale”. This episode is the ending of the show, so a lot of people would want to express their opinion of how the show ended and what they thought of the show as a whole by leaving a IMDB rating. There is an interesting thing happening in the graph where you can see, from left to right, the older seasons, like 7-9, had the least amount of viewers and to the right were the early seasons, like 1-3, of the show where they had the most engagement being that it was a newer show at the time. My hypothesis is that as the show got older, only the most loyal of fans stuck around and stayed till the end watching. Whilst the reason there were a lot of viewers for the earlier seasons was because of the correlation of it being a new show when it first aired, and people were interested in seeing what it was all about.
The problem that we are trying to analyze is how has the show’s popularity changed over time? To do this, we need to see the trend between the season that an episode aired in and how many total viewers an episode got. My hypothesis is, that as the show goes on, the show’s popularity will increase at the beginning and then decrease the longer that it goes on. I think this because usually shows get a lot of viewership in the beginning couple of seasons because people are intrigued by the new show and want to see what the hype is all about, and then when people start to get bored of the show, in the last couple of seasons, they stop watching it. We can potentially see a direct relationship occur between the season that an episode aired in and how many total viewers an episode got, in the first couple of seasons, and then an inverse relationship for the later seasons. As the number of seasons increases, the viewership on episodes increases in the beginning, and then decreases in the end. We can see if my hypothesis is true by using a box plot:
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = viewers)) +
labs(x = "season which the episodes aired in",
y = "total number of viewers on the original air date (in millions)",
title = "Season Number as a Function of Viewership Total",
caption = "data obtained from githubusercontent.com")
The shows popularity started off low in the first season with only around an average of 5 million views per episode, and have one exception being the “Pilot” where lots of people tuned in, but then gradually got higher till the fourth season where it peaked, and then it started to gradually go back down after that to where the ninth season has an average of views less than the first season. There were some outliers from the trend, but most of the data points stayed in the same trajectory. Overall, the trend for the box plot is that it is an arc; going up then back down at the end.
The question that we would like to solve here is how has the show’s appeal changed over time? In order to solve this, we need to create a trend between the season that an episode aired in and the average fan rating, of an episode, on IMDB.com. The hypothesis that I came up with is, that as The Office goes on, the show’s appeal to viewers will grow in the earlier seasons, but then diminish in the later seasons. The reason why I believe this is because, usually, the viewers love the beginning of the show when there are good character arcs, good story telling, good character development, and etc. But, as the show goes on, the writers and producers have a hard time making the show as good as or even better then it was previously; resulting in fans losing appeal to the show, and the ratings for the show slowly dwindle because of bad writing, bad characters, bad plot points, and etc. We could presumably witness a direct relationship transpire between the season that an episode aired in and the average fan rating of an episode, on IMDB.com, in the first couple of seasons, and then an inverse relationship for the later seasons. We might see as the number of seasons go up, the IMDB rating go up as well at the start, and then go down at the end. A box plot would be best to see if my speculation is correct:
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
labs(x = "season which the episodes aired in",
y = "average fan rating on IMDB.com (scale from 1 to 10)",
title = "Season Number as a Function of Average Fan Rating",
caption = "data obtained from githubusercontent.com")
The show’s appeal started off pretty well for the first season having an average IMDB rating of 8, then gradually got higher and higher till the fourth season where it peaked, then the following season had a dip in ratings, and then it went up and down after that till the last season. The trend does have some outliers, but most of the data points stayed in the same course. Overall, the trend for this graph is that of an arc; going up then back down at the end.
The problem we are trying to get at here is if the trends from the two previous graphs are somewhat more similar or different from each other? What I believe is that these two trends are a bit more similar then they are different from each other. The reason why I believe is because:
The trends in the previous two problems are somewhat similar because they do follow the same trajectory a little bit. They both follow an arc; where they started low, gradually went up, and fell down lower at the end. I think the reason why they are similar is because the first problems graph is about how as the TV show goes on, it starts to first gain a lot of viewers, and then lose more viewers because the viewers either get tired of waiting for the next episode, just stop watching because they just lost interest in keeping up with the show, or some other reason. While the second problem’s graph is also about how as the show goes on it gets more and more popular because the directors are making the show interesting and funny, but as the show goes on they start to run out of good and interesting story ideas and character arcs. This causes viewers to think the show is not as good and lacks quality as it was before so they rate the later seasons less than the previous ones. With total viewership and fan liking of a show being so closely related; meaning if one starts to drop the other one drops as well. This causes the two graphs, from the previous problems, to look similar.
The question that is being answered here is if there is a connection
between the episode within the season and the number of viewers, on an
episode? Because this question is looking at the trend within the
individual seasons, it would be wise to also add the season
variable as an extra aesthetic; perhaps the
color aesthetic would be best. These three variables help
to analyze the trend of viewership in a given season over the course of
the individual episodes in that season. It could be reasonably assumed
that as a season of a show goes on, it will lose more viewers as the
episodes go on. We could assume that the trend would be somewhat of an
inverse relationship between the episode within the season and the
number of viewers, on an episode. There will possibly be some outlier
episodes that do better than the previous one, but generally the trend
is downward because some people get bored and stop watching the show. As
the number of episodes in a season rises, the viewership total in that
season drops. In order to test my theory, creating a scatter plot
visualization would be best suited to test it:
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = episode, y = viewers, color = season)) +
labs(x = "episode number within a given season",
y = "total number of viewers on the original air date (in millions)",
title = "Season Number as a Function of Viewership Total",
color = "season which the episodes aired in",
caption = "data obtained from githubusercontent.com")
Yes, there is a trend in total viewership within the individual seasons. As the seasons progress the total viewership gradually increases, where the earlier seasons had a much higher viewership average compared to the later seasons. With the start of a new show, you will always have a boost in viewership because of new interest in the new show, because of this the viewership on the original air date for the early seasons would constantly be at a high. One of the big changes in viewership was on the day of the Super Bowl after it was completed, the episode “Stress Relief” was aired and garnered a large viewing which The Office used to bring in new viewers and old viewers back to the show. For a time, this marketing technique did work, with them never seeing viewership numbers as high as that day before and after the episode, but it brought back a good amount of viewership. Although it did not last that long, as the show approached season seven, number decreased at the point of episode 21 “Goodbye, Michael” we never saw that a viewership total ever again that high after it. Where what you were left with until the end of the season, was your dedicated fan base who stayed to see the end of the show.
In summary, we can infer that as the seasons of a show goes on, in
this case The Office, the viewership for that show starts out
high with a lot of people wanting to get in on the hype around the new
show that just aired, but then decreases as the show goes on for longer
than its lifespan and the writers and producers run out of ideas so the
fans stop watching. Also, as the viewership for an episode increases, so
does the IMDB rating for that episode increase, and vice versa because
they work in tandem. Lastly, the substance surrounding an episode can
either make or break the show’s success as a whole moving forward for
later seasons. All in all, every variable, in
office_ratings, plays a part in the data set, and the
visualizations that are created to decipher any meaningful information
from them.