Office Rating and Viewership Trends over time
The office is one of the most popular TV shows every made so how did
viewership effect its ratings and popularity over time?
The data used in this project and for all visualizations is sourced
from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17/readme.md
and https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes
library(tidyverse)
We are using tidyverse to create the visualizations
We are using library DT to better show data visuals on the
visualizations
datatable(office_ratings)
This is the data we will be using for the visualizations It has 7
total variables and 186 entries. These 7 and their explanation are
Season (season number), episode (episode number), Title (episode title),
viewers (number of viewers in millions), imdb_rating( average fan rating
from 1-10), total_votes(number of votes on imdb), and lastly
air_date(date episode originally aired.)
The Office trends and peaks in viewership, rating, and total
votes
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = total_votes, color = imdb_rating)) +
labs(x = "Viewers (Office_Rating)",
y = "Total Votes (Office_Rating)",
color = "Average Rating",
title = "Office episode rating vs Viewership",
captaions ="")

On average the trend for the graph follows the pattern that the more
people that watch the episode the better its liked. This trend stays
true for almost every point of the graph. But there are some major
outliers that sit outside of the predicted trend.
So why do we have such large outliers that don’t match the
pattern?
Adressing outliers
ggplot(data = office_ratings) +
geom_point(mapping = aes(y = viewers, x = episode, color = season)) +
labs(x = "episode number",
y = "Viewers (In Millions)",
color = "Season",
title = "Office episode episodes vs viewership")

This graph which now breaks down the viewership by each episode to
better see our outlier. Here we can see our major outlier is episode 13
of season 5. Which if we go to The Office wiki (https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes)
page we can find that season 5 episode 13 was a 2 part episode but the
viewership was added together which explains its high viewership but
lower rating and votes.
Our other outlier had low viewership but high ratings and high votes.
This episode was the series finale. It being an outlier is most likely
explained by only long time fans watching the finale. This correlates
with the fact that the long term fans are more likely to leave reviews
and vote on an episode.
Viewership and rating trends
ggplot(data = office_ratings) +
geom_point(mapping = aes(x = viewers, y = total_votes, color = imdb_rating)) +
geom_smooth(mapping = aes(x = viewers, y = total_votes), se = FALSE) +
labs(x = "Viewers (Millions)",
y = "Total Votes (Office_Rating)",
color = "Average Rating",
title = "Office episode rating vs Viewership",
captaions ="")

NA
Here we can see the trend line of our views vs total votes for each
episode. We can see a general upwards trend where the more views a
episode gets the more votes it gets. I think a more interesting question
is why is there a large bump at the start of the graph. This question I
don’t have a great answer for perhaps these episodes were not very good
and so many people voted but not many watched.
How did the show’s popularity change over time?
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(y = viewers, x = season)) +
labs(y = "Viewers (Millions)",
x = "season",
color = "Average Rating",
title = "Office season viewership over time",
captaions ="")

We can see from the box plot that the median viewers for each season
followed a fairly predicable popularity curve with the popularity of The
Office peaking around season 3,4 and 5 before starting to fall off
during season 7 through 9.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(y = total_votes, x = season)) +
labs(y = "total votes",
x = "season",
title = "Office season voter count over time",
captaions ="")

This trend can also be seen in the number of votes on IMDB, with a
steady decrease over time with some larger outliers talked about
earlier. The average number of votes declined as time went on. As time
went on it seems like the show appealed less to the average view and
more to those who were true fans. That stayed to watch and leave votes
and reviews.
ggplot(data = office_ratings) +
geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
labs(y = "Rating",
x = "Season",
title = "The Office rating by season",
captaions ="")

We can see this shown in a box plot of the average rating by season
peaking in season 3 through 5 and then decreasing from season 5 onward.
Showing that as time went on less and less people watch and voted which
looks to be because the views didn’t like seasons 6,7,8, and 9.
Conclusion
Overall based on the data we know that The Office’s viewership and
popularity stayed roughly the same through its lifespan with its large
decrease in viewership later in season 7,8, and 9, however I think the
most interesting thing about its viewership and data is how its voter
count peaks in the first season and then stays on a negative trend until
the last season. Which I believe is due to the show changing over time
and driving away long time fans who are more likely to go online and
leave reviews of the show. Maybe the viewers grew bored or didn’t like
how the show changed? But this shows that The Office kept its popularity
throughout is run but its appeal to longer time fans faded as time went
on. I think that more data would be necessary to further prove my
proposed answer. Such as counting repeated voters or having a poll that
asks viewers if and why they stopped watching. But I think with the data
available its a good answer.
---
title: "R Notebook"
author: Noah Szarejko
date: 9/9/2024
output: html_notebook
---

### Office Rating and Viewership Trends over time

The office is one of the most popular TV shows every made so how did viewership effect its ratings and popularity over time?


The data used in this project and for all visualizations is sourced from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17/readme.md and https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes
```{r}
library(tidyverse)
```
We are using tidyverse to create the visualizations 

```{r include=FALSE}
library(DT)
```
We are using library DT to better show data visuals on the visualizations
```{r}
datatable(office_ratings)
```
This is the data we will be using for the visualizations
It has 7 total variables and 186 entries. These 7 and their explanation are Season (season number), episode (episode number), Title (episode title), viewers (number of viewers in millions), imdb_rating( average fan rating from 1-10), total_votes(number of votes on imdb), and lastly air_date(date episode originally aired.)

# The Office trends and peaks in viewership, rating, and total votes
 
```{r}
ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes, color = imdb_rating)) + 
  labs(x = "Viewers (Office_Rating)",
       y = "Total Votes (Office_Rating)",
       color = "Average Rating",
       title = "Office episode rating vs Viewership",
       captaions ="")
```
 On average the trend for the graph follows the pattern that the more people that watch the episode the better its liked. This trend stays true for almost every point of the graph. But there are some major outliers that sit outside of the predicted trend.
 
 So why do we have such large outliers that don't match the pattern?

# Adressing outliers 
```{r}
ggplot(data = office_ratings) +
  geom_point(mapping = aes(y = viewers, x = episode, color = season)) + 
  labs(x = "episode number",
       y = "Viewers (In Millions)",
       color = "Season",
       title = "Office episode episodes vs viewership")
```
This graph which now breaks down the viewership by each episode to better see our outlier. Here we can see our major outlier is episode 13 of season 5. Which if we go to The Office wiki (https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes) page we can find that season 5 episode 13 was a 2 part episode but the viewership was added together which explains its high viewership but lower rating and votes. 

Our other outlier had low viewership but high ratings and high votes. This episode was the series finale. It being an outlier is most likely explained by only long time fans watching the finale. This correlates with the fact that the long term fans are more likely to leave reviews and vote on an episode. 



## Viewership and rating trends 

```{r}
ggplot(data = office_ratings) +
  geom_point(mapping = aes(x = viewers, y = total_votes, color = imdb_rating)) +
  geom_smooth(mapping = aes(x = viewers, y = total_votes), se = FALSE) + 
  labs(x = "Viewers (Millions)",
       y = "Total Votes (Office_Rating)",
       color = "Average Rating",
       title = "Office episode rating vs Viewership",
       captaions ="")
  
```
Here we can see the trend line of our views vs total votes for each episode. We can see a general upwards trend where the more views a episode gets the more votes it gets. I think a more interesting question is why is there a large bump at the start of the graph. This question I don't have a great answer for perhaps these episodes were not very good and so many people voted but not many watched.  
  
  
# How did the show’s popularity change over time?
```{r}
ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(y = viewers, x = season)) +
  labs(y = "Viewers (Millions)",
       x = "season",
       color = "Average Rating",
       title = "Office season viewership over time",
       captaions ="")
```
We can see from the box plot that the median viewers for each season followed a fairly predicable popularity curve with the popularity of The Office peaking around season 3,4 and 5 before starting to fall off during season 7 through 9. 

```{r}
ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(y = total_votes, x = season)) +
  labs(y = "total votes",
       x = "season",
       title = "Office season voter count over time",
       captaions ="")
```
This trend can also be seen in the number of votes on IMDB, with a steady decrease over time with some larger outliers talked about earlier. The average number of votes declined as time went on. As time went on it seems like the show appealed less to the average view and more to those who were true fans. That stayed to watch and leave votes and reviews. 

```{r}
ggplot(data = office_ratings) +
  geom_boxplot(mapping = aes(x = season, y = imdb_rating)) +
  labs(y = "Rating",
       x = "Season",
       title = "The Office rating by season",
       captaions ="")
```
We can see this shown in a box plot of the average rating by season peaking in season 3 through 5 and then decreasing from season 5 onward. Showing that as time went on less and less people watch and voted which looks to be because the views didn't like seasons 6,7,8, and 9.  


## Conclusion 
 
 Overall based on the data we know that The Office's viewership and popularity stayed roughly the same through its lifespan with its large decrease in viewership later in season 7,8, and 9, however I think the most interesting thing about its viewership and data is how its voter count peaks in the first season and then stays on a negative trend until the last season. Which I believe is due to the show changing over time and driving away long time fans who are more likely to go online and leave reviews of the show. Maybe the viewers grew bored or didn't like how the show changed? But this shows that The Office kept its popularity throughout is run but its appeal to longer time fans faded as time went on. I think that more data would be necessary to further prove my proposed answer. Such as counting repeated voters or having a poll that asks viewers if and why they stopped watching. But I think with the data available its a good answer.  


