Overview

Dataset Description

The dataset I am using for this project is a TV show dataset from TMDB (The Movie Database), which contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.

The dataset includes the following key variables:

  • Number_of_seasons: The number of seasons the TV show has aired.
  • Number_of_episodes: The total number of episodes across all seasons.
  • Vote_average: The average rating of the show, calculated based on user votes.
  • Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).
  • Networks: The networks where the show is broadcast.

Project Goal

The main goal of this project is to investigate the relationship between TV show characteristics (e.g., the number of episodes, genres, networks) and their viewer ratings. Specifically, I want to answer the question:

  • Do TV shows with more episodes or seasons tend to have higher viewer ratings?

This will help explore whether there is a trend in audience preferences based on show length or production characteristics.

Visualizations

  • Visualization 1: Distribution of Vote Average by Genres
# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)

# Load dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Group by genres and summarize show_count and avg_rating
genre_group <- tv_data |>
  group_by(genres) |>
  summarize(show_count = n(), avg_rating = mean(vote_average, na.rm = TRUE)) |>
  arrange(desc(show_count))

# Limit the visualization to the top 40 genres by show_count
top_40_genres <- genre_group |> slice_max(order_by = show_count, n = 40)

# Create the bar plot for top 40 genres
ggplot(top_40_genres, aes(x = reorder(genres, -avg_rating), y = avg_rating)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Vote Ratings by Top 40 Genres", x = "Genres", y = "Average Rating") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8))

This bar chart visualizes the average vote ratings across the top 40 genres. Genres such as Drama and Documentary tend to have higher vote averages, indicating that shows in these genres might appeal more to audiences. This suggests that certain genres consistently receive better ratings, which could be due to audience preferences for specific types of content.

Further investigation could explore whether these high ratings are influenced by other factors, such as the number of episodes or the networks that broadcast these shows.

  • Visualization 2: Number of Episodes vs. Vote Average
# Create new column for average episodes per season
tv_data <- tv_data |> 
  mutate(avg_episodes_per_season = number_of_episodes / number_of_seasons)

# Visualization: Number of Episodes vs Vote Average
ggplot(tv_data, aes(x = avg_episodes_per_season, y = vote_average)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Number of Episodes per Season vs. Vote Average", 
       x = "Average Episodes per Season", y = "Vote Average") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 22428 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 22393 rows containing missing values or values outside the scale range
## (`geom_point()`).

This scatter plot shows the relationship between the average number of episodes per season and the vote average. The red regression line suggests a very weak positive correlation, meaning that shows with more episodes per season might have slightly higher ratings. However, the relationship is not strong, implying that the number of episodes per season is not a major factor in determining the audience ratings.

The presence of potential outliers, such as shows with an unusually high number of episodes but low ratings, warrants further exploration to understand if these deviations are due to specific genres or networks.

Plan Moving Forward

  • Hypothesis Testing: Test the relationships between the number of episodes/seasons and viewer ratings to see if there is statistical significance.
  • Data Cleaning: Address any missing values or inconsistencies in the dataset (e.g., missing episode counts or vote averages).
  • Refining Visualizations: Further refine visualizations to include other variables such as networks or languages, which might offer additional insights.
  • Explore Other Variables: Explore the effect of networks and production companies on viewer ratings to gain a more comprehensive understanding.

Initial Findings

Hypotheses

  • Hypothesis 1: TV shows with more seasons or episodes tend to have higher viewer ratings. This hypothesis assumes that longer-running shows may build larger fan bases, leading to higher ratings.

  • Hypothesis 2: TV shows in specific genres (e.g., Drama or Sci-Fi) receive higher average ratings than others. This hypothesis is based on the assumption that certain genres are more popular among audiences, resulting in higher ratings.

Hypothesis Visualizations

  • Visualization 1: Number of Seasons vs Vote Average
# Visualization: Number of Seasons vs Vote Average
ggplot(tv_data, aes(x = number_of_seasons, y = vote_average)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Number of Seasons vs. Vote Average", x = "Number of Seasons", y = "Vote Average") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This plot suggests that while there is a positive relationship between the number of seasons and vote average, the correlation appears weak. This could indicate that while longer-running shows might attract larger audiences, the length of a show alone does not guarantee higher ratings. Other factors, such as genre or network, might have a stronger influence.

  • Visualization 2: Genres vs Vote Average
# Limit the visualization to the top 100 genres by avg_rating
top_100_genres <- genre_group |> 
  arrange(desc(avg_rating)) |> # Sort by avg_rating in descending order
  slice_max(order_by = avg_rating, n = 100)  # Select top 100 based on avg_rating

# Create the bar plot for top 100 genres by vote average
ggplot(top_100_genres, aes(x = reorder(genres, -avg_rating), y = avg_rating)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Vote Ratings by Top 100 Genres", x = "Genres", y = "Average Rating") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6))

This visualization highlights the top 100 genres by their average vote ratings. Genres like Documentary and Drama continue to dominate, supporting Hypothesis 2 that certain genres attract higher ratings. This suggests a trend where certain genres, potentially due to their universal appeal or niche audiences, receive consistently higher ratings. Further investigation could focus on whether these trends are consistent across different networks or regions.

Conclusion:

Based on the initial findings:

  • Hypothesis 1 shows a weak positive relationship between the number of seasons and viewer ratings. While longer-running shows may attract larger audiences, other factors such as genre or network likely play a stronger role in determining ratings.

  • Hypothesis 2 is supported by the data, showing that certain genres (e.g., Drama and Documentary) consistently receive higher ratings. This suggests that genre preference significantly impacts how viewers rate shows.

Next Steps: