R Notebook - Week 8

Introduction

This week’s data dive involves running ANOVA tests and building regression models using the TMDB TV show dataset, which includes detailed information about various TV shows, such as ratings, genres, episodes, and more. We will focus on identifying relationships between explanatory variables and a response variable of interest. The insights derived from these analyses could be valuable for understanding factors influencing TV show ratings.

Selecting the Response and Explanatory Variables

Response Variable: The response variable chosen is Vote_average (average rating), as it is a key indicator of the show’s popularity and viewer satisfaction.

Explanatory Variable (Categorical): The chosen explanatory variable is Genres, which may influence the average rating of a show.

Performing ANOVA to Test for Differences in Average Ratings across Genres

Null Hypothesis (H0): There is no significant difference in the average ratings (Vote_average) between different genres.

We will consolidate genres as there are more than 10 categories.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(readr)

# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Consolidate genres as there are more than 10 categories
top_10_genres <- tv_data |>
  group_by(genres) |>
  summarize(show_count = n()) |>
  top_n(10, show_count) |>
  pull(genres)

tv_data_filtered <- tv_data |>
  filter(genres %in% top_10_genres)

# Run the ANOVA test
anova_result <- aov(vote_average ~ genres, data = tv_data_filtered)
cat("\n")

summary(anova_result)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## genres          8  27325    3416   266.3 <2e-16 ***
## Residuals   62444 800761      13                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 68926 observations deleted due to missingness

Interpretation

Degrees of Freedom (Df): The “genres” variable has 8 degrees of freedom, which means there are 9 genre categories being compared in the dataset. The residual degrees of freedom are 62,444, which represents the total number of observations minus the number of groups.

Sum of Squares (Sum Sq): The sum of squares for “genres” is 27,325, which indicates the variation in the average ratings that is explained by the genre categories. The residual sum of squares is 800,761, representing the variation in ratings that is not explained by the genres.

Mean Square (Mean Sq): The mean square for “genres” is 3,416, obtained by dividing the sum of squares by the degrees of freedom (27,325 / 8). The residual mean square is 13, calculated as 800,761 / 62,444.

F value: The F-statistic is 266.3, indicating the ratio of the variation explained by the genres to the variation within the genres.

p-value (Pr(>F)): The p-value is less than 2e-16, which is extremely small. This indicates a statistically significant result.

Implications

Since the p-value is much smaller than the common significance level (α = 0.05), we reject the null hypothesis. This means that there is strong evidence to conclude that there are significant differences in the average ratings among different genres.

What would it mean?

The significant result suggests that the genre of a TV show has an impact on its average rating. Different genres tend to have different average ratings, indicating that some genres may be more appealing to audiences than others.

For content producers or streaming services, focusing on genres that typically receive higher ratings could be beneficial. This insight can help guide decisions about which genres to prioritize or explore further for content development.

Building a Linear Regression Model

The chosen continuous variable is Number_of_episodes, which may influence the show’s average rating.

We will fit a linear regression model using Number_of_episodes to predict Vote_average.

# Fit a linear regression model
linear_model <- tv_data |>
  lm(vote_average ~ number_of_episodes, data = _)

# Display the model summary
summary(linear_model)

## 
## Call:
## lm(formula = vote_average ~ number_of_episodes, data = tv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.493  -2.290  -2.278   3.665   7.725 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.275e+00  8.512e-03   267.3   <2e-16 ***
## number_of_episodes 2.386e-03  6.213e-05    38.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.439 on 168637 degrees of freedom
## Multiple R-squared:  0.008668,   Adjusted R-squared:  0.008662 
## F-statistic:  1474 on 1 and 168637 DF,  p-value: < 2.2e-16

Interpretation

The linear regression model was fitted to explore the relationship between the number of episodes and the average rating (vote_average) of TV shows. The results are as follows:

Intercept: The intercept is estimated to be 2.275, with a standard error of 0.0085. This means that when the number of episodes is zero, the predicted average rating is approximately 2.275. Although this value may not have practical meaning in this context (as a TV show would usually have at least one episode), it serves as a baseline for the regression equation.
Slope (number_of_episodes): The coefficient for the number of episodes is estimated to be 0.002386, with a standard error of 0.000062. This indicates that for each additional episode, the average rating is expected to increase by approximately 0.0024. This relationship is statistically significant, as the p-value is less than 2e-16, indicating a strong association between the number of episodes and the average rating.
Residuals: The residuals show a range from -44.493 to 7.725, suggesting that there are some observations with large deviations from the predicted values. The residual standard error is 3.439, indicating the typical size of the error in the model’s predictions.
R-squared: The R-squared value is 0.008668, which means that only about 0.87% of the variability in the average ratings is explained by the number of episodes. This low R-squared value indicates that the number of episodes alone is not a strong predictor of the ratings.
Adjusted R-squared: The adjusted R-squared is very similar to the R-squared (0.008662), reinforcing the notion that adding more predictors may be necessary to improve the model’s explanatory power.
F-statistic: The F-statistic is 1474 with a p-value of less than 2.2e-16, indicating that the model as a whole is statistically significant. However, statistical significance does not necessarily imply practical significance, as indicated by the low R-squared value.

Implication

Although the number of episodes is statistically significant as a predictor of average ratings, the relationship is weak, as indicated by the low R-squared value. This suggests that the number of episodes alone does not adequately explain variations in TV show ratings.
Since the model explains less than 1% of the variance in the ratings, other factors likely play a more substantial role in influencing the average ratings of TV shows. These factors may include the show’s genre, network, production company, or even qualitative aspects like plot quality and character development.

Recommendations

To improve the model’s predictive power, we can consider including other variables such as genres, networks, or production companies. These factors may better capture the variations in ratings.
Given the weak linear relationship, it may be worth exploring non-linear models or transformations of the variables to capture more complex relationships.
Interaction effects between variables, such as the interaction between genres and the number of episodes, may provide further insights into how different factors jointly influence ratings.

Overall Insights

The ANOVA results suggest that genre significantly affects ratings, while the linear regression results show that the number of episodes has a minimal impact. This indicates that the content type (genre) plays a more important role in driving ratings than show length.
It may be the case that certain genres (e.g., Drama or Documentary) lend themselves to longer formats with more episodes, potentially explaining the weak relationship observed between the number of episodes and ratings.
These findings can help guide strategic decisions for content creators, such as focusing on producing high-quality content in popular genres rather than simply increasing the episode count.

Further Questions to Investigate

Are Certain Genres More Sensitive to Show Length?
What Other Variables Might Better Explain Ratings?
How Do Outliers or High-Variance Genres Impact Ratings?
Would a Multi-factor Model Improve Predictive Accuracy?
Could Non-linear Models Capture More Complex Relationships?