Introduction

This week, we extend last week’s regression model by adding new predictors to explore more complex relationships between variables in the TMDB TV show dataset. We’ll evaluate the expanded model using regression diagnostics, identifying any issues with model assumptions and assessing the significance of added predictors.

Extending the Regression Model

Response Variable: The response variable is Vote_average, as it serves as an important metric for audience satisfaction and show popularity.

Explanatory Variables:

1. Number_of_episodes (continuous): The total number of episodes of a TV show.

2. Avg_episodes_per_season (continuous): Calculated as number_of_episodes / number_of_seasons, representing the average number of episodes per season.

3. Binary_genre_comedy (binary): A binary variable for genre, where 1 represents Comedy and 0 represents other genres. This allows us to assess whether Comedy genres influence ratings differently.

Adding these variables allows us to:

- Examine potential interactions between content type (genre) and episode count.
- Address non-linearity by including Avg_episodes_per_season, which may influence ratings differently from number_of_episodes alone.
- Test specific genre effects by focusing on Comedy as a categorical variable, assuming Comedy may have distinct characteristics that influence viewer ratings.

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Load and prepare the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add calculated and binary variables
tv_data <- tv_data |>
  mutate(
    avg_episodes_per_season = ifelse(number_of_seasons != 0, number_of_episodes / number_of_seasons, NA),
    binary_genre_comedy = ifelse(genres == "Comedy", 1, 0)
  ) |>
  # Remove rows with NA, NaN, or Inf values in the relevant columns
  filter(
    !is.na(vote_average),
    !is.na(number_of_episodes),
    !is.na(avg_episodes_per_season),
    !is.na(binary_genre_comedy),
    is.finite(avg_episodes_per_season)
  )

# Build the expanded regression model
expanded_model <- lm(vote_average ~ number_of_episodes + avg_episodes_per_season + binary_genre_comedy, data = tv_data)
summary(expanded_model)
## 
## Call:
## lm(formula = vote_average ~ number_of_episodes + avg_episodes_per_season + 
##     binary_genre_comedy, data = tv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.388  -3.436  -1.928   3.577   6.600 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.400e+00  1.369e-02 248.416  < 2e-16 ***
## number_of_episodes      1.031e-03  7.437e-05  13.869  < 2e-16 ***
## avg_episodes_per_season 3.437e-03  3.371e-04  10.197  < 2e-16 ***
## binary_genre_comedy     1.486e-01  3.955e-02   3.756 0.000173 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.697 on 96341 degrees of freedom
## Multiple R-squared:  0.004893,   Adjusted R-squared:  0.004862 
## F-statistic: 157.9 on 3 and 96341 DF,  p-value: < 2.2e-16

Intercept: The intercept of 3.400 represents the expected average rating (vote_average) for a TV show with zero episodes, zero average episodes per season, and not categorized as a comedy. Although a hypothetical scenario, it serves as a baseline reference point in the model.

number_of_episodes (Coefficient: 0.001031): This coefficient suggests that for each additional episode, the average rating is expected to increase by 0.001, holding all other variables constant. While statistically significant (p-value < 2e-16), the effect size is very small, indicating that the number of episodes alone has a minor impact on a show’s rating.

avg_episodes_per_season (Coefficient: 0.003437): For each additional episode in the average episodes per season, the average rating is expected to increase by 0.0034, holding all other variables constant. This effect is also statistically significant but relatively small, suggesting that the pacing or density of episodes within a season has a limited influence on ratings.

binary_genre_comedy (Coefficient: 0.1486): Shows categorized as comedies have an expected average rating that is 0.15 points higher than non-comedies, holding other variables constant. The p-value (0.000173) indicates that this result is statistically significant. This implies that genre, specifically the comedy genre, can have a modest positive effect on the ratings.

R-squared (0.0049): The R-squared value of 0.0049 implies that only about 0.49% of the variance in the ratings is explained by this model. This low value suggests that while the predictors included are statistically significant, they are not practically substantial for predicting ratings. Other factors likely contribute much more to the variance in ratings, such as perhaps storytelling quality, acting, or production values—elements that are not captured in this dataset.

Adjusted R-squared (0.00486): This value is similar to the R-squared, indicating no large penalty from including multiple predictors. However, the model still explains a very small amount of the variance.

F-statistic (157.9) and p-value (< 2.2e-16): The F-test p-value indicates that the model as a whole is statistically significant. This confirms that at least one of the predictors contributes to explaining the variance in ratings.

Residuals: The residual standard error of 3.697 suggests that the average deviation of observed ratings from the model’s predicted ratings is approximately 3.7 rating points, indicating considerable variability in ratings that the model does not account for.

Insights and Implications

Multicollinearity Check

We will use the Variance Inflation Factor (VIF) to check for multicollinearity, which helps identify if any predictors are strongly correlated with each other.

# Check for multicollinearity using VIF
vif(expanded_model)
##      number_of_episodes avg_episodes_per_season     binary_genre_comedy 
##                1.156898                1.157799                1.000921

The results provided above indicate the variance inflation factor (VIF) values for each variable included in the multiple regression model.

Model Diagnostics

Before analyzing the diagnostic plots, it’s important to keep in mind the assumptions of linear regression:

# Diagnostic plots for expanded_model
par(mfrow = c(2, 2)) # Arrange plots in a 2x3 grid
plot(expanded_model)

The diagnostic plots provided are crucial for evaluating the assumptions of linear regression and understanding potential issues with the model fit.

Recommendations and Next Steps

Insights

The results show that while the number of episodes and genre may impact ratings, their effects are minor compared to other, unobserved factors. This suggests that the quality of content or other qualitative factors likely play a more significant role in influencing show ratings.

Further Questions