Regression Model Diagnostics

Introduction

This week, we extend last week’s regression model by adding new predictors to explore more complex relationships between variables in the TMDB TV show dataset. We’ll evaluate the expanded model using regression diagnostics, identifying any issues with model assumptions and assessing the significance of added predictors.

Extending the Regression Model

Response Variable: The response variable is Vote_average, as it serves as an important metric for audience satisfaction and show popularity.

Explanatory Variables:

1. Number_of_episodes (continuous): The total number of episodes of a TV show.

2. Avg_episodes_per_season (continuous): Calculated as number_of_episodes / number_of_seasons, representing the average number of episodes per season.

3. Binary_genre_comedy (binary): A binary variable for genre, where 1 represents Comedy and 0 represents other genres. This allows us to assess whether Comedy genres influence ratings differently.

Adding these variables allows us to:

- Examine potential interactions between content type (genre) and episode count.
- Address non-linearity by including Avg_episodes_per_season, which may influence ratings differently from number_of_episodes alone.
- Test specific genre effects by focusing on Comedy as a categorical variable, assuming Comedy may have distinct characteristics that influence viewer ratings.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Load and prepare the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Add calculated and binary variables
tv_data <- tv_data |>
  mutate(
    avg_episodes_per_season = ifelse(number_of_seasons != 0, number_of_episodes / number_of_seasons, NA),
    binary_genre_comedy = ifelse(genres == "Comedy", 1, 0)
  ) |>
  # Remove rows with NA, NaN, or Inf values in the relevant columns
  filter(
    !is.na(vote_average),
    !is.na(number_of_episodes),
    !is.na(avg_episodes_per_season),
    !is.na(binary_genre_comedy),
    is.finite(avg_episodes_per_season)
  )

# Build the expanded regression model
expanded_model <- lm(vote_average ~ number_of_episodes + avg_episodes_per_season + binary_genre_comedy, data = tv_data)
summary(expanded_model)

## 
## Call:
## lm(formula = vote_average ~ number_of_episodes + avg_episodes_per_season + 
##     binary_genre_comedy, data = tv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.388  -3.436  -1.928   3.577   6.600 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.400e+00  1.369e-02 248.416  < 2e-16 ***
## number_of_episodes      1.031e-03  7.437e-05  13.869  < 2e-16 ***
## avg_episodes_per_season 3.437e-03  3.371e-04  10.197  < 2e-16 ***
## binary_genre_comedy     1.486e-01  3.955e-02   3.756 0.000173 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.697 on 96341 degrees of freedom
## Multiple R-squared:  0.004893,   Adjusted R-squared:  0.004862 
## F-statistic: 157.9 on 3 and 96341 DF,  p-value: < 2.2e-16

Intercept: The intercept of 3.400 represents the expected average rating (vote_average) for a TV show with zero episodes, zero average episodes per season, and not categorized as a comedy. Although a hypothetical scenario, it serves as a baseline reference point in the model.

number_of_episodes (Coefficient: 0.001031): This coefficient suggests that for each additional episode, the average rating is expected to increase by 0.001, holding all other variables constant. While statistically significant (p-value < 2e-16), the effect size is very small, indicating that the number of episodes alone has a minor impact on a show’s rating.

avg_episodes_per_season (Coefficient: 0.003437): For each additional episode in the average episodes per season, the average rating is expected to increase by 0.0034, holding all other variables constant. This effect is also statistically significant but relatively small, suggesting that the pacing or density of episodes within a season has a limited influence on ratings.

binary_genre_comedy (Coefficient: 0.1486): Shows categorized as comedies have an expected average rating that is 0.15 points higher than non-comedies, holding other variables constant. The p-value (0.000173) indicates that this result is statistically significant. This implies that genre, specifically the comedy genre, can have a modest positive effect on the ratings.

R-squared (0.0049): The R-squared value of 0.0049 implies that only about 0.49% of the variance in the ratings is explained by this model. This low value suggests that while the predictors included are statistically significant, they are not practically substantial for predicting ratings. Other factors likely contribute much more to the variance in ratings, such as perhaps storytelling quality, acting, or production values—elements that are not captured in this dataset.

Adjusted R-squared (0.00486): This value is similar to the R-squared, indicating no large penalty from including multiple predictors. However, the model still explains a very small amount of the variance.

F-statistic (157.9) and p-value (< 2.2e-16): The F-test p-value indicates that the model as a whole is statistically significant. This confirms that at least one of the predictors contributes to explaining the variance in ratings.

Residuals: The residual standard error of 3.697 suggests that the average deviation of observed ratings from the model’s predicted ratings is approximately 3.7 rating points, indicating considerable variability in ratings that the model does not account for.

Insights and Implications

The influence of Show Length and Genre is small but significant coefficients for number_of_episodes and avg_episodes_per_season imply that show length and density have limited but positive impacts on ratings. The comedy genre shows a slight preference among viewers, which may be a factor to consider for content creators.
Given the low R-squared value, this model’s predictors do not capture a substantial portion of the factors influencing ratings. This suggests that other elements—such as subjective audience preferences, production quality, or critical acclaim—might play a larger role in determining a show’s success.

Multicollinearity Check

We will use the Variance Inflation Factor (VIF) to check for multicollinearity, which helps identify if any predictors are strongly correlated with each other.

# Check for multicollinearity using VIF
vif(expanded_model)

##      number_of_episodes avg_episodes_per_season     binary_genre_comedy 
##                1.156898                1.157799                1.000921

The results provided above indicate the variance inflation factor (VIF) values for each variable included in the multiple regression model.

number_of_episodes (VIF = 1.1569): This VIF is close to 1, indicating very low multicollinearity with the other variables. This suggests that the total number of episodes is relatively independent of the other predictors in the model, which is ideal.
avg_episodes_per_season (VIF = 1.1578): Similarly, the VIF for the average episodes per season is close to 1, indicating very low multicollinearity. This means that the avg_episodes_per_season variable does not exhibit strong linear relationships with the other predictors in the model.
binary_genre_comedy (VIF = 1.0009): The VIF for this binary variable (indicating whether the genre is Comedy) is also close to 1, showing no signs of multicollinearity. This indicates that binary_genre_comedy is independent of both number_of_episodes and avg_episodes_per_season.

Model Diagnostics

Before analyzing the diagnostic plots, it’s important to keep in mind the assumptions of linear regression:

Linearity: The relationship between predictors and response is linear.
Independence: Observations are independent of each other.
Homoscedasticity: Constant variance of residuals.
Normality: Residuals should follow a normal distribution.

# Diagnostic plots for expanded_model
par(mfrow = c(2, 2)) # Arrange plots in a 2x3 grid
plot(expanded_model)

The diagnostic plots provided are crucial for evaluating the assumptions of linear regression and understanding potential issues with the model fit.

Residuals vs. Fitted Plot: This plot helps assess the assumption of linearity and the homoscedasticity (constant variance) of residuals. The residuals do not appear to be randomly scattered around the horizontal line at zero; there is a noticeable curve, indicating that the model may not capture all the underlying patterns in the data. This curved pattern suggests that the model may be missing important non-linear relationships, and it violates the assumption of linearity.
Q-Q Plot (Quantile-Quantile Plot): This plot assesses whether the residuals follow a normal distribution. The points deviate substantially from the diagonal line, especially at both ends. This departure indicates that the residuals are not normally distributed. The lack of normality in residuals can impact the validity of significance tests and confidence intervals.
Scale-Location Plot (Spread-Location Plot): This plot checks for homoscedasticity by displaying the spread of residuals across fitted values. The red line is not flat and increases with fitted values, suggesting increasing variance of residuals as fitted values increase. This non-constant spread of residuals confirms the heteroscedasticity issue seen in the Residuals vs. Fitted plot.
Residuals vs. Leverage Plot: This plot helps identify influential points that might unduly affect the model fit. Points with high leverage or high standardized residuals are potential outliers. A few points, such as observations labeled 22770, 77146, and 68387, have higher leverage and may significantly influence the model. These influential points could be outliers or points with extreme values that affect the regression results.

Recommendations and Next Steps

Consider transforming the response variable (e.g., log transformation) to address non-linearity and heteroscedasticity.
Experiment with adding polynomial or interaction terms to better capture any non-linear relationships.
Examine points with high leverage or large residuals to determine if they are valid observations or potential outliers that may distort the model.
Since the model has a low R-squared, explore adding other relevant predictors, such as production company, network, or specific genres.

Insights

The results show that while the number of episodes and genre may impact ratings, their effects are minor compared to other, unobserved factors. This suggests that the quality of content or other qualitative factors likely play a more significant role in influencing show ratings.

Further Questions

Would a non-linear regression model capture the relationship between episode count and ratings more effectively?
Could Additional Variables Improve the Model?
What Role Do Subjective Qualities Play?
Does Viewer Engagement Metrics Have a Larger Impact?