Regression Model Diagnostics

Introduction

This data dive focuses on building a logistic regression model using the TMDB TV show dataset. The goal is to model a binary response variable that represents whether a TV show has a high average rating, using relevant predictors. Logistic regression is suitable for modeling the probability of a binary outcome, allowing us to explore how factors such as show length and genre influence high ratings.

Response Variable

Binary Response Variable: For this task, we define Binary_high_rating as the response variable, indicating whether a TV show has a high average rating (1 for high rating, 0 otherwise). We classify a show as having a “high rating” if its vote_average is above a threshold, such as 7 out of 10.

Explanatory Variables

We selected three variables that may influence the likelihood of a high rating:

1. Number_of_episodes (continuous): Total episodes aired for a show.
2. Avg_episodes_per_season (continuous): Average episodes per season, calculated as “number_of_episodes / number_of_seasons”.
3. Binary_genre_comedy (binary): Indicates if the show genre is Comedy (1 for Comedy, 0 for others).

These variables allow us to investigate whether factors like show length, density of episodes, or genre type (Comedy or non-Comedy) contribute to a high rating.

Building the Logistic Regression Model

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)
library(broom)

# Load and prepare the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Create binary response variable (e.g., high rating if vote_average > 7)
tv_data <- tv_data |>
  mutate(
    Binary_high_rating = ifelse(vote_average > 7, 1, 0),
    avg_episodes_per_season = ifelse(number_of_seasons != 0, number_of_episodes / number_of_seasons, NA),
    binary_genre_comedy = ifelse(genres == "Comedy", 1, 0)
  ) |>
  filter(!is.na(avg_episodes_per_season), !is.na(binary_genre_comedy))

# Build the logistic regression model
logistic_model <- glm(Binary_high_rating ~ number_of_episodes + avg_episodes_per_season + binary_genre_comedy, 
                      data = tv_data, family = binomial)
summary(logistic_model)

## 
## Call:
## glm(formula = Binary_high_rating ~ number_of_episodes + avg_episodes_per_season + 
##     binary_genre_comedy, family = binomial, data = tv_data)
## 
## Coefficients:
##                           Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)             -1.128e+00  8.654e-03 -130.291  < 2e-16 ***
## number_of_episodes       1.933e-04  4.763e-05    4.059 4.93e-05 ***
## avg_episodes_per_season  1.336e-03  2.196e-04    6.086 1.16e-09 ***
## binary_genre_comedy     -1.260e-01  2.549e-02   -4.945 7.60e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107914  on 96344  degrees of freedom
## Residual deviance: 107794  on 96341  degrees of freedom
## AIC: 107802
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Model Coefficients

The coefficients in a logistic regression model represent the change in the log-odds of the response variable (i.e., the probability of a high rating) associated with a one-unit increase in the predictor, holding all other variables constant.

Intercept (-1.128): The intercept represents the log-odds of a high rating when all predictor variables are zero. Though this situation (zero episodes and no episodes per season) is hypothetical, the intercept serves as a baseline for calculating probabilities.
number_of_episodes (0.0001933): For each additional episode, the log-odds of receiving a high rating increase by 0.0001933, assuming other predictors are constant. This effect is statistically significant (p-value < 0.001) but very small, suggesting that the number of episodes alone has a minimal impact on the likelihood of a high rating.
avg_episodes_per_season (0.001336): For each additional average episode per season, the log-odds of a high rating increase by 0.001336, holding other predictors constant. Although statistically significant, this small effect size suggests only a slight impact of episode density on high ratings.
binary_genre_comedy (-0.126): Shows classified as Comedy have log-odds of receiving a high rating that are 0.126 lower than non-Comedy shows, assuming other factors are equal. This statistically significant negative coefficient suggests Comedy shows are less likely to receive high ratings than other genres.

Statistical Significance

The small p-values for all three predictors (< 0.001) indicate strong evidence for a relationship between each predictor and the likelihood of a high rating.

Null Deviance (107914) vs. Residual Deviance (107794): The reduction in deviance suggests the predictors improve the model, though the effect size is modest.
AIC (107802): This Akaike Information Criterion value provides a baseline for model comparison; lower AIC values suggest better model fit.

Insights and Implications

While the number of episodes and avg_episodes_per_season have statistically significant coefficients, their small sizes indicate a minor impact on high ratings. However, the negative association for Comedy implies that Comedy shows are less likely to receive high ratings, suggesting genre may play a more meaningful role in determining ratings.
The small reduction in deviance and AIC values suggest that other, unmeasured factors, such as production quality, acting, or plot, may better explain a show’s rating.

Odd Ratios:

To understand the effect sizes more intuitively, we can convert each coefficient to an odds ratio by exponentiating.

# Exponentiating coefficients to interpret as odds
exp(coef(logistic_model))

##             (Intercept)      number_of_episodes avg_episodes_per_season 
##               0.3238270               1.0001933               1.0013372 
##     binary_genre_comedy 
##               0.8815721

Intercept (0.3238): When all predictors are zero, the odds of a high rating are approximately 0.324, though this scenario is hypothetical.
number_of_episodes (1.0002): For each additional episode, the odds of a high rating increase by a factor of 1.0002. This tiny effect size indicates that episode count alone has a minimal effect on the probability of a high rating.
avg_episodes_per_season (1.0013): For each additional episode in the average episodes per season, the odds of a high rating increase by a factor of 1.0013, showing only a slight positive relationship.
binary_genre_comedy (0.8816): Shows classified as Comedy have odds of a high rating that are approximately 0.882 times those of non-Comedy shows. This odds ratio suggests a small but meaningful negative relationship, where Comedy shows are less likely to achieve high ratings than non-Comedy shows.

Confidence Interval for Coefficients

A confidence interval for the binary_genre_comedy coefficient provides insight into the reliability of this estimate.

# Calculate confidence interval for the binary_genre_comedy coefficient
confint_genre_comedy <- confint(logistic_model, parm = "binary_genre_comedy")

## Waiting for profiling to be done...

confint_genre_comedy

##       2.5 %      97.5 % 
## -0.17619190 -0.07627382

The confidence interval for binary_genre_comedy is:

Lower Bound (2.5%): -0.1762
Upper Bound (97.5%): -0.0763

This interval tells us, with 95% confidence, that the true effect of the Comedy genre on the log-odds of receiving a high rating is between -0.176 and -0.076, confirming the negative relationship.

Translating to Odds Ratio

To interpret this in terms of odds ratios, we can exponentiate the confidence interval:

# Exponentiate to interpret in terms of odds ratios
exp(confint_genre_comedy)

##     2.5 %    97.5 % 
## 0.8384571 0.9265625

The resulting 95% confidence interval for the odds ratio has a Lower Bound (2.5%) as 0.8385 and the Upper Bound (97.5%) as 0.9266.
This confidence interval for the odds ratio tells us that, with 95% confidence, the true odds ratio for a show being classified as Comedy (compared to non-Comedy) lies between 0.8385 and 0.9266.
Since the entire interval is below 1, this suggests that Comedy shows are less likely to receive a high rating compared to non-Comedy shows, holding other factors constant. Specifically, the odds of receiving a high rating are approximately 7-16% lower for Comedy shows than for non-Comedy shows.

Further Questions to Investigate

Would the effect of the number of episodes differ for different genres?
Would adjusting the rating threshold (e.g., >6 or >8) significantly alter the model results?
How does this model perform in predicting high ratings?

Regression Model Diagnostics - Week 9