Introduction

This data dive focuses on building a logistic regression model using the TMDB TV show dataset. The goal is to model a binary response variable that represents whether a TV show has a high average rating, using relevant predictors. Logistic regression is suitable for modeling the probability of a binary outcome, allowing us to explore how factors such as show length and genre influence high ratings.

Response Variable

Explanatory Variables

We selected three variables that may influence the likelihood of a high rating:

1. Number_of_episodes (continuous): Total episodes aired for a show.
2. Avg_episodes_per_season (continuous): Average episodes per season, calculated as “number_of_episodes / number_of_seasons”.
3. Binary_genre_comedy (binary): Indicates if the show genre is Comedy (1 for Comedy, 0 for others).

These variables allow us to investigate whether factors like show length, density of episodes, or genre type (Comedy or non-Comedy) contribute to a high rating.

Building the Logistic Regression Model

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(broom)

# Load and prepare the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create binary response variable (e.g., high rating if vote_average > 7)
tv_data <- tv_data |>
  mutate(
    Binary_high_rating = ifelse(vote_average > 7, 1, 0),
    avg_episodes_per_season = ifelse(number_of_seasons != 0, number_of_episodes / number_of_seasons, NA),
    binary_genre_comedy = ifelse(genres == "Comedy", 1, 0)
  ) |>
  filter(!is.na(avg_episodes_per_season), !is.na(binary_genre_comedy))

# Build the logistic regression model
logistic_model <- glm(Binary_high_rating ~ number_of_episodes + avg_episodes_per_season + binary_genre_comedy, 
                      data = tv_data, family = binomial)
summary(logistic_model)
## 
## Call:
## glm(formula = Binary_high_rating ~ number_of_episodes + avg_episodes_per_season + 
##     binary_genre_comedy, family = binomial, data = tv_data)
## 
## Coefficients:
##                           Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)             -1.128e+00  8.654e-03 -130.291  < 2e-16 ***
## number_of_episodes       1.933e-04  4.763e-05    4.059 4.93e-05 ***
## avg_episodes_per_season  1.336e-03  2.196e-04    6.086 1.16e-09 ***
## binary_genre_comedy     -1.260e-01  2.549e-02   -4.945 7.60e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107914  on 96344  degrees of freedom
## Residual deviance: 107794  on 96341  degrees of freedom
## AIC: 107802
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Model Coefficients

The coefficients in a logistic regression model represent the change in the log-odds of the response variable (i.e., the probability of a high rating) associated with a one-unit increase in the predictor, holding all other variables constant.

  • Intercept (-1.128): The intercept represents the log-odds of a high rating when all predictor variables are zero. Though this situation (zero episodes and no episodes per season) is hypothetical, the intercept serves as a baseline for calculating probabilities.

  • number_of_episodes (0.0001933): For each additional episode, the log-odds of receiving a high rating increase by 0.0001933, assuming other predictors are constant. This effect is statistically significant (p-value < 0.001) but very small, suggesting that the number of episodes alone has a minimal impact on the likelihood of a high rating.

  • avg_episodes_per_season (0.001336): For each additional average episode per season, the log-odds of a high rating increase by 0.001336, holding other predictors constant. Although statistically significant, this small effect size suggests only a slight impact of episode density on high ratings.

  • binary_genre_comedy (-0.126): Shows classified as Comedy have log-odds of receiving a high rating that are 0.126 lower than non-Comedy shows, assuming other factors are equal. This statistically significant negative coefficient suggests Comedy shows are less likely to receive high ratings than other genres.

Statistical Significance

The small p-values for all three predictors (< 0.001) indicate strong evidence for a relationship between each predictor and the likelihood of a high rating.

  • Null Deviance (107914) vs. Residual Deviance (107794): The reduction in deviance suggests the predictors improve the model, though the effect size is modest.

  • AIC (107802): This Akaike Information Criterion value provides a baseline for model comparison; lower AIC values suggest better model fit.

Insights and Implications

  • While the number of episodes and avg_episodes_per_season have statistically significant coefficients, their small sizes indicate a minor impact on high ratings. However, the negative association for Comedy implies that Comedy shows are less likely to receive high ratings, suggesting genre may play a more meaningful role in determining ratings.

  • The small reduction in deviance and AIC values suggest that other, unmeasured factors, such as production quality, acting, or plot, may better explain a show’s rating.

Odd Ratios:

To understand the effect sizes more intuitively, we can convert each coefficient to an odds ratio by exponentiating.

# Exponentiating coefficients to interpret as odds
exp(coef(logistic_model))
##             (Intercept)      number_of_episodes avg_episodes_per_season 
##               0.3238270               1.0001933               1.0013372 
##     binary_genre_comedy 
##               0.8815721
  • Intercept (0.3238): When all predictors are zero, the odds of a high rating are approximately 0.324, though this scenario is hypothetical.

  • number_of_episodes (1.0002): For each additional episode, the odds of a high rating increase by a factor of 1.0002. This tiny effect size indicates that episode count alone has a minimal effect on the probability of a high rating.

  • avg_episodes_per_season (1.0013): For each additional episode in the average episodes per season, the odds of a high rating increase by a factor of 1.0013, showing only a slight positive relationship.

  • binary_genre_comedy (0.8816): Shows classified as Comedy have odds of a high rating that are approximately 0.882 times those of non-Comedy shows. This odds ratio suggests a small but meaningful negative relationship, where Comedy shows are less likely to achieve high ratings than non-Comedy shows.

Confidence Interval for Coefficients

A confidence interval for the binary_genre_comedy coefficient provides insight into the reliability of this estimate.

# Calculate confidence interval for the binary_genre_comedy coefficient
confint_genre_comedy <- confint(logistic_model, parm = "binary_genre_comedy")
## Waiting for profiling to be done...
confint_genre_comedy
##       2.5 %      97.5 % 
## -0.17619190 -0.07627382

The confidence interval for binary_genre_comedy is:

This interval tells us, with 95% confidence, that the true effect of the Comedy genre on the log-odds of receiving a high rating is between -0.176 and -0.076, confirming the negative relationship.

Translating to Odds Ratio

To interpret this in terms of odds ratios, we can exponentiate the confidence interval:

# Exponentiate to interpret in terms of odds ratios
exp(confint_genre_comedy)
##     2.5 %    97.5 % 
## 0.8384571 0.9265625
  • The resulting 95% confidence interval for the odds ratio has a Lower Bound (2.5%) as 0.8385 and the Upper Bound (97.5%) as 0.9266.

  • This confidence interval for the odds ratio tells us that, with 95% confidence, the true odds ratio for a show being classified as Comedy (compared to non-Comedy) lies between 0.8385 and 0.9266.

  • Since the entire interval is below 1, this suggests that Comedy shows are less likely to receive a high rating compared to non-Comedy shows, holding other factors constant. Specifically, the odds of receiving a high rating are approximately 7-16% lower for Comedy shows than for non-Comedy shows.

Further Questions to Investigate

  1. Would the effect of the number of episodes differ for different genres?
  2. Would adjusting the rating threshold (e.g., >6 or >8) significantly alter the model results?
  3. How does this model perform in predicting high ratings?