This data dive focuses on building a logistic regression model using the TMDB TV show dataset. The goal is to model a binary response variable that represents whether a TV show has a high average rating, using relevant predictors. Logistic regression is suitable for modeling the probability of a binary outcome, allowing us to explore how factors such as show length and genre influence high ratings.
We selected three variables that may influence the likelihood of a high rating:
1. Number_of_episodes (continuous): Total episodes
aired for a show.
2. Avg_episodes_per_season (continuous): Average
episodes per season, calculated as “number_of_episodes /
number_of_seasons”.
3. Binary_genre_comedy (binary): Indicates if the show
genre is Comedy (1 for Comedy, 0 for others).
These variables allow us to investigate whether factors like show length, density of episodes, or genre type (Comedy or non-Comedy) contribute to a high rating.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(broom)
# Load and prepare the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create binary response variable (e.g., high rating if vote_average > 7)
tv_data <- tv_data |>
mutate(
Binary_high_rating = ifelse(vote_average > 7, 1, 0),
avg_episodes_per_season = ifelse(number_of_seasons != 0, number_of_episodes / number_of_seasons, NA),
binary_genre_comedy = ifelse(genres == "Comedy", 1, 0)
) |>
filter(!is.na(avg_episodes_per_season), !is.na(binary_genre_comedy))
# Build the logistic regression model
logistic_model <- glm(Binary_high_rating ~ number_of_episodes + avg_episodes_per_season + binary_genre_comedy,
data = tv_data, family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = Binary_high_rating ~ number_of_episodes + avg_episodes_per_season +
## binary_genre_comedy, family = binomial, data = tv_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.128e+00 8.654e-03 -130.291 < 2e-16 ***
## number_of_episodes 1.933e-04 4.763e-05 4.059 4.93e-05 ***
## avg_episodes_per_season 1.336e-03 2.196e-04 6.086 1.16e-09 ***
## binary_genre_comedy -1.260e-01 2.549e-02 -4.945 7.60e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107914 on 96344 degrees of freedom
## Residual deviance: 107794 on 96341 degrees of freedom
## AIC: 107802
##
## Number of Fisher Scoring iterations: 4
The coefficients in a logistic regression model represent the change in the log-odds of the response variable (i.e., the probability of a high rating) associated with a one-unit increase in the predictor, holding all other variables constant.
Intercept (-1.128): The intercept represents the log-odds of a high rating when all predictor variables are zero. Though this situation (zero episodes and no episodes per season) is hypothetical, the intercept serves as a baseline for calculating probabilities.
number_of_episodes (0.0001933): For each additional episode, the log-odds of receiving a high rating increase by 0.0001933, assuming other predictors are constant. This effect is statistically significant (p-value < 0.001) but very small, suggesting that the number of episodes alone has a minimal impact on the likelihood of a high rating.
avg_episodes_per_season (0.001336): For each additional average episode per season, the log-odds of a high rating increase by 0.001336, holding other predictors constant. Although statistically significant, this small effect size suggests only a slight impact of episode density on high ratings.
binary_genre_comedy (-0.126): Shows classified as Comedy have log-odds of receiving a high rating that are 0.126 lower than non-Comedy shows, assuming other factors are equal. This statistically significant negative coefficient suggests Comedy shows are less likely to receive high ratings than other genres.
The small p-values for all three predictors (< 0.001) indicate strong evidence for a relationship between each predictor and the likelihood of a high rating.
Null Deviance (107914) vs. Residual Deviance (107794): The reduction in deviance suggests the predictors improve the model, though the effect size is modest.
AIC (107802): This Akaike Information Criterion value provides a baseline for model comparison; lower AIC values suggest better model fit.
While the number of episodes and avg_episodes_per_season have statistically significant coefficients, their small sizes indicate a minor impact on high ratings. However, the negative association for Comedy implies that Comedy shows are less likely to receive high ratings, suggesting genre may play a more meaningful role in determining ratings.
The small reduction in deviance and AIC values suggest that other, unmeasured factors, such as production quality, acting, or plot, may better explain a show’s rating.
To understand the effect sizes more intuitively, we can convert each coefficient to an odds ratio by exponentiating.
# Exponentiating coefficients to interpret as odds
exp(coef(logistic_model))
## (Intercept) number_of_episodes avg_episodes_per_season
## 0.3238270 1.0001933 1.0013372
## binary_genre_comedy
## 0.8815721
Intercept (0.3238): When all predictors are zero, the odds of a high rating are approximately 0.324, though this scenario is hypothetical.
number_of_episodes (1.0002): For each additional episode, the odds of a high rating increase by a factor of 1.0002. This tiny effect size indicates that episode count alone has a minimal effect on the probability of a high rating.
avg_episodes_per_season (1.0013): For each additional episode in the average episodes per season, the odds of a high rating increase by a factor of 1.0013, showing only a slight positive relationship.
binary_genre_comedy (0.8816): Shows classified as Comedy have odds of a high rating that are approximately 0.882 times those of non-Comedy shows. This odds ratio suggests a small but meaningful negative relationship, where Comedy shows are less likely to achieve high ratings than non-Comedy shows.
A confidence interval for the binary_genre_comedy coefficient provides insight into the reliability of this estimate.
# Calculate confidence interval for the binary_genre_comedy coefficient
confint_genre_comedy <- confint(logistic_model, parm = "binary_genre_comedy")
## Waiting for profiling to be done...
confint_genre_comedy
## 2.5 % 97.5 %
## -0.17619190 -0.07627382
The confidence interval for binary_genre_comedy is:
This interval tells us, with 95% confidence, that the true effect of the Comedy genre on the log-odds of receiving a high rating is between -0.176 and -0.076, confirming the negative relationship.
To interpret this in terms of odds ratios, we can exponentiate the confidence interval:
# Exponentiate to interpret in terms of odds ratios
exp(confint_genre_comedy)
## 2.5 % 97.5 %
## 0.8384571 0.9265625
The resulting 95% confidence interval for the odds ratio has a Lower Bound (2.5%) as 0.8385 and the Upper Bound (97.5%) as 0.9266.
This confidence interval for the odds ratio tells us that, with 95% confidence, the true odds ratio for a show being classified as Comedy (compared to non-Comedy) lies between 0.8385 and 0.9266.
Since the entire interval is below 1, this suggests that Comedy shows are less likely to receive a high rating compared to non-Comedy shows, holding other factors constant. Specifically, the odds of receiving a high rating are approximately 7-16% lower for Comedy shows than for non-Comedy shows.