Linear Modeling and Generalized Linear Models

Introduction

The purpose of this data dive is to expand our knowledge on linear modeling and generalized linear models. This week, we will use the TMDB TV Show dataset to build a linear model to understand the relationship between different TV show attributes and their average ratings. Specifically, we will explore how factors such as the number of seasons, number of episodes, and episode run time impact the average rating of a show.

Data Overview

The key variables used in this analysis are:

Response Variable: vote_average (Average rating of the show)
Explanatory Variables: number_of_seasons, number_of_episodes, episode_run_time

We chose these variables to explore how different aspects of a TV show might contribute to its overall rating. The number of seasons and episodes may reflect the popularity and longevity of a show, while the episode run time might indicate a certain level of content quality and viewer engagement.

Building the Linear Model

To explore the relationship between the average rating (vote_average) and the selected explanatory variables (number_of_seasons, number_of_episodes, episode_run_time), we fit a linear model using the lm() function in R. Below is the code used to build the model:

# Load necessary packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

# Load the dataset
tmdb_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Fit a linear model
linear_model <- lm(vote_average ~ number_of_seasons + number_of_episodes + episode_run_time, 
                   data = tmdb_data)

# Summary of the linear model
summary(linear_model)

## 
## Call:
## lm(formula = vote_average ~ number_of_seasons + number_of_episodes + 
##     episode_run_time, data = tmdb_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.600  -2.143  -1.985   3.347   8.177 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.823e+00  9.955e-03  183.08   <2e-16 ***
## number_of_seasons  1.619e-01  3.071e-03   52.73   <2e-16 ***
## number_of_episodes 7.880e-04  6.695e-05   11.77   <2e-16 ***
## episode_run_time   1.067e-02  1.715e-04   62.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.37 on 168635 degrees of freedom
## Multiple R-squared:  0.04834,    Adjusted R-squared:  0.04832 
## F-statistic:  2855 on 3 and 168635 DF,  p-value: < 2.2e-16

Explanation of Outputs

Coefficients

The output provides the estimated coefficients for the intercept and each of the explanatory variables:

Intercept: The intercept value is 1.823, which represents the expected value of vote_average when all explanatory variables are equal to zero. In this context, it may not have a meaningful interpretation, as it is unlikely that a show has zero seasons, episodes, or runtime.
Number of Seasons: The coefficient for number_of_seasons is 0.1619. This means that, on average, for each additional season, the average rating (vote_average) increases by approximately 0.1619, holding all other variables constant. This suggests that shows with more seasons are generally rated higher, potentially indicating greater viewer loyalty and sustained quality over time. The p-value for this coefficient is very small (< 2e-16), indicating that this effect is statistically significant.
Number of Episodes: The coefficient for number_of_episodes is 0.000788. This suggests that each additional episode is associated with an increase in the average rating by 0.000788, holding all other variables constant. Although this effect is small, it is statistically significant (p-value < 2e-16), suggesting that having more episodes may slightly contribute to better ratings, possibly because more content allows for greater character development and plot complexity.
Episode Run Time: The coefficient for episode_run_time is 0.01067. This indicates that, on average, each additional minute of episode runtime is associated with an increase in the average rating by 0.01067, holding other factors constant. The positive coefficient suggests that longer episodes may be indicative of higher quality or greater engagement, which results in higher ratings. The p-value is again very small (< 2e-16), showing that this effect is statistically significant.

Residuals

The residuals provide information about the differences between the observed values and the values predicted by the model. The summary shows the minimum, 1st quartile, median, 3rd quartile, and maximum residuals. A median close to zero suggests that the model’s predictions are, on average, fairly accurate, but the wide range of residuals (from -66.600 to 8.177) suggests that there are some shows whose ratings are not well predicted by the model.

Model Fit Metrics

Residual Standard Error: The residual standard error is 3.37, which provides an estimate of the typical size of the residuals. This value suggests that there is considerable variability in the ratings that is not explained by the model, indicating that other factors not included in the model likely play a significant role in determining a show’s rating.
Multiple R-squared: The R-squared value is 0.04834, which indicates that approximately 4.8% of the variance in vote_average is explained by the model. This is a relatively low R-squared value, suggesting that other factors not included in the model likely have a larger impact on a show’s rating.
Adjusted R-squared: The adjusted R-squared value is 0.04832, which is very close to the R-squared value. This metric adjusts for the number of predictors in the model, and since it is similar to the R-squared value, it indicates that adding more variables did not greatly improve the model fit.
F-statistic: The F-statistic is 2855 with a p-value of < 2.2e-16, indicating that the model as a whole is statistically significant. In other words, at least one of the explanatory variables has a significant relationship with the response variable.

Interpretation

The model shows that the number of seasons, number of episodes, and episode runtime all have statistically significant relationships with the average rating of a TV show. However, the low R-squared value suggests that these variables explain only a small portion of the variation in ratings. This implies that other factors, such as genre, production quality, or viewer demographics, may play an important role in determining ratings and should be explored in future analyses.

Model Diagnostics

To ensure the model’s validity, we performed diagnostic checks to evaluate the assumptions of linear regression. The following tools were used to diagnose the model:

# Diagnostic plots
par(mfrow = c(2, 2))
plot(linear_model)

Key Diagnostic Observations

Residuals vs Fitted plot shows that the residuals have a clear pattern, which suggests non-linearity and potential heteroscedasticity, meaning that the relationship between the predictors and the response may not be purely linear and the variance of residuals is not constant. This non-linearity indicates that the model may not be fully capturing the relationships in the data, and the presence of heteroscedasticity violates one of the key assumptions of linear regression.

Normal Q-Q plot indicates that the residuals deviate significantly from normality, especially in the tails. This implies that the residuals are not normally distributed, which could impact the reliability of statistical tests, such as p-values and confidence intervals, used in the model.

Scale-Location plot also shows an increasing trend in the spread of residuals with fitted values, confirming the presence of heteroscedasticity. This means that the variability in the response variable changes depending on the fitted value, which affects the consistency of the model’s predictions.

Residuals vs Leverage plot highlights several influential points that could disproportionately affect the model’s fit. These influential points need to be investigated further to determine whether they represent genuine data points or outliers that should be handled differently. Influential points may skew the model’s results and lead to biased estimates.

Issues Highlighted by Diagnostics

There appears to be non-constant variance (heteroscedasticity), as indicated by the Scale-Location plot. This may affect the reliability of our coefficient estimates and suggests that the assumption of homoscedasticity is violated.
Some outliers and influential points were observed in the Residuals vs Leverage plot, suggesting that certain shows have a significant impact on the model’s fit. Further investigation might be needed to understand these points, as they could be exerting undue influence on the model.

Interpretation of Coefficients

Number of Seasons: The coefficient for number_of_seasons is positive, indicating that, on average, an increase in the number of seasons is associated with an increase in the average rating of the show, holding other factors constant. Specifically, each additional season increases the average rating by approximately 0.1619 units. This suggests that longer-running shows tend to have higher average ratings, which could imply greater viewer engagement and loyalty over time.

The significance of this coefficient (p-value < 2e-16) indicates that there is strong evidence that the number of seasons has a positive effect on a show’s average rating. This finding is relevant for producers and networks, as it suggests that extending successful shows for more seasons could positively impact their overall ratings.

Insights

The positive relationship between the number of seasons and average rating indicates that successful shows are often extended for additional seasons. However, the model’s reliability is affected by heteroscedasticity and influential points, which suggests that additional steps (e.g., transforming variables or removing outliers) may be needed to improve model accuracy.

Further Questions

What other factors might influence the average rating of a show? For example, how do genres or production companies impact ratings?
Would a generalized linear model (e.g., Poisson regression for count data) be more appropriate for predicting other outcomes, such as the number of episodes?
Should transformations, such as taking the log of the response variable, be applied to address heteroscedasticity?

Further exploration could involve transforming variables to address non-linearity, using regularization techniques to handle influential points, or experimenting with other types of models to better understand the relationships in the dataset. Additionally, including categorical variables like genre and production companies could help improve the explanatory power of the model.

Conclusion

In this data dive, we explored how various attributes of TV shows relate to their average ratings using a linear model. We highlighted some issues with the model, such as heteroscedasticity and influential points, and interpreted the impact of the number of seasons on ratings. Future work should focus on improving model diagnostics, transforming variables to better meet the assumptions of linear regression, and exploring additional predictors to increase the model’s explanatory power.

Linear Modeling and Generalized Linear Models - Week 11