The purpose of this data dive is to expand our knowledge on linear modeling and generalized linear models. This week, we will use the TMDB TV Show dataset to build a linear model to understand the relationship between different TV show attributes and their average ratings. Specifically, we will explore how factors such as the number of seasons, number of episodes, and episode run time impact the average rating of a show.
The key variables used in this analysis are:
vote_average
(Average rating of the show)number_of_seasons
, number_of_episodes
,
episode_run_time
We chose these variables to explore how different aspects of a TV show might contribute to its overall rating. The number of seasons and episodes may reflect the popularity and longevity of a show, while the episode run time might indicate a certain level of content quality and viewer engagement.
To explore the relationship between the average rating
(vote_average
) and the selected explanatory variables
(number_of_seasons
, number_of_episodes
,
episode_run_time
), we fit a linear model using the
lm()
function in R. Below is the code used to build the
model:
# Load necessary packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
# Load the dataset
tmdb_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Fit a linear model
linear_model <- lm(vote_average ~ number_of_seasons + number_of_episodes + episode_run_time,
data = tmdb_data)
# Summary of the linear model
summary(linear_model)
##
## Call:
## lm(formula = vote_average ~ number_of_seasons + number_of_episodes +
## episode_run_time, data = tmdb_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.600 -2.143 -1.985 3.347 8.177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.823e+00 9.955e-03 183.08 <2e-16 ***
## number_of_seasons 1.619e-01 3.071e-03 52.73 <2e-16 ***
## number_of_episodes 7.880e-04 6.695e-05 11.77 <2e-16 ***
## episode_run_time 1.067e-02 1.715e-04 62.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.37 on 168635 degrees of freedom
## Multiple R-squared: 0.04834, Adjusted R-squared: 0.04832
## F-statistic: 2855 on 3 and 168635 DF, p-value: < 2.2e-16
The output provides the estimated coefficients for the intercept and each of the explanatory variables:
Intercept: The intercept value is
1.823
, which represents the expected value of
vote_average
when all explanatory variables are equal to
zero. In this context, it may not have a meaningful interpretation, as
it is unlikely that a show has zero seasons, episodes, or
runtime.
Number of Seasons: The coefficient for
number_of_seasons
is 0.1619
. This means that,
on average, for each additional season, the average rating
(vote_average
) increases by approximately
0.1619
, holding all other variables constant. This suggests
that shows with more seasons are generally rated higher, potentially
indicating greater viewer loyalty and sustained quality over time. The
p-value for this coefficient is very small (< 2e-16
),
indicating that this effect is statistically significant.
Number of Episodes: The coefficient for
number_of_episodes
is 0.000788
. This suggests
that each additional episode is associated with an increase in the
average rating by 0.000788
, holding all other variables
constant. Although this effect is small, it is statistically significant
(p-value < 2e-16
), suggesting that having more episodes
may slightly contribute to better ratings, possibly because more content
allows for greater character development and plot complexity.
Episode Run Time: The coefficient for
episode_run_time
is 0.01067
. This indicates
that, on average, each additional minute of episode runtime is
associated with an increase in the average rating by
0.01067
, holding other factors constant. The positive
coefficient suggests that longer episodes may be indicative of higher
quality or greater engagement, which results in higher ratings. The
p-value is again very small (< 2e-16
), showing that this
effect is statistically significant.
The residuals provide information about the differences between the
observed values and the values predicted by the model. The summary shows
the minimum, 1st quartile, median, 3rd quartile, and maximum residuals.
A median close to zero suggests that the model’s predictions are, on
average, fairly accurate, but the wide range of residuals (from
-66.600
to 8.177
) suggests that there are some
shows whose ratings are not well predicted by the model.
Residual Standard Error: The residual standard
error is 3.37
, which provides an estimate of the typical
size of the residuals. This value suggests that there is considerable
variability in the ratings that is not explained by the model,
indicating that other factors not included in the model likely play a
significant role in determining a show’s rating.
Multiple R-squared: The R-squared value is
0.04834
, which indicates that approximately 4.8% of the
variance in vote_average
is explained by the model. This is
a relatively low R-squared value, suggesting that other factors not
included in the model likely have a larger impact on a show’s
rating.
Adjusted R-squared: The adjusted R-squared value
is 0.04832
, which is very close to the R-squared value.
This metric adjusts for the number of predictors in the model, and since
it is similar to the R-squared value, it indicates that adding more
variables did not greatly improve the model fit.
F-statistic: The F-statistic is
2855
with a p-value of < 2.2e-16
,
indicating that the model as a whole is statistically significant. In
other words, at least one of the explanatory variables has a significant
relationship with the response variable.
The model shows that the number of seasons, number of episodes, and episode runtime all have statistically significant relationships with the average rating of a TV show. However, the low R-squared value suggests that these variables explain only a small portion of the variation in ratings. This implies that other factors, such as genre, production quality, or viewer demographics, may play an important role in determining ratings and should be explored in future analyses.
To ensure the model’s validity, we performed diagnostic checks to evaluate the assumptions of linear regression. The following tools were used to diagnose the model:
# Diagnostic plots
par(mfrow = c(2, 2))
plot(linear_model)
Residuals vs Fitted plot shows that the residuals have a clear pattern, which suggests non-linearity and potential heteroscedasticity, meaning that the relationship between the predictors and the response may not be purely linear and the variance of residuals is not constant. This non-linearity indicates that the model may not be fully capturing the relationships in the data, and the presence of heteroscedasticity violates one of the key assumptions of linear regression.
Normal Q-Q plot indicates that the residuals deviate significantly from normality, especially in the tails. This implies that the residuals are not normally distributed, which could impact the reliability of statistical tests, such as p-values and confidence intervals, used in the model.
Scale-Location plot also shows an increasing trend in the spread of residuals with fitted values, confirming the presence of heteroscedasticity. This means that the variability in the response variable changes depending on the fitted value, which affects the consistency of the model’s predictions.
Residuals vs Leverage plot highlights several influential points that could disproportionately affect the model’s fit. These influential points need to be investigated further to determine whether they represent genuine data points or outliers that should be handled differently. Influential points may skew the model’s results and lead to biased estimates.
There appears to be non-constant variance (heteroscedasticity), as indicated by the Scale-Location plot. This may affect the reliability of our coefficient estimates and suggests that the assumption of homoscedasticity is violated.
Some outliers and influential points were observed in the Residuals vs Leverage plot, suggesting that certain shows have a significant impact on the model’s fit. Further investigation might be needed to understand these points, as they could be exerting undue influence on the model.
Number of Seasons: The coefficient for
number_of_seasons
is positive, indicating that, on average,
an increase in the number of seasons is associated with an increase in
the average rating of the show, holding other factors constant.
Specifically, each additional season increases the average rating by
approximately 0.1619
units. This suggests that
longer-running shows tend to have higher average ratings, which could
imply greater viewer engagement and loyalty over time.
The significance of this coefficient
(p-value < 2e-16
) indicates that there is strong
evidence that the number of seasons has a positive effect on a show’s
average rating. This finding is relevant for producers and networks, as
it suggests that extending successful shows for more seasons could
positively impact their overall ratings.
Further exploration could involve transforming variables to address non-linearity, using regularization techniques to handle influential points, or experimenting with other types of models to better understand the relationships in the dataset. Additionally, including categorical variables like genre and production companies could help improve the explanatory power of the model.
In this data dive, we explored how various attributes of TV shows relate to their average ratings using a linear model. We highlighted some issues with the model, such as heteroscedasticity and influential points, and interpreted the impact of the number of seasons on ratings. Future work should focus on improving model diagnostics, transforming variables to better meet the assumptions of linear regression, and exploring additional predictors to increase the model’s explanatory power.