Netflix Data Dive - Regression Diagnostics

Load the Netflix:

netflix_data <- read.csv("~/Netflix_dataset.csv")


# Preview structure and initial summary
str(netflix_data)

## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

summary(netflix_data)

##       id               title               type           description       
##  Length:5806        Length:5806        Length:5806        Length:5806       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1945   Length:5806        Min.   :  0.00   Length:5806       
##  1st Qu.:2015   Class :character   1st Qu.: 44.00   Class :character  
##  Median :2018   Mode  :character   Median : 84.00   Mode  :character  
##  Mean   :2016                      Mean   : 77.64                     
##  3rd Qu.:2020                      3rd Qu.:105.00                     
##  Max.   :2022                      Max.   :251.00                     
##                                                                       
##  production_countries    seasons         imdb_id            imdb_score   
##  Length:5806          Min.   : 1.000   Length:5806        Min.   :1.500  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:5.800  
##  Mode  :character     Median : 1.000   Mode  :character   Median :6.600  
##                       Mean   : 2.166                      Mean   :6.533  
##                       3rd Qu.: 2.000                      3rd Qu.:7.400  
##                       Max.   :42.000                      Max.   :9.600  
##                       NA's   :3759                        NA's   :523    
##    imdb_votes      tmdb_popularity       tmdb_score    
##  Min.   :      5   Min.   :   0.0094   Min.   : 0.500  
##  1st Qu.:    521   1st Qu.:   3.1553   1st Qu.: 6.100  
##  Median :   2279   Median :   7.4780   Median : 6.900  
##  Mean   :  23407   Mean   :  22.5257   Mean   : 6.818  
##  3rd Qu.:  10144   3rd Qu.:  17.7757   3rd Qu.: 7.500  
##  Max.   :2268288   Max.   :1823.3740   Max.   :10.000  
##  NA's   :539       NA's   :94          NA's   :318

# Data Preparation
netflix_data <- netflix_data %>%
  drop_na(imdb_score, tmdb_score, runtime, type) %>%
  mutate(type_binary = ifelse(type == "SHOW", 1, 0))  # Creating a binary variable for type
# Preview structure and initial summary
str(netflix_data)

## 'data.frame':    5055 obs. of  16 variables:
##  $ id                  : chr  "tm84618" "tm127384" "tm70993" "tm190788" ...
##  $ title               : chr  "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
##  $ type                : chr  "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
##  $ release_year        : int  1976 1975 1979 1973 1969 1971 1964 1980 1967 1966 ...
##  $ age_certification   : chr  "R" "PG" "R" "R" ...
##  $ runtime             : int  113 91 94 133 30 102 170 104 110 117 ...
##  $ genres              : chr  "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
##  $ production_countries: chr  "['US']" "['GB']" "['GB']" "['US']" ...
##  $ seasons             : num  NA NA NA NA 4 NA NA NA NA NA ...
##  $ imdb_id             : chr  "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
##  $ imdb_score          : num  8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
##  $ imdb_votes          : num  795222 530877 392419 391942 72895 ...
##  $ tmdb_popularity     : num  27.6 18.2 17.5 95.3 12.9 ...
##  $ tmdb_score          : num  8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...
##  $ type_binary         : num  0 0 0 0 1 0 0 0 0 0 ...

summary(netflix_data)

##       id               title               type           description       
##  Length:5055        Length:5055        Length:5055        Length:5055       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1953   Length:5055        Min.   :  0.00   Length:5055       
##  1st Qu.:2015   Class :character   1st Qu.: 46.00   Class :character  
##  Median :2018   Mode  :character   Median : 87.00   Mode  :character  
##  Mean   :2016                      Mean   : 79.53                     
##  3rd Qu.:2020                      3rd Qu.:106.00                     
##  Max.   :2022                      Max.   :235.00                     
##                                                                       
##  production_countries    seasons         imdb_id            imdb_score   
##  Length:5055          Min.   : 1.000   Length:5055        Min.   :1.600  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:5.800  
##  Mode  :character     Median : 1.000   Mode  :character   Median :6.600  
##                       Mean   : 2.277                      Mean   :6.536  
##                       3rd Qu.: 3.000                      3rd Qu.:7.400  
##                       Max.   :42.000                      Max.   :9.500  
##                       NA's   :3269                                       
##    imdb_votes      tmdb_popularity      tmdb_score      type_binary    
##  Min.   :      5   Min.   :   0.600   Min.   : 1.000   Min.   :0.0000  
##  1st Qu.:    616   1st Qu.:   3.614   1st Qu.: 6.100   1st Qu.:0.0000  
##  Median :   2534   Median :   8.320   Median : 6.900   Median :0.0000  
##  Mean   :  24349   Mean   :  24.219   Mean   : 6.811   Mean   :0.3533  
##  3rd Qu.:  11039   3rd Qu.:  19.164   3rd Qu.: 7.500   3rd Qu.:1.0000  
##  Max.   :2268288   Max.   :1823.374   Max.   :10.000   Max.   :1.0000  
##  NA's   :14

The dataset is cleaned to remove rows with missing values in key variables and a binary variable (type_binary) is created to differentiate between “SHOW” and “MOVIE”. This allows for the inclusion of categorical variables in a linear regression model.

Referencing the Simple Model from Week 8

# Simple linear regression model from Week 8
lm_model_basic <- lm(imdb_score ~ tmdb_score, data = netflix_data)
summary(lm_model_basic)

## 
## Call:
## lm(formula = imdb_score ~ tmdb_score, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7174 -0.4582  0.1218  0.5970  4.4042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.50379    0.07918   31.62   <2e-16 ***
## tmdb_score   0.59200    0.01147   51.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9316 on 5053 degrees of freedom
## Multiple R-squared:  0.3454, Adjusted R-squared:  0.3452 
## F-statistic:  2666 on 1 and 5053 DF,  p-value: < 2.2e-16

The simple model uses tmdb_score to predict imdb_score. The coefficient for tmdb_score is 0.592, meaning that for every one-point increase in the TMDb score, the IMDb score increases by about 0.59 points. The R-squared value is 0.345, so the model explains about 34.5% of the variance in IMDb scores. While this is decent, the residual standard error of 0.9316 indicates there’s still quite a bit of unexplained variability. Clearly, additional predictors could help improve the model’s fit.

Expanding the Model with Additional Predictors

Variables Selection and Reasoning

type_binary (binary term): Whether the content is a “SHOW” or “MOVIE” could affect the relationship between TMDb and IMDb scores.
runtime (continuous variable): Runtime is a relevant factor as it may affect viewer engagement and rating patterns.
Interaction Term - tmdb_score:type_binary: This interaction allows the effect of tmdb_score to vary depending on whether the content is a show or a movie.

# Enhanced regression model
lm_model_extended <- lm(imdb_score ~ tmdb_score * type_binary + runtime, data = netflix_data)
summary(lm_model_extended)

## 
## Call:
## lm(formula = imdb_score ~ tmdb_score * type_binary + runtime, 
##     data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6422 -0.4322  0.1074  0.5746  4.2319 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.3713179  0.1122275  12.219   <2e-16 ***
## tmdb_score              0.6803382  0.0156441  43.489   <2e-16 ***
## type_binary             2.8946661  0.1866062  15.512   <2e-16 ***
## runtime                 0.0050629  0.0005174   9.785   <2e-16 ***
## tmdb_score:type_binary -0.3380755  0.0256591 -13.176   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9045 on 5050 degrees of freedom
## Multiple R-squared:  0.3832, Adjusted R-squared:  0.3827 
## F-statistic: 784.5 on 4 and 5050 DF,  p-value: < 2.2e-16

In the expanded model:

tmdb_score: For movies, a one-point increase in TMDb score leads to a 0.68-point increase in IMDb score. For shows, the effect is less because of the interaction term.

type_binary: Shows tend to have IMDb scores that are about 2.89 points higher than movies, all else being equal.
runtime: Each additional minute of runtime is associated with a small (0.005-point) increase in IMDb score.
Interaction term: The relationship between TMDb and IMDb scores is weaker for shows compared to movies, with the effect being reduced by 0.34 points.

The R-squared for the enhanced model is 0.383, meaning it explains 38.3% of the variance in IMDb scores—an improvement over the simple model. The residual standard error dropped slightly to 0.9045, meaning the model fits better, but there’s still room for improvement.

Multicollinearity Check

Multicollinearity can inflate variances, so I will check the Variance Inflation Factor (VIF) values to assess it.

# Check for multicollinearity
vif_values <- vif(lm_model_extended)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

vif_values

##             tmdb_score            type_binary                runtime 
##               1.974737              49.154635               2.488030 
## tmdb_score:type_binary 
##              53.719631

A high VIF (>5) suggests multicollinearity, which can inflate standard errors and reduce interpretability. In this case:

tmdb_score:type_binary has a high VIF, indicating multicollinearity with tmdb_score and type_binary.

If multicollinearity impacts interpretability or robustness, we may consider alternative specifications, like centering variables or removing redundant terms.

Model Diagnostics Using Plots

I’ll assess key assumptions of the regression model using five diagnostic plots.

1. Residuals vs. Fitted Values Plot

This plot checks for homoscedasticity (constant variance of errors) across all fitted values.

# Residuals vs Fitted Values
gg_resfitted(lm_model_extended) + geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Interpretation: Ideally, residuals should scatter around zero without forming a discernible pattern. Here, a slight “fanning” effect suggests heteroscedasticity, where variance increases at higher fitted values. If this pattern worsens, alternative models or weighted regression may be needed.

2. Residuals vs. Predictor Plots

To further examine the assumptions, I plot residuals against each continuous predictor (tmdb_score and runtime) to check for patterns, which would suggest non-linearity or non-constant variance.

# Residuals vs tmdb_score
ggplot(netflix_data, aes(x = tmdb_score, y = residuals(lm_model_extended))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Residuals vs TMDb Score")

## `geom_smooth()` using formula = 'y ~ x'

# Residuals vs runtime
ggplot(netflix_data, aes(x = runtime, y = residuals(lm_model_extended))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Residuals vs Runtime")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: These plots check for linearity. A non-linear pattern or “fanning” shape in residuals would indicate a violation of the constant variance or linearity assumption. Here, residuals appear randomly distributed, generally supporting linearity for both predictors.

3. Residual Histogram

# Residual Histogram
gg_reshist(lm_model_extended)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The histogram should show a roughly normal distribution for residuals. While slight asymmetry or skewness may be present, it does not appear severe enough to invalidate model assumptions. If normality is significantly violated, transformations could be considered.

4. QQ Plot

A QQ plot checks for normality in residuals by plotting them against a normal distribution.

# QQ Plot
gg_qqplot(lm_model_extended)

Interpretation: Residuals falling along the 45-degree line indicate normality. Deviations in the tails suggest minor non-normality, but given the general linearity of the points, this assumption is adequately met for practical purposes.

5. Cook’s Distance Plot

Cook’s Distance measures the influence of each data point on the model, with high values indicating high influence.

# Cook’s Distance Plot

gg_cooksd(lm_model_extended, threshold = ‘matlab’)

# Calculate Cook's distance
cooksd <- cooks.distance(lm_model_extended)

# Get the indices of the top 10 highest Cook's distances
top_10_indices <- order(cooksd, decreasing = TRUE)[1:10]

# Plot Cook's distance with labels for the top 10 points
ggplot(data.frame(cooksd = cooksd), aes(x = 1:length(cooksd), y = cooksd)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 4 / length(cooksd), color = "red") +
  geom_text(data = data.frame(cooksd = cooksd[top_10_indices],
                             x = top_10_indices,
                             label = top_10_indices),
            aes(x = x, y = cooksd, label = label),
            vjust = -0.5, size = 3) +
  labs(x = "Observation Number", y = "Cook's Distance",
       title = "Cook's Distance Plot")

Interpretation: Points with high Cook’s D values exert an outsize influence on the model. Several points exceed the threshold, suggesting influential observations. These points should be reviewed to determine if they represent anomalies, potential outliers, or valid data points with high leverage.

Evaluating Error Metrics for Model Performance

The following metrics provide a quantitative assessment of model fit, with lower values indicating better predictive accuracy.

# Error Metrics for the Basic Model (Week 8)
sse_basic <- sum(lm_model_basic$residuals ^ 2)
mse_basic <- mean(lm_model_basic$residuals ^ 2)
rmse_basic <- sqrt(mse_basic)
mae_basic <- mean(abs(lm_model_basic$residuals))
mape_basic <- mean((abs(lm_model_basic$residuals) + 1) / (predict(lm_model_basic) + 1))

# Error Metrics for the Extended Model (Week 9)
sse_extended <- sum(lm_model_extended$residuals ^ 2)
mse_extended <- mean(lm_model_extended$residuals ^ 2)
rmse_extended <- sqrt(mse_extended)
mae_extended <- mean(abs(lm_model_extended$residuals))
mape_extended <- mean((abs(lm_model_extended$residuals) + 1) / (predict(lm_model_extended) + 1))

# Display metrics for comparison
data.frame(
  Metric = c("SSE", "MSE", "RMSE", "MAE", "MAPE"),
  Basic_Model = c(sse_basic, mse_basic, rmse_basic, mae_basic, mape_basic),
  Extended_Model = c(sse_extended, mse_extended, rmse_extended, mae_extended, mape_extended)
)

##   Metric  Basic_Model Extended_Model
## 1    SSE 4385.6711006   4131.9435811
## 2    MSE    0.8675907      0.8173973
## 3   RMSE    0.9314455      0.9041003
## 4    MAE    0.6933627      0.6675413
## 5   MAPE    0.2270968      0.2242176

To quantify model performance, I calculated several error metrics and compared them to the simpler Week 8 model and the above table shows these metrics for both models:

Sum of Squared Errors (SSE) and Mean Squared Error (MSE) measure overall error, with lower values in the enhanced model suggesting better fit.
Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) provide average deviations, again lower in the enhanced model, showing improvement.
Mean Absolute Percentage Error (MAPE) offers a percentage-based error, and the slight reduction in MAPE for the enhanced model implies it provides a better predictive fit.

Summary of Insights and Future Directions

Throughout this analysis, I gathered several insights from the diagnostic plots and model evaluation, each pointing to areas of improvement and further research opportunities.

Homoscedasticity: The Residuals vs. Fitted plot revealed minor heteroscedasticity, indicated by a slight fanning pattern. While this isn’t severe enough to invalidate the model, it suggests that the variance of errors increases at higher predicted values. If this pattern persists in future models, I may need to consider alternative approaches, such as using weighted regression or applying a transformation to the response variable to stabilize the variance.
Residual Normality: The QQ plot and the histogram of residuals both suggest minor deviations from normality, particularly in the tails. However, since the deviations are not drastic, I consider the normality assumption reasonably met for practical purposes. If future models display more significant non-normality, I might explore transformations or even non-linear models to better capture the underlying distribution of residuals.
Influential Points: The Cook’s Distance plot identified several points with high leverage that could disproportionately influence the model. It’s important to investigate these observations further to determine whether they represent genuine outliers or if they are simply high-leverage data points that reflect real-world variability. Depending on the outcome, I could consider either removing these points or using robust regression techniques to reduce their influence on the model.
Multicollinearity: The high Variance Inflation Factor (VIF) for the interaction term indicates the presence of multicollinearity between the interaction and its component terms (i.e., tmdb_score and type_binary). Multicollinearity inflates the standard errors of the coefficients, making it harder to assess the individual impact of predictors. To address this, I may try centering the predictor variables or removing the interaction term if it’s not critical to the model. Alternatively, ridge regression could be used to mitigate multicollinearity without dropping important variables.

Future Directions

Additional Variables: Including variables such as genre, age certification, or IMDb votes could capture additional variability in imdb_score.
Alternative Models: If linearity assumptions continue to show strain, exploring non-linear models, such as polynomial or generalized additive models, might yield a better fit.
Outlier Analysis: Further investigation into influential points, as identified in the Cook’s Distance plot, may improve model robustness by addressing high-leverage cases.

This analysis underscores the value of considering runtime, TMDb ratings, and content type in predicting IMDb scores, providing insights into audience preferences that could inform Netflix content strategies. Through ongoing refinement of variable selection and diagnostics, this model can continue to yield meaningful, actionable insights for content analysts and creators.