Load the Netflix:
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview structure and initial summary
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
summary(netflix_data)
## id title type description
## Length:5806 Length:5806 Length:5806 Length:5806
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_year age_certification runtime genres
## Min. :1945 Length:5806 Min. : 0.00 Length:5806
## 1st Qu.:2015 Class :character 1st Qu.: 44.00 Class :character
## Median :2018 Mode :character Median : 84.00 Mode :character
## Mean :2016 Mean : 77.64
## 3rd Qu.:2020 3rd Qu.:105.00
## Max. :2022 Max. :251.00
##
## production_countries seasons imdb_id imdb_score
## Length:5806 Min. : 1.000 Length:5806 Min. :1.500
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:5.800
## Mode :character Median : 1.000 Mode :character Median :6.600
## Mean : 2.166 Mean :6.533
## 3rd Qu.: 2.000 3rd Qu.:7.400
## Max. :42.000 Max. :9.600
## NA's :3759 NA's :523
## imdb_votes tmdb_popularity tmdb_score
## Min. : 5 Min. : 0.0094 Min. : 0.500
## 1st Qu.: 521 1st Qu.: 3.1553 1st Qu.: 6.100
## Median : 2279 Median : 7.4780 Median : 6.900
## Mean : 23407 Mean : 22.5257 Mean : 6.818
## 3rd Qu.: 10144 3rd Qu.: 17.7757 3rd Qu.: 7.500
## Max. :2268288 Max. :1823.3740 Max. :10.000
## NA's :539 NA's :94 NA's :318
# Data Preparation
netflix_data <- netflix_data %>%
drop_na(imdb_score, tmdb_score, runtime, type) %>%
mutate(type_binary = ifelse(type == "SHOW", 1, 0)) # Creating a binary variable for type
# Preview structure and initial summary
str(netflix_data)
## 'data.frame': 5055 obs. of 16 variables:
## $ id : chr "tm84618" "tm127384" "tm70993" "tm190788" ...
## $ title : chr "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
## $ type : chr "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
## $ release_year : int 1976 1975 1979 1973 1969 1971 1964 1980 1967 1966 ...
## $ age_certification : chr "R" "PG" "R" "R" ...
## $ runtime : int 113 91 94 133 30 102 170 104 110 117 ...
## $ genres : chr "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
## $ production_countries: chr "['US']" "['GB']" "['GB']" "['US']" ...
## $ seasons : num NA NA NA NA 4 NA NA NA NA NA ...
## $ imdb_id : chr "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
## $ imdb_score : num 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
## $ imdb_votes : num 795222 530877 392419 391942 72895 ...
## $ tmdb_popularity : num 27.6 18.2 17.5 95.3 12.9 ...
## $ tmdb_score : num 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...
## $ type_binary : num 0 0 0 0 1 0 0 0 0 0 ...
summary(netflix_data)
## id title type description
## Length:5055 Length:5055 Length:5055 Length:5055
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_year age_certification runtime genres
## Min. :1953 Length:5055 Min. : 0.00 Length:5055
## 1st Qu.:2015 Class :character 1st Qu.: 46.00 Class :character
## Median :2018 Mode :character Median : 87.00 Mode :character
## Mean :2016 Mean : 79.53
## 3rd Qu.:2020 3rd Qu.:106.00
## Max. :2022 Max. :235.00
##
## production_countries seasons imdb_id imdb_score
## Length:5055 Min. : 1.000 Length:5055 Min. :1.600
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:5.800
## Mode :character Median : 1.000 Mode :character Median :6.600
## Mean : 2.277 Mean :6.536
## 3rd Qu.: 3.000 3rd Qu.:7.400
## Max. :42.000 Max. :9.500
## NA's :3269
## imdb_votes tmdb_popularity tmdb_score type_binary
## Min. : 5 Min. : 0.600 Min. : 1.000 Min. :0.0000
## 1st Qu.: 616 1st Qu.: 3.614 1st Qu.: 6.100 1st Qu.:0.0000
## Median : 2534 Median : 8.320 Median : 6.900 Median :0.0000
## Mean : 24349 Mean : 24.219 Mean : 6.811 Mean :0.3533
## 3rd Qu.: 11039 3rd Qu.: 19.164 3rd Qu.: 7.500 3rd Qu.:1.0000
## Max. :2268288 Max. :1823.374 Max. :10.000 Max. :1.0000
## NA's :14
The dataset is cleaned to remove rows with missing values in key
variables and a binary variable (type_binary
) is created to
differentiate between “SHOW” and “MOVIE”. This allows for the inclusion
of categorical variables in a linear regression model.
Referencing the Simple Model from Week 8
# Simple linear regression model from Week 8
lm_model_basic <- lm(imdb_score ~ tmdb_score, data = netflix_data)
summary(lm_model_basic)
##
## Call:
## lm(formula = imdb_score ~ tmdb_score, data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7174 -0.4582 0.1218 0.5970 4.4042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.50379 0.07918 31.62 <2e-16 ***
## tmdb_score 0.59200 0.01147 51.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9316 on 5053 degrees of freedom
## Multiple R-squared: 0.3454, Adjusted R-squared: 0.3452
## F-statistic: 2666 on 1 and 5053 DF, p-value: < 2.2e-16
The simple model uses tmdb_score
to predict
imdb_score
. The coefficient for tmdb_score
is
0.592, meaning that for every one-point increase in the TMDb score, the
IMDb score increases by about 0.59 points. The R-squared value is 0.345,
so the model explains about 34.5% of the variance in IMDb scores. While
this is decent, the residual standard error of 0.9316 indicates there’s
still quite a bit of unexplained variability. Clearly, additional
predictors could help improve the model’s fit.
type_binary
(binary term): Whether
the content is a “SHOW” or “MOVIE” could affect the relationship between
TMDb and IMDb scores.
runtime
(continuous variable):
Runtime is a relevant factor as it may affect viewer engagement and
rating patterns.
Interaction Term -
tmdb_score:type_binary
: This interaction allows the effect
of tmdb_score
to vary depending on whether the content is a
show or a movie.
# Enhanced regression model
lm_model_extended <- lm(imdb_score ~ tmdb_score * type_binary + runtime, data = netflix_data)
summary(lm_model_extended)
##
## Call:
## lm(formula = imdb_score ~ tmdb_score * type_binary + runtime,
## data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6422 -0.4322 0.1074 0.5746 4.2319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3713179 0.1122275 12.219 <2e-16 ***
## tmdb_score 0.6803382 0.0156441 43.489 <2e-16 ***
## type_binary 2.8946661 0.1866062 15.512 <2e-16 ***
## runtime 0.0050629 0.0005174 9.785 <2e-16 ***
## tmdb_score:type_binary -0.3380755 0.0256591 -13.176 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9045 on 5050 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3827
## F-statistic: 784.5 on 4 and 5050 DF, p-value: < 2.2e-16
In the expanded model:
type_binary: Shows tend to have IMDb scores that are about 2.89 points higher than movies, all else being equal.
runtime: Each additional minute of runtime is associated with a small (0.005-point) increase in IMDb score.
Interaction term: The relationship between TMDb and IMDb scores is weaker for shows compared to movies, with the effect being reduced by 0.34 points.
The R-squared for the enhanced model is 0.383, meaning it explains 38.3% of the variance in IMDb scores—an improvement over the simple model. The residual standard error dropped slightly to 0.9045, meaning the model fits better, but there’s still room for improvement.
Multicollinearity Check
Multicollinearity can inflate variances, so I will check the Variance Inflation Factor (VIF) values to assess it.
# Check for multicollinearity
vif_values <- vif(lm_model_extended)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
vif_values
## tmdb_score type_binary runtime
## 1.974737 49.154635 2.488030
## tmdb_score:type_binary
## 53.719631
A high VIF (>5) suggests multicollinearity, which can inflate standard errors and reduce interpretability. In this case:
tmdb_score:type_binary
has a high VIF,
indicating multicollinearity with tmdb_score
and
type_binary
.If multicollinearity impacts interpretability or robustness, we may consider alternative specifications, like centering variables or removing redundant terms.
Model Diagnostics Using Plots
I’ll assess key assumptions of the regression model using five diagnostic plots.
1. Residuals vs. Fitted Values Plot
This plot checks for homoscedasticity (constant variance of errors) across all fitted values.
# Residuals vs Fitted Values
gg_resfitted(lm_model_extended) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Interpretation: Ideally, residuals should scatter around zero without forming a discernible pattern. Here, a slight “fanning” effect suggests heteroscedasticity, where variance increases at higher fitted values. If this pattern worsens, alternative models or weighted regression may be needed.
2. Residuals vs. Predictor Plots
To further examine the assumptions, I plot residuals against each
continuous predictor (tmdb_score
and runtime
)
to check for patterns, which would suggest non-linearity or non-constant
variance.
# Residuals vs tmdb_score
ggplot(netflix_data, aes(x = tmdb_score, y = residuals(lm_model_extended))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Residuals vs TMDb Score")
## `geom_smooth()` using formula = 'y ~ x'
# Residuals vs runtime
ggplot(netflix_data, aes(x = runtime, y = residuals(lm_model_extended))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Residuals vs Runtime")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: These plots check for linearity. A non-linear pattern or “fanning” shape in residuals would indicate a violation of the constant variance or linearity assumption. Here, residuals appear randomly distributed, generally supporting linearity for both predictors.
3. Residual Histogram
# Residual Histogram
gg_reshist(lm_model_extended)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: The histogram should show a roughly normal distribution for residuals. While slight asymmetry or skewness may be present, it does not appear severe enough to invalidate model assumptions. If normality is significantly violated, transformations could be considered.
4. QQ Plot
A QQ plot checks for normality in residuals by plotting them against a normal distribution.
# QQ Plot
gg_qqplot(lm_model_extended)
Interpretation: Residuals falling along the 45-degree line indicate normality. Deviations in the tails suggest minor non-normality, but given the general linearity of the points, this assumption is adequately met for practical purposes.
5. Cook’s Distance Plot
Cook’s Distance measures the influence of each data point on the model, with high values indicating high influence.
# Cook’s Distance Plot
gg_cooksd(lm_model_extended, threshold = ‘matlab’)
# Calculate Cook's distance
cooksd <- cooks.distance(lm_model_extended)
# Get the indices of the top 10 highest Cook's distances
top_10_indices <- order(cooksd, decreasing = TRUE)[1:10]
# Plot Cook's distance with labels for the top 10 points
ggplot(data.frame(cooksd = cooksd), aes(x = 1:length(cooksd), y = cooksd)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 4 / length(cooksd), color = "red") +
geom_text(data = data.frame(cooksd = cooksd[top_10_indices],
x = top_10_indices,
label = top_10_indices),
aes(x = x, y = cooksd, label = label),
vjust = -0.5, size = 3) +
labs(x = "Observation Number", y = "Cook's Distance",
title = "Cook's Distance Plot")
Interpretation: Points with high Cook’s D values exert an outsize influence on the model. Several points exceed the threshold, suggesting influential observations. These points should be reviewed to determine if they represent anomalies, potential outliers, or valid data points with high leverage.
The following metrics provide a quantitative assessment of model fit, with lower values indicating better predictive accuracy.
# Error Metrics for the Basic Model (Week 8)
sse_basic <- sum(lm_model_basic$residuals ^ 2)
mse_basic <- mean(lm_model_basic$residuals ^ 2)
rmse_basic <- sqrt(mse_basic)
mae_basic <- mean(abs(lm_model_basic$residuals))
mape_basic <- mean((abs(lm_model_basic$residuals) + 1) / (predict(lm_model_basic) + 1))
# Error Metrics for the Extended Model (Week 9)
sse_extended <- sum(lm_model_extended$residuals ^ 2)
mse_extended <- mean(lm_model_extended$residuals ^ 2)
rmse_extended <- sqrt(mse_extended)
mae_extended <- mean(abs(lm_model_extended$residuals))
mape_extended <- mean((abs(lm_model_extended$residuals) + 1) / (predict(lm_model_extended) + 1))
# Display metrics for comparison
data.frame(
Metric = c("SSE", "MSE", "RMSE", "MAE", "MAPE"),
Basic_Model = c(sse_basic, mse_basic, rmse_basic, mae_basic, mape_basic),
Extended_Model = c(sse_extended, mse_extended, rmse_extended, mae_extended, mape_extended)
)
## Metric Basic_Model Extended_Model
## 1 SSE 4385.6711006 4131.9435811
## 2 MSE 0.8675907 0.8173973
## 3 RMSE 0.9314455 0.9041003
## 4 MAE 0.6933627 0.6675413
## 5 MAPE 0.2270968 0.2242176
To quantify model performance, I calculated several error metrics and compared them to the simpler Week 8 model and the above table shows these metrics for both models:
Sum of Squared Errors (SSE) and Mean Squared Error (MSE) measure overall error, with lower values in the enhanced model suggesting better fit.
Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) provide average deviations, again lower in the enhanced model, showing improvement.
Mean Absolute Percentage Error (MAPE) offers a percentage-based error, and the slight reduction in MAPE for the enhanced model implies it provides a better predictive fit.
Throughout this analysis, I gathered several insights from the diagnostic plots and model evaluation, each pointing to areas of improvement and further research opportunities.
Homoscedasticity: The Residuals vs. Fitted plot revealed minor heteroscedasticity, indicated by a slight fanning pattern. While this isn’t severe enough to invalidate the model, it suggests that the variance of errors increases at higher predicted values. If this pattern persists in future models, I may need to consider alternative approaches, such as using weighted regression or applying a transformation to the response variable to stabilize the variance.
Residual Normality: The QQ plot and the histogram of residuals both suggest minor deviations from normality, particularly in the tails. However, since the deviations are not drastic, I consider the normality assumption reasonably met for practical purposes. If future models display more significant non-normality, I might explore transformations or even non-linear models to better capture the underlying distribution of residuals.
Influential Points: The Cook’s Distance plot identified several points with high leverage that could disproportionately influence the model. It’s important to investigate these observations further to determine whether they represent genuine outliers or if they are simply high-leverage data points that reflect real-world variability. Depending on the outcome, I could consider either removing these points or using robust regression techniques to reduce their influence on the model.
Multicollinearity: The high Variance Inflation
Factor (VIF) for the interaction term indicates the presence of
multicollinearity between the interaction and its component terms (i.e.,
tmdb_score
and type_binary
). Multicollinearity
inflates the standard errors of the coefficients, making it harder to
assess the individual impact of predictors. To address this, I may try
centering the predictor variables or removing the interaction term if
it’s not critical to the model. Alternatively, ridge regression could be
used to mitigate multicollinearity without dropping important
variables.
Additional Variables: Including variables such
as genre, age certification, or IMDb votes could capture additional
variability in imdb_score
.
Alternative Models: If linearity assumptions continue to show strain, exploring non-linear models, such as polynomial or generalized additive models, might yield a better fit.
Outlier Analysis: Further investigation into influential points, as identified in the Cook’s Distance plot, may improve model robustness by addressing high-leverage cases.
This analysis underscores the value of considering runtime, TMDb ratings, and content type in predicting IMDb scores, providing insights into audience preferences that could inform Netflix content strategies. Through ongoing refinement of variable selection and diagnostics, this model can continue to yield meaningful, actionable insights for content analysts and creators.