Netflix Data Dive - GLMs Part 2

Loading and Preparing the Dataset

First, I’ll load the Netflix data and create a binary column, is_show, to indicate if the title is a show or movie. Then, I’ll perform a basic inspection to understand the data structure and identify any missing values.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Check the structure of the dataset
str(netflix_data)

## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

# Summarize the data to see missing values
summary(netflix_data)

##       id               title               type           description       
##  Length:5806        Length:5806        Length:5806        Length:5806       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1945   Length:5806        Min.   :  0.00   Length:5806       
##  1st Qu.:2015   Class :character   1st Qu.: 44.00   Class :character  
##  Median :2018   Mode  :character   Median : 84.00   Mode  :character  
##  Mean   :2016                      Mean   : 77.64                     
##  3rd Qu.:2020                      3rd Qu.:105.00                     
##  Max.   :2022                      Max.   :251.00                     
##                                                                       
##  production_countries    seasons         imdb_id            imdb_score   
##  Length:5806          Min.   : 1.000   Length:5806        Min.   :1.500  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:5.800  
##  Mode  :character     Median : 1.000   Mode  :character   Median :6.600  
##                       Mean   : 2.166                      Mean   :6.533  
##                       3rd Qu.: 2.000                      3rd Qu.:7.400  
##                       Max.   :42.000                      Max.   :9.600  
##                       NA's   :3759                        NA's   :523    
##    imdb_votes      tmdb_popularity       tmdb_score    
##  Min.   :      5   Min.   :   0.0094   Min.   : 0.500  
##  1st Qu.:    521   1st Qu.:   3.1553   1st Qu.: 6.100  
##  Median :   2279   Median :   7.4780   Median : 6.900  
##  Mean   :  23407   Mean   :  22.5257   Mean   : 6.818  
##  3rd Qu.:  10144   3rd Qu.:  17.7757   3rd Qu.: 7.500  
##  Max.   :2268288   Max.   :1823.3740   Max.   :10.000  
##  NA's   :539       NA's   :94          NA's   :318

# Data Cleaning
netflix_data <- netflix_data %>%
  drop_na(imdb_score, tmdb_score, runtime, release_year, type) %>%
  mutate(is_show = ifelse(type == "SHOW", 1, 0))

Selecting Response and Explanatory Variables

For this analysis, I’m interested in understanding the relationships between various features of Netflix titles and their IMDb scores. My response variable will be imdb_score, and I’ll select a mix of explanatory variables that could logically influence this score:

runtime: This variable measures the length of the content in minutes, with the hypothesis that longer runtime could indicate either higher production value or increased viewer engagement, potentially impacting the IMDb rating.
tmdb_score: TMDb scores are another well-known rating metric, and I suspect there may be a positive correlation between TMDb and IMDb scores.
release_year: This variable represents the year the title was released. Over time, changes in production quality, viewer preferences, and platform strategies could impact IMDb scores.
is_show: A binary variable indicating whether the title is a show or a movie. Shows and movies may inherently receive different types of ratings.

I’ll build a linear regression model using these variables.

Building the Linear Regression Model

To model IMDb scores, I’ll use a linear regression approach, including the selected explanatory variables.

# Building the linear regression model
lm_model <- lm(imdb_score ~ runtime + tmdb_score + release_year + is_show, data = netflix_data)

# Viewing the summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = imdb_score ~ runtime + tmdb_score + release_year + 
##     is_show, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5621 -0.4423  0.1196  0.5898  4.1329 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.9340894  3.5698361  12.307  < 2e-16 ***
## runtime       0.0042126  0.0005266   7.999 1.54e-15 ***
## tmdb_score    0.5557144  0.0124473  44.645  < 2e-16 ***
## release_year -0.0206805  0.0017667 -11.706  < 2e-16 ***
## is_show       0.4825849  0.0445700  10.828  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9077 on 5050 degrees of freedom
## Multiple R-squared:  0.3789, Adjusted R-squared:  0.3784 
## F-statistic: 770.1 on 4 and 5050 DF,  p-value: < 2.2e-16

Model Summary Interpretation:

Intercept: The estimated IMDb score when all predictors are at zero. Given unrealistic values (e.g., zero runtime), it primarily serves as a baseline.
runtime: Positively associated with IMDb score (0.00421), suggesting longer content might attract higher ratings.
tmdb_score: Positive coefficient (0.5557) indicates that TMDb ratings align well with IMDb scores.
release_year: Slight negative trend (-0.0207), possibly indicating a shift in rating standards or user expectations over time.
is_show: Shows (1) are rated slightly higher on IMDb compared to movies, reflected by the positive coefficient (0.4826).

Interpretation of the Coefficient for tmdb_score:

If tmdb_score has a positive coefficient, this would indicate that as the TMDb rating increases, the IMDb rating also increases. This would be expected, as higher ratings on one platform often correlate with higher ratings on another.

Confidence Interval for tmdb_score

To further interpret the coefficient of tmdb_score, I’ll calculate its confidence interval to evaluate the precision of the estimate.

# Confidence Interval for tmdb_score coefficient
confint(lm_model, "tmdb_score", level = 0.95)

##                2.5 %    97.5 %
## tmdb_score 0.5313123 0.5801165

This interval suggests we are 95% confident that the true impact of TMDb ratings on IMDb scores lies within the estimated range, reinforcing the significance of tmdb_score as a predictor.

Diagnostic Analysis of the Model

To ensure the model’s reliability, I’ll examine diagnostic plots and statistics. This will allow me to assess any potential issues in the model, such as multicollinearity, non-normality, or heteroscedasticity.

Residuals vs. Fitted Values

I plotted residuals against fitted values to check for homoscedasticity and linearity.

# Plotting Residuals vs Fitted Values to check for homoscedasticity
# Adding a loess curve to visualize trends in the residuals
plot(lm_model, which = 1, main = "Residuals vs Fitted")
abline(h = 0, col = "red")  # Adding a horizontal line at 0
lines(lowess(lm_model$fitted.values, residuals(lm_model)), col = "blue", lwd = 2)  # Loess curve

Residuals vs Fitted Values Interpretation:

The plot indicates that the model’s predictions are equally accurate across the range of predictor values (homoscedasticity). However, the loess curve suggests a non-linear relationship between the response and predictors. This could be addressed by transforming variables, investigating outliers, or adding polynomial terms.

Checking for Normality

To validate the normality assumption, I’ll use a Q-Q plot.

# Q-Q plot for normality check
plot(lm_model, which = 2)

Q-Q Plot Interpretation: Slight deviations from the line suggest minor non-normality, particularly at the tails. However, this is generally acceptable for large datasets like ours. Any severe deviations could impact confidence intervals, but here the overall distribution is reasonably close to normal.

Residual Distribution

For a clearer view of the residual spread, I plotted a histogram.

# Histogram of residuals
hist(residuals(lm_model), breaks = 30, main = "Distribution of Residuals", xlab = "Residuals")

Interpretation: The residuals are nearly normally distributed with slight skewness, which falls within acceptable limits for large datasets.

Multicollinearity Check

Using the Variance Inflation Factor (VIF) will help me detect multicollinearity, which can inflate the variances of the estimated regression coefficients.

# Checking VIF to detect multicollinearity
vif(lm_model)

##      runtime   tmdb_score release_year      is_show 
##     2.559091     1.241384     1.047751     2.784495

VIF Interpretation: All VIF values are below 5, indicating no concerning multicollinearity issues among predictors. This means that each variable provides unique information, enhancing model stability and interpretability.

VIF Bar Plot:

{r} vif_values <- vif(lm_model) barplot(vif_values, main = "Variance Inflation Factor", col = "lightblue", ylim = c(0, 5))}

VIF values remain under 5, indicating low multicollinearity. The bar plot visualization simplifies the assessment of multicollinearity across predictors.

Residuals vs. Leverage Plot:

plot(lm_model, which = 5, main = "Residuals vs Leverage")

The Residuals vs. Leverage Plot helps identify points with both high residuals and leverage, confirming which observations may be highly influential.

Partial Residual Plots:

crPlots(lm_model)

Partial residual plots provide a visual confirmation of the linear relationship between each predictor and IMDb scores, while adjusting for the other predictors. Any non-linear patterns here might suggest the need for polynomial terms.

Outlier Detection Using Cook’s Distance

To identify any influential points that might unduly affect the model, I’ll examine Cook’s Distance.

# Cook's Distance plot
plot(lm_model, which = 4)

Cook’s Distance Interpretation: Points 2888 and 4705 show moderate Cook’s Distance values, suggesting they might influence the model’s fit. These should be investigated to confirm they are legitimate observations and not data entry errors.

Influence Plot Interpretation

The Influence Plot helps identify high-leverage points with significant residuals, potentially affecting the model.

# Influence plot
influencePlot(lm_model)

##         StudRes          Hat        CookD
## 24   -0.4557303 0.0149152196 0.0006290281
## 28   -0.4835200 0.0145713907 0.0006915135
## 2888  4.0818402 0.0094111939 0.0315608367
## 4595 -5.9870117 0.0008972439 0.0063938768
## 4705  4.5819363 0.0086923165 0.0366723768
## 4835 -6.1528922 0.0010205767 0.0076792619

Data Point 2888: Large positive residual with moderate Cook’s Distance indicates a potential outlier that might exert moderate influence on the model.
Data Point 4705: Large positive residual and moderate Cook’s Distance imply influence, though it does not distort the model severely.
Other Points (e.g., 24, 28, 4595): Low leverage and small Cook’s Distance values indicate minimal impact on model fit despite slight deviations.

Summary and Insights

This linear model provides several insights:

Variable Impact: Each predictor has a statistically significant impact on IMDb scores. tmdb_score and runtime are positively correlated, while release_year has a slight negative trend.
Model Diagnostics: Diagnostic checks, including the Cook’s Distance and influence plots, suggest that a few points may slightly influence the model, though no major violations of assumptions were identified.
Coefficient Confidence: Confidence intervals indicate precision for coefficients, reinforcing their impact on imdb_score.

Further Investigation

Potential areas to explore include:

Non-linear Terms: Adding polynomial terms might enhance model fit if the linearity assumption is weak.
Data Transformation: To address any remaining heteroscedasticity, transformation techniques could be employed.
Validation: Testing the model on a hold-out set would provide insights into generalization performance.

This refined analysis enhances our understanding of the Netflix dataset, especially factors influencing IMDb scores, and sets the stage for future model improvements.