First, I’ll load the Netflix data and create a binary column,
is_show
, to indicate if the title is a show or movie. Then,
I’ll perform a basic inspection to understand the data structure and
identify any missing values.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Check the structure of the dataset
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
# Summarize the data to see missing values
summary(netflix_data)
## id title type description
## Length:5806 Length:5806 Length:5806 Length:5806
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_year age_certification runtime genres
## Min. :1945 Length:5806 Min. : 0.00 Length:5806
## 1st Qu.:2015 Class :character 1st Qu.: 44.00 Class :character
## Median :2018 Mode :character Median : 84.00 Mode :character
## Mean :2016 Mean : 77.64
## 3rd Qu.:2020 3rd Qu.:105.00
## Max. :2022 Max. :251.00
##
## production_countries seasons imdb_id imdb_score
## Length:5806 Min. : 1.000 Length:5806 Min. :1.500
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:5.800
## Mode :character Median : 1.000 Mode :character Median :6.600
## Mean : 2.166 Mean :6.533
## 3rd Qu.: 2.000 3rd Qu.:7.400
## Max. :42.000 Max. :9.600
## NA's :3759 NA's :523
## imdb_votes tmdb_popularity tmdb_score
## Min. : 5 Min. : 0.0094 Min. : 0.500
## 1st Qu.: 521 1st Qu.: 3.1553 1st Qu.: 6.100
## Median : 2279 Median : 7.4780 Median : 6.900
## Mean : 23407 Mean : 22.5257 Mean : 6.818
## 3rd Qu.: 10144 3rd Qu.: 17.7757 3rd Qu.: 7.500
## Max. :2268288 Max. :1823.3740 Max. :10.000
## NA's :539 NA's :94 NA's :318
# Data Cleaning
netflix_data <- netflix_data %>%
drop_na(imdb_score, tmdb_score, runtime, release_year, type) %>%
mutate(is_show = ifelse(type == "SHOW", 1, 0))
For this analysis, I’m interested in understanding the relationships
between various features of Netflix titles and their IMDb scores. My
response variable will be imdb_score
, and I’ll select a mix
of explanatory variables that could logically influence this score:
runtime: This variable measures the length of the content in minutes, with the hypothesis that longer runtime could indicate either higher production value or increased viewer engagement, potentially impacting the IMDb rating.
tmdb_score: TMDb scores are another well-known rating metric, and I suspect there may be a positive correlation between TMDb and IMDb scores.
release_year: This variable represents the year the title was released. Over time, changes in production quality, viewer preferences, and platform strategies could impact IMDb scores.
is_show: A binary variable indicating whether the title is a show or a movie. Shows and movies may inherently receive different types of ratings.
I’ll build a linear regression model using these variables.
To model IMDb scores, I’ll use a linear regression approach, including the selected explanatory variables.
# Building the linear regression model
lm_model <- lm(imdb_score ~ runtime + tmdb_score + release_year + is_show, data = netflix_data)
# Viewing the summary of the model
summary(lm_model)
##
## Call:
## lm(formula = imdb_score ~ runtime + tmdb_score + release_year +
## is_show, data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5621 -0.4423 0.1196 0.5898 4.1329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.9340894 3.5698361 12.307 < 2e-16 ***
## runtime 0.0042126 0.0005266 7.999 1.54e-15 ***
## tmdb_score 0.5557144 0.0124473 44.645 < 2e-16 ***
## release_year -0.0206805 0.0017667 -11.706 < 2e-16 ***
## is_show 0.4825849 0.0445700 10.828 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9077 on 5050 degrees of freedom
## Multiple R-squared: 0.3789, Adjusted R-squared: 0.3784
## F-statistic: 770.1 on 4 and 5050 DF, p-value: < 2.2e-16
Intercept: The estimated IMDb score when all
predictors are at zero. Given unrealistic values (e.g., zero
runtime
), it primarily serves as a baseline.
runtime
: Positively associated with
IMDb score (0.00421), suggesting longer content might attract higher
ratings.
tmdb_score
: Positive coefficient
(0.5557) indicates that TMDb ratings align well with IMDb
scores.
release_year
: Slight negative trend
(-0.0207), possibly indicating a shift in rating standards or user
expectations over time.
is_show
: Shows (1) are rated
slightly higher on IMDb compared to movies, reflected by the positive
coefficient (0.4826).
Interpretation of the Coefficient for
tmdb_score
:
tmdb_score
has a positive coefficient, this would
indicate that as the TMDb rating increases, the IMDb rating also
increases. This would be expected, as higher ratings on one platform
often correlate with higher ratings on another.To further interpret the coefficient of tmdb_score
, I’ll
calculate its confidence interval to evaluate the precision of the
estimate.
# Confidence Interval for tmdb_score coefficient
confint(lm_model, "tmdb_score", level = 0.95)
## 2.5 % 97.5 %
## tmdb_score 0.5313123 0.5801165
This interval suggests we are 95% confident that the true impact of
TMDb ratings on IMDb scores lies within the estimated range, reinforcing
the significance of tmdb_score
as a predictor.
To ensure the model’s reliability, I’ll examine diagnostic plots and statistics. This will allow me to assess any potential issues in the model, such as multicollinearity, non-normality, or heteroscedasticity.
I plotted residuals against fitted values to check for homoscedasticity and linearity.
# Plotting Residuals vs Fitted Values to check for homoscedasticity
# Adding a loess curve to visualize trends in the residuals
plot(lm_model, which = 1, main = "Residuals vs Fitted")
abline(h = 0, col = "red") # Adding a horizontal line at 0
lines(lowess(lm_model$fitted.values, residuals(lm_model)), col = "blue", lwd = 2) # Loess curve
Residuals vs Fitted Values Interpretation:
The plot indicates that the model’s predictions are equally accurate across the range of predictor values (homoscedasticity). However, the loess curve suggests a non-linear relationship between the response and predictors. This could be addressed by transforming variables, investigating outliers, or adding polynomial terms.
To validate the normality assumption, I’ll use a Q-Q plot.
# Q-Q plot for normality check
plot(lm_model, which = 2)
Q-Q Plot Interpretation: Slight deviations from the line suggest minor non-normality, particularly at the tails. However, this is generally acceptable for large datasets like ours. Any severe deviations could impact confidence intervals, but here the overall distribution is reasonably close to normal.
For a clearer view of the residual spread, I plotted a histogram.
# Histogram of residuals
hist(residuals(lm_model), breaks = 30, main = "Distribution of Residuals", xlab = "Residuals")
Interpretation: The residuals are nearly normally distributed with slight skewness, which falls within acceptable limits for large datasets.
Using the Variance Inflation Factor (VIF) will help me detect multicollinearity, which can inflate the variances of the estimated regression coefficients.
# Checking VIF to detect multicollinearity
vif(lm_model)
## runtime tmdb_score release_year is_show
## 2.559091 1.241384 1.047751 2.784495
VIF Interpretation: All VIF values are below 5, indicating no concerning multicollinearity issues among predictors. This means that each variable provides unique information, enhancing model stability and interpretability.
VIF Bar Plot:
{r} vif_values <- vif(lm_model) barplot(vif_values, main = "Variance Inflation Factor", col = "lightblue", ylim = c(0, 5))}
VIF values remain under 5, indicating low multicollinearity. The bar plot visualization simplifies the assessment of multicollinearity across predictors.
Residuals vs. Leverage Plot:
plot(lm_model, which = 5, main = "Residuals vs Leverage")
The Residuals vs. Leverage Plot helps identify points with both high residuals and leverage, confirming which observations may be highly influential.
Partial Residual Plots:
crPlots(lm_model)
Partial residual plots provide a visual confirmation of the linear relationship between each predictor and IMDb scores, while adjusting for the other predictors. Any non-linear patterns here might suggest the need for polynomial terms.
To identify any influential points that might unduly affect the model, I’ll examine Cook’s Distance.
# Cook's Distance plot
plot(lm_model, which = 4)
Cook’s Distance Interpretation: Points 2888 and 4705 show moderate Cook’s Distance values, suggesting they might influence the model’s fit. These should be investigated to confirm they are legitimate observations and not data entry errors.
The Influence Plot helps identify high-leverage points with significant residuals, potentially affecting the model.
# Influence plot
influencePlot(lm_model)
## StudRes Hat CookD
## 24 -0.4557303 0.0149152196 0.0006290281
## 28 -0.4835200 0.0145713907 0.0006915135
## 2888 4.0818402 0.0094111939 0.0315608367
## 4595 -5.9870117 0.0008972439 0.0063938768
## 4705 4.5819363 0.0086923165 0.0366723768
## 4835 -6.1528922 0.0010205767 0.0076792619
Data Point 2888: Large positive residual with moderate Cook’s Distance indicates a potential outlier that might exert moderate influence on the model.
Data Point 4705: Large positive residual and moderate Cook’s Distance imply influence, though it does not distort the model severely.
Other Points (e.g., 24, 28, 4595): Low leverage and small Cook’s Distance values indicate minimal impact on model fit despite slight deviations.
This linear model provides several insights:
Variable Impact: Each predictor has a
statistically significant impact on IMDb scores. tmdb_score
and runtime
are positively correlated, while
release_year
has a slight negative trend.
Model Diagnostics: Diagnostic checks, including the Cook’s Distance and influence plots, suggest that a few points may slightly influence the model, though no major violations of assumptions were identified.
Coefficient Confidence: Confidence intervals
indicate precision for coefficients, reinforcing their impact on
imdb_score
.
Potential areas to explore include:
Non-linear Terms: Adding polynomial terms might enhance model fit if the linearity assumption is weak.
Data Transformation: To address any remaining heteroscedasticity, transformation techniques could be employed.
Validation: Testing the model on a hold-out set would provide insights into generalization performance.
This refined analysis enhances our understanding of the Netflix dataset, especially factors influencing IMDb scores, and sets the stage for future model improvements.