Load the Netflix:
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview structure and initial summary
str(netflix_data)
## 'data.frame': 5806 obs. of 15 variables:
## $ id : chr "ts300399" "tm84618" "tm127384" "tm70993" ...
## $ title : chr "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
## $ type : chr "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
## $ release_year : int 1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
## $ age_certification : chr "TV-MA" "R" "PG" "R" ...
## $ runtime : int 48 113 91 94 133 30 102 170 104 110 ...
## $ genres : chr "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
## $ production_countries: chr "['US']" "['US']" "['GB']" "['GB']" ...
## $ seasons : num 1 NA NA NA NA 4 NA NA NA NA ...
## $ imdb_id : chr "" "tt0075314" "tt0071853" "tt0079470" ...
## $ imdb_score : num NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
## $ imdb_votes : num NA 795222 530877 392419 391942 ...
## $ tmdb_popularity : num 0.6 27.6 18.2 17.5 95.3 ...
## $ tmdb_score : num NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...
summary(netflix_data)
## id title type description
## Length:5806 Length:5806 Length:5806 Length:5806
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## release_year age_certification runtime genres
## Min. :1945 Length:5806 Min. : 0.00 Length:5806
## 1st Qu.:2015 Class :character 1st Qu.: 44.00 Class :character
## Median :2018 Mode :character Median : 84.00 Mode :character
## Mean :2016 Mean : 77.64
## 3rd Qu.:2020 3rd Qu.:105.00
## Max. :2022 Max. :251.00
##
## production_countries seasons imdb_id imdb_score
## Length:5806 Min. : 1.000 Length:5806 Min. :1.500
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:5.800
## Mode :character Median : 1.000 Mode :character Median :6.600
## Mean : 2.166 Mean :6.533
## 3rd Qu.: 2.000 3rd Qu.:7.400
## Max. :42.000 Max. :9.600
## NA's :3759 NA's :523
## imdb_votes tmdb_popularity tmdb_score
## Min. : 5 Min. : 0.0094 Min. : 0.500
## 1st Qu.: 521 1st Qu.: 3.1553 1st Qu.: 6.100
## Median : 2279 Median : 7.4780 Median : 6.900
## Mean : 23407 Mean : 22.5257 Mean : 6.818
## 3rd Qu.: 10144 3rd Qu.: 17.7757 3rd Qu.: 7.500
## Max. :2268288 Max. :1823.3740 Max. :10.000
## NA's :539 NA's :94 NA's :318
Selecting and Preparing a Binary Variable
In this analysis, I’ll use the type
column, which
specifies whether a title is a “SHOW” or a “MOVIE.” I’ll convert this
into a binary variable called is_show
, where 1 represents a
“SHOW” and 0 represents a “MOVIE.”
# Create binary column
netflix_data <- netflix_data %>%
drop_na(imdb_score, tmdb_score, runtime, type) %>%
mutate(is_show = ifelse(type == "SHOW", 1, 0))
# Previewing the dataset with the new binary variable
table(netflix_data$is_show)
##
## 0 1
## 3269 1786
Explanation: Choosing is_show
as the
response variable is meaningful because it allows us to explore the
factors influencing the likelihood of a title being a show versus a
movie. This information could be helpful for understanding the
structural characteristics associated with each type of content.
Building a Logistic Regression Model
Now, I’ll fit a logistic regression model where is_show
is the response variable. For explanatory variables, I’ll use:
runtime
: The length of the content in minutes, as
longer runtimes could indicate movies.
tmdb_score
: The TMDb score, as shows and movies may
score differently.
release_year
: Year of release, since trends in
content type might evolve over time.
# Logistic regression model
logistic_model <- glm(is_show ~ runtime + tmdb_score + release_year,
data = netflix_data,
family = binomial)
# Summary of the logistic model
summary(logistic_model)
##
## Call:
## glm(formula = is_show ~ runtime + tmdb_score + release_year,
## family = binomial, data = netflix_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -76.817200 24.797424 -3.098 0.00195 **
## runtime -0.107090 0.003462 -30.932 < 2e-16 ***
## tmdb_score 0.985397 0.063577 15.499 < 2e-16 ***
## release_year 0.037887 0.012294 3.082 0.00206 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6566.2 on 5054 degrees of freedom
## Residual deviance: 1808.8 on 5051 degrees of freedom
## AIC: 1816.8
##
## Number of Fisher Scoring iterations: 7
Explanation: The choice of predictors here aims to uncover how content features (like runtime and score) and trends over time (release year) impact the probability of content being categorized as a “SHOW.” The logistic regression approach is suitable for modeling binary outcomes and estimating the probability of a title being a show.
Visualizing Relationships of Predictors with Response Variable
Plotting the relationships between each predictor and
is_show
can provide insights into the nature of these
associations.
# Scatter plot of runtime by is_show
ggplot(netflix_data, aes(x = runtime, y = is_show)) +
geom_jitter(height = 0.05, alpha = 0.3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
labs(title = "Probability of Title Being a Show vs. Runtime")
## `geom_smooth()` using formula = 'y ~ x'
# Scatter plot of tmdb_score by is_show
ggplot(netflix_data, aes(x = tmdb_score, y = is_show)) +
geom_jitter(height = 0.05, alpha = 0.3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
labs(title = "Probability of Title Being a Show vs. TMDb Score")
## `geom_smooth()` using formula = 'y ~ x'
# Scatter plot of release_year by is_show
ggplot(netflix_data, aes(x = release_year, y = is_show)) +
geom_jitter(height = 0.05, alpha = 0.3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
labs(title = "Probability of Title Being a Show vs. Release Year")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation of Visuals:
Runtime: A decreasing trend as runtime increases suggests movies typically have longer runtimes.
TMDb Score: An increasing probability indicates that shows may score higher on TMDb, perhaps due to episodic content or audience engagement differences.
Release Year: A positive trend suggests newer content is more likely to be categorized as shows, reflecting potential shifts in production preferences.
Interpreting the Coefficients of the Logistic Model
Examining the model summary provides coefficient estimates, which can be interpreted in terms of log odds.
Intercept: The intercept represents the baseline
log odds of a title being a show when runtime
,
tmdb_score
, and release_year
are zero.
However, in practice, we interpret it in relation to the overall effect
of other variables.
runtime: The coefficient for
runtime
indicates the change in log odds for
is_show
with each additional minute in runtime. A negative
value would suggest that longer runtimes are associated with
movies.
tmdb_score: This coefficient shows how changes in TMDb score impact the probability of a title being a show, with a positive coefficient suggesting a higher likelihood of a title being a show for each point increase in score.
release_year: This coefficient represents how the release year affects the log odds of the title being a show, helping reveal any temporal trends in content type.
# Extracting coefficients
tidy(logistic_model)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -76.8 24.8 -3.10 1.95e- 3
## 2 runtime -0.107 0.00346 -30.9 4.42e-210
## 3 tmdb_score 0.985 0.0636 15.5 3.51e- 54
## 4 release_year 0.0379 0.0123 3.08 2.06e- 3
Explanation: These coefficients provide insights into which features are significantly associated with the likelihood of a title being categorized as a “SHOW.” Each coefficient affects the log odds, which can be converted to an odds ratio for easier interpretation. For example, if the runtime coefficient is negative, this indicates that movies tend to have longer runtimes than shows, while a positive tmdb_score coefficient suggests that shows might score higher on TMDb.
Constructing a Confidence Interval for a Coefficient
To quantify the uncertainty around the runtime
coefficient, I’ll construct a 95% confidence interval.
# Calculate confidence intervals for the coefficients
confint(logistic_model, parm = "runtime", level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## -0.1140880 -0.1005069
Explanation: This interval provides a range within
which we can be 95% confident the true effect of runtime
on
the log odds of a title being a show lies. A CI that does not include
zero indicates that runtime
is a statistically significant
predictor in determining whether a title is a show or movie.
Interpreting the CI: If the runtime
coefficient’s CI is fully negative, this confirms a significant negative
relationship between runtime and the probability of being a show,
suggesting longer runtimes are characteristic of movies. Alternatively,
if the CI includes zero, it implies that runtime
may not be
a reliable predictor of content type.
Summary and Insights
This GLM analysis with logistic regression allowed me to:
Identify meaningful predictors for whether a title is a show or movie.
Estimate the impact of runtime, TMDb score, and release year on the likelihood of a title being a show.
Evaluate coefficient significance and construct confidence intervals to quantify uncertainty in predictor effects.
Future Considerations:
runtime
and tmdb_score
could reveal deeper insights into how these features collectively impact
content type.age_certification
or other categorical variables could add
depth, potentially revealing if audience-targeted trends further impact
classification.This GLM analysis offers a clear view of the structural and temporal factors affecting whether Netflix content is categorized as a show or movie, providing insights valuable for Netflix content strategy and audience engagement.