Netflix Data Dive

Load the Netflix:

netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview structure and initial summary
str(netflix_data)

## 'data.frame':    5806 obs. of  15 variables:
##  $ id                  : chr  "ts300399" "tm84618" "tm127384" "tm70993" ...
##  $ title               : chr  "Five Came Back: The Reference Films" "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" ...
##  $ type                : chr  "SHOW" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discu"| __truncated__ "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ ...
##  $ release_year        : int  1945 1976 1975 1979 1973 1969 1971 1964 1980 1967 ...
##  $ age_certification   : chr  "TV-MA" "R" "PG" "R" ...
##  $ runtime             : int  48 113 91 94 133 30 102 170 104 110 ...
##  $ genres              : chr  "['documentation']" "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" ...
##  $ production_countries: chr  "['US']" "['US']" "['GB']" "['GB']" ...
##  $ seasons             : num  1 NA NA NA NA 4 NA NA NA NA ...
##  $ imdb_id             : chr  "" "tt0075314" "tt0071853" "tt0079470" ...
##  $ imdb_score          : num  NA 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 ...
##  $ imdb_votes          : num  NA 795222 530877 392419 391942 ...
##  $ tmdb_popularity     : num  0.6 27.6 18.2 17.5 95.3 ...
##  $ tmdb_score          : num  NA 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 ...

summary(netflix_data)

##       id               title               type           description       
##  Length:5806        Length:5806        Length:5806        Length:5806       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   release_year  age_certification     runtime          genres         
##  Min.   :1945   Length:5806        Min.   :  0.00   Length:5806       
##  1st Qu.:2015   Class :character   1st Qu.: 44.00   Class :character  
##  Median :2018   Mode  :character   Median : 84.00   Mode  :character  
##  Mean   :2016                      Mean   : 77.64                     
##  3rd Qu.:2020                      3rd Qu.:105.00                     
##  Max.   :2022                      Max.   :251.00                     
##                                                                       
##  production_countries    seasons         imdb_id            imdb_score   
##  Length:5806          Min.   : 1.000   Length:5806        Min.   :1.500  
##  Class :character     1st Qu.: 1.000   Class :character   1st Qu.:5.800  
##  Mode  :character     Median : 1.000   Mode  :character   Median :6.600  
##                       Mean   : 2.166                      Mean   :6.533  
##                       3rd Qu.: 2.000                      3rd Qu.:7.400  
##                       Max.   :42.000                      Max.   :9.600  
##                       NA's   :3759                        NA's   :523    
##    imdb_votes      tmdb_popularity       tmdb_score    
##  Min.   :      5   Min.   :   0.0094   Min.   : 0.500  
##  1st Qu.:    521   1st Qu.:   3.1553   1st Qu.: 6.100  
##  Median :   2279   Median :   7.4780   Median : 6.900  
##  Mean   :  23407   Mean   :  22.5257   Mean   : 6.818  
##  3rd Qu.:  10144   3rd Qu.:  17.7757   3rd Qu.: 7.500  
##  Max.   :2268288   Max.   :1823.3740   Max.   :10.000  
##  NA's   :539       NA's   :94          NA's   :318

Selecting and Preparing a Binary Variable

In this analysis, I’ll use the type column, which specifies whether a title is a “SHOW” or a “MOVIE.” I’ll convert this into a binary variable called is_show, where 1 represents a “SHOW” and 0 represents a “MOVIE.”

# Create binary column
netflix_data <- netflix_data %>%
  drop_na(imdb_score, tmdb_score, runtime, type) %>%
  mutate(is_show = ifelse(type == "SHOW", 1, 0))

# Previewing the dataset with the new binary variable
table(netflix_data$is_show)

## 
##    0    1 
## 3269 1786

Explanation: Choosing is_show as the response variable is meaningful because it allows us to explore the factors influencing the likelihood of a title being a show versus a movie. This information could be helpful for understanding the structural characteristics associated with each type of content.

Building a Logistic Regression Model

Now, I’ll fit a logistic regression model where is_show is the response variable. For explanatory variables, I’ll use:

runtime: The length of the content in minutes, as longer runtimes could indicate movies.
tmdb_score: The TMDb score, as shows and movies may score differently.
release_year: Year of release, since trends in content type might evolve over time.

# Logistic regression model
logistic_model <- glm(is_show ~ runtime + tmdb_score + release_year, 
                      data = netflix_data, 
                      family = binomial)

# Summary of the logistic model
summary(logistic_model)

## 
## Call:
## glm(formula = is_show ~ runtime + tmdb_score + release_year, 
##     family = binomial, data = netflix_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -76.817200  24.797424  -3.098  0.00195 ** 
## runtime       -0.107090   0.003462 -30.932  < 2e-16 ***
## tmdb_score     0.985397   0.063577  15.499  < 2e-16 ***
## release_year   0.037887   0.012294   3.082  0.00206 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6566.2  on 5054  degrees of freedom
## Residual deviance: 1808.8  on 5051  degrees of freedom
## AIC: 1816.8
## 
## Number of Fisher Scoring iterations: 7

Explanation: The choice of predictors here aims to uncover how content features (like runtime and score) and trends over time (release year) impact the probability of content being categorized as a “SHOW.” The logistic regression approach is suitable for modeling binary outcomes and estimating the probability of a title being a show.

Visualizing Relationships of Predictors with Response Variable

Plotting the relationships between each predictor and is_show can provide insights into the nature of these associations.

# Scatter plot of runtime by is_show
ggplot(netflix_data, aes(x = runtime, y = is_show)) +
  geom_jitter(height = 0.05, alpha = 0.3) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
  labs(title = "Probability of Title Being a Show vs. Runtime")

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot of tmdb_score by is_show
ggplot(netflix_data, aes(x = tmdb_score, y = is_show)) +
  geom_jitter(height = 0.05, alpha = 0.3) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
  labs(title = "Probability of Title Being a Show vs. TMDb Score")

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot of release_year by is_show
ggplot(netflix_data, aes(x = release_year, y = is_show)) +
  geom_jitter(height = 0.05, alpha = 0.3) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
  labs(title = "Probability of Title Being a Show vs. Release Year")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation of Visuals:

Runtime: A decreasing trend as runtime increases suggests movies typically have longer runtimes.
TMDb Score: An increasing probability indicates that shows may score higher on TMDb, perhaps due to episodic content or audience engagement differences.
Release Year: A positive trend suggests newer content is more likely to be categorized as shows, reflecting potential shifts in production preferences.

Interpreting the Coefficients of the Logistic Model

Examining the model summary provides coefficient estimates, which can be interpreted in terms of log odds.

Intercept: The intercept represents the baseline log odds of a title being a show when runtime, tmdb_score, and release_year are zero. However, in practice, we interpret it in relation to the overall effect of other variables.
runtime: The coefficient for runtime indicates the change in log odds for is_show with each additional minute in runtime. A negative value would suggest that longer runtimes are associated with movies.
tmdb_score: This coefficient shows how changes in TMDb score impact the probability of a title being a show, with a positive coefficient suggesting a higher likelihood of a title being a show for each point increase in score.
release_year: This coefficient represents how the release year affects the log odds of the title being a show, helping reveal any temporal trends in content type.

# Extracting coefficients
tidy(logistic_model)

## # A tibble: 4 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -76.8     24.8         -3.10 1.95e-  3
## 2 runtime       -0.107    0.00346    -30.9  4.42e-210
## 3 tmdb_score     0.985    0.0636      15.5  3.51e- 54
## 4 release_year   0.0379   0.0123       3.08 2.06e-  3

Explanation: These coefficients provide insights into which features are significantly associated with the likelihood of a title being categorized as a “SHOW.” Each coefficient affects the log odds, which can be converted to an odds ratio for easier interpretation. For example, if the runtime coefficient is negative, this indicates that movies tend to have longer runtimes than shows, while a positive tmdb_score coefficient suggests that shows might score higher on TMDb.

Constructing a Confidence Interval for a Coefficient

To quantify the uncertainty around the runtime coefficient, I’ll construct a 95% confidence interval.

# Calculate confidence intervals for the coefficients
confint(logistic_model, parm = "runtime", level = 0.95)

## Waiting for profiling to be done...

##      2.5 %     97.5 % 
## -0.1140880 -0.1005069

Explanation: This interval provides a range within which we can be 95% confident the true effect of runtime on the log odds of a title being a show lies. A CI that does not include zero indicates that runtime is a statistically significant predictor in determining whether a title is a show or movie.

Interpreting the CI: If the runtime coefficient’s CI is fully negative, this confirms a significant negative relationship between runtime and the probability of being a show, suggesting longer runtimes are characteristic of movies. Alternatively, if the CI includes zero, it implies that runtime may not be a reliable predictor of content type.

Summary and Insights

This GLM analysis with logistic regression allowed me to:

Identify meaningful predictors for whether a title is a show or movie.
Estimate the impact of runtime, TMDb score, and release year on the likelihood of a title being a show.
Evaluate coefficient significance and construct confidence intervals to quantify uncertainty in predictor effects.

Future Considerations:

Interactions and Nonlinear Terms: Examining interactions between runtime and tmdb_score could reveal deeper insights into how these features collectively impact content type.

Inclusion of Additional Binary Features: Including age_certification or other categorical variables could add depth, potentially revealing if audience-targeted trends further impact classification.

This GLM analysis offers a clear view of the structural and temporal factors affecting whether Netflix content is categorized as a show or movie, providing insights valuable for Netflix content strategy and audience engagement.

Netflix Data Dive - GLMs

Junaid Ahmed Mohammed

2024-11-03