Netflix Data Dive - Regression Modeling

Load the Netflix:

Dataset I’ll first load the Netflix dataset

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

netflix_data <- netflix_data %>%
  drop_na(imdb_score, tmdb_score, type) # Remove rows with missing values

# First, inspect the dataset
str(netflix_data)

## 'data.frame':    5055 obs. of  15 variables:
##  $ id                  : chr  "tm84618" "tm127384" "tm70993" "tm190788" ...
##  $ title               : chr  "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
##  $ type                : chr  "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
##  $ description         : chr  "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
##  $ release_year        : int  1976 1975 1979 1973 1969 1971 1964 1980 1967 1966 ...
##  $ age_certification   : chr  "R" "PG" "R" "R" ...
##  $ runtime             : int  113 91 94 133 30 102 170 104 110 117 ...
##  $ genres              : chr  "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
##  $ production_countries: chr  "['US']" "['GB']" "['GB']" "['US']" ...
##  $ seasons             : num  NA NA NA NA 4 NA NA NA NA NA ...
##  $ imdb_id             : chr  "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
##  $ imdb_score          : num  8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
##  $ imdb_votes          : num  795222 530877 392419 391942 72895 ...
##  $ tmdb_popularity     : num  27.6 18.2 17.5 95.3 12.9 ...
##  $ tmdb_score          : num  8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...

# Preview the dataset
head(netflix_data)

##         id                           title  type
## 1  tm84618                     Taxi Driver MOVIE
## 2 tm127384 Monty Python and the Holy Grail MOVIE
## 3  tm70993                   Life of Brian MOVIE
## 4 tm190788                    The Exorcist MOVIE
## 5  ts22164    Monty Python's Flying Circus  SHOW
## 6  tm14873                     Dirty Harry MOVIE
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 2                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 3 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 4                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 5                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## 6                                                                    When a madman dubbed 'Scorpio' terrorizes San Francisco, hard-nosed cop, Harry Callahan – famous for his take-no-prisoners approach to law enforcement – is tasked with hunting down the psychopath. Harry eventually collars Scorpio in the process of rescuing a kidnap victim, only to see him walk on technicalities. Now, the maverick detective is determined to nail the maniac himself.
##   release_year age_certification runtime                          genres
## 1         1976                 R     113              ['crime', 'drama']
## 2         1975                PG      91           ['comedy', 'fantasy']
## 3         1979                 R      94                      ['comedy']
## 4         1973                 R     133                      ['horror']
## 5         1969             TV-14      30          ['comedy', 'european']
## 6         1971                 R     102 ['thriller', 'crime', 'action']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']      NA tt0075314        8.3     795222          27.612
## 2               ['GB']      NA tt0071853        8.2     530877          18.216
## 3               ['GB']      NA tt0079470        8.0     392419          17.505
## 4               ['US']      NA tt0070047        8.1     391942          95.337
## 5               ['GB']       4 tt0063929        8.8      72895          12.919
## 6               ['US']      NA tt0066999        7.7     153463          14.745
##   tmdb_score
## 1        8.2
## 2        7.8
## 3        7.8
## 4        7.7
## 5        8.3
## 6        7.5

Task 1: Selecting Response and Explanatory Variables

Response Variable

For this analysis, I’ve decided that the most valuable variable to focus on is the IMDb score (imdb_score). IMDb ratings are often used by viewers to judge the quality of a movie or show, making it a key metric of interest.

Explanatory Variable

I suspect that the type of content, whether it’s a Movie or a Show (captured in the type column), might influence IMDb scores. So, I will use type as my categorical explanatory variable for the ANOVA test.

# Define response and explanatory variables
response_var <- netflix_data$imdb_score
explanatory_var <- as.factor(netflix_data$type)

ANOVA Test

Null Hypothesis: The null hypothesis (H0) I’ll test is that there is no significant difference in IMDb scores between movies and shows.

To test this, I’ll run an ANOVA using the aov() function in R:

# Run ANOVA test
anova_model <- aov(imdb_score ~ type, data = netflix_data)
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## type           1    666   665.9   557.7 <2e-16 ***
## Residuals   5053   6034     1.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(2, 2))
plot(anova_model)

The p-value is extremely small (<2e-16), which means it is highly significant. I can reject the null hypothesis and conclude that there is a significant difference in IMDb scores between movies and shows.

Visualization: IMDb Scores by Content Type

In addition to the statistical test, I will visualize the distribution of IMDb scores for movies and shows using a boxplot. This will help to illustrate the differences in IMDb scores between content types.

ggplot(netflix_data, aes(x = type, y = imdb_score, fill = type)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 2, color = "black") +
  labs(title = "IMDb Scores by Content Type", x = "Type of Content", y = "IMDb Score") +
  theme_minimal()

Interpretation

This result suggests that the type of content—whether it’s a movie or a show—affects its IMDb score. For those interested in analyzing Netflix’s content, this insight could mean that viewers might rate movies and shows differently, possibly due to differences in how they consume or perceive these types of content.

Linear Regression

Next, I aim to explore how well another continuous variable, tmdb_score, predicts IMDb scores. Since TMDb scores are another widely used metric, I expect them to be a good predictor.

Building the Regression Model

To begin, I will create a simple linear regression model to predict IMDb scores using TMDb scores.

# Define continuous explanatory variable
continuous_var <- netflix_data$tmdb_score

# Build the regression model
regression_model <- lm(imdb_score ~ tmdb_score, data = netflix_data)
summary(regression_model)

## 
## Call:
## lm(formula = imdb_score ~ tmdb_score, data = netflix_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7174 -0.4582  0.1218  0.5970  4.4042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.50379    0.07918   31.62   <2e-16 ***
## tmdb_score   0.59200    0.01147   51.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9316 on 5053 degrees of freedom
## Multiple R-squared:  0.3454, Adjusted R-squared:  0.3452 
## F-statistic:  2666 on 1 and 5053 DF,  p-value: < 2.2e-16

# Checking assumptions of the model
par(mfrow = c(2, 2))
plot(regression_model)

The p-value for the tmdb_score variable is very significant (p < 2.2e-16), and the R-squared value is about 0.345. This means that around 34.5% of the variation in IMDb scores can be explained by TMDb scores, which indicates a moderately strong relationship.

Visualization: Linear Relationship Between IMDb and TMDb Scores

The relationship between TMDb and IMDb scores can be visualized with a scatter plot and a linear regression line.

# Plot the relationship between tmdb_score and imdb_score
ggplot(netflix_data, aes(x = tmdb_score, y = imdb_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Linear Regression of IMDb Score vs TMDb Score",
       x = "TMDb Score", y = "IMDb Score")

## `geom_smooth()` using formula = 'y ~ x'

The plot confirms the positive linear relationship between the TMDb score and the IMDb score, as expected.

Interpretation of the Regression Coefficients

In the regression model, the intercept is approximately 2.50, and the slope for tmdb_score is about 0.59. This means that for every 1-point increase in the TMDb score, the IMDb score is predicted to increase by 0.59 points. The intercept represents the predicted IMDb score when the TMDb score is zero, which isn’t practically meaningful but is just part of the mathematical model.

What This Means

TMDb scores are a strong predictor of IMDb scores, suggesting that high TMDb scores typically correspond to high IMDb scores. This could be valuable for content creators or analysts at Netflix, as it suggests that TMDb ratings may be a reliable indicator of how well a movie or show will perform on IMDb. However, the R-squared value of 0.345 indicates that other factors, such as genre, runtime, or age rating, likely play a role in influencing IMDb scores.

Conclusions and Future Investigations

ANOVA Test: The ANOVA test revealed a significant difference in IMDb scores between movies and shows, suggesting that content type plays a role in how audiences rate Netflix content.

Linear Regression: TMDb scores explain a substantial portion of the variance in IMDb scores but do not account for all of it. Further investigations could focus on other potential predictors like genre, age rating, and runtime.

Future Research

In future analyses, I plan to investigate the influence of additional factors such as:

Genre: Does the genre (e.g., comedy, drama) affect IMDb scores?
Runtime: Does a longer runtime correlate with higher IMDb ratings?

These questions can provide further insights into the factors that influence audience ratings on IMDb and help content creators optimize their offerings to cater to viewer preferences.