Load the Netflix:
Dataset I’ll first load the Netflix dataset
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
netflix_data <- netflix_data %>%
drop_na(imdb_score, tmdb_score, type) # Remove rows with missing values
# First, inspect the dataset
str(netflix_data)
## 'data.frame': 5055 obs. of 15 variables:
## $ id : chr "tm84618" "tm127384" "tm70993" "tm190788" ...
## $ title : chr "Taxi Driver" "Monty Python and the Holy Grail" "Life of Brian" "The Exorcist" ...
## $ type : chr "MOVIE" "MOVIE" "MOVIE" "MOVIE" ...
## $ description : chr "A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived "| __truncated__ "King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wis"| __truncated__ "Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as"| __truncated__ "12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area o"| __truncated__ ...
## $ release_year : int 1976 1975 1979 1973 1969 1971 1964 1980 1967 1966 ...
## $ age_certification : chr "R" "PG" "R" "R" ...
## $ runtime : int 113 91 94 133 30 102 170 104 110 117 ...
## $ genres : chr "['crime', 'drama']" "['comedy', 'fantasy']" "['comedy']" "['horror']" ...
## $ production_countries: chr "['US']" "['GB']" "['GB']" "['US']" ...
## $ seasons : num NA NA NA NA 4 NA NA NA NA NA ...
## $ imdb_id : chr "tt0075314" "tt0071853" "tt0079470" "tt0070047" ...
## $ imdb_score : num 8.3 8.2 8 8.1 8.8 7.7 7.8 5.8 7.7 7.3 ...
## $ imdb_votes : num 795222 530877 392419 391942 72895 ...
## $ tmdb_popularity : num 27.6 18.2 17.5 95.3 12.9 ...
## $ tmdb_score : num 8.2 7.8 7.8 7.7 8.3 7.5 7.6 6.2 7.5 7.1 ...
# Preview the dataset
head(netflix_data)
## id title type
## 1 tm84618 Taxi Driver MOVIE
## 2 tm127384 Monty Python and the Holy Grail MOVIE
## 3 tm70993 Life of Brian MOVIE
## 4 tm190788 The Exorcist MOVIE
## 5 ts22164 Monty Python's Flying Circus SHOW
## 6 tm14873 Dirty Harry MOVIE
## description
## 1 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 2 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 3 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 4 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 5 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## 6 When a madman dubbed 'Scorpio' terrorizes San Francisco, hard-nosed cop, Harry Callahan – famous for his take-no-prisoners approach to law enforcement – is tasked with hunting down the psychopath. Harry eventually collars Scorpio in the process of rescuing a kidnap victim, only to see him walk on technicalities. Now, the maverick detective is determined to nail the maniac himself.
## release_year age_certification runtime genres
## 1 1976 R 113 ['crime', 'drama']
## 2 1975 PG 91 ['comedy', 'fantasy']
## 3 1979 R 94 ['comedy']
## 4 1973 R 133 ['horror']
## 5 1969 TV-14 30 ['comedy', 'european']
## 6 1971 R 102 ['thriller', 'crime', 'action']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] NA tt0075314 8.3 795222 27.612
## 2 ['GB'] NA tt0071853 8.2 530877 18.216
## 3 ['GB'] NA tt0079470 8.0 392419 17.505
## 4 ['US'] NA tt0070047 8.1 391942 95.337
## 5 ['GB'] 4 tt0063929 8.8 72895 12.919
## 6 ['US'] NA tt0066999 7.7 153463 14.745
## tmdb_score
## 1 8.2
## 2 7.8
## 3 7.8
## 4 7.7
## 5 8.3
## 6 7.5
For this analysis, I’ve decided that the most valuable variable to
focus on is the IMDb score (imdb_score
). IMDb ratings are
often used by viewers to judge the quality of a movie or show, making it
a key metric of interest.
I suspect that the type of content, whether it’s a
Movie or a Show (captured in the
type
column), might influence IMDb scores. So, I will use
type
as my categorical explanatory variable for the ANOVA
test.
# Define response and explanatory variables
response_var <- netflix_data$imdb_score
explanatory_var <- as.factor(netflix_data$type)
Null Hypothesis: The null hypothesis (H0) I’ll test is that there is no significant difference in IMDb scores between movies and shows.
To test this, I’ll run an ANOVA using the aov()
function
in R:
# Run ANOVA test
anova_model <- aov(imdb_score ~ type, data = netflix_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 666 665.9 557.7 <2e-16 ***
## Residuals 5053 6034 1.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(anova_model)
The p-value is extremely small (<2e-16), which means it is highly significant. I can reject the null hypothesis and conclude that there is a significant difference in IMDb scores between movies and shows.
In addition to the statistical test, I will visualize the distribution of IMDb scores for movies and shows using a boxplot. This will help to illustrate the differences in IMDb scores between content types.
ggplot(netflix_data, aes(x = type, y = imdb_score, fill = type)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "point", shape = 23, size = 2, color = "black") +
labs(title = "IMDb Scores by Content Type", x = "Type of Content", y = "IMDb Score") +
theme_minimal()
This result suggests that the type of content—whether it’s a movie or a show—affects its IMDb score. For those interested in analyzing Netflix’s content, this insight could mean that viewers might rate movies and shows differently, possibly due to differences in how they consume or perceive these types of content.
Next, I aim to explore how well another continuous variable,
tmdb_score
, predicts IMDb scores. Since TMDb scores are
another widely used metric, I expect them to be a good predictor.
To begin, I will create a simple linear regression model to predict IMDb scores using TMDb scores.
# Define continuous explanatory variable
continuous_var <- netflix_data$tmdb_score
# Build the regression model
regression_model <- lm(imdb_score ~ tmdb_score, data = netflix_data)
summary(regression_model)
##
## Call:
## lm(formula = imdb_score ~ tmdb_score, data = netflix_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7174 -0.4582 0.1218 0.5970 4.4042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.50379 0.07918 31.62 <2e-16 ***
## tmdb_score 0.59200 0.01147 51.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9316 on 5053 degrees of freedom
## Multiple R-squared: 0.3454, Adjusted R-squared: 0.3452
## F-statistic: 2666 on 1 and 5053 DF, p-value: < 2.2e-16
# Checking assumptions of the model
par(mfrow = c(2, 2))
plot(regression_model)
The p-value for the tmdb_score
variable is very
significant (p < 2.2e-16), and the R-squared value is about
0.345. This means that around 34.5% of
the variation in IMDb scores can be explained by TMDb scores, which
indicates a moderately strong relationship.
The relationship between TMDb and IMDb scores can be visualized with a scatter plot and a linear regression line.
# Plot the relationship between tmdb_score and imdb_score
ggplot(netflix_data, aes(x = tmdb_score, y = imdb_score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Linear Regression of IMDb Score vs TMDb Score",
x = "TMDb Score", y = "IMDb Score")
## `geom_smooth()` using formula = 'y ~ x'
The plot confirms the positive linear relationship between the TMDb score and the IMDb score, as expected.
In the regression model, the intercept is approximately
2.50, and the slope for tmdb_score
is
about 0.59. This means that for every 1-point increase
in the TMDb score, the IMDb score is predicted to increase by
0.59 points. The intercept represents the predicted
IMDb score when the TMDb score is zero, which isn’t practically
meaningful but is just part of the mathematical model.
TMDb scores are a strong predictor of IMDb scores, suggesting that high TMDb scores typically correspond to high IMDb scores. This could be valuable for content creators or analysts at Netflix, as it suggests that TMDb ratings may be a reliable indicator of how well a movie or show will perform on IMDb. However, the R-squared value of 0.345 indicates that other factors, such as genre, runtime, or age rating, likely play a role in influencing IMDb scores.
ANOVA Test: The ANOVA test revealed a significant difference in IMDb scores between movies and shows, suggesting that content type plays a role in how audiences rate Netflix content.
Linear Regression: TMDb scores explain a substantial portion of the variance in IMDb scores but do not account for all of it. Further investigations could focus on other potential predictors like genre, age rating, and runtime.
Future Research
In future analyses, I plan to investigate the influence of additional factors such as:
Genre: Does the genre (e.g., comedy, drama) affect IMDb scores?
Runtime: Does a longer runtime correlate with higher IMDb ratings?
These questions can provide further insights into the factors that influence audience ratings on IMDb and help content creators optimize their offerings to cater to viewer preferences.