Load the Netflix:
Dataset I’ll first load the Netflix dataset
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
Over the past few weeks, I’ve explored the Netflix dataset and identified several aspects worth investigating. For this data dive, I will formulate two hypotheses and conduct hypothesis tests using the Neyman-Pearson framework and Fisher’s Significance Testing framework.
Null Hypothesis (H0): There is no difference in the
average IMDb score between movies and TV shows.
Alternative Hypothesis (H1): There is a significant
difference in the average IMDb score between movies and TV shows.
I chose an alpha level of 0.05, which allows for a 5% chance of incorrectly rejecting the null hypothesis. This is a standard threshold and works well for comparing IMDb scores between movies and TV shows, where the consequences of a false positive aren’t as critical as in more sensitive fields like medicine. A stricter alpha (e.g., 0.01) isn’t necessary in this context.
For the power level, I selected 0.80, ensuring an 80% chance of detecting a true difference in scores if one exists. While a higher power (0.90) could reduce the risk of missing real effects, it would also require a larger sample size, so 0.80 strikes the right balance for this analysis.
I chose a medium effect size (Cohen’s d = 0.5) because I expect a moderate difference in IMDb scores between movies and TV shows. A smaller effect size would demand more data, while a larger one might overestimate the actual difference.
Sample Size Calculation and Data Suitability
To ensure I have enough data to conduct this test, I calculated the required sample size.
# Load necessary libraries
library(pwr)
# Calculate required sample size for t-test (Cohen's d = 0.5, power = 0.80, alpha = 0.05)
sample_size <- pwr.t.test(d = 0.5, power = 0.8, sig.level = 0.05, type = "two.sample")$n
sample_size
## [1] 63.76561
# Check if we have enough data in each group
netflix_data %>%
group_by(type) %>%
summarize(n = n()) # Count the number of movies and shows
## # A tibble: 2 × 2
## type n
## <chr> <int>
## 1 MOVIE 3759
## 2 SHOW 2047
# Perform two-sample t-test (assuming unequal variance)
t_test_result <- t.test(imdb_score ~ type, data = netflix_data, var.equal = FALSE)
# Show results
t_test_result
##
## Welch Two Sample t-test
##
## data: imdb_score by type
## t = -23.875, df = 3976.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group MOVIE and group SHOW is not equal to 0
## 95 percent confidence interval:
## -0.8120173 -0.6887780
## sample estimates:
## mean in group MOVIE mean in group SHOW
## 6.266980 7.017377
After performing the t-test, I found that the p-value is < 2.2e-16, which is well below my alpha level of 0.05. This means I can confidently reject the null hypothesis. The test also provides a 95% confidence interval for the difference in means between movies and TV shows, ranging from -0.812 to -0.689.
Interpretation:
The mean IMDb score for TV shows is 7.02, while the mean score for movies is 6.27. This indicates that, on average, TV shows tend to have significantly higher IMDb scores than movies.
With such a low p-value, I am confident that the difference is statistically significant. The confidence interval suggests that the true difference in IMDb scores between movies and TV shows is between 0.689 and 0.812 points.
Thus, I conclude that TV shows, on average, tend to receive higher IMDb ratings than movies.
Null Hypothesis (H0): There is no correlation between the runtime of content and its TMDB popularity.
For this hypothesis, I will use Pearson’s correlation test and interpret the p-value.
# Perform Pearson correlation test between runtime and TMDB popularity
cor_test_result <- cor.test(netflix_data$runtime, netflix_data$tmdb_popularity, method = "pearson")
# Show results
cor_test_result
##
## Pearson's product-moment correlation
##
## data: netflix_data$runtime and netflix_data$tmdb_popularity
## t = -2.0783, df = 5710, p-value = 0.03773
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.053388722 -0.001559761
## sample estimates:
## cor
## -0.02749272
Interpretation of Results:
I performed Pearson’s correlation test and found a p-value of 0.03773, which is below 0.05. This allows me to reject the null hypothesis and conclude that there is a statistically significant correlation between runtime and TMDB popularity. The correlation coefficient is -0.0275, which indicates a very weak negative correlation.
The correlation coefficient suggests that as runtime increases, TMDB popularity slightly decreases, although this relationship is weak.
The confidence interval for the correlation ranges from -0.0534 to -0.0016, indicating that the true correlation is likely close to zero but still negative.
Given the weak correlation, runtime may not be a strong driver of TMDB popularity. Other factors likely play a more prominent role in determining a title’s popularity.
To support the results of my hypothesis tests, I’ve created two visualizations:
# Plot: IMDb Scores by Type (Movies vs Shows)
ggplot(netflix_data, aes(x = type, y = imdb_score, fill = type)) +
geom_boxplot() +
labs(title = "IMDb Scores for Movies vs TV Shows",
x = "Content Type", y = "IMDb Score") +
theme_minimal()
## Warning: Removed 523 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
I visualized these differences using a boxplot, which clearly shows that the median IMDb score for TV shows is higher than that of movies. This visual supports the results of the t-test.
# Plot: Runtime vs TMDB Popularity
ggplot(netflix_data, aes(x = runtime, y = tmdb_popularity)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Relationship Between Runtime and TMDB Popularity",
x = "Runtime (minutes)", y = "TMDB Popularity") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 94 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 94 rows containing missing values or values outside the scale range
## (`geom_point()`).
I visualized the relationship between runtime and TMDB popularity using a scatter plot with a trend line. The plot shows a slight downward trend, consistent with the weak negative correlation.
Key Insights:
TV shows tend to have significantly higher IMDb scores than movies.
Although there is a statistically significant correlation between runtime and TMDB popularity, the relationship is weak and negative, suggesting that other factors influence popularity more than runtime.
Further Questions:
What other factors might explain the higher IMDb scores for TV shows? Could it be related to audience preferences or the nature of serialized content?
Are there other variables (e.g., genre, release year) that could explain TMDB popularity better than runtime?