Load the Netflix Dataset I’ll first load the Netflix dataset and preview it to understand its structure.
# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")
# Preview the dataset
head(netflix_data)
## id title type
## 1 ts300399 Five Came Back: The Reference Films SHOW
## 2 tm84618 Taxi Driver MOVIE
## 3 tm127384 Monty Python and the Holy Grail MOVIE
## 4 tm70993 Life of Brian MOVIE
## 5 tm190788 The Exorcist MOVIE
## 6 ts22164 Monty Python's Flying Circus SHOW
## description
## 1 This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2 A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5 12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6 A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
## release_year age_certification runtime genres
## 1 1945 TV-MA 48 ['documentation']
## 2 1976 R 113 ['crime', 'drama']
## 3 1975 PG 91 ['comedy', 'fantasy']
## 4 1979 R 94 ['comedy']
## 5 1973 R 133 ['horror']
## 6 1969 TV-14 30 ['comedy', 'european']
## production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity
## 1 ['US'] 1 NA NA 0.600
## 2 ['US'] NA tt0075314 8.3 795222 27.612
## 3 ['GB'] NA tt0071853 8.2 530877 18.216
## 4 ['GB'] NA tt0079470 8.0 392419 17.505
## 5 ['US'] NA tt0070047 8.1 391942 95.337
## 6 ['GB'] 4 tt0063929 8.8 72895 12.919
## tmdb_score
## 1 NA
## 2 8.2
## 3 7.8
## 4 7.8
## 5 7.7
## 6 8.3
For this project, I am using the Netflix Dataset, which contains detailed information about the titles available on Netflix. It includes fields like IMDb scores, TMDb popularity, genres, and more. You can explore the dataset here along with the documentation.
There are 15 columns in the dataset, covering various attributes such as the title’s ID, type (Movie or TV Show), age certification, IMDb score, genres, and production countries. My aim is to explore the relationships between these factors and identify interesting patterns that Netflix titles might exhibit.
Main Goal:
My main focus in this analysis is to explore the relationship between
IMDb ratings
and genres
, and how a title’s
popularity (TMDB popularity score) is influenced by its
runtime
and production country
.
Initial Seed Question:
Is there a strong correlation between a title’s IMDb score and its TMDB popularity? Are certain genres consistently rated higher than others on IMDb?
Visualizations for Further Investigation:
set.seed(123)
# taking a random sample of 50% of the dataset
netflix_data_1 <- netflix_data %>% sample_frac(0.5)
# cleaning the data by removing empty genres and formatting the genres column
netflix_data_1_clean <- netflix_data_1 %>%
filter(genres != "[]" & genres != "" & !is.na(genres)) %>% # Remove records with empty or NA genres
separate_rows(genres, sep = ",") %>% # Split genres into separate rows
mutate(genres = str_replace_all(genres, "\\[|\\]|'", "")) %>% # Remove brackets and quotes
mutate(genres = trimws(genres)) # Remove any extra whitespace
# Visualize median IMDb score by genre
netflix_data_1_clean %>%
group_by(genres) %>%
summarize(median_imdb = median(imdb_score, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(genres, -median_imdb), y = median_imdb, fill = genres)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Median IMDb Score by Genre",
x = "Genres", y = "IMDb Score")
Why this is interesting: This plot will allow me to see if certain genres tend to perform better on IMDb. Genres with consistently high or low average scores could reveal trends in user preferences or biases in Netflix’s content.
Median is often the most robust metric for central tendency when dealing with skewed data or outliers, and it would be a better choice than the mean in these cases.
I also want to look at the relationship between TMDB popularity and a title’s runtime to see if longer titles are generally more or less popular.
# Scatter plot for Popularity vs Runtime
ggplot(netflix_data_1, aes(x = runtime, y = tmdb_popularity)) +
geom_point(aes(color = tmdb_popularity), alpha = 0.6, size = 3) +
scale_color_gradient(low = "lightblue", high = "darkblue") +
geom_smooth(method = "lm", col = "red", se = FALSE, size = 1) +
labs(
title = "How Runtime Affects TMDB Popularity",
subtitle = "Are longer movies more popular?",
x = "Runtime (Minutes)",
y = "Popularity (TMDB)",
color = "TMDB Popularity"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "right"
) +
scale_x_continuous(labels = scales::comma)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 48 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 48 rows containing missing values or values outside the scale range
## (`geom_point()`).
Why this is interesting: This scatter plot will help me understand if runtime has any effect on how popular a title is. It might highlight whether longer movies or shows are preferred by viewers or if shorter, binge-worthy content dominates
Test Hypotheses: My next step will be to develop and test specific hypotheses related to IMDb scores, genres, and TMDB popularity.
Age Certification Analysis: I plan to investigate whether the age certification of titles impacts their IMDb rating or TMDB popularity.
Dive Deeper into Genres: Exploring which genres influence IMDb scores and popularity will be a priority.
Handling Missing Data: I need to clean the dataset and deal with missing values.
Monte Carlo Simulations: Using simulations could help understand the variability in IMDb ratings or popularity scores.
Hypothesis: Higher IMDb ratings are associated with higher TMDB popularity scores.
I hypothesize that titles with higher IMDb ratings are also more likely to be popular on Netflix.
# Scatter plot for IMDb Score vs TMDB Popularity
ggplot(netflix_data_1, aes(x = imdb_score, y = tmdb_popularity)) +
geom_point(aes(color = tmdb_popularity), alpha = 0.6, size = 3) +
scale_color_gradient(low = "lightblue", high = "darkblue") +
geom_smooth(method = "lm", col = "red", se = FALSE, size = 1.2) +
labs(
title = "How IMDb Score Relates to TMDB Popularity",
subtitle = "Are higher-rated titles more popular?",
x = "IMDB Score",
y = "TMDB Popularity",
color = "TMDB Popularity"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "right"
) +
scale_x_continuous(breaks = seq(0, 10, 1), labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 301 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 301 rows containing missing values or values outside the scale range
## (`geom_point()`).
Hypothesis: Certain genres (like ‘Drama’ or ‘Documentary’) will have higher IMDb ratings compared to genres like ‘Comedy’ or ‘Action’.
This hypothesis suggests that more serious or thoughtful genres may receive higher ratings on IMDb compared to lighthearted genres.
# Enhanced IMDb score distribution by genre using boxplot
ggplot(netflix_data_1_clean, aes(x = reorder(genres, imdb_score), y = imdb_score, fill = genres)) +
geom_boxplot(outlier.color = "red", outlier.size = 2, notch = TRUE) + # Highlight outliers and use notched boxplots
scale_fill_viridis_d(option = "C", direction = -1) +
labs(
title = "How IMDb Scores Vary Across Genres",
subtitle = "Visualizing the distribution of IMDb scores by genre",
x = "Genres",
y = "IMDb Score"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
legend.position = "none"
) +
scale_y_continuous(breaks = seq(0, 10, 1), labels = scales::comma)
## Warning: Removed 300 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
The summary so far focuses on the structure of the dataset and an initial exploration into how genres and IMDb ratings interact. The two visualizations (IMDb by genre and popularity vs runtime) provide early insights into possible relationships worth investigating.
My first hypothesis looks at whether highly rated titles also tend to be popular, and my second hypothesis explores whether some genres are more highly rated than others.
These initial findings will guide my further analysis, and my next steps will be testing the hypotheses more thoroughly, cleaning up any missing data, and exploring any further patterns that emerge.
This should give a strong foundation to build on in the meeting, and I’m looking forward to discussing these insights in more detail.