Netflix - Project Data Discovery

Load the Netflix Dataset I’ll first load the Netflix dataset and preview it to understand its structure.

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)

##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3

Dataset Description:

For this project, I am using the Netflix Dataset, which contains detailed information about the titles available on Netflix. It includes fields like IMDb scores, TMDb popularity, genres, and more. You can explore the dataset here along with the documentation.

There are 15 columns in the dataset, covering various attributes such as the title’s ID, type (Movie or TV Show), age certification, IMDb score, genres, and production countries. My aim is to explore the relationships between these factors and identify interesting patterns that Netflix titles might exhibit.

Main Goal:

My main focus in this analysis is to explore the relationship between IMDb ratings and genres, and how a title’s popularity (TMDB popularity score) is influenced by its runtime and production country.

Initial Seed Question:

Is there a strong correlation between a title’s IMDb score and its TMDB popularity? Are certain genres consistently rated higher than others on IMDb?

Visualizations for Further Investigation:

set.seed(123)

# taking a random sample of 50% of the dataset
netflix_data_1 <- netflix_data %>% sample_frac(0.5)

# cleaning the data by removing empty genres and formatting the genres column
netflix_data_1_clean <- netflix_data_1 %>%
  filter(genres != "[]" & genres != "" & !is.na(genres)) %>%  # Remove records with empty or NA genres
  separate_rows(genres, sep = ",") %>%  # Split genres into separate rows
  mutate(genres = str_replace_all(genres, "\\[|\\]|'", "")) %>%  # Remove brackets and quotes
  mutate(genres = trimws(genres))  # Remove any extra whitespace

# Visualize median IMDb score by genre
netflix_data_1_clean %>%
  group_by(genres) %>%
  summarize(median_imdb = median(imdb_score, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(genres, -median_imdb), y = median_imdb, fill = genres)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Median IMDb Score by Genre",
       x = "Genres", y = "IMDb Score")

Why this is interesting: This plot will allow me to see if certain genres tend to perform better on IMDb. Genres with consistently high or low average scores could reveal trends in user preferences or biases in Netflix’s content.

Median is often the most robust metric for central tendency when dealing with skewed data or outliers, and it would be a better choice than the mean in these cases.

2. Popularity vs Runtime

I also want to look at the relationship between TMDB popularity and a title’s runtime to see if longer titles are generally more or less popular.

#  Scatter plot for Popularity vs Runtime
ggplot(netflix_data_1, aes(x = runtime, y = tmdb_popularity)) +
  geom_point(aes(color = tmdb_popularity), alpha = 0.6, size = 3) + 
  scale_color_gradient(low = "lightblue", high = "darkblue") + 
  geom_smooth(method = "lm", col = "red", se = FALSE, size = 1) + 
  labs(
    title = "How Runtime Affects TMDB Popularity",
    subtitle = "Are longer movies more popular?",
    x = "Runtime (Minutes)",
    y = "Popularity (TMDB)",
    color = "TMDB Popularity"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  
    plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),  
    axis.title = element_text(size = 14), 
    axis.text = element_text(size = 12), 
    legend.position = "right" 
  ) +
  scale_x_continuous(labels = scales::comma)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 48 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 48 rows containing missing values or values outside the scale range
## (`geom_point()`).

Why this is interesting: This scatter plot will help me understand if runtime has any effect on how popular a title is. It might highlight whether longer movies or shows are preferred by viewers or if shorter, binge-worthy content dominates

To-Do List:

Test Hypotheses: My next step will be to develop and test specific hypotheses related to IMDb scores, genres, and TMDB popularity.
Age Certification Analysis: I plan to investigate whether the age certification of titles impacts their IMDb rating or TMDB popularity.
Dive Deeper into Genres: Exploring which genres influence IMDb scores and popularity will be a priority.
Handling Missing Data: I need to clean the dataset and deal with missing values.
Monte Carlo Simulations: Using simulations could help understand the variability in IMDb ratings or popularity scores.

Initial Findings

Hypothesis 1:

Hypothesis: Higher IMDb ratings are associated with higher TMDB popularity scores.

I hypothesize that titles with higher IMDb ratings are also more likely to be popular on Netflix.

Visualization for Hypothesis 1:

#  Scatter plot for IMDb Score vs TMDB Popularity
ggplot(netflix_data_1, aes(x = imdb_score, y = tmdb_popularity)) +
  geom_point(aes(color = tmdb_popularity), alpha = 0.6, size = 3) + 
  scale_color_gradient(low = "lightblue", high = "darkblue") + 
  geom_smooth(method = "lm", col = "red", se = FALSE, size = 1.2) + 
  labs(
    title = "How IMDb Score Relates to TMDB Popularity",
    subtitle = "Are higher-rated titles more popular?",
    x = "IMDB Score",
    y = "TMDB Popularity",
    color = "TMDB Popularity"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  
    plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5),  
    axis.title = element_text(size = 14),  
    axis.text = element_text(size = 12), 
    legend.position = "right"  
  ) +
  scale_x_continuous(breaks = seq(0, 10, 1), labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 301 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 301 rows containing missing values or values outside the scale range
## (`geom_point()`).

Hypothesis 2:

Hypothesis: Certain genres (like ‘Drama’ or ‘Documentary’) will have higher IMDb ratings compared to genres like ‘Comedy’ or ‘Action’.

This hypothesis suggests that more serious or thoughtful genres may receive higher ratings on IMDb compared to lighthearted genres.

Visualization for Hypothesis 2:

# Enhanced IMDb score distribution by genre using boxplot
ggplot(netflix_data_1_clean, aes(x = reorder(genres, imdb_score), y = imdb_score, fill = genres)) +
  geom_boxplot(outlier.color = "red", outlier.size = 2, notch = TRUE) +  # Highlight outliers and use notched boxplots
  scale_fill_viridis_d(option = "C", direction = -1) + 
  labs(
    title = "How IMDb Scores Vary Across Genres",
    subtitle = "Visualizing the distribution of IMDb scores by genre",
    x = "Genres", 
    y = "IMDb Score"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),  
    plot.subtitle = element_text(size = 14, face = "italic", hjust = 0.5), 
    axis.title = element_text(size = 14),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),  
    legend.position = "none" 
  ) +
  scale_y_continuous(breaks = seq(0, 10, 1), labels = scales::comma)

## Warning: Removed 300 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?

Insights Gathered So Far:

The summary so far focuses on the structure of the dataset and an initial exploration into how genres and IMDb ratings interact. The two visualizations (IMDb by genre and popularity vs runtime) provide early insights into possible relationships worth investigating.

My first hypothesis looks at whether highly rated titles also tend to be popular, and my second hypothesis explores whether some genres are more highly rated than others.

These initial findings will guide my further analysis, and my next steps will be testing the hypotheses more thoroughly, cleaning up any missing data, and exploring any further patterns that emerge.

This should give a strong foundation to build on in the meeting, and I’m looking forward to discussing these insights in more detail.