1. Load and Preview the Data

# Load the Netflix dataset
netflix_data <- read.csv("~/Netflix_dataset.csv")

# Preview the dataset
head(netflix_data)
##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm127384     Monty Python and the Holy Grail MOVIE
## 4  tm70993                       Life of Brian MOVIE
## 5 tm190788                        The Exorcist MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                                                                                                                                                                                                                                                                                            This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.
## 3                                    King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 4 Brian Cohen is an average young Jewish man, but through a series of ridiculous events, he gains a reputation as the Messiah. When he's not dodging his followers or being scolded by his shrill mother, the hapless Brian has to contend with the pompous Pontius Pilate and acronym-obsessed members of a separatist movement. Rife with Monty Python's signature absurdity, the tale finds Brian's life paralleling Biblical lore, albeit with many more laughs.
## 5                                                                                                                12-year-old Regan MacNeil begins to adapt an explicit new personality as strange events befall the local area of Georgetown. Her mother becomes torn between science and superstition in a desperate bid to save her daughter, and ultimately turns to her last hope: Father Damien Karras, a troubled priest who is struggling with his own faith.
## 6                                                                                                                                                                                                                                                                                             A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime                 genres
## 1         1945             TV-MA      48      ['documentation']
## 2         1976                 R     113     ['crime', 'drama']
## 3         1975                PG      91  ['comedy', 'fantasy']
## 4         1979                 R      94             ['comedy']
## 5         1973                 R     133             ['horror']
## 6         1969             TV-14      30 ['comedy', 'european']
##   production_countries seasons   imdb_id imdb_score imdb_votes tmdb_popularity
## 1               ['US']       1                   NA         NA           0.600
## 2               ['US']      NA tt0075314        8.3     795222          27.612
## 3               ['GB']      NA tt0071853        8.2     530877          18.216
## 4               ['GB']      NA tt0079470        8.0     392419          17.505
## 5               ['US']      NA tt0070047        8.1     391942          95.337
## 6               ['GB']       4 tt0063929        8.8      72895          12.919
##   tmdb_score
## 1         NA
## 2        8.2
## 3        7.8
## 4        7.8
## 5        7.7
## 6        8.3


2. Numeric Summary of Two Columns

# Summary statistics for runtime and imdb_score
summary(netflix_data$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   44.00   84.00   77.64  105.00  251.00
summary(netflix_data$imdb_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.500   5.800   6.600   6.533   7.400   9.600     523
# Quantiles for runtime
quantile(netflix_data$runtime, probs = seq(0, 1, 0.25))
##   0%  25%  50%  75% 100% 
##    0   44   84  105  251
# Quantiles for imdb_score
quantile(netflix_data$imdb_score, probs = seq(0, 1, 0.25), na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##  1.5  5.8  6.6  7.4  9.6

Explanation:

We focus on summarizing two key numeric columns:

  • runtime: Measures the length of movies/shows.

  • imdb_score: A rating out of 10 from the IMDb website.

  • Summary Statistics:

    • For runtime, you will see the min, max, mean, median, and quantiles. This helps understand the spread of runtimes in the dataset.

    • For imdb_score, we compute similar statistics, excluding missing values (NA). Quantiles give a more detailed picture of the distribution.

  • Key Insights:

    • The summary of runtime helps us understand the variation between short and long content.

    • The imdb_score statistics show the distribution of user ratings for the Netflix content.


3. Categorical Column Summary

# Count unique values for 'type' (e.g., MOVIE, SHOW)
table(netflix_data$type)
## 
## MOVIE  SHOW 
##  3759  2047
# Count unique values for 'age_certification'
table(netflix_data$age_certification)
## 
##           G NC-17    PG PG-13     R TV-14  TV-G TV-MA TV-PG  TV-Y TV-Y7 
##  2610   131    14   246   440   575   470    76   841   186   105   112
# Count unique values for 'genres' (first 10 unique counts)
head(sort(table(netflix_data$genres), decreasing = TRUE), 10)
## 
##                  ['comedy']                   ['drama'] 
##                         510                         350 
##           ['documentation']         ['comedy', 'drama'] 
##                         320                         141 
##         ['drama', 'comedy']                 ['reality'] 
##                         128                         120 
##        ['drama', 'romance'] ['comedy', 'documentation'] 
##                         112                          93 
##               ['animation']                          [] 
##                          69                          68

Explanation:

Here we analyze the categorical columns:

  • type: Counts the number of SHOWS and MOVIES.

  • age_certification: Displays the count of unique age certifications (e.g., TV-MA, PG-13).

  • genres: Counts and lists the top 10 genres in the dataset.

  • Key Insights:

    • The type count tells us how balanced the dataset is between movies and shows.

    • The age_certification count informs us about the distribution of age ratings, useful for analyzing the target audience.

    • The genres count shows the most frequent genres, which helps us understand the type of content available on Netflix.

4. Novel Questions to Investigate

  1. Does the average runtime differ between movies and TV shows?
  2. What is the most common genre among Netflix’s top-rated content?
  3. How has the distribution of genres changed over time?

Addressing one of the above questions. I am using groupby and summarize functions to calculate average runtime between movies and Tv shows.

# Group by type and calculate average runtime
avg_runtime <- netflix_data %>%
  group_by(type) %>%
  summarize(average_runtime = mean(runtime, na.rm = TRUE))
avg_runtime
## # A tibble: 2 × 2
##   type  average_runtime
##   <chr>           <dbl>
## 1 MOVIE            98.8
## 2 SHOW             38.8

Explanation:

We group the data by type (SHOW or MOVIE) and calculate the average runtime for each. This gives insight into how the typical duration differs between shows and movies.

  • Key Insights:

    • This could show that movies generally have longer runtimes, while shows may have shorter episodes. The result helps in understanding how Netflix content is structured.


5. Visual Summaries of Columns

5.1. Distribution of Runtime

# Histogram of runtime
ggplot(netflix_data, aes(x = runtime)) +
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Runtime", x = "Runtime (minutes)", y = "Frequency")

Explanation:

This histogram visualizes the distribution of runtime for all content. The binwidth is set to 10 minutes.

  • Interpretation of the Plot:

    • The histogram shows the frequency of content with different runtimes.

    • Key Insights: A peak in short runtimes may indicate a prevalence of shows, while a tail at the higher end might reflect longer movies.


5.2. Scatter Plot: Correlation Between IMDB and TMDB Scores

# Filter data for MOVIES only
movies_data <- netflix_data %>%
  filter(type == "MOVIE")

# Scatter plot of imdb_score vs tmdb_score
ggplot(movies_data, aes(x = imdb_score, y = tmdb_score)) +
  geom_point(color = "blue", alpha = 0.5) +
  labs(title = "IMDB Score vs TMDB Score (Movies)", x = "IMDB Score", y = "TMDB Score") +
  geom_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 490 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 490 rows containing missing values or values outside the scale range
## (`geom_point()`).

Explanation:

This scatter plot shows the relationship between imdb_score and tmdb_score for movies. A linear regression line (red) is added to help visualize any trends.

  • Interpretation of the Plot:

    • The points represent individual movies with their IMDb and TMDB ratings.

    • Key Insights: If the points are tightly clustered along the red line, it suggests a strong positive correlation between the two rating systems. Outliers indicate movies with differing IMDb and TMDB scores.


5.3. Bar Plot: Most Common Genres by Age Certification

# counting the occurrences of genres and age certification
genres_age <- netflix_data %>%
  group_by(genres, age_certification) %>%
  tally()

# listing the top 10 genres based on the total count
top_10_genres <- genres_age %>%
  group_by(genres) %>%
  summarise(total_count = sum(n)) %>%
  arrange(desc(total_count)) %>%
  top_n(10, total_count)

# filtering the original data to keep only the top 10 genres
genres_age_top10 <- genres_age %>%
  filter(genres %in% top_10_genres$genres)

# Bar plot for the top 10 genres and age certification
ggplot(genres_age_top10, aes(x = reorder(genres, -n), y = n, fill = age_certification)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Top 10 Genres by Age Certification", x = "Genres", y = "Count")

Note: The details in the bar graph were not that clear on the full data, hence picked top 10 genre to visualize it in a better way.

Explanation:

This bar plot shows the distribution of genres across different age certifications (e.g., TV-MA, PG-13). The bars are color-coded by age_certification, with position = dodge separating them for comparison.

  • Interpretation of the Plot:

    • The plot shows how genres are distributed across different age groups. For example, you might see that certain genres (like drama or crime) are more common in the TV-MA category.

    • Key Insights: This helps in understanding the relationship between content type and target audience (e.g., adult-oriented vs family-friendly genres).


6. Conclusions

Summarize the insights gathered from the data and visualizations: