Setup and Data Loading

1 Phase 1: Data Preparation and Cleaning

1.1 Loading Library

library(tidyverse)
library(ggplot2)
library(dplyr)
library(stringr)
library(rpart)

1.2 Question 1: Loading the Data

netflix_data <- read.csv("netflix_titles.csv")

Interpretation: Imports the CSV file into an R data frame

1.3 Question 2: Structure of the dataset.

str(netflix_data)
## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
##  $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
##  $ country     : chr  "United States" "South Africa" "" "" ...
##  $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
summary(netflix_data)
##    show_id              type              title             director        
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      cast             country           date_added         release_year 
##  Length:8807        Length:8807        Length:8807        Min.   :1925  
##  Class :character   Class :character   Class :character   1st Qu.:2013  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2014  
##                                                           3rd Qu.:2019  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Interpretation: This dataset contains netflix data across 8807 records and 12 columns.

1.4 Question 3: Handling Missing Values

colSums(is.na(netflix_data) | netflix_data == "")
##      show_id         type        title     director         cast      country 
##            0            0            0         2634          825          831 
##   date_added release_year       rating     duration    listed_in  description 
##           10            0            4            3            0            0

Interpretation: Identifies columns with null or empty strings (common in director, cast, and country). Overall Status: Out of 12 columns, 7 contain missing or empty values. Major Gaps: Has the highest missing data (2,634), followed by Country (831) and Cast (825). This suggests a significant portion of production details is unavailable. Date_added (10), Rating (4), and Duration (3) have very few missing values and can be easily cleaned. Clean Columns: Columns like show_id, type, title, and release_year are 100% complete and ready for analysis.

1.5 Question 4: Standardizing Dates

netflix_data$date_added <- as.Date(trimws(netflix_data$date_added), format = "%B %d, %Y")

Interpretation: Converts the character date into a formal R Date object for time-series analysis.

1.6 Question 5: Extracting Year and Month

netflix_data$year_added <- as.numeric(format(netflix_data$date_added, "%Y"))
netflix_data$month_added <- format(netflix_data$date_added, "%B")

Interpretation: Breaks down the date to see which months or years Netflix adds the most content.

1.7 Question 6: Cleaning the ‘Duration’ Column

netflix_data$duration_val <- as.numeric(gsub("([0-9]+).*", "\\1", netflix_data$duration))

Interpretation: Extracts the number from “90 min” or “2 Seasons” so we can perform math on it.

2 Phase 2: Exploratory Data Analysis (EDA)

2.1 Question 7: Movie vs. TV Show Distribution

table(netflix_data$type)
## 
##   Movie TV Show 
##    6131    2676

Interpretation: A simple count to see if the library is dominated by Movies or TV Shows.

2.2 Question 8: Ploting bar chart (“Movies” vs “TV Shows”)

type_counts <- as.data.frame(table(netflix_data$type))
colnames(type_counts) <- c("Type", "Count")

# Bar Chart ploting
ggplot(type_counts, aes(x = Type, y = Count, fill = Type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Movie" = "red", "TV Show" = "black")) +
  theme_minimal() +
  labs(title = "Distribution of Netflix Content",
       x = "Content Type",
       y = "Total Count")

It compares the total number of Movies and TV Shows side-by-side.

2.3 Question 9: Content Growth Over Time

# summarize dataset
release_counts <- as.data.frame(table(netflix_data$release_year))
colnames(release_counts) <- c("Year", "Count")
release_counts$Year <- as.numeric(as.character(release_counts$Year))

# ploting ggplot graph
ggplot(release_counts, aes(x = Year, y = Count)) +
  geom_line(color = "red", size = 1) +
  geom_area(fill = "red", alpha = 0.2) + 
  theme_minimal() +
  labs(title = "Netflix Content Release Trend",
       subtitle = "Visualizing the growth of titles over decades",
       x = "Release Year",
       y = "Number of Titles")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation: Visualizes the release years of titles to see the trend of “New” vs. “Classic” content.

2.4 Question 10: Top 10 Countries

head(sort(table(netflix_data$country), decreasing = TRUE), 10)
## 
##  United States          India                United Kingdom          Japan 
##           2818            972            831            419            245 
##    South Korea         Canada          Spain         France         Mexico 
##            199            181            145            124            110

Interpretation: Identifies which countries produce the most content available on Netflix.

2.5 Question 11: Rating Distribution

# summarize data
rating_data <- as.data.frame(table(netflix_data$rating))
colnames(rating_data) <- c("Rating", "Count")

# Step 2: ggplot(bar_chat)
ggplot(rating_data, aes(x = reorder(Rating, -Count), y = Count, fill = Rating)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Netflix Content Ratings Distribution",
       x = "Content Rating",
       y = "Number of Titles") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Labels ko rotate karne ke liye

Objective: This visualization represents a categorical distribution of the rating column to understand the target audience of Netflix.

Action: We converted the frequency table into a data frame and used ggplot2 to create an ordered bar chart.

Key Insight: The chart highlights which age-rating categories (like TV-MA or TV-14) dominate the platform.

Formatting: We used reorder() to sort the bars from highest to lowest count, making the graph easier to read for stakeholders.

2.6 Question 12:Rating vs Type Analysis (Grouped Bar Chart)

ggplot(netflix_data, aes(x = rating, fill = type)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(title = "Rating Distribution by Content Type")

Interpretation: The primary goal of this visualization is to perform a Bivariate Analysis to understand how content ratings are distributed across Movies and TV Shows.

2.7 Question 13: Duration Analysis for Movies

# 
netflix_data %>%
  filter(type == "Movie" & !is.na(duration_val)) %>%
  ggplot(aes(x = duration_val)) +
  geom_histogram(binwidth = 10, fill = "red", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Movie Durations",
       x = "Duration (Minutes)",
       y = "Frequency")

Interpretation: Displays the average length of a Netflix movie.

2.8 Question 14: TV Show Season Analysis

netflix_data %>%
  filter(type == "TV Show") %>%
  summarise(Average_Seasons = mean(duration_val, na.rm = TRUE))
##   Average_Seasons
## 1        1.764948

Interpretation: This approach uses the dplyr library to create a clean data pipeline. We first filter for ‘TV Shows’ and then use the summarise function to calculate the average number of seasons.

2.9 Question 15: Genre Frequency (First Genre Listed)

genre_analysis <- netflix_data %>%
  mutate(primary_genre = str_split_i(listed_in, ",", 1)) %>% 
  group_by(primary_genre) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 

# Result
print(genre_analysis)
## # A tibble: 36 × 2
##    primary_genre            Count
##    <chr>                    <int>
##  1 Dramas                    1600
##  2 Comedies                  1210
##  3 Action & Adventure         859
##  4 Documentaries              829
##  5 International TV Shows     774
##  6 Children & Family Movies   605
##  7 Crime TV Shows             399
##  8 Kids' TV                   388
##  9 Stand-Up Comedy            334
## 10 Horror Movies              275
## # ℹ 26 more rows

Interpretation: Extracts the main category for each title to see which genres are most popular.

2.10 Question 16: Correlation: Release Year vs Date Added

netflix_data %>%
  mutate(
    release_year = as.numeric(as.character(release_year)),
    year_added = as.numeric(as.character(year_added))
  ) %>%
  filter(!is.na(release_year) & !is.na(year_added)) %>%
  summarise(correlation = cor(release_year, year_added))
##   correlation
## 1    0.111531

Interpretation: Checks if Netflix is focusing on adding newer releases or older library titles.

2.11 Question 17: Analyzing Content Added by Month

month_summary <- netflix_data %>%
  group_by(month_added) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) 
# Result
print(month_summary)
## # A tibble: 13 × 2
##    month_added Count
##    <chr>       <int>
##  1 July          827
##  2 December      813
##  3 September     770
##  4 April         764
##  5 October       760
##  6 August        755
##  7 March         742
##  8 January       738
##  9 June          728
## 10 November      705
## 11 May           632
## 12 February      563
## 13 <NA>           10

Interpretation: Determines if there is a “seasonal” trend in when Netflix drops new content.

2.12 Question 18: Most Prolific Directors

top_directors <- netflix_data %>%
  filter(director != "") %>%             
  group_by(director) %>%                 
  summarise(Total_Content = n()) %>%     
  arrange(desc(Total_Content)) %>%       
  head(10)                               

# Result
print(top_directors)
## # A tibble: 10 × 2
##    director               Total_Content
##    <chr>                          <int>
##  1 Rajiv Chilaka                     19
##  2 Raúl Campos, Jan Suter            18
##  3 Marcus Raboy                      16
##  4 Suhas Kadav                       16
##  5 Jay Karas                         14
##  6 Cathy Garcia-Molina               13
##  7 Jay Chapman                       12
##  8 Martin Scorsese                   12
##  9 Youssef Chahine                   12
## 10 Steven Spielberg                  11

Interpretation: Lists directors with the most titles on the platform.

2.13 Question 19: Identifying “Oldies

netflix_data[netflix_data$release_year < 1960, c("title", "release_year")]
##                                                 title release_year
## 543                                             Ujala         1959
## 1332              Five Came Back: The Reference Films         1945
## 1700                                  White Christmas         1954
## 2369                                    Cairo Station         1958
## 2370                                      Dark Waters         1956
## 2376                                  The Blazing Sun         1954
## 4251                Pioneers: First Women Filmmakers*         1925
## 6432                            Cat on a Hot Tin Roof         1958
## 6785                                 Forbidden Planet         1956
## 6854                                             Gigi         1958
## 7220                          Know Your Enemy - Japan         1945
## 7295                               Let There Be Light         1946
## 7576                         Nazi Concentration Camps         1945
## 7744              Pioneers of African-American Cinema         1946
## 7791                                   Prelude to War         1942
## 7840                            Rebel Without a Cause         1955
## 7931                                       San Pietro         1945
## 7955                              Scandal in Sorrento         1955
## 8206                             The Battle of Midway         1942
## 8420 The Memphis Belle: A Story of a\nFlying Fortress         1944
## 8437                                The Negro Soldier         1944
## 8507                                The Sign of Venus         1955
## 8588                                      Thunderbolt         1947
## 8641                                 Tunisian Victory         1944
## 8661    Undercover: How to Operate Behind Enemy Lines         1943
## 8740               Why We Fight: The Battle of Russia         1943
## 8764                  WWII: Report from the Aleutians         1943

Interpretation: Filters for classic content released before 1960.

2.14 Question 20: Title Length Analysis

netflix_data$title_length <- nchar(netflix_data$title)
mean(netflix_data$title_length)
## [1] 17.72579

Interpretation: Checks the average character count of titles

3 Phase 3: Advanced Analysis & Machine Learning

3.1 Question 22: Feature Engineering: Gap Years

netflix_data$gap_years <- netflix_data$year_added - netflix_data$release_year

Interpretation: Calculates how long it took for a movie to arrive on Netflix after its theater release.

3.2 Question 23: Top Cast Members

head(sort(table(netflix_data$cast), decreasing = TRUE), 10)
## 
##                                                                                                                        
##                                                                                                                    825 
##                                                                                                     David Attenborough 
##                                                                                                                     19 
##                                Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jigna Bhardwaj, Rajesh Kava, Mousam, Swapnil 
##                                                                                                                     14 
##                                                                                                            Samuel West 
##                                                                                                                     10 
##                                                                                                            Jeff Dunham 
##                                                                                                                      7 
##                                                                                                          Craig Sechler 
##                                                                                                                      6 
##                                                                           David Spade, London Hughes, Fortune Feimster 
##                                                                                                                      6 
##                                                                                                             Kevin Hart 
##                                                                                                                      6 
## Michela Luci, Jamie Watson, Eric Peterson, Anna Claire Bartlam, Nicolas Aqui, Cory Doran, Julie Lemieux, Derek McGrath 
##                                                                                                                      6 
##                                                                                                              Bill Burr 
##                                                                                                                      5

Interpretation: Identifies the most frequently appearing actors.

3.3 Question 24: Identifying Kids Content

netflix_data$for_kids <- ifelse(netflix_data$rating %in% c("G", "TV-G", "TV-Y"), 1, 0)

Interpretation: Creates a binary flag for family-friendly content

3.4 Question 25: Subsetting for Modeling

top_directors <- netflix_data %>%
  filter(director != "") %>%             
  group_by(director) %>%                 
  summarise(Total_Content = n()) %>%     
  arrange(desc(Total_Content)) %>%       
  head(10)                               

# Result
print(top_directors)
## # A tibble: 10 × 2
##    director               Total_Content
##    <chr>                          <int>
##  1 Rajiv Chilaka                     19
##  2 Raúl Campos, Jan Suter            18
##  3 Marcus Raboy                      16
##  4 Suhas Kadav                       16
##  5 Jay Karas                         14
##  6 Cathy Garcia-Molina               13
##  7 Jay Chapman                       12
##  8 Martin Scorsese                   12
##  9 Youssef Chahine                   12
## 10 Steven Spielberg                  11

Interpretation: Prepares a clean dataset specifically for the training model.

Objective: This analysis identifies the Top 10 most prolific directors on Netflix based on the total number of titles they have directed.

Data Cleaning: We first filtered out records where the director’s name was missing or blank to ensure the accuracy of the ranking.

Methodology: By using the group_by() and summarise() functions, we performed a frequency count of each director’s contributions.

Result: The final list is sorted in descending order to highlight the directors with the highest volume of content, providing insight into Netflix’s most frequent creative collaborators.

3.5 Question 26: Encoding Categorical Variables

model_data <- netflix_data


model_data <- model_data %>%
  mutate(
    type = as.factor(type),
    rating = as.factor(rating)
  )


str(model_data$type)
##  Factor w/ 2 levels "Movie","TV Show": 1 2 2 2 2 2 1 1 2 1 ...

Converts characters to factors so R models can interpret them as categories.

3.6 Question 27: Converts characters to factors so R models can interpret them as categories.

train_idx <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train_set <- model_data[train_idx, ]
test_set <- model_data[-train_idx, ]

Interpretation: Splits data into 80% for training the model and 20% for testing its accuracy.

3.7 Question 28: Training a Decision Tree Model

# Model Training step
fit <- rpart(type ~ release_year + duration_val, 
             data = train_set, 
             method = "class")

Data Partitioning: The dataset was partitioned into a 70% training set and a 30% testing set to ensure unbiased model evaluation.

Model Induction: The training phase involved using the rpart algorithm to build a decision tree. The model ‘learned’ the relationship between features like release_year and duration_val to classify titles as Movies or TV Shows.

Pattern Recognition: During training, the algorithm identifies split points in the data (nodes) that best separate the classes based on statistical purity.

Validation: Post-training, the model’s predictive performance was validated using the unseen testing set to check for overfitting or underfitting.

3.8 Question 29: Visualizing the Mode

plot(fit); text(fit)

Interpretation: Displays the decision rules the model learned.

3.9 Question 30: Making Predictions

predictions <- predict(fit, test_set, type = "class")

Interpretation: Uses the model on unseen data to see how well it performs.

3.10 Question 31: Evaluating Accuracy (Confusion Matrix)

table(predictions, test_set$type)
##            
## predictions Movie TV Show
##     Movie    1237       3
##     TV Show     1     521

Interpretation: Compares predicted values vs actual values to calculate the final model accuracy.