Setup and Data Loading
Interpretation: Imports the CSV file into an R data frame
## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : chr "Movie" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" "" "Julien Leclercq" "" ...
## $ cast : chr "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
## $ country : chr "United States" "South Africa" "" "" ...
## $ date_added : chr "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : chr "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
## show_id type title director
## Length:8807 Length:8807 Length:8807 Length:8807
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cast country date_added release_year
## Length:8807 Length:8807 Length:8807 Min. :1925
## Class :character Class :character Class :character 1st Qu.:2013
## Mode :character Mode :character Mode :character Median :2017
## Mean :2014
## 3rd Qu.:2019
## Max. :2021
## rating duration listed_in description
## Length:8807 Length:8807 Length:8807 Length:8807
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
Interpretation: This dataset contains netflix data across 8807 records and 12 columns.
## show_id type title director cast country
## 0 0 0 2634 825 831
## date_added release_year rating duration listed_in description
## 10 0 4 3 0 0
Interpretation: Identifies columns with null or empty strings (common in director, cast, and country). Overall Status: Out of 12 columns, 7 contain missing or empty values. Major Gaps: Has the highest missing data (2,634), followed by Country (831) and Cast (825). This suggests a significant portion of production details is unavailable. Date_added (10), Rating (4), and Duration (3) have very few missing values and can be easily cleaned. Clean Columns: Columns like show_id, type, title, and release_year are 100% complete and ready for analysis.
Interpretation: Converts the character date into a formal R Date object for time-series analysis.
##
## Movie TV Show
## 6131 2676
Interpretation: A simple count to see if the library is dominated by Movies or TV Shows.
type_counts <- as.data.frame(table(netflix_data$type))
colnames(type_counts) <- c("Type", "Count")
# Bar Chart ploting
ggplot(type_counts, aes(x = Type, y = Count, fill = Type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Movie" = "red", "TV Show" = "black")) +
theme_minimal() +
labs(title = "Distribution of Netflix Content",
x = "Content Type",
y = "Total Count")
It compares the total number of Movies and TV Shows
side-by-side.
# summarize dataset
release_counts <- as.data.frame(table(netflix_data$release_year))
colnames(release_counts) <- c("Year", "Count")
release_counts$Year <- as.numeric(as.character(release_counts$Year))
# ploting ggplot graph
ggplot(release_counts, aes(x = Year, y = Count)) +
geom_line(color = "red", size = 1) +
geom_area(fill = "red", alpha = 0.2) +
theme_minimal() +
labs(title = "Netflix Content Release Trend",
subtitle = "Visualizing the growth of titles over decades",
x = "Release Year",
y = "Number of Titles")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation: Visualizes the release years of titles to see
the trend of “New” vs. “Classic” content.
##
## United States India United Kingdom Japan
## 2818 972 831 419 245
## South Korea Canada Spain France Mexico
## 199 181 145 124 110
Interpretation: Identifies which countries produce the most content available on Netflix.
# summarize data
rating_data <- as.data.frame(table(netflix_data$rating))
colnames(rating_data) <- c("Rating", "Count")
# Step 2: ggplot(bar_chat)
ggplot(rating_data, aes(x = reorder(Rating, -Count), y = Count, fill = Rating)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Netflix Content Ratings Distribution",
x = "Content Rating",
y = "Number of Titles") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Labels ko rotate karne ke liye
Objective: This visualization represents a categorical
distribution of the rating column to understand the target audience of
Netflix.
Action: We converted the frequency table into a data frame and used ggplot2 to create an ordered bar chart.
Key Insight: The chart highlights which age-rating categories (like TV-MA or TV-14) dominate the platform.
Formatting: We used reorder() to sort the bars from highest to lowest count, making the graph easier to read for stakeholders.
ggplot(netflix_data, aes(x = rating, fill = type)) +
geom_bar(position = "dodge") +
theme_minimal() +
labs(title = "Rating Distribution by Content Type")Interpretation: The primary goal of this visualization is to perform a Bivariate Analysis to understand how content ratings are distributed across Movies and TV Shows.
#
netflix_data %>%
filter(type == "Movie" & !is.na(duration_val)) %>%
ggplot(aes(x = duration_val)) +
geom_histogram(binwidth = 10, fill = "red", color = "white") +
theme_minimal() +
labs(title = "Distribution of Movie Durations",
x = "Duration (Minutes)",
y = "Frequency")
Interpretation: Displays the average length of a Netflix
movie.
netflix_data %>%
filter(type == "TV Show") %>%
summarise(Average_Seasons = mean(duration_val, na.rm = TRUE))## Average_Seasons
## 1 1.764948
Interpretation: This approach uses the dplyr library to create a clean data pipeline. We first filter for ‘TV Shows’ and then use the summarise function to calculate the average number of seasons.
genre_analysis <- netflix_data %>%
mutate(primary_genre = str_split_i(listed_in, ",", 1)) %>%
group_by(primary_genre) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
# Result
print(genre_analysis)## # A tibble: 36 × 2
## primary_genre Count
## <chr> <int>
## 1 Dramas 1600
## 2 Comedies 1210
## 3 Action & Adventure 859
## 4 Documentaries 829
## 5 International TV Shows 774
## 6 Children & Family Movies 605
## 7 Crime TV Shows 399
## 8 Kids' TV 388
## 9 Stand-Up Comedy 334
## 10 Horror Movies 275
## # ℹ 26 more rows
Interpretation: Extracts the main category for each title to see which genres are most popular.
netflix_data %>%
mutate(
release_year = as.numeric(as.character(release_year)),
year_added = as.numeric(as.character(year_added))
) %>%
filter(!is.na(release_year) & !is.na(year_added)) %>%
summarise(correlation = cor(release_year, year_added))## correlation
## 1 0.111531
Interpretation: Checks if Netflix is focusing on adding newer releases or older library titles.
month_summary <- netflix_data %>%
group_by(month_added) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
# Result
print(month_summary)## # A tibble: 13 × 2
## month_added Count
## <chr> <int>
## 1 July 827
## 2 December 813
## 3 September 770
## 4 April 764
## 5 October 760
## 6 August 755
## 7 March 742
## 8 January 738
## 9 June 728
## 10 November 705
## 11 May 632
## 12 February 563
## 13 <NA> 10
Interpretation: Determines if there is a “seasonal” trend in when Netflix drops new content.
top_directors <- netflix_data %>%
filter(director != "") %>%
group_by(director) %>%
summarise(Total_Content = n()) %>%
arrange(desc(Total_Content)) %>%
head(10)
# Result
print(top_directors)## # A tibble: 10 × 2
## director Total_Content
## <chr> <int>
## 1 Rajiv Chilaka 19
## 2 Raúl Campos, Jan Suter 18
## 3 Marcus Raboy 16
## 4 Suhas Kadav 16
## 5 Jay Karas 14
## 6 Cathy Garcia-Molina 13
## 7 Jay Chapman 12
## 8 Martin Scorsese 12
## 9 Youssef Chahine 12
## 10 Steven Spielberg 11
Interpretation: Lists directors with the most titles on the platform.
## title release_year
## 543 Ujala 1959
## 1332 Five Came Back: The Reference Films 1945
## 1700 White Christmas 1954
## 2369 Cairo Station 1958
## 2370 Dark Waters 1956
## 2376 The Blazing Sun 1954
## 4251 Pioneers: First Women Filmmakers* 1925
## 6432 Cat on a Hot Tin Roof 1958
## 6785 Forbidden Planet 1956
## 6854 Gigi 1958
## 7220 Know Your Enemy - Japan 1945
## 7295 Let There Be Light 1946
## 7576 Nazi Concentration Camps 1945
## 7744 Pioneers of African-American Cinema 1946
## 7791 Prelude to War 1942
## 7840 Rebel Without a Cause 1955
## 7931 San Pietro 1945
## 7955 Scandal in Sorrento 1955
## 8206 The Battle of Midway 1942
## 8420 The Memphis Belle: A Story of a\nFlying Fortress 1944
## 8437 The Negro Soldier 1944
## 8507 The Sign of Venus 1955
## 8588 Thunderbolt 1947
## 8641 Tunisian Victory 1944
## 8661 Undercover: How to Operate Behind Enemy Lines 1943
## 8740 Why We Fight: The Battle of Russia 1943
## 8764 WWII: Report from the Aleutians 1943
Interpretation: Filters for classic content released before 1960.
## [1] 17.72579
Interpretation: Checks the average character count of titles
netflix_data$is_dark <- grepl("dark|death|murder|crime", netflix_data$description, ignore.case = TRUE)
sum(netflix_data$is_dark)## [1] 816
Interpretation: A simple text-mining step to see how much content is labeled with “dark” keywords.
Interpretation: Calculates how long it took for a movie to arrive on Netflix after its theater release.
##
##
## 825
## David Attenborough
## 19
## Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jigna Bhardwaj, Rajesh Kava, Mousam, Swapnil
## 14
## Samuel West
## 10
## Jeff Dunham
## 7
## Craig Sechler
## 6
## David Spade, London Hughes, Fortune Feimster
## 6
## Kevin Hart
## 6
## Michela Luci, Jamie Watson, Eric Peterson, Anna Claire Bartlam, Nicolas Aqui, Cory Doran, Julie Lemieux, Derek McGrath
## 6
## Bill Burr
## 5
Interpretation: Identifies the most frequently appearing actors.
Interpretation: Creates a binary flag for family-friendly content
top_directors <- netflix_data %>%
filter(director != "") %>%
group_by(director) %>%
summarise(Total_Content = n()) %>%
arrange(desc(Total_Content)) %>%
head(10)
# Result
print(top_directors)## # A tibble: 10 × 2
## director Total_Content
## <chr> <int>
## 1 Rajiv Chilaka 19
## 2 Raúl Campos, Jan Suter 18
## 3 Marcus Raboy 16
## 4 Suhas Kadav 16
## 5 Jay Karas 14
## 6 Cathy Garcia-Molina 13
## 7 Jay Chapman 12
## 8 Martin Scorsese 12
## 9 Youssef Chahine 12
## 10 Steven Spielberg 11
Interpretation: Prepares a clean dataset specifically for the training model.
Objective: This analysis identifies the Top 10 most prolific directors on Netflix based on the total number of titles they have directed.
Data Cleaning: We first filtered out records where the director’s name was missing or blank to ensure the accuracy of the ranking.
Methodology: By using the group_by() and summarise() functions, we performed a frequency count of each director’s contributions.
Result: The final list is sorted in descending order to highlight the directors with the highest volume of content, providing insight into Netflix’s most frequent creative collaborators.
model_data <- netflix_data
model_data <- model_data %>%
mutate(
type = as.factor(type),
rating = as.factor(rating)
)
str(model_data$type)## Factor w/ 2 levels "Movie","TV Show": 1 2 2 2 2 2 1 1 2 1 ...
Converts characters to factors so R models can interpret them as categories.
train_idx <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train_set <- model_data[train_idx, ]
test_set <- model_data[-train_idx, ]Interpretation: Splits data into 80% for training the model and 20% for testing its accuracy.
# Model Training step
fit <- rpart(type ~ release_year + duration_val,
data = train_set,
method = "class")Data Partitioning: The dataset was partitioned into a 70% training set and a 30% testing set to ensure unbiased model evaluation.
Model Induction: The training phase involved using the rpart algorithm to build a decision tree. The model ‘learned’ the relationship between features like release_year and duration_val to classify titles as Movies or TV Shows.
Pattern Recognition: During training, the algorithm identifies split points in the data (nodes) that best separate the classes based on statistical purity.
Validation: Post-training, the model’s predictive performance was validated using the unseen testing set to check for overfitting or underfitting.
Interpretation: Displays the decision rules the model
learned.
Interpretation: Uses the model on unseen data to see how well it performs.