Netflix is one of the most popular streaming platforms worldwide, offering a diverse collection of movies, TV shows, and original content. With the rapid growth of digital entertainment, understanding content trends and user preferences has become essential.
This project performs a detailed exploratory data analysis (EDA) of Netflix content. It includes data understanding, filtering, grouping, and advanced visualization to extract meaningful insights from the dataset.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
data <- read.csv("C:/Volume D/R project/netflix_titles.csv")
head(data)
## show_id type title director
## 1 s1 Movie Dick Johnson Is Dead Kirsten Johnson
## 2 s2 TV Show Blood & Water
## 3 s3 TV Show Ganglands Julien Leclercq
## 4 s4 TV Show Jailbirds New Orleans
## 5 s5 TV Show Kota Factory
## 6 s6 TV Show Midnight Mass Mike Flanagan
## cast
## 1
## 2 Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4
## 5 Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6 Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
## country date_added release_year rating duration
## 1 United States September 25, 2021 2020 PG-13 90 min
## 2 South Africa September 24, 2021 2021 TV-MA 2 Seasons
## 3 September 24, 2021 2021 TV-MA 1 Season
## 4 September 24, 2021 2021 TV-MA 1 Season
## 5 India September 24, 2021 2021 TV-MA 2 Seasons
## 6 September 24, 2021 2021 TV-MA 1 Season
## listed_in
## 1 Documentaries
## 2 International TV Shows, TV Dramas, TV Mysteries
## 3 Crime TV Shows, International TV Shows, TV Action & Adventure
## 4 Docuseries, Reality TV
## 5 International TV Shows, Romantic TV Shows, TV Comedies
## 6 TV Dramas, TV Horror, TV Mysteries
## description
## 1 As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2 After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3 To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4 Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5 In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6 The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.
cat("Rows:", nrow(data), "\n")
## Rows: 8807
cat("Columns:", ncol(data), "\n")
## Columns: 12
str(data)
## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : chr "s1" "s2" "s3" "s4" ...
## $ type : chr "Movie" "TV Show" "TV Show" "TV Show" ...
## $ title : chr "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
## $ director : chr "Kirsten Johnson" "" "Julien Leclercq" "" ...
## $ cast : chr "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
## $ country : chr "United States" "South Africa" "" "" ...
## $ date_added : chr "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : chr "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
## $ duration : chr "90 min" "2 Seasons" "1 Season" "1 Season" ...
## $ listed_in : chr "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
## $ description : chr "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...
Interpretation:
The dataset contains multiple variables including categorical fields
like type, rating, and country, and numerical fields such as release
year. This structure makes it suitable for both statistical analysis and
visualization.
colSums(is.na(data))
## show_id type title director cast country
## 0 0 0 0 0 0
## date_added release_year rating duration listed_in description
## 0 0 0 0 0 0
Interpretation:
The dataset contains some missing values in certain columns. Handling
these values is important to ensure accurate analysis.
unique(data$type)
## [1] "Movie" "TV Show"
unique(data$rating)
## [1] "PG-13" "TV-MA" "PG" "TV-14" "TV-PG" "TV-Y"
## [7] "TV-Y7" "R" "TV-G" "G" "NC-17" "74 min"
## [13] "84 min" "66 min" "NR" "" "TV-Y7-FV" "UR"
Interpretation:
Netflix content is mainly divided into Movies and TV Shows, with
multiple rating categories indicating different audience
suitability.
table(data$type)
##
## Movie TV Show
## 6131 2676
Interpretation:
Movies are significantly higher in number compared to TV Shows,
indicating Netflix’s strong focus on movie content.
prop.table(table(data$type))
##
## Movie TV Show
## 0.6961508 0.3038492
Interpretation:
This shows the proportion of Movies and TV Shows, giving a clearer
understanding of content distribution.
sort(table(na.omit(data$country)), decreasing=TRUE)[1:5]
##
## United States India United Kingdom Japan
## 2818 972 831 419 245
Interpretation:
The United States dominates content production, followed by countries
like India and the UK, showing global contribution trends.
sum(data$type=="Movie" & data$release_year > 2015, na.rm=TRUE)
## [1] 3619
Interpretation:
A large number of movies have been released after 2015, indicating rapid
growth in recent years.
top3 <- names(sort(table(na.omit(data$country)), decreasing=TRUE)[1:3])
sum(data$type=="TV Show" & data$country %in% top3, na.rm=TRUE)
## [1] 1230
Interpretation:
Top countries contribute significantly to TV Show production, reflecting
strong regional content creation.
sum(data$rating=="TV-MA", na.rm=TRUE)
## [1] 3207
Interpretation:
A high number of TV-MA ratings indicates a preference for mature content
on Netflix.
sum(data$country=="India" & data$type=="Movie", na.rm=TRUE)
## [1] 893
Interpretation:
India contributes a considerable number of movies, showing its
importance in Netflix’s content library.
sum(data$release_year < 2000, na.rm=TRUE)
## [1] 525
Interpretation:
Older content exists but is relatively smaller compared to modern
releases.
data %>% group_by(country) %>% summarise(count=n()) %>% arrange(desc(count))
## # A tibble: 749 × 2
## country count
## <chr> <int>
## 1 "United States" 2818
## 2 "India" 972
## 3 "" 831
## 4 "United Kingdom" 419
## 5 "Japan" 245
## 6 "South Korea" 199
## 7 "Canada" 181
## 8 "Spain" 145
## 9 "France" 124
## 10 "Mexico" 110
## # ℹ 739 more rows
Interpretation:
This grouping highlights which countries produce the most Netflix
content.
aggregate(release_year ~ type, data, mean)
## type release_year
## 1 Movie 2013.122
## 2 TV Show 2016.606
Interpretation:
TV Shows tend to have slightly newer average release years compared to
Movies.
table(data$rating, data$type)
##
## Movie TV Show
## 2 2
## 66 min 1 0
## 74 min 1 0
## 84 min 1 0
## G 41 0
## NC-17 3 0
## NR 75 5
## PG 287 0
## PG-13 490 0
## R 797 2
## TV-14 1427 733
## TV-G 126 94
## TV-MA 2062 1145
## TV-PG 540 323
## TV-Y 131 176
## TV-Y7 139 195
## TV-Y7-FV 5 1
## UR 3 0
Interpretation:
Different ratings are distributed across Movies and TV Shows, reflecting
varied audience targeting.
sort(table(data$listed_in), decreasing=TRUE)[1:10]
##
## Dramas, International Movies
## 362
## Documentaries
## 359
## Stand-Up Comedy
## 334
## Comedies, Dramas, International Movies
## 274
## Dramas, Independent Movies, International Movies
## 252
## Kids' TV
## 220
## Children & Family Movies
## 215
## Children & Family Movies, Comedies
## 201
## Documentaries, International Movies
## 186
## Dramas, International Movies, Romantic Movies
## 180
Interpretation:
Certain genres like Drama and Comedy dominate Netflix content.
sort(table(data$release_year), decreasing=TRUE)[1]
## 2018
## 1147
Interpretation:
Recent years have the highest content releases, indicating rapid
expansion.
ggplot(data, aes(x=type)) + geom_bar(fill="steelblue")
Interpretation:
The bar chart clearly shows the dominance of Movies over TV Shows.
ggplot(data, aes(x=release_year)) + geom_histogram(fill="blue", bins=30)
Interpretation:
Most content is concentrated in recent years.
top_country <- as.data.frame(sort(table(data$country), decreasing=TRUE)[1:10])
ggplot(top_country, aes(x=reorder(Var1, Freq), y=Freq)) +
geom_bar(stat="identity", fill="green") + coord_flip()
Interpretation:
A few countries dominate Netflix content production.
pie(table(data$type), col=c("red","blue"))
Interpretation:
The pie chart shows the proportion of Movies vs TV Shows.
ggplot(data, aes(x=rating)) +
geom_bar(fill="purple") +
theme(axis.text.x = element_text(angle=90))
Interpretation:
Some ratings are more frequent, showing viewer preference trends.
top_genre <- as.data.frame(sort(table(data$listed_in), decreasing=TRUE)[1:10])
ggplot(top_genre, aes(x=reorder(Var1, Freq), y=Freq)) +
geom_bar(stat="identity", fill="orange") + coord_flip()
Interpretation:
Popular genres dominate the platform’s content.
ggplot(data, aes(x=release_year)) +
geom_line(stat="count", color="green")
Interpretation:
Content production has increased significantly over time.
ggplot(data, aes(x=type, y=release_year)) +
geom_boxplot(fill="cyan")
Interpretation:
Shows distribution differences between Movies and TV Shows.
ggplot(data, aes(x=release_year)) +
geom_density(fill="purple", alpha=0.5)
Interpretation:
Highlights the concentration of releases in recent years.
ggplot(data, aes(x=release_year)) +
geom_histogram() +
facet_wrap(~type)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Interpretation:
Allows comparison between Movies and TV Shows separately.
sort(table(data$release_year), decreasing=TRUE)[1:5]
##
## 2018 2017 2019 2020 2016
## 1147 1032 1030 953 902
Interpretation:
Certain years have peak content releases.
sort(table(data$rating), decreasing=TRUE)[1:5]
##
## TV-MA TV-14 TV-PG R PG-13
## 3207 2160 863 799 490
Interpretation:
Some ratings dominate, indicating audience preference.
trend <- data %>% group_by(release_year) %>% summarise(count=n())
head(trend)
## # A tibble: 6 × 2
## release_year count
## <int> <int>
## 1 1925 1
## 2 1942 2
## 3 1943 3
## 4 1944 3
## 5 1945 4
## 6 1946 2
Interpretation:
Shows how content has evolved over time.
This analysis highlights that Netflix primarily focuses on movie content, with rapid growth in recent years. Countries like the United States and India play a major role in content production. Visualizations and statistical summaries reveal trends in genres, ratings, and release patterns.
End of Report