Course Details

  • Student Name: Sumit Kumar
  • Roll Number: 45
  • Course Code: CAP397

Introduction

Netflix is one of the most popular streaming platforms worldwide, offering a diverse collection of movies, TV shows, and original content. With the rapid growth of digital entertainment, understanding content trends and user preferences has become essential.

This project performs a detailed exploratory data analysis (EDA) of Netflix content. It includes data understanding, filtering, grouping, and advanced visualization to extract meaningful insights from the dataset.


Load Libraries and Dataset

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
data <- read.csv("C:/Volume D/R project/netflix_titles.csv")
head(data)
##   show_id    type                 title        director
## 1      s1   Movie  Dick Johnson Is Dead Kirsten Johnson
## 2      s2 TV Show         Blood & Water                
## 3      s3 TV Show             Ganglands Julien Leclercq
## 4      s4 TV Show Jailbirds New Orleans                
## 5      s5 TV Show          Kota Factory                
## 6      s6 TV Show         Midnight Mass   Mike Flanagan
##                                                                                                                                                                                                                                                                                                              cast
## 1                                                                                                                                                                                                                                                                                                                
## 2 Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng
## 3                                                                                                                                                             Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera
## 4                                                                                                                                                                                                                                                                                                                
## 5                                                                                                                                                                                                        Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar
## 6                                                                        Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver
##         country         date_added release_year rating  duration
## 1 United States September 25, 2021         2020  PG-13    90 min
## 2  South Africa September 24, 2021         2021  TV-MA 2 Seasons
## 3               September 24, 2021         2021  TV-MA  1 Season
## 4               September 24, 2021         2021  TV-MA  1 Season
## 5         India September 24, 2021         2021  TV-MA 2 Seasons
## 6               September 24, 2021         2021  TV-MA  1 Season
##                                                       listed_in
## 1                                                 Documentaries
## 2               International TV Shows, TV Dramas, TV Mysteries
## 3 Crime TV Shows, International TV Shows, TV Action & Adventure
## 4                                        Docuseries, Reality TV
## 5        International TV Shows, Romantic TV Shows, TV Comedies
## 6                            TV Dramas, TV Horror, TV Mysteries
##                                                                                                                                                description
## 1 As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
## 2      After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
## 3       To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
## 4      Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.
## 5 In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
## 6 The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe.

Level 1: Understanding the Data

Question 1.1: What is the overall structure of the dataset, including the number of observations and variables, and what types of data are present?

cat("Rows:", nrow(data), "\n")
## Rows: 8807
cat("Columns:", ncol(data), "\n")
## Columns: 12
str(data)
## 'data.frame':    8807 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "Movie" "TV Show" "TV Show" "TV Show" ...
##  $ title       : chr  "Dick Johnson Is Dead" "Blood & Water" "Ganglands" "Jailbirds New Orleans" ...
##  $ director    : chr  "Kirsten Johnson" "" "Julien Leclercq" "" ...
##  $ cast        : chr  "" "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile "| __truncated__ "Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, G"| __truncated__ "" ...
##  $ country     : chr  "United States" "South Africa" "" "" ...
##  $ date_added  : chr  "September 25, 2021" "September 24, 2021" "September 24, 2021" "September 24, 2021" ...
##  $ release_year: int  2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
##  $ rating      : chr  "PG-13" "TV-MA" "TV-MA" "TV-MA" ...
##  $ duration    : chr  "90 min" "2 Seasons" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Documentaries" "International TV Shows, TV Dramas, TV Mysteries" "Crime TV Shows, International TV Shows, TV Action & Adventure" "Docuseries, Reality TV" ...
##  $ description : chr  "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical wa"| __truncated__ "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is h"| __truncated__ "To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled "| __truncated__ "Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Or"| __truncated__ ...

Interpretation:
The dataset contains multiple variables including categorical fields like type, rating, and country, and numerical fields such as release year. This structure makes it suitable for both statistical analysis and visualization.


Question 1.2: How many missing values exist in each column, and how might they impact the reliability of the analysis?

colSums(is.na(data))
##      show_id         type        title     director         cast      country 
##            0            0            0            0            0            0 
##   date_added release_year       rating     duration    listed_in  description 
##            0            0            0            0            0            0

Interpretation:
The dataset contains some missing values in certain columns. Handling these values is important to ensure accurate analysis.


Question 1.3: What are the unique categories present in content type and rating, and what do they reveal about Netflix’s classification system?

unique(data$type)
## [1] "Movie"   "TV Show"
unique(data$rating)
##  [1] "PG-13"    "TV-MA"    "PG"       "TV-14"    "TV-PG"    "TV-Y"    
##  [7] "TV-Y7"    "R"        "TV-G"     "G"        "NC-17"    "74 min"  
## [13] "84 min"   "66 min"   "NR"       ""         "TV-Y7-FV" "UR"

Interpretation:
Netflix content is mainly divided into Movies and TV Shows, with multiple rating categories indicating different audience suitability.


Level 2: Basic Analysis

Question 2.1: What is the distribution of Movies versus TV Shows on Netflix, and what does it suggest about platform focus?

table(data$type)
## 
##   Movie TV Show 
##    6131    2676

Interpretation:
Movies are significantly higher in number compared to TV Shows, indicating Netflix’s strong focus on movie content.


Question 2.2: What percentage of the total content is Movies and TV Shows, and how balanced is the distribution?

prop.table(table(data$type))
## 
##     Movie   TV Show 
## 0.6961508 0.3038492

Interpretation:
This shows the proportion of Movies and TV Shows, giving a clearer understanding of content distribution.


Level 3: Filtering

Question 3.1: How many movies have been released after 2015, and what does this indicate about Netflix’s recent growth?

sum(data$type=="Movie" & data$release_year > 2015, na.rm=TRUE)
## [1] 3619

Interpretation:
A large number of movies have been released after 2015, indicating rapid growth in recent years.


Question 3.2: How many TV Shows originate from the top contributing countries, and what insights can be drawn about regional content dominance?

top3 <- names(sort(table(na.omit(data$country)), decreasing=TRUE)[1:3])
sum(data$type=="TV Show" & data$country %in% top3, na.rm=TRUE)
## [1] 1230

Interpretation:
Top countries contribute significantly to TV Show production, reflecting strong regional content creation.


Question 3.3: How prevalent is TV-MA rated content on Netflix, and what does it suggest about target audience maturity?

sum(data$rating=="TV-MA", na.rm=TRUE)
## [1] 3207

Interpretation:
A high number of TV-MA ratings indicates a preference for mature content on Netflix.


Question 3.4: How many movies are produced in India, and what does it indicate about India’s contribution to Netflix?

sum(data$country=="India" & data$type=="Movie", na.rm=TRUE)
## [1] 893

Interpretation:
India contributes a considerable number of movies, showing its importance in Netflix’s content library.


Question 3.5: How much content was released before the year 2000, and how significant is older content in Netflix’s library?

sum(data$release_year < 2000, na.rm=TRUE)
## [1] 525

Interpretation:
Older content exists but is relatively smaller compared to modern releases.


Level 4: Grouping & Summarization

Question 4.1: How does the distribution of content vary across different countries, and which regions dominate the platform?

data %>% group_by(country) %>% summarise(count=n()) %>% arrange(desc(count))
## # A tibble: 749 × 2
##    country          count
##    <chr>            <int>
##  1 "United States"   2818
##  2 "India"            972
##  3 ""                 831
##  4 "United Kingdom"   419
##  5 "Japan"            245
##  6 "South Korea"      199
##  7 "Canada"           181
##  8 "Spain"            145
##  9 "France"           124
## 10 "Mexico"           110
## # ℹ 739 more rows

Interpretation:
This grouping highlights which countries produce the most Netflix content.


Question 4.2: What is the average release year for Movies and TV Shows, and what does it reveal about content recency?

aggregate(release_year ~ type, data, mean)
##      type release_year
## 1   Movie     2013.122
## 2 TV Show     2016.606

Interpretation:
TV Shows tend to have slightly newer average release years compared to Movies.


Question 4.3: How are content ratings distributed across Movies and TV Shows, and are there noticeable differences?

table(data$rating, data$type)
##           
##            Movie TV Show
##                2       2
##   66 min       1       0
##   74 min       1       0
##   84 min       1       0
##   G           41       0
##   NC-17        3       0
##   NR          75       5
##   PG         287       0
##   PG-13      490       0
##   R          797       2
##   TV-14     1427     733
##   TV-G       126      94
##   TV-MA     2062    1145
##   TV-PG      540     323
##   TV-Y       131     176
##   TV-Y7      139     195
##   TV-Y7-FV     5       1
##   UR           3       0

Interpretation:
Different ratings are distributed across Movies and TV Shows, reflecting varied audience targeting.


Question 4.4: What are the most frequent genres on Netflix, and what does this indicate about audience preferences?

sort(table(data$listed_in), decreasing=TRUE)[1:10]
## 
##                     Dramas, International Movies 
##                                              362 
##                                    Documentaries 
##                                              359 
##                                  Stand-Up Comedy 
##                                              334 
##           Comedies, Dramas, International Movies 
##                                              274 
## Dramas, Independent Movies, International Movies 
##                                              252 
##                                         Kids' TV 
##                                              220 
##                         Children & Family Movies 
##                                              215 
##               Children & Family Movies, Comedies 
##                                              201 
##              Documentaries, International Movies 
##                                              186 
##    Dramas, International Movies, Romantic Movies 
##                                              180

Interpretation:
Certain genres like Drama and Comedy dominate Netflix content.


Question 4.5: Which release year had the highest number of content additions, and what trend does it represent?

sort(table(data$release_year), decreasing=TRUE)[1]
## 2018 
## 1147

Interpretation:
Recent years have the highest content releases, indicating rapid expansion.


Level 5: Visualization

V1: How does the distribution of Movies and TV Shows appear visually, and what insights can be derived from it?

ggplot(data, aes(x=type)) + geom_bar(fill="steelblue")

Interpretation:
The bar chart clearly shows the dominance of Movies over TV Shows.


V2: How are content releases distributed over the years, and what trend does the histogram reveal?

ggplot(data, aes(x=release_year)) + geom_histogram(fill="blue", bins=30)

Interpretation:
Most content is concentrated in recent years.


V3: Which countries dominate Netflix content production based on visualization?

top_country <- as.data.frame(sort(table(data$country), decreasing=TRUE)[1:10])
ggplot(top_country, aes(x=reorder(Var1, Freq), y=Freq)) +
  geom_bar(stat="identity", fill="green") + coord_flip()

Interpretation:
A few countries dominate Netflix content production.


V4: What does the proportional distribution of content types indicate when represented as a pie chart?

pie(table(data$type), col=c("red","blue"))

Interpretation:
The pie chart shows the proportion of Movies vs TV Shows.


Level 6: Advanced Visualization

V5: How are different content ratings distributed, and what does it suggest about audience segmentation?

ggplot(data, aes(x=rating)) +
  geom_bar(fill="purple") +
  theme(axis.text.x = element_text(angle=90))

Interpretation:
Some ratings are more frequent, showing viewer preference trends.


V6: Which genres are most dominant, and how does visualization help identify them clearly?

top_genre <- as.data.frame(sort(table(data$listed_in), decreasing=TRUE)[1:10])
ggplot(top_genre, aes(x=reorder(Var1, Freq), y=Freq)) +
  geom_bar(stat="identity", fill="orange") + coord_flip()

Interpretation:
Popular genres dominate the platform’s content.


V7: What trend can be observed in content release over time using a line chart

ggplot(data, aes(x=release_year)) +
  geom_line(stat="count", color="green")

Interpretation:
Content production has increased significantly over time.


V8: How does the spread of release years differ between Movies and TV Shows using a boxplot?

ggplot(data, aes(x=type, y=release_year)) +
  geom_boxplot(fill="cyan")

Interpretation:
Shows distribution differences between Movies and TV Shows.


V9: What does the density distribution of release years indicate about content concentration?

ggplot(data, aes(x=release_year)) +
  geom_density(fill="purple", alpha=0.5)

Interpretation:
Highlights the concentration of releases in recent years.


V10: How does content distribution differ when separated by type using facet visualization?

ggplot(data, aes(x=release_year)) +
  geom_histogram() +
  facet_wrap(~type)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Interpretation:
Allows comparison between Movies and TV Shows separately.


Level 7: Advanced Analysis

Question 7.1: Which years recorded the highest number of content releases, and what does this trend signify?

sort(table(data$release_year), decreasing=TRUE)[1:5]
## 
## 2018 2017 2019 2020 2016 
## 1147 1032 1030  953  902

Interpretation:
Certain years have peak content releases.


Question 7.2: What are the most common content ratings on Netflix, and what does this imply about viewer preferences?

sort(table(data$rating), decreasing=TRUE)[1:5]
## 
## TV-MA TV-14 TV-PG     R PG-13 
##  3207  2160   863   799   490

Interpretation:
Some ratings dominate, indicating audience preference.


Question 7.3: How does the trend of content growth evolve over time based on grouped analysis?

trend <- data %>% group_by(release_year) %>% summarise(count=n())
head(trend)
## # A tibble: 6 × 2
##   release_year count
##          <int> <int>
## 1         1925     1
## 2         1942     2
## 3         1943     3
## 4         1944     3
## 5         1945     4
## 6         1946     2

Interpretation:
Shows how content has evolved over time.


Level 8: Correlation Analysis

Question 8.1: What relationships exist between numerical variables in the dataset, and how can correlation analysis help uncover hidden patterns?

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
data_numeric <- data %>%
  mutate(
    is_movie = ifelse(type == "Movie", 1, 0),
    is_tv = ifelse(type == "TV Show", 1, 0)
  )

num_data <- data_numeric %>%
  select(release_year, is_movie, is_tv)

cor_matrix <- cor(num_data, use="complete.obs")

corrplot(cor_matrix, method="color")

Interpretation:
The heatmap shows relationships between variables like content type and release year.


Final Conclusion

This analysis highlights that Netflix primarily focuses on movie content, with rapid growth in recent years. Countries like the United States and India play a major role in content production. Visualizations and statistical summaries reveal trends in genres, ratings, and release patterns.


End of Report