Started as a DVD rental company, Netflix is the most well known streaming platform we used today. Who would’ve thought, as of December 31, 2021, Netflix had over 221.8 million subscribers worldwide? Moreover, it is the second largest entertainment/media company by market capitalization as of February, 2022.
Netflix, Inc. is an American subscription streaming service and production company. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California. Netflix initially both sold and rented DVDs by mail, but the sales were eliminated within a year to focus on the DVD rental business. In 2007, Netflix introduced streaming media and video on demand, which is what we are the most familiar with today.
Currently, Netflix streaming holds more than 3,600 movies and more than 1,800 TV shows available to its subscribers. Netflix also produces a wide variety of original content, all of which is labeled on the streaming service as a ‘Netflix Original’. The site really caters for audiences of all demographics with a wide variety of genres.
In this report, we are going to explore movies and TV shows on Netflix. Exploratory Data Analysis and Data Visualization will be performed to give more insights about shows on Netflix. The process includes Data Input, Data Cleansing & Coercion, Data Summary, Data Transformation & Visualization, and Conclusion. The objective of this report is to give insights and interesting statistics about its shows.
Reference: Wikipedia
# Library Input
library(tidyverse)
library(tidyselect)
library(lubridate)
library(glue)
library(scales)
library(ggplot2)
library(scales)
library(plotly)
library(DT)
library(RColorBrewer)
library(leaflet)
library(sf)
library(sp)
library(spatial)
library(highcharter)
library(shiny)# Data Input
netflix <- read.csv("data/netflix_titles.csv", stringsAsFactors = T, na.strings=c("", " ", "NA"))The data used in this report contains 8807 data and 12 variables. The data set consists of several variables with the following details:
show_id : Unique ID for every Movie / TV Showtype : Identifier - A Movie or TV Showtitle : Title of the Movie / TV Showdirector: Director of the Moviecast : Actors involved in the movie / showcountry : Country where the movie / show was produceddate_added : Date it was added on Netflixrelease_year : Actual Release year of the movie / showrating : TV Rating of the movie / showduration : Total Duration - in minutes or number of seasonslisted_in: Genredescription : The summary description# Checking Data Types
str(netflix)## 'data.frame': 8807 obs. of 12 variables:
## $ show_id : Factor w/ 8807 levels "s1","s10","s100",..: 1 1112 2223 3334 4445 5556 6667 7778 8697 2 ...
## $ type : Factor w/ 2 levels "Movie","TV Show": 1 2 2 2 2 2 1 1 2 1 ...
## $ title : Factor w/ 8807 levels "'76","'89","#Alive",..: 1990 1104 2661 3522 3882 4590 4899 6075 7234 7789 ...
## $ director : Factor w/ 4528 levels "Ömer Faruk Sorak",..: 2308 NA 2122 NA NA 2880 3555 1518 311 4169 ...
## $ cast : Factor w/ 7692 levels "'Najite Dede, Jude Chukwuka, Taiwo Arimoro, Odenike Odetola, Funmi Eko, Keppy Ekpenyong",..: NA 421 6312 NA 4839 3804 7283 4052 4863 4873 ...
## $ country : Factor w/ 748 levels ", France, Algeria",..: 604 427 NA NA 252 NA NA 664 507 604 ...
## $ date_added : Factor w/ 1767 levels " April 15, 2018",..: 1712 1707 1707 1707 1707 1707 1707 1707 1707 1707 ...
## $ release_year: int 2020 2021 2021 2021 2021 2021 2021 1993 2021 2021 ...
## $ rating : Factor w/ 17 levels "66 min","74 min",..: 8 12 12 12 12 12 7 12 10 8 ...
## $ duration : Factor w/ 220 levels "1 Season","10 min",..: 211 111 1 1 111 1 212 33 210 8 ...
## $ listed_in : Factor w/ 514 levels "Action & Adventure",..: 275 416 243 298 395 502 122 320 115 198 ...
## $ description : Factor w/ 8775 levels "\"Bridgerton\" cast members share behind-the-scenes stories from the hit show, plus comedian Nikki Glaser break"| __truncated__,..: 2569 1756 7339 3620 4376 6390 3460 5511 1148 1339 ...
# Data coertions
netflix[ , c("show_id", "title", "description")] <-
lapply(netflix[ , c("show_id", "title", "description")], as.character)
netflix$date_added <- mdy(netflix$date_added)
netflix$release_year <- as.factor(netflix$release_year)All data types have been converted to the desired data types. Next, we will check the missing value.
#Checking missing value
colSums(is.na(netflix))## show_id type title director cast country
## 0 0 0 2634 825 831
## date_added release_year rating duration listed_in description
## 10 0 4 3 0 0
#Calculating the percentage of missing values
colSums(is.na(netflix))/nrow(netflix)## show_id type title director cast country
## 0.0000000000 0.0000000000 0.0000000000 0.2990802771 0.0936754854 0.0943567617
## date_added release_year rating duration listed_in description
## 0.0011354604 0.0000000000 0.0004541842 0.0003406381 0.0000000000 0.0000000000
In this report, we are not going to analyze the director variable, hence we will drop this column. We also can drop show_id and description since these two hold a little information for our analysis. Moreover, since the missing value in each other columns is still considered a small part of the data observation, the row with any missing value in it will be dropped.
# Drop NA
netflix_clean <- netflix %>% select(-director, -show_id, -description)
netflix_clean <- netflix_clean %>% drop_na(cast, country, date_added, rating, duration)
anyNA(netflix_clean)## [1] FALSE
# Checking the dimension of cleaned data
dim(netflix_clean)## [1] 7290 9
After data cleansing is performed, our data contain 7290 observations and 11 variables. All data types have been converted to the desired data types and there’s no more missing value.
#Data summary
summary(netflix_clean)## type title cast
## Movie :5277 Length:7290 David Attenborough: 19
## TV Show:2013 Class :character Samuel West : 10
## Mode :character Jeff Dunham : 7
## Craig Sechler : 6
## Kevin Hart : 6
## Bill Burr : 5
## (Other) :7237
## country date_added release_year rating
## United States :2479 Min. :2008-01-01 2018 : 935 TV-MA :2657
## India : 940 1st Qu.:2018-04-01 2017 : 862 TV-14 :1755
## United Kingdom: 350 Median :2019-06-28 2019 : 822 R : 779
## Japan : 238 Mean :2019-04-27 2016 : 751 TV-PG : 653
## South Korea : 196 3rd Qu.:2020-07-05 2020 : 744 PG-13 : 470
## Canada : 162 Max. :2021-09-24 2015 : 472 PG : 275
## (Other) :2925 (Other):2704 (Other): 701
## duration listed_in
## 1 Season :1252 Dramas, International Movies : 337
## 2 Seasons: 348 Stand-Up Comedy : 302
## 3 Seasons: 169 Comedies, Dramas, International Movies : 260
## 94 min : 136 Dramas, Independent Movies, International Movies: 243
## 93 min : 130 Children & Family Movies, Comedies : 181
## 95 min : 130 Documentaries : 166
## (Other) :5125 (Other) :5801
Based on the data summary, we can conclude that:
type of shows on Netflix is Movie.released on 2018.rating of most of the shows os TV-MA.duration of most TV shows on Netflix consist of 1 season.duration of most movies on Netflix is 94 minutes.However, for country, listed_in (Genre), and cast we can’t conclude from this data summary as it hasn’t been separated correctly. In this data frame, those 3 variables have more than one input in one row separated by comma.
proportion <- aggregate.data.frame(x = (netflix_clean$type),
by = list(Type = netflix_clean$type),
FUN = length)
proportion <- proportion %>%
mutate(Percentage = x/ sum(proportion$x) * 100) %>%
mutate(Total = x) %>%
select(-x)
datatable(proportion)div(hchart(proportion, "pie", hcaes(name = Type, y = {round(Percentage, 1)})) %>%
hc_tooltip(pointFormat = "{point.y:.1f}%") %>%
hc_colors(color = c("#221f1f", "#b20710")) %>%
hc_title(text = "<b>Shows Type Ratio on Netflix</b>"), align = "right")Based on the pie chart above, it’s obvious that the most shows on Netflix are Movie. With the ratio between movies and TV shows around 72.4% : 27.6% and with the total of 5277 movies and 2013 TV shows.
# Genre
listed_in <- netflix_clean$listed_in
Genre <- unlist(strsplit(as.character(listed_in), ", "))
genre_list <- as.data.frame(table(Genre)) %>% arrange(desc(Freq))
datatable(genre_list)Based on the plot, it can be seen that International Movies is type of genres Netflix provides the most, with the total of 2.392 movies / shows with this genre. Followed by Dramas with 2.309 and Comedies with 1.574 movies / shows with this genre.
Next, we are going to see the what is the type of genres Netflix provides the most in each category, Movies and TV shows.
# Movie Genre
genre_mov <- netflix_clean %>%
filter(type == "Movie") %>%
select(listed_in)
Movie_Genre <- unlist(strsplit(as.character(genre_mov$listed_in), ", "))
genre_mov_list <- as.data.frame(table(Movie_Genre)) %>% arrange(desc(Freq))It can be seen that the top 3 movie genre provided by Netflix areInternational Movies, with 2.392 movies, followed by Dramas with 2.309 and Comedies with 1.574 movies. This result is currently in line with the result of the previous plot, “Top 10 Genre Netflix Provides The Most”, which means the top 3 are all movies and not TV shows.
# TV Shows Genre
genre_tv <- netflix_clean %>%
filter(type == "TV Show") %>%
select(listed_in)
TV_Genre <- unlist(strsplit(as.character(genre_tv$listed_in), ", "))
genre_tv_list <- as.data.frame(table(TV_Genre)) %>% arrange(desc(Freq))It can be seen that the top 3 TV show genre provided by Netflix areInternational TV Shows, with 1.047 shows, followed by TV Dramas with 657 and TV Comedies with 480 shows. These top 3 spots are also in line with the previous plot, “Top 10 Movie Genre Netflix Provides The Most”, which means these 3 categories are the most produced in both movie and TV show. The difference starts in the 4th place, in movie, the 4th spot is Action & Adventure while in TV show, it’s Crime TV Show
# Movies & TV Shows Added to Netflix Overtime
day_added_all <- netflix_clean %>%
select(date_added) %>%
mutate(year = year(date_added)) %>%
group_by(year) %>%
summarise(n = n())
colnames(day_added_all) <- c("Year", "Total")# Time Series
div(day_added_all %>%
hchart('area', hcaes(x = 'Year', y = 'Total')) %>%
hc_tooltip(pointFormat = "Total Added:{point.y}") %>%
hc_colors(color = "#b20710") %>%
hc_title(text = "<b>Amount of Movies & TV Shows Added to Netflix</b>") %>%
hc_chart(backgroundColor = "white"), align = "right")Based on the time series above, we can see that the 2019 is the year when Netflix adds the most movies and TV shows to its platform, with the total of 1.722 shows. Significant addition can be seen on 2017 with the total of 989 shows added compared with the previous year where Netflix added 358 shows to its platform. However, on 2020, decreasing addition of shows can be seen from the plot. On that year, Netflix added 1.640 shows to its platform, a decreasing amount compared to the previous year which have 1.722 shows added.
On 2018 until 2012, Netflix only added movie to its platform. But, on 2013, Netflix started to add TV shows to its platform as well. Now, we are going to compare the two in terms of total added to Netflix overtime.
# Movies VS TV Shows Added to Netflix Overtime
day_added <- netflix_clean %>%
select(type, date_added) %>%
mutate(year = year(date_added)) %>%
group_by(year, type) %>%
summarise(n = n()) %>%
filter(year > 2012)
colnames(day_added) <- c("Year", "Type", "Total")# Data Table
day_added_wide <- pivot_wider(data = day_added,
names_from = Type,
values_from = Total)
datatable(day_added_wide)# Grouped Time Series
div(day_added %>%
hchart('area', hcaes(x = 'Year', y = 'Total', group = "Type")) %>%
hc_tooltip(pointFormat = "Total Added:{point.y}") %>%
hc_colors(color = c("#b20710", "#221f1f")) %>%
hc_title(text = "<b>Amount of Movies / Shows Added to Netflix</b>") %>%
hc_chart(backgroundColor = "white"), align = "right")Based on the time series and the table above, we can see that Movie is always added more than TV Show in each year. Netflix added the most movie in 2019 with the total of 1.261 movies added, while TV show was added the most in 2020 with the total of 476 shows added. However, after their peak, both seem to declined.
# Years of the Movies / Shows on Netflix Released
released <- netflix_clean %>%
select(type, release_year) %>%
mutate(year = as.character(release_year)) %>%
mutate(year = as.numeric(year)) %>%
group_by(year, type) %>%
summarise(n = n())
colnames(released) <- c("Year", "Type", "Total")# Data Table
released_wide <- pivot_wider(data = released,
names_from = Type,
values_from = Total) %>%
arrange(desc(Year))
datatable(released_wide)# Grouped Time Series
div(released %>%
hchart('area', hcaes(x = 'Year', y = 'Total', group = "Type")) %>%
hc_tooltip(pointFormat = "Total: {point.y} Released") %>%
hc_colors(color = c("#b20710", "#221f1f")) %>%
hc_title(text = "<b>Actual Release Year of Movies & Shows on Netflix</b>") %>%
hc_chart(backgroundColor = "white"), align = "right")Based on the time series above, we can see that we can see that Movie is always released more than TV Show in each year. And that there’s no TV show on Netflix that was released before 1963. You can also see that the oldest movie on Netflix is released in 1942.
Furthermore, 2018 is the year with the most movie released provided in Netflix platform as 653 movies were released on that year. Meanwhile, 2020 is the year with the most TV show released provided in Netflix platform as 327 TV shows were released on that year. Significant change can be seen in 2016 on movie category, in which that year, there were 574 movies released. Such a huge gap compared to the previous year which only had 344 movies released. For TV show, the most significant change can be seen in 2018, as there were 282 TV shows released compared to the previous year which only had 213 TV shows released. However, in 2019 until 2021, there’s a decline in movie release. The same thing happened on TV show release in 2021.
Now, we are going to compare the amount of movie VS TV shows in the top 10 years when most of the movies / shows on Netflix released.
# Aggregate Table with Top 10 Years When Most of the Movies / Shows on Netflix Released
most_release <- released %>%
mutate(Year = as.factor(Year)) %>%
group_by(Year) %>%
mutate(total_released = sum(Total)) %>%
arrange(desc(total_released))Based on the bar chart above, we can see that 2018 year when most of the movies / shows on Netflix were released, with the total 653 movies and 282 TV shows released in that year. Followed by the year 2017 and 2019. It’s also fun to see that how fast the movie / show industry is growing throughout the year, if it’s not for the pandemic, it safe to say that there won’t be a decline.
# Type of Movies / Shows Rating
rating <- netflix_clean %>%
select(type, rating) %>%
group_by(rating, type) %>%
summarise(n = n())
colnames(rating) <- c("Rating", "Type", "Total")
datatable(rating)Based on the bar chart above, we can see that TV-MA (Mature Audience Only) is the rating of most movies and TV shows on Netflix. This shows that the target audience of Netflix are young adults/ adults (people above 17 years old). Then followed by TV-14 rating (content may be unsuitable for minors younger than 14 years of age) on the second spot, which suggest that the target audience of Netflix beside adults are teenagers. The third rating is R (Restricted for under 17) and only movies have this rating, this confirmed our previous statement that the target audience of Netflix are indeed young adults/ adults.
# Country
Country <- netflix_clean$country
Country <- unlist(strsplit(as.character(Country),", " ))
country_list <- as.data.frame(table(Country))
country_list <- country_list %>%
filter(Country != "") %>%
arrange(desc(Freq))
datatable(country_list)Now, we are going to see the distribution of shows production on Netflix through this map below.
## Reading layer `TM_WORLD_BORDERS_SIMPL-0.3' from data source
## `D:\Data Analyst\Algoritma\Data Science\LBB\Netflix Movies and TV Shows Visualization\TM_WORLD_BORDERS_SIMPL-0.3.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## Geodetic CRS: WGS 84
Based on the table and map above, it can be seen that United States is the country who produces the most movies / shows on Netflix, with the total of 3.274 movies / shows produced. Followed by India with 1.007 and United Kingdom with 707 movies / shows produced.
However, we haven’t seen how the show type varies for each country. Hence, we are going to se the proportion of each country’s shows production on Netflix below.
# Content by Country
country_content <- netflix_clean %>%
select(type, country) %>%
separate_rows(country, sep = ", ") %>%
group_by(country, type) %>%
summarise(n = n()) %>%
filter(country != "") %>%
arrange(desc(n))
colnames(country_content) <- c("Country", "Type", "Total Produced")
datatable(country_content)There’s an interesting insight from this plot, such as: India actually produced movie 93.5% of its total shows production; South Korea and Japan mostly produced TV Shows rather than movie, 74% of South Korea’s shows production are TV shows, while 62.9% of Japan’s shows production are TV shows.
# Casts
Cast <- netflix_clean$cast
Cast <- unlist(strsplit(as.character(Cast), ", "))
cast_list <- as.data.frame(table(Cast)) %>% arrange(desc(Freq))
colnames(cast_list) <- c("Cast", "Total Shows Starred")
datatable(cast_list)Based on the plot and the table, it can be seen that Indian actors/ actresses dominated the top positions. However, based on the previous section, United States is the country who produces the most movies / shows on Netflix. This could indicate unequal distribution of actors / actresses in the movie industry in India.
The most frequently featured cast in Netflix movies / shows is Anupam Kher with 43 shows. Followed by Shah Rukh Khan with 34 shows and Naseeruddin Shah with 31 shows. As stated before, since this list is dominated by Indian actors/ actresses, now we are going to see the top cast list from the country who produces the most movies / shows on Netflix, United States.
# Cast of USA
cast_USA <- netflix_clean %>%
filter(grepl('United States', country)) %>%
select(cast) %>%
separate(cast, into = "cast" , sep = ", ") %>%
group_by(cast) %>%
summarise(n = n()) %>%
arrange(desc(n))## Warning: Expected 1 pieces. Additional pieces discarded in 2785 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Based on the plot above, it can be seen that Adam Sandler is the most frequently featured cast in Netflix movies / shows in United States with 20 shows. Followed by Nicolas Cage with 14 shows and David Attenborough with 11 shows.
In this section, we are going to see the duration of shows on Netflix. As Movie and TV Show have different interpretation of duration (movie in minutes and TV show in seasons), hence we are going to see them separately.
# Movie Duration
duration_mov <- netflix_clean %>%
filter(type == "Movie") %>%
select(duration) %>%
group_by(duration) %>%
summarise(n = n())
duration_mov$duration <- gsub(" min", "" , as.character(duration_mov$duration))
duration_mov$duration <- as.numeric(duration_mov$duration)
duration_mov <- duration_mov %>%
arrange(desc(-duration))From the plot, we can see that most Movie on Netflix has around 90 - 100 minutes duration. There are also some short movies on Netflix with the duration below 30 minutes. There’s also an outlier which is one movie with 312 minutes duration titled “Black Mirror: Bandersnatch”.
# TV Show Duration
duration_tv <- netflix_clean %>%
filter(type == "TV Show") %>%
select(duration) %>%
group_by(duration) %>%
summarise(n = n())
duration_tv$duration <- gsub(" Seasons", "" , as.character(duration_tv$duration))
duration_tv$duration <- gsub(" Season", "" , as.character(duration_tv$duration))
duration_tv$duration <- as.numeric(duration_tv$duration)
duration_tv <- duration_tv %>%
arrange(desc(-duration))
colnames(duration_tv) <- c("Duration (in Season)", "Total TV Show")
datatable(duration_tv)From the plot, we can see that most TV Show on Netflix only has 1 Season duration with 1252 TV shows, such a large amount considering the total of TV show on Netflix is only 2013. Meanwhile, the longest TV Show on Netflix has 17 Seasons duration, it’s titled “Grey’s Anatomy”.
Based on the data used in this report and the data transformation & visualization process that has been done, we can conclude that:
Movie. With the ratio between movies and TV shows around 72.4% : 27.6% and with the total of 5277 movies and 2013 TV shows.International Movies, with 2.392 movies, followed by Dramas with 2.309 and Comedies with 1.574 movies. Meanwhile, the top 3 TV show genre provided by Netflix areInternational TV Shows, with 1.047 shows, followed by TV Dramas with 657 and TV Comedies with 480 shows.2019 is the year when Netflix adds the most movies and TV shows to its platform, with the total of 1.722 shows. However, since 2020, the number has decreased continuously. It is worth to notice that Movie is always added more than TV Show in each year.Movie is always released more than TV Show in each year. And that there’s no TV show on Netflix that was released before 1963. You can also see that the oldest movie on Netflix is released in 1942. Meanwhile, 2018 year when most of the movies / shows on Netflix were released, with the total 653 movies and 282 TV shows released in that year.TV-MA is the rating of most movies and TV shows on Netflix, followed by TV-14 rating and R, this shows that the target audience of Netflix are indeed young adults/ adults.United States is the country who produces the most movies / shows on Netflix, with the total of 3.274 movies / shows produced. Followed by India with 1.007 and United Kingdom with 707 movies / shows produced.South Korea and Japan mostly produced TV Shows rather than movie, 74% of South Korea’s shows production are TV shows, while 62.9% of Japan’s shows production are TV shows.United States is the country who produces the most movies / shows on Netflix (not India), this could indicate unequal distribution of actors / actresses in the movie industry in India.United States with 20 shows. Followed by Nicolas Cage with 14 shows and David Attenborough with 11 shows.Movie on Netflix has around 90 - 100 minutes duration, while most TV Show on Netflix only has 1 Season duration.