This data is obtained from kaggle, which scrapped / sourced from flixable, a third party netflix catalogue. In this document, i will visualize based on the data used to extract useful insights/info thats hard or even impossible to interpret if only presented as raw data.
Goal of this project is to create various visualizations based on movies in netflix catalogue to be further analized.
#Data Preparation
Like Usual, import the library needed.
library(ggplot2)
library(dplyr)
library(lubridate)
library(stringr)Import data.
nf <- read.csv('netflix_titles.csv')
head(nf, 2)Assigning right data type to each column and create new appropriate column based on existing column.
nf <- nf %>%
mutate(type = as.factor(type),
country = as.factor(country),
date_added = mdy(date_added),
listed_in = as.character(listed_in),
year_added = as.integer(year(date_added)),
month_added = as.factor(month(date_added)),
rating = as.factor(rating),
durationinminute = as.integer(str_remove(duration, 'min')))
head(nf, 2)Check the data.
summary(nf)#> show_id type title director
#> Min. : 247747 Movie :4265 Length:6234 Length:6234
#> 1st Qu.:80035802 TV Show:1969 Class :character Class :character
#> Median :80163367 Mode :character Mode :character
#> Mean :76703679
#> 3rd Qu.:80244889
#> Max. :81235729
#>
#> cast country date_added release_year
#> Length:6234 United States :2032 Min. :2008-01-01 Min. :1925
#> Class :character India : 777 1st Qu.:2017-10-01 1st Qu.:2013
#> Mode :character : 476 Median :2018-09-30 Median :2016
#> United Kingdom: 348 Mean :2018-07-01 Mean :2013
#> Japan : 176 3rd Qu.:2019-06-08 3rd Qu.:2018
#> Canada : 141 Max. :2020-01-18 Max. :2020
#> (Other) :2284 NA's :11
#> rating duration listed_in description
#> TV-MA :2027 Length:6234 Length:6234 Length:6234
#> TV-14 :1698 Class :character Class :character Class :character
#> TV-PG : 701 Mode :character Mode :character Mode :character
#> R : 508
#> PG-13 : 286
#> NR : 218
#> (Other): 796
#> year_added month_added durationinminute
#> Min. :2008 12 : 696 Min. : 3.0
#> 1st Qu.:2017 10 : 646 1st Qu.: 86.0
#> Median :2018 11 : 612 Median : 98.0
#> Mean :2018 1 : 610 Mean : 99.1
#> 3rd Qu.:2019 3 : 551 3rd Qu.:115.0
#> Max. :2020 (Other):3108 Max. :312.0
#> NA's :11 NA's : 11 NA's :1969
It looks like we have missing data on date_added col, but it is only 11 out of 4265 complete data, it safe to remove the row from our data. also we subsetting the data to only include content added between year 2010 to 2019.
nfc <- subset(nf, date_added != is.na(date_added))Great, now we can move on to next part, processing and visualizing!.
Let’s start with the easiest and most obvious, how many content based on type (Movies and TV Show) netflix added to their catalog each year?
catadd <- nfc %>% group_by(type, year_added) %>% summarise(count = n()) %>% ungroup() %>% arrange(year_added)
cataddNetflix started adding collection to their catalogue since 2008, but they only actually increased in meaningful number to the catalogue starting 2015, and also this is data collected on early 2020, so i will subset the data between year_added 2011 through 2019
catadd %>% filter(year_added >= 2015, year_added <=2019) %>% ggplot(aes(x=year_added, y=count))+
geom_col(aes(fill = type), position = 'dodge')+
geom_text(aes(label = count, vjust=0), position = position_dodge(width = 1), vjust=-0.5, hjust = 0)+
scale_y_continuous(limits = c(0, 1600))+
labs(y='', x='Year', title = 'Number of content added to Netflix')+
theme(plot.background = element_rect(fill='white'),
panel.background = element_rect(fill='white'),
panel.grid = element_blank(),
plot.title = element_text(color = 'red', hjust=0.5, size=16, face='bold')) It seems netflix really boosting their number of movies to their catalog each year. but remember, tv shows consists of many episodes and multiple season.
lets focus our analysis to Movie content. also, filter anything lower than 50 minutes duration, thats should be a TV show.
nfmov <- nfc %>% filter(type == 'Movie', durationinminute > 50)
summary(nfmov)#> show_id type title director
#> Min. : 247747 Movie :4078 Length:4078 Length:4078
#> 1st Qu.:70301392 TV Show: 0 Class :character Class :character
#> Median :80157890 Mode :character Mode :character
#> Mean :75480328
#> 3rd Qu.:80990812
#> Max. :81235729
#>
#> cast country date_added release_year
#> Length:4078 United States :1383 Min. :2008-01-01 Min. :1942
#> Class :character India : 721 1st Qu.:2017-10-13 1st Qu.:2011
#> Mode :character : 163 Median :2018-09-15 Median :2016
#> United Kingdom: 151 Mean :2018-07-07 Mean :2012
#> Canada : 80 3rd Qu.:2019-06-06 3rd Qu.:2017
#> Spain : 79 Max. :2020-01-18 Max. :2020
#> (Other) :1501
#> rating duration listed_in description
#> TV-MA :1317 Length:4078 Length:4078 Length:4078
#> TV-14 :1009 Class :character Class :character Class :character
#> R : 506 Mode :character Mode :character Mode :character
#> TV-PG : 398
#> PG-13 : 286
#> NR : 197
#> (Other): 365
#> year_added month_added durationinminute
#> Min. :2008 1 : 439 Min. : 51
#> 1st Qu.:2017 12 : 438 1st Qu.: 88
#> Median :2018 10 : 423 Median : 99
#> Mean :2018 11 : 402 Mean :102
#> 3rd Qu.:2019 3 : 377 3rd Qu.:116
#> Max. :2020 8 : 320 Max. :312
#> (Other):1679
nfmov %>% ggplot(aes(x=durationinminute))+
geom_histogram()+
scale_y_continuous(limits = c(0, 1000))+
labs(y='count', x='Duration (Minutes)', title = 'Movie Duration on Netflix')+
theme(plot.background = element_rect(fill='white'),
panel.background = element_rect(fill='white'),
panel.grid = element_blank(),
plot.title = element_text(color = 'red', hjust=0.5, size=16, face='bold')) there, we can see most of the movies on netflix is about 100 minutes long.
lets see whats the popular content rating in 2019 and before so we can judge whats the trend moving forward.
nfprating <- nfmov %>% group_by(rating) %>% summarise(count = n()) %>% ungroup() %>% arrange(desc(count))
nfprating %>% ggplot(aes(x=reorder(rating, count), y=count))+
geom_col() It make sense that TV-MA are ranked #1 on netflix because TV-MA content often not allowed to be shown on Public broadcast, that or get heavily censored to the point you lost the point of the content.
nfryc <- nfmov %>% group_by(rating, year_added) %>% summarise(count = n()) %>% ungroup() %>% filter(rating %in% c('TV-MA','TV-14','R'), year_added > 2014, year_added < 2020)
nfryc %>% ggplot(aes(x=year_added, y=count))+
geom_line(aes(color = rating)) All 3 Content Ratings shows upward trend with various strength, TV-MA shows strong trend up to 2018 and immediately decline in number in 2019. TV-14 on the other hand, still going upward altough with less strength in 2018 to 2019. R rating starts with low but upward trend and steadily increasing troughout the year and significant increases in 2019.
It looks like R rated movie gaining significant popularity, at least among the netflix catalogue.