1 Intro

This data is obtained from kaggle, which scrapped / sourced from flixable, a third party netflix catalogue. In this document, i will visualize based on the data used to extract useful insights/info thats hard or even impossible to interpret if only presented as raw data.

1.1 Project Goal

Goal of this project is to create various visualizations based on movies in netflix catalogue to be further analized.

1.2 Data Dictionary

  • show_id : id number of each content.
  • type : content type, Movie or TV Show.
  • director : name of the content director.
  • cast : name of cast involved.
  • country : available on which country.
  • date_added : the date of each content added to netflix catalogue.
  • release_year : the release year of the content.
  • rating : assigned tv rating.
  • duration : the duration of content from start to finish.
  • listed_in : listed on which section, not necessarely a genre.

#Data Preparation

Like Usual, import the library needed.

library(ggplot2)
library(dplyr)
library(lubridate)
library(stringr)

Import data.

nf <- read.csv('netflix_titles.csv')
head(nf, 2)

Assigning right data type to each column and create new appropriate column based on existing column.

nf <- nf %>%
  mutate(type = as.factor(type),
         country = as.factor(country),
         date_added = mdy(date_added),
         listed_in = as.character(listed_in),
         year_added = as.integer(year(date_added)),
         month_added = as.factor(month(date_added)),
         rating = as.factor(rating),
         durationinminute = as.integer(str_remove(duration, 'min')))
head(nf, 2)

Check the data.

summary(nf)
#>     show_id              type         title             director        
#>  Min.   :  247747   Movie  :4265   Length:6234        Length:6234       
#>  1st Qu.:80035802   TV Show:1969   Class :character   Class :character  
#>  Median :80163367                  Mode  :character   Mode  :character  
#>  Mean   :76703679                                                       
#>  3rd Qu.:80244889                                                       
#>  Max.   :81235729                                                       
#>                                                                         
#>      cast                     country       date_added          release_year 
#>  Length:6234        United States :2032   Min.   :2008-01-01   Min.   :1925  
#>  Class :character   India         : 777   1st Qu.:2017-10-01   1st Qu.:2013  
#>  Mode  :character                 : 476   Median :2018-09-30   Median :2016  
#>                     United Kingdom: 348   Mean   :2018-07-01   Mean   :2013  
#>                     Japan         : 176   3rd Qu.:2019-06-08   3rd Qu.:2018  
#>                     Canada        : 141   Max.   :2020-01-18   Max.   :2020  
#>                     (Other)       :2284   NA's   :11                         
#>      rating       duration          listed_in         description       
#>  TV-MA  :2027   Length:6234        Length:6234        Length:6234       
#>  TV-14  :1698   Class :character   Class :character   Class :character  
#>  TV-PG  : 701   Mode  :character   Mode  :character   Mode  :character  
#>  R      : 508                                                           
#>  PG-13  : 286                                                           
#>  NR     : 218                                                           
#>  (Other): 796                                                           
#>    year_added    month_added   durationinminute
#>  Min.   :2008   12     : 696   Min.   :  3.0   
#>  1st Qu.:2017   10     : 646   1st Qu.: 86.0   
#>  Median :2018   11     : 612   Median : 98.0   
#>  Mean   :2018   1      : 610   Mean   : 99.1   
#>  3rd Qu.:2019   3      : 551   3rd Qu.:115.0   
#>  Max.   :2020   (Other):3108   Max.   :312.0   
#>  NA's   :11     NA's   :  11   NA's   :1969

It looks like we have missing data on date_added col, but it is only 11 out of 4265 complete data, it safe to remove the row from our data. also we subsetting the data to only include content added between year 2010 to 2019.

nfc <- subset(nf, date_added != is.na(date_added))

Great, now we can move on to next part, processing and visualizing!.

2 Data Processing and Visualization.

2.1 How Many Content Each Year?

Let’s start with the easiest and most obvious, how many content based on type (Movies and TV Show) netflix added to their catalog each year?

catadd <- nfc %>% group_by(type, year_added) %>% summarise(count = n()) %>% ungroup() %>% arrange(year_added)
catadd

Netflix started adding collection to their catalogue since 2008, but they only actually increased in meaningful number to the catalogue starting 2015, and also this is data collected on early 2020, so i will subset the data between year_added 2011 through 2019

catadd %>% filter(year_added >= 2015, year_added <=2019) %>% ggplot(aes(x=year_added, y=count))+
  geom_col(aes(fill = type), position = 'dodge')+
  geom_text(aes(label = count, vjust=0), position = position_dodge(width = 1), vjust=-0.5, hjust = 0)+
  scale_y_continuous(limits = c(0, 1600))+
  labs(y='', x='Year', title = 'Number of content added to Netflix')+
  theme(plot.background = element_rect(fill='white'),
        panel.background = element_rect(fill='white'), 
        panel.grid = element_blank(),
        plot.title = element_text(color = 'red', hjust=0.5, size=16, face='bold'))

It seems netflix really boosting their number of movies to their catalog each year. but remember, tv shows consists of many episodes and multiple season.

2.2 How Long are the movies?

lets focus our analysis to Movie content. also, filter anything lower than 50 minutes duration, thats should be a TV show.

nfmov <- nfc %>% filter(type == 'Movie', durationinminute > 50)
summary(nfmov)
#>     show_id              type         title             director        
#>  Min.   :  247747   Movie  :4078   Length:4078        Length:4078       
#>  1st Qu.:70301392   TV Show:   0   Class :character   Class :character  
#>  Median :80157890                  Mode  :character   Mode  :character  
#>  Mean   :75480328                                                       
#>  3rd Qu.:80990812                                                       
#>  Max.   :81235729                                                       
#>                                                                         
#>      cast                     country       date_added          release_year 
#>  Length:4078        United States :1383   Min.   :2008-01-01   Min.   :1942  
#>  Class :character   India         : 721   1st Qu.:2017-10-13   1st Qu.:2011  
#>  Mode  :character                 : 163   Median :2018-09-15   Median :2016  
#>                     United Kingdom: 151   Mean   :2018-07-07   Mean   :2012  
#>                     Canada        :  80   3rd Qu.:2019-06-06   3rd Qu.:2017  
#>                     Spain         :  79   Max.   :2020-01-18   Max.   :2020  
#>                     (Other)       :1501                                      
#>      rating       duration          listed_in         description       
#>  TV-MA  :1317   Length:4078        Length:4078        Length:4078       
#>  TV-14  :1009   Class :character   Class :character   Class :character  
#>  R      : 506   Mode  :character   Mode  :character   Mode  :character  
#>  TV-PG  : 398                                                           
#>  PG-13  : 286                                                           
#>  NR     : 197                                                           
#>  (Other): 365                                                           
#>    year_added    month_added   durationinminute
#>  Min.   :2008   1      : 439   Min.   : 51     
#>  1st Qu.:2017   12     : 438   1st Qu.: 88     
#>  Median :2018   10     : 423   Median : 99     
#>  Mean   :2018   11     : 402   Mean   :102     
#>  3rd Qu.:2019   3      : 377   3rd Qu.:116     
#>  Max.   :2020   8      : 320   Max.   :312     
#>                 (Other):1679
nfmov %>% ggplot(aes(x=durationinminute))+
  geom_histogram()+
  scale_y_continuous(limits = c(0, 1000))+
  labs(y='count', x='Duration (Minutes)', title = 'Movie Duration on Netflix')+
  theme(plot.background = element_rect(fill='white'),
        panel.background = element_rect(fill='white'), 
        panel.grid = element_blank(),
        plot.title = element_text(color = 'red', hjust=0.5, size=16, face='bold'))

there, we can see most of the movies on netflix is about 100 minutes long.

2.3 Who is the most prominent target audience?

lets see whats the popular content rating in 2019 and before so we can judge whats the trend moving forward.

nfprating <- nfmov %>% group_by(rating) %>% summarise(count = n()) %>% ungroup() %>% arrange(desc(count))
nfprating %>% ggplot(aes(x=reorder(rating, count), y=count))+
  geom_col()

It make sense that TV-MA are ranked #1 on netflix because TV-MA content often not allowed to be shown on Public broadcast, that or get heavily censored to the point you lost the point of the content.

nfryc <- nfmov %>% group_by(rating, year_added) %>% summarise(count = n()) %>% ungroup() %>% filter(rating %in% c('TV-MA','TV-14','R'), year_added > 2014, year_added < 2020)
nfryc %>% ggplot(aes(x=year_added, y=count))+
  geom_line(aes(color = rating))

All 3 Content Ratings shows upward trend with various strength, TV-MA shows strong trend up to 2018 and immediately decline in number in 2019. TV-14 on the other hand, still going upward altough with less strength in 2018 to 2019. R rating starts with low but upward trend and steadily increasing troughout the year and significant increases in 2019.

It looks like R rated movie gaining significant popularity, at least among the netflix catalogue.