Netflix: Data Visualization and Key Insights

(Image: Tenor)

1. Overview

Netflix is the world’s leading streaming entertainment service with 193 million paid memberships in over 190 countries enjoying TV series, documentaries and feature films across a wide variety of genres and languages.

In 2018, Netflix released an interesting report which shows that the number of TV shows on the streaming service has nearly tripled since 2010, while its number of movies has decreased by more than 2,000 titles. It will be interesting to explore what all other insights can be obtained from the same dataset.

In this DataViz, we will conduct a short exploratory analysis of the Netflix dataset, which is available on Kaggle. This dataset consists of TV shows and movies available on Netflix as of 2019. The dataset was collected from Flixable which is a third-party Netflix search engine.

This DataViz is presented with the author’s love for TV shows and movies to find interesting insights about the Netflix dataset through visualizations made with Plotly package for R.

2. Dataset

In the dataset there are 6234 observations of 12 following variables describing the TV shows and movies:

show_id - Unique ID for every Movie / TV Show
type - Identifier, Movie or TV Show
title - Title of the Movie / TV Show
director - Director of the Movie
cast - Actors involved in the movie / show
country - Country where the movie / show was produced
date_added - Date it was added on Netflix
release_year - Actual Release year of the move / show
rating - TV Rating of the movie / show
duration - Total Duration, in minutes or in number of seasons
listed_in - Genre
description - The summary description

3. Sketch of Proposed DataViz Design

4. Data Preparation Steps

4.1. Load R packages and Data

First, load the necessary R packages in RStudio.

tidyverse contains a set of essential packages for data manipulation and exploration.
kableExtra to build common complex tables and manipulate table styles.
plotly to create interactive web graphics from ggplot2 graphs.
lubridate to make it easier to work with dates and times.

packages = c('tidyverse', 'kableExtra', 'plotly', 'lubridate')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Load the data.

df <- read_csv('data/netflix_titles.csv')

4.2. Data Cleaning

4.2.1. Variable Reduction

Firstly, we need to remove uninformative variables from the dataset. In this case, it is show_id variable. The description variable will not be used for the exploratory data analysis here, but it can be used to find similar movies and TV shows using the text similarities for further analysis, that is out of scope for this time.

# remove uninformative variables
df <- tibble::as_tibble(df) %>% 
  select(-c(show_id, description))

# check the data
kable(df[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "500px")

type	title	director	cast	country	date_added	release_year	rating	duration	listed_in
Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson	United States, India, South Korea, China	September 9, 2019	2019	TV-PG	90 min	Children & Family Movies, Comedies
Movie	Jandino: Whatever it Takes	NA	Jandino Asporaat	United Kingdom	September 9, 2016	2016	TV-MA	94 min	Stand-Up Comedy
TV Show	Transformers Prime	NA	Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle	United States	September 8, 2018	2013	TV-Y7-FV	1 Season	Kids’ TV
TV Show	Transformers: Robots in Disguise	NA	Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen	United States	September 8, 2018	2016	TV-Y7	1 Season	Kids’ TV
Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis	United States	September 8, 2017	2017	TV-14	99 min	Comedies
TV Show	Apaches	NA	Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac	Spain	September 8, 2017	2016	TV-MA	1 Season	Crime TV Shows, International TV Shows, Spanish-Language TV Shows
Movie	Automata	Gabe Ibáñez	Antonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David Ryall	Bulgaria, United States, Spain, Canada	September 8, 2017	2014	R	110 min	International Movies, Sci-Fi & Fantasy, Thrillers
Movie	Fabrizio Copano: Solo pienso en mi	Rodrigo Toro, Francisco Schultz	Fabrizio Copano	Chile	September 8, 2017	2017	TV-MA	60 min	Stand-Up Comedy
TV Show	Fire Chasers	NA	NA	United States	September 8, 2017	2017	TV-MA	1 Season	Docuseries, Science & Nature TV
Movie	Good People	Henrik Ruben Genz	James Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid Murtagh	United States, United Kingdom, Denmark, Sweden	September 8, 2017	2014	R	90 min	Action & Adventure, Thrillers

4.2.2. Handling Missing Values

Check if we have missing values in the dataset.

# print number of missing values for each variable
data.frame(var = c(colnames(df)), 
           missing = sapply(df, function(x) sum(is.na(x))), row.names = NULL) %>%
  mutate(missing = cell_spec(missing, "html", 
                             color = ifelse(missing > 0, 'red', 'black'))) %>% 
  rename(`Variable` = var, `Missing Value Count` = missing) %>%
  kable(format = "html", escape = F, align = c("l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

Variable	Missing Value Count
type	0
title	0
director	1969
cast	570
country	476
date_added	11
release_year	0
rating	10
duration	0
listed_in	0

From the above output we see that we have missing values for variables director, cast, country, date_added and rating. Since rating is the categorical variable with 14 levels, we can fill in (approximately) the missing values for rating with a mode using getmode() function.

# function to find a mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
df$rating[is.na(df$rating)] <- getmode(df$rating)

Unlike rating, the missing values for the variables director, cast, country and date_added cannot be easily approximated. So for now, we will continue without filling them. We will drop these missing values at point where it will be necessary.

We also drop duplicated rows in the dataset based on the type, title, country, release_year variables.

# drop duplicated rows based on the title, country, type and release_year
df <- distinct(df, type, title, country, release_year, .keep_all = T)

4.2.3. Date Formatting

Change the date format of the date_added variable for easier manipulations further.

# change date format 
df$date_added <- as.Date(df$date_added, format = "%B %d, %Y")

# check the date format
kable(df[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  column_spec(6, color = 'red', bold = T) %>% 
  scroll_box(width = "100%", height = "352px")

type	title	director	cast	country	date_added	release_year	rating	duration	listed_in
Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson	United States, India, South Korea, China	2019-09-09	2019	TV-PG	90 min	Children & Family Movies, Comedies
Movie	Jandino: Whatever it Takes	NA	Jandino Asporaat	United Kingdom	2016-09-09	2016	TV-MA	94 min	Stand-Up Comedy
TV Show	Transformers Prime	NA	Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle	United States	2018-09-08	2013	TV-Y7-FV	1 Season	Kids’ TV
TV Show	Transformers: Robots in Disguise	NA	Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen	United States	2018-09-08	2016	TV-Y7	1 Season	Kids’ TV
Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis	United States	2017-09-08	2017	TV-14	99 min	Comedies
TV Show	Apaches	NA	Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac	Spain	2017-09-08	2016	TV-MA	1 Season	Crime TV Shows, International TV Shows, Spanish-Language TV Shows
Movie	Automata	Gabe Ibáñez	Antonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David Ryall	Bulgaria, United States, Spain, Canada	2017-09-08	2014	R	110 min	International Movies, Sci-Fi & Fantasy, Thrillers
Movie	Fabrizio Copano: Solo pienso en mi	Rodrigo Toro, Francisco Schultz	Fabrizio Copano	Chile	2017-09-08	2017	TV-MA	60 min	Stand-Up Comedy
TV Show	Fire Chasers	NA	NA	United States	2017-09-08	2017	TV-MA	1 Season	Docuseries, Science & Nature TV
Movie	Good People	Henrik Ruben Genz	James Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid Murtagh	United States, United Kingdom, Denmark, Sweden	2017-09-08	2014	R	90 min	Action & Adventure, Thrillers

5. Final Visualization Steps and Insights

We have done the data cleaning steps and can continue with exploring the data. We will conduct exploratory data analysis (EDA) with interactive data visualization using plot_ly() function of Plotly R package. Useful insights and findings for each visualization will be provided in this section.

5.1. Proportion of Content by Type

First, we will visualize the proportion of Netflix content by type.

content_by_type <- df %>% group_by(type) %>% 
  summarise(count = n())

plot_ly(content_by_type, labels = ~type, values = ~count, 
        type = 'pie', marker = list(colors = c("#bd3939", "#399ba3"))) %>% 
  layout(xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         title = "Proportion of Content by Type", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

As we see from above, there are more than 2 times more Movies than TV Shows on Netflix.

5.2. Top 12 Countries by Amount of Produced Content

Many Movies and TV Shows are co-produced by several countries (ref: country variable). In order to correctly count the total amount of content produced by each country, we need to split strings in country variable and count the total occurrence of each country on its own.

s <- strsplit(df$country, split = ", ")
cntry_split <- data.frame(type = rep(df$type, sapply(s, length)), country = unlist(s))
cntry_split$country <- as.character(gsub(",","", cntry_split$country))

amount_by_country <- na.omit(cntry_split) %>%
  group_by(country, type) %>%
  summarise(count = n())

w <- reshape(data = data.frame(amount_by_country), 
             idvar = "country",
             v.names = "count",
             timevar = "type",
             direction = "wide") %>% 
  arrange(desc(count.Movie)) %>% top_n(12)

names(w)[2] <- "count_movie"
names(w)[3] <- "count_tv_show"
w <- w[order(desc(w$count_movie + w$count_tv_show)),] 

plot_ly(w, x = w$country, y = ~count_movie, 
        type = 'bar', name = 'Movie', marker = list(color = '#bd3939')) %>% 
  add_trace(y = ~count_tv_show, name = 'TV Show', marker = list(color = '#399ba3')) %>% 
  layout(xaxis = list(categoryorder = "array", 
                      categoryarray = w$country, 
                      title = "Country"), 
         yaxis = list(title = 'Amount of content'),
         barmode = 'stack', 
         title = "Top 12 Countries by Amount of Produced Content", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

We see that the United States is a clear leader in the amount of content on Netflix. Countries as Japan, South Korea and Taiwan have more TV Shows than Movies on Neflix.

5.3. Growth in Content over the Years

We will visualize Netflix’s growth in amount of content as a function of time.

df_by_date_full <- df %>% group_by(date_added) %>% 
  summarise(added_today = n()) %>% 
  mutate(total_number_of_content = cumsum(added_today), type = "Total")

df_by_date <- df %>% group_by(date_added, type) %>% 
  summarise(added_today = n()) %>% 
  ungroup() %>% 
  group_by(type) %>%
  mutate(total_number_of_content = cumsum(added_today))

full_data <- rbind(as.data.frame(df_by_date_full), as.data.frame(df_by_date))

plot_ly(full_data, x = ~date_added, y = ~total_number_of_content, 
        mode = 'lines', type = 'scatter',
        color = ~type, colors = c("#bd3939",  "#9addbd", "#399ba3")) %>% 
  layout(yaxis = list(title = 'Count'), 
         xaxis = list(title = 'Date'), 
         title = "Growth in Content over the Years", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

From above, we see that starting from the year 2016, the total amount of content was growing exponentially. We also notice how fast the amount of Movies on Netflix overcame the amount of TV Shows.

5.4. Content Added per Month

Next, visualize the amount of content added per month.

df_by_date_month <- df %>% group_by(month_added = floor_date(date_added, "month"), type) %>%
  summarise(added_today = n())

wd <- reshape(data = data.frame(df_by_date_month),
              idvar = "month_added",
              v.names = "added_today",
              timevar = "type",
              direction = "wide")

names(wd)[2] <- "added_today_movie"
names(wd)[3] <- "added_today_tv_show"
wd$added_today_movie[is.na(wd$added_today_movie)] <- 0
wd$added_today_tv_show[is.na(wd$added_today_tv_show)] <- 0
wd <-na.omit(wd)

plot_ly(wd, x = wd$month_added, y = ~added_today_movie, 
        type = 'bar', name = 'Movie', 
        marker = list(color = '#bd3939')) %>% 
  add_trace(y = ~added_today_tv_show, 
            name = 'TV Show', 
            marker = list(color = '#399ba3')) %>% 
  layout(xaxis = list(categoryorder = 'array', 
                      categoryarray = wd$month_added, 
                      title = 'Date'), 
         yaxis = list(title = 'Count'), 
         barmode = 'stack', 
         title = "Amount of Content Added per Month", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

We can see from above that November 2019 was a peak month for Netflix for the amount of added content.

5.5. Distribution of Content by Rating

We will look at the distribution of Netflix content by rating classes.

df_by_rating <- df %>% group_by(rating) %>% 
  summarise(count = n())

plot_ly(df_by_rating, type = 'pie',
        labels = ~rating, values = ~count) %>% 
  layout(xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         title = "Distribution of Content by Rating", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

The largest count of contents are made with the TV-MA rating. TV-MA is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only.

Second largest is TV-14 which stands for content that may be inappropriate for children younger than 14 years of age.

Third largest is the very popular R rating. An R-rated content is assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America; the MPAA writes “Under 17 requires accompanying parent or adult guardian”.

5.6. Top Genres on Netflix

s_genres <- strsplit(df$listed_in, split = ", ")
genres_listed_in <- data.frame(type = rep(df$type, sapply(s_genres, length)), 
                               listed_in = unlist(s_genres))
genres_listed_in$listed_in <- as.character(gsub(",","",genres_listed_in$listed_in))

df_by_listed_in <- genres_listed_in %>% 
  group_by(type, listed_in) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>% top_n(10)

plot_ly(df_by_listed_in, x = ~listed_in, y = ~count,
        type = 'bar', color = ~type,
        colors = c("#bd3939", "#399ba3")) %>%
  layout(xaxis = list(categoryorder = "array", 
                      categoryarray = df_by_listed_in$listed_in, 
                      title = 'Genre',
                      tickangle = 45), 
         yaxis = list(title = 'Count'), 
         title = "Top Genres (Movie vs. TV Show)", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

We see that International Movies / TV Shows are showing up as the dominant category in both Movies and TV Shows, followed by Dramas and Comedies. These are the top three genres that have the highest amount of content on Netflix!

5.7. Movie Duration in Top 12 Countries

mov_duration_cntry <- na.omit(df[df$type == "Movie",][,c("country", "duration")])

s_dur <- strsplit(mov_duration_cntry$country, split = ", ")
duration_full <- data.frame(duration = rep(mov_duration_cntry$duration,
                                           sapply(s_dur, length)),
                            country = unlist(s_dur))
duration_full$duration <- as.numeric(gsub(" min","", duration_full$duration))

duration_full_subset <- duration_full[duration_full$country %in% 
                                        c("United States", "India", "United Kingdom",
                                          "Canada", "France", "Japan", "Spain", "South Korea",
                                          "Mexico", "Australia", "China", "Taiwan"),]

plot_ly(duration_full_subset, y = ~duration, color = ~country, type = "box") %>% 
  layout(xaxis = list(title = "Country"), 
         yaxis = list(title = 'Duration (in min)'),
         title = "Box-Plots of Movie Duration in Top 12 Countries", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

It can be seen from above that Movies produced in India tend to be the longest on average with the average duration of 127 min.