(Image: Tenor)

(Image: Tenor)

1. Overview

Netflix is the world’s leading streaming entertainment service with 193 million paid memberships in over 190 countries enjoying TV series, documentaries and feature films across a wide variety of genres and languages.

In 2018, Netflix released an interesting report which shows that the number of TV shows on the streaming service has nearly tripled since 2010, while its number of movies has decreased by more than 2,000 titles. It will be interesting to explore what all other insights can be obtained from the same dataset.

In this DataViz, we will conduct a short exploratory analysis of the Netflix dataset, which is available on Kaggle. This dataset consists of TV shows and movies available on Netflix as of 2019. The dataset was collected from Flixable which is a third-party Netflix search engine.

This DataViz is presented with the author’s love for TV shows and movies to find interesting insights about the Netflix dataset through visualizations made with Plotly package for R.

2. Dataset

In the dataset there are 6234 observations of 12 following variables describing the TV shows and movies:

  • show_id - Unique ID for every Movie / TV Show
  • type - Identifier, Movie or TV Show
  • title - Title of the Movie / TV Show
  • director - Director of the Movie
  • cast - Actors involved in the movie / show
  • country - Country where the movie / show was produced
  • date_added - Date it was added on Netflix
  • release_year - Actual Release year of the move / show
  • rating - TV Rating of the movie / show
  • duration - Total Duration, in minutes or in number of seasons
  • listed_in - Genre
  • description - The summary description

3. Sketch of Proposed DataViz Design


4. Data Preparation Steps

4.1. Load R packages and Data

First, load the necessary R packages in RStudio.

  • tidyverse contains a set of essential packages for data manipulation and exploration.
  • kableExtra to build common complex tables and manipulate table styles.
  • plotly to create interactive web graphics from ggplot2 graphs.
  • lubridate to make it easier to work with dates and times.
packages = c('tidyverse', 'kableExtra', 'plotly', 'lubridate')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Load the data.

df <- read_csv('data/netflix_titles.csv')

4.2. Data Cleaning

4.2.1. Variable Reduction

Firstly, we need to remove uninformative variables from the dataset. In this case, it is show_id variable. The description variable will not be used for the exploratory data analysis here, but it can be used to find similar movies and TV shows using the text similarities for further analysis, that is out of scope for this time.

# remove uninformative variables
df <- tibble::as_tibble(df) %>% 
  select(-c(show_id, description))
# check the data
kable(df[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "500px")
type title director cast country date_added release_year rating duration listed_in
Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies
Movie Jandino: Whatever it Takes NA Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy
TV Show Transformers Prime NA Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids’ TV
TV Show Transformers: Robots in Disguise NA Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen United States September 8, 2018 2016 TV-Y7 1 Season Kids’ TV
Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis United States September 8, 2017 2017 TV-14 99 min Comedies
TV Show Apaches NA Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac Spain September 8, 2017 2016 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanish-Language TV Shows
Movie Automata Gabe Ibáñez Antonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David Ryall Bulgaria, United States, Spain, Canada September 8, 2017 2014 R 110 min International Movies, Sci-Fi & Fantasy, Thrillers
Movie Fabrizio Copano: Solo pienso en mi Rodrigo Toro, Francisco Schultz Fabrizio Copano Chile September 8, 2017 2017 TV-MA 60 min Stand-Up Comedy
TV Show Fire Chasers NA NA United States September 8, 2017 2017 TV-MA 1 Season Docuseries, Science & Nature TV
Movie Good People Henrik Ruben Genz James Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid Murtagh United States, United Kingdom, Denmark, Sweden September 8, 2017 2014 R 90 min Action & Adventure, Thrillers

4.2.2. Handling Missing Values

Check if we have missing values in the dataset.

# print number of missing values for each variable
data.frame(var = c(colnames(df)), 
           missing = sapply(df, function(x) sum(is.na(x))), row.names = NULL) %>%
  mutate(missing = cell_spec(missing, "html", 
                             color = ifelse(missing > 0, 'red', 'black'))) %>% 
  rename(`Variable` = var, `Missing Value Count` = missing) %>%
  kable(format = "html", escape = F, align = c("l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Variable Missing Value Count
type 0
title 0
director 1969
cast 570
country 476
date_added 11
release_year 0
rating 10
duration 0
listed_in 0

From the above output we see that we have missing values for variables director, cast, country, date_added and rating. Since rating is the categorical variable with 14 levels, we can fill in (approximately) the missing values for rating with a mode using getmode() function.

# function to find a mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
df$rating[is.na(df$rating)] <- getmode(df$rating)

Unlike rating, the missing values for the variables director, cast, country and date_added cannot be easily approximated. So for now, we will continue without filling them. We will drop these missing values at point where it will be necessary.

We also drop duplicated rows in the dataset based on the type, title, country, release_year variables.

# drop duplicated rows based on the title, country, type and release_year
df <- distinct(df, type, title, country, release_year, .keep_all = T)

4.2.3. Date Formatting

Change the date format of the date_added variable for easier manipulations further.

# change date format 
df$date_added <- as.Date(df$date_added, format = "%B %d, %Y")
# check the date format
kable(df[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  column_spec(6, color = 'red', bold = T) %>% 
  scroll_box(width = "100%", height = "352px")
type title director cast country date_added release_year rating duration listed_in
Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 min Children & Family Movies, Comedies
Movie Jandino: Whatever it Takes NA Jandino Asporaat United Kingdom 2016-09-09 2016 TV-MA 94 min Stand-Up Comedy
TV Show Transformers Prime NA Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids’ TV
TV Show Transformers: Robots in Disguise NA Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen United States 2018-09-08 2016 TV-Y7 1 Season Kids’ TV
Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis United States 2017-09-08 2017 TV-14 99 min Comedies
TV Show Apaches NA Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac Spain 2017-09-08 2016 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanish-Language TV Shows
Movie Automata Gabe Ibáñez Antonio Banderas, Dylan McDermott, Melanie Griffith, Birgitte Hjort Sørensen, Robert Forster, Christa Campbell, Tim McInnerny, Andy Nyman, David Ryall Bulgaria, United States, Spain, Canada 2017-09-08 2014 R 110 min International Movies, Sci-Fi & Fantasy, Thrillers
Movie Fabrizio Copano: Solo pienso en mi Rodrigo Toro, Francisco Schultz Fabrizio Copano Chile 2017-09-08 2017 TV-MA 60 min Stand-Up Comedy
TV Show Fire Chasers NA NA United States 2017-09-08 2017 TV-MA 1 Season Docuseries, Science & Nature TV
Movie Good People Henrik Ruben Genz James Franco, Kate Hudson, Tom Wilkinson, Omar Sy, Sam Spruell, Anna Friel, Thomas Arnold, Oliver Dimsdale, Diana Hardcastle, Michael Jibson, Diarmaid Murtagh United States, United Kingdom, Denmark, Sweden 2017-09-08 2014 R 90 min Action & Adventure, Thrillers

5. Final Visualization Steps and Insights

We have done the data cleaning steps and can continue with exploring the data. We will conduct exploratory data analysis (EDA) with interactive data visualization using plot_ly() function of Plotly R package. Useful insights and findings for each visualization will be provided in this section.

5.1. Proportion of Content by Type

First, we will visualize the proportion of Netflix content by type.

content_by_type <- df %>% group_by(type) %>% 
  summarise(count = n())

plot_ly(content_by_type, labels = ~type, values = ~count, 
        type = 'pie', marker = list(colors = c("#bd3939", "#399ba3"))) %>% 
  layout(xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         title = "Proportion of Content by Type", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

As we see from above, there are more than 2 times more Movies than TV Shows on Netflix.

5.2. Top 12 Countries by Amount of Produced Content

Many Movies and TV Shows are co-produced by several countries (ref: country variable). In order to correctly count the total amount of content produced by each country, we need to split strings in country variable and count the total occurrence of each country on its own.

s <- strsplit(df$country, split = ", ")
cntry_split <- data.frame(type = rep(df$type, sapply(s, length)), country = unlist(s))
cntry_split$country <- as.character(gsub(",","", cntry_split$country))

amount_by_country <- na.omit(cntry_split) %>%
  group_by(country, type) %>%
  summarise(count = n())

w <- reshape(data = data.frame(amount_by_country), 
             idvar = "country",
             v.names = "count",
             timevar = "type",
             direction = "wide") %>% 
  arrange(desc(count.Movie)) %>% top_n(12)

names(w)[2] <- "count_movie"
names(w)[3] <- "count_tv_show"
w <- w[order(desc(w$count_movie + w$count_tv_show)),] 

plot_ly(w, x = w$country, y = ~count_movie, 
        type = 'bar', name = 'Movie', marker = list(color = '#bd3939')) %>% 
  add_trace(y = ~count_tv_show, name = 'TV Show', marker = list(color = '#399ba3')) %>% 
  layout(xaxis = list(categoryorder = "array", 
                      categoryarray = w$country, 
                      title = "Country"), 
         yaxis = list(title = 'Amount of content'),
         barmode = 'stack', 
         title = "Top 12 Countries by Amount of Produced Content", margin = list(t = 54),
         legend = list(x = 100, y = 0.5)) 

We see that the United States is a clear leader in the amount of content on Netflix. Countries as Japan, South Korea and Taiwan have more TV Shows than Movies on Neflix.

5.3. Growth in Content over the Years

We will visualize Netflix’s growth in amount of content as a function of time.

df_by_date_full <- df %>% group_by(date_added) %>% 
  summarise(added_today = n()) %>% 
  mutate(total_number_of_content = cumsum(added_today), type = "Total")

df_by_date <- df %>% group_by(date_added, type) %>% 
  summarise(added_today = n()) %>% 
  ungroup() %>% 
  group_by(type) %>%
  mutate(total_number_of_content = cumsum(added_today))

full_data <- rbind(as.data.frame(df_by_date_full), as.data.frame(df_by_date))

plot_ly(full_data, x = ~date_added, y = ~total_number_of_content, 
        mode = 'lines', type = 'scatter',
        color = ~type, colors = c("#bd3939",  "#9addbd", "#399ba3")) %>% 
  layout(yaxis = list(title = 'Count'), 
         xaxis = list(title = 'Date'), 
         title = "Growth in Content over the Years", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

From above, we see that starting from the year 2016, the total amount of content was growing exponentially. We also notice how fast the amount of Movies on Netflix overcame the amount of TV Shows.

5.4. Content Added per Month

Next, visualize the amount of content added per month.

df_by_date_month <- df %>% group_by(month_added = floor_date(date_added, "month"), type) %>%
  summarise(added_today = n())

wd <- reshape(data = data.frame(df_by_date_month),
              idvar = "month_added",
              v.names = "added_today",
              timevar = "type",
              direction = "wide")

names(wd)[2] <- "added_today_movie"
names(wd)[3] <- "added_today_tv_show"
wd$added_today_movie[is.na(wd$added_today_movie)] <- 0
wd$added_today_tv_show[is.na(wd$added_today_tv_show)] <- 0
wd <-na.omit(wd)

plot_ly(wd, x = wd$month_added, y = ~added_today_movie, 
        type = 'bar', name = 'Movie', 
        marker = list(color = '#bd3939')) %>% 
  add_trace(y = ~added_today_tv_show, 
            name = 'TV Show', 
            marker = list(color = '#399ba3')) %>% 
  layout(xaxis = list(categoryorder = 'array', 
                      categoryarray = wd$month_added, 
                      title = 'Date'), 
         yaxis = list(title = 'Count'), 
         barmode = 'stack', 
         title = "Amount of Content Added per Month", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

We can see from above that November 2019 was a peak month for Netflix for the amount of added content.

5.5. Distribution of Content by Rating

We will look at the distribution of Netflix content by rating classes.

df_by_rating <- df %>% group_by(rating) %>% 
  summarise(count = n())

plot_ly(df_by_rating, type = 'pie',
        labels = ~rating, values = ~count) %>% 
  layout(xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
         title = "Distribution of Content by Rating", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

The largest count of contents are made with the TV-MA rating. TV-MA is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only.

Second largest is TV-14 which stands for content that may be inappropriate for children younger than 14 years of age.

Third largest is the very popular R rating. An R-rated content is assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America; the MPAA writes “Under 17 requires accompanying parent or adult guardian”.

5.6. Top Genres on Netflix

s_genres <- strsplit(df$listed_in, split = ", ")
genres_listed_in <- data.frame(type = rep(df$type, sapply(s_genres, length)), 
                               listed_in = unlist(s_genres))
genres_listed_in$listed_in <- as.character(gsub(",","",genres_listed_in$listed_in))

df_by_listed_in <- genres_listed_in %>% 
  group_by(type, listed_in) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>% top_n(10)

plot_ly(df_by_listed_in, x = ~listed_in, y = ~count,
        type = 'bar', color = ~type,
        colors = c("#bd3939", "#399ba3")) %>%
  layout(xaxis = list(categoryorder = "array", 
                      categoryarray = df_by_listed_in$listed_in, 
                      title = 'Genre',
                      tickangle = 45), 
         yaxis = list(title = 'Count'), 
         title = "Top Genres (Movie vs. TV Show)", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

We see that International Movies / TV Shows are showing up as the dominant category in both Movies and TV Shows, followed by Dramas and Comedies. These are the top three genres that have the highest amount of content on Netflix!

5.7. Movie Duration in Top 12 Countries

mov_duration_cntry <- na.omit(df[df$type == "Movie",][,c("country", "duration")])

s_dur <- strsplit(mov_duration_cntry$country, split = ", ")
duration_full <- data.frame(duration = rep(mov_duration_cntry$duration,
                                           sapply(s_dur, length)),
                            country = unlist(s_dur))
duration_full$duration <- as.numeric(gsub(" min","", duration_full$duration))

duration_full_subset <- duration_full[duration_full$country %in% 
                                        c("United States", "India", "United Kingdom",
                                          "Canada", "France", "Japan", "Spain", "South Korea",
                                          "Mexico", "Australia", "China", "Taiwan"),]

plot_ly(duration_full_subset, y = ~duration, color = ~country, type = "box") %>% 
  layout(xaxis = list(title = "Country"), 
         yaxis = list(title = 'Duration (in min)'),
         title = "Box-Plots of Movie Duration in Top 12 Countries", margin = list(t = 54),
         legend = list(x = 100, y = 0.5))

It can be seen from above that Movies produced in India tend to be the longest on average with the average duration of 127 min.