Netflix Visualization

Introduction

This documentation is the preparation for the dashboard that the writer had built by using shiny apps which could be seen from the link here. The dashboard has a purpose to informed the movie enthusiasts to discover the Netflix contents which are presented in several vizualisations. The data that is sourced from Kaggle by the publisher, Shivam Bansal. The data set consists of various of tv shows and movies that are available in Netflix platform as of 2021 (the updated version) and collected from Flixable as a third-party Netflix search engine. To briefly describe the contents of the data set, the descriptions of each variables are described as below:

show_id: unique id represents the contents (TV Shows/Movies)
type: The type of the contents whether it is a Movie or Tv Show
title: The title of the contents
director: name of the director(s) of the content
cast: name of the cast(s) of the content
country: Country of which contents was produced
date_added: the date of the contents added into the platform
release_year: the actual year of the contents release
rating: the ratings of the content (viewer ratings)
duration: length of duration for the contents (num of series for TV Shows and num of minutes for Movies)
listed_in: the list of genres of which the contents was listed in
description: full descriptions and synopses of the contents.

In this documentation, it would be shown the process of the making for the visualizations that are used in the dashboard as mentioned above.

Data Preparation

The first step of establishing the visualization of the data set is to prepare the data to be visualized. Beforehand, the packages that are required for the visualization would be loaded as below.

library(plotly)
library(tidyverse)
library(scales)
library(lubridate)
library(visdat)
library(igraph)
library(networkD3)

Next, the data that would be used from data_input directory would be imported by using read_csv from readr package. The use of this function would be optional as to read the data in R could simply use the base function of read.csv. The difference, however, is the speed of the data load by read_csv function. The data set could be seen in the table below.

df_netflix <- read_csv("data_input/netflix_titles.csv")
df_netflix

The data could also be seen from the use of glimpse function as seen as the code chunk below. In glimpse the data would be summarized by each of the variables along with the descriptions of dimensions and data types for each variable.

glimpse(df_netflix)

## Rows: 7,787
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1~
## $ type         <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",~
## $ title        <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70~
## $ director     <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "~
## $ cast         <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val~
## $ country      <chr> "Brazil", "Mexico", "Singapore", "United States", "United~
## $ date_added   <chr> "August 14, 2020", "December 23, 2016", "December 20, 201~
## $ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200~
## $ rating       <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"~
## $ duration     <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 ~
## $ listed_in    <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",~
## $ description  <chr> "In a future where the elite inhabit an island paradise f~

Note that there are several factor type variables (e.g. type) that are still in their character form. Thus it would be transformed into the appropriate formate in the next section.

Finding missing values and duplicate data

Before the transformation, it would be better to see whether there are any duplicated and missing values within the dataset. To discern the missing values within the data set, vis_miss function could be used from visdat package. This function would create a visualization for the missing data in each variables. It would also show the proportion of the missing values by each variables and as a whole data set. The visualization is represented as below.

vis_miss(df_netflix, sort_miss = T, cluster = T)

By the chart, it is seen that several variables including director, cast, country date_added, and rating, have missing values in which the director variable has the largest proportion among other variables. To tackle this issue, the missing value would be filled rather than being dropped due to large proportion of information loss if the values were dropped. There are several ways to fill the missing values such as filling the record by the central value of the variables (mean, median, or mode), specific value that may be relevant to the case, or filling 0 into the column as these would be depended on the case of a study or a business. To fill the missing values for country, rating, and date_added columns, a mode value would be filled by using a predetermined function below. As for the director and cast variables, an “Unkown” string value would be filled as it would be impossible to acknowledge the names directly.

mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

df_netflix <- df_netflix %>% 
  replace_na(list(country = mode(df_netflix$country), rating = mode(df_netflix$rating), 
                  director = "Unknown", cast = "Unknown", 
                  date_added = mode(df_netflix$date_added)))
vis_miss(df_netflix)

By the visualization above, it is shown that all of the missing values has been completely fulfilled. The next step is to investigate the duplicate values within the data set by using anyDuplicated function.

anyDuplicated(df_netflix)

## [1] 0

By the result, it is observed that there are no duplicated values occurred in the data set which would allow the process to continue.

Data Cleaning and Manipulation

In this step, the data set would then be cleaned and manipulated to acquire several insights that would be relevant in the dashboard. To briefly explain the steps within this section, the summary is seen on the list below:

*Changing several data types: - type to factors - date_added to datetime

*Adding several columns: - year_added - main_country - main_cast - target-age - Genre

In the code chunk below, main_country, main_cast, and genre variables are added by extracting the first of the content of each of the original variables (country, cast, listed_in). This is done due to the long list of the original variables which would be difficult to visualize. By this action, the first item on the lists of the variables are assumed to be the main class of each variables (e.g. main_country is the first item on country list). Afterwards, the data set is un-nested by unnest function as the result of the str_split would return a nested list. Next, type and date_added columns are adjusted to the appropriate format. Finally, the year of the date_added would be extracted into the year_added column and the labels of the rating are changed to 4 categories including ‘Kids’, ‘Older Kids’, ‘Teens’, and ‘Adults’.

df_netflix <- df_netflix %>%
  mutate(main_country = map(str_split(country, ", "), 1), 
         main_cast = map(str_split(cast, ", "), 1), 
         genre = map(str_split(listed_in, ", "), 1)) %>% 
  unnest(cols = c(main_country, main_cast, genre)) %>% 
  mutate(type = as.factor(type), 
         date_added = mdy(date_added),
         year_added = year(date_added),
         main_country = str_remove(main_country, ","),
         target_age = factor(sapply(rating, switch, 
                             'TV-PG' = 'Older Kids', 
                             'TV-MA' = 'Adults', 
                             'TV-Y7-FV' = 'Older Kids',
                             'TV-Y7' = 'Older Kids',
                             'TV-14' = 'Teens',
                             'R' = 'Adults',
                             'TV-Y' = 'Kids',
                             'NR' = 'Adults',
                             'PG-13' = 'Teens',
                             'TV-G' = 'Kids',
                             'PG' = 'Older Kids',
                             'G' = 'Kids',
                             'UR' = 'Adults',
                             'NC-17' = 'Adults'), level = c("Kids", "Older Kids", "Teens", "Adults"))
         ) 
head(df_netflix)

Creating Visualization

There are many ways to create an interactive visualization in R (e.g. plotly and highcharter). In creating the visualization of this documentation, however, most of the plots are using ggplot function and then combined with ggplotly function to make the plots interactive.

Growth for Number of Contents Each Year

The visualization within this section is the growth of the Netflix contents by each year. The plot use the animation feature to present the changes of cumulative counts of each contents (Movies/TV Shows). To establish the plot, the first step is to determine the function of accumulated animation into an object called accumulate_by. Next, the plot is established by using the ggplot function and several transformation to accumulate the total contents each year. Finally, the ggplotly function is used to create the interactive plot along with its animation.

accumulate_by <- function(dat, var) {
  var <- lazyeval::f_eval(var, dat)
  lvls <- plotly:::getLevels(var)
  dats <- lapply(seq_along(lvls), function(x) {
    cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
  })
  dplyr::bind_rows(dats)
}

mtv_growth <- df_netflix %>% 
  group_by(year_added, type) %>% 
  summarise(movie_count = n()) %>% 
  ungroup() %>% 
  mutate(cumulative_count = ave(movie_count, type, FUN = cumsum)) %>% 
  accumulate_by(~year_added) %>% 
  ggplot(aes(x = year_added, y = cumulative_count, color = type, frame = frame)) +
  geom_line(size = 1.5, alpha = 0.8)+
  geom_point(size = 3.5, shape = 21, fill = "white", aes(text = paste0("Year: ", year_added, "<br>",
                                                   "Content Count: ", cumulative_count))) + 
  scale_color_manual(values = c("firebrick", "grey16")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Growth Numbers of Movies and TV Shows by Year",
       x = "",
       y = "Number of Movies/TV Shows",
       color = "",
       ) +
  theme_minimal() +
  theme(title = element_text(face = "bold"))


ggplotly(mtv_growth, tooltip = c("text", "frame")) %>%
  config(displayModeBar = F)

Line plot is one of the basic charts that could be used to identify a certain trend in the data over time. The usability of the chart is also explained in the trend of the accumulated contents of Netflix by year as shown above. Based on the visualization, it is shown that the contents of the platform have a dramatic increase as of 2015 - 2020. It may shows that the platform may have gain tractions overtime with its peak at 2018.

To use the line chart efficiently, there are several aspects that are need to be considered. First, keep in mind of the time span to focus on. If the visualization needs to focus on a certain time frame, it is best not to cutoff the chart to a smaller range as it will hide the trend as a whole. Secondly, the number of categories presented should be in a few number as possible to make the plot more tidy.

Choropleth Map for Netflix Content Distribution

Map visualization is one of the common approach to present any patterns in a global scale such as a distribution of product, demand of product for each country, etc. Several variations that could be used in map visualization including Connection Map, Choropleth Map, Map Hexbin, and Bubble Map Charts. In this presentation, a choropleth map is used as it is one of the best visualization to discern the distribution of an accounted variables by each country. The choropleth map that is presented here is used to visualize the distribution of Netflix contents by each countries.

The flow to establish this plot are divided into 5 steps. First, the map data from ggplot2 is imported as this step would be required to most of the map plots due to the requirements of latitudes and longitudes of each countries in making the polygonal visualization of the map.Second, the number of contents by each countries are accumulated to see each of the countries’ contents that are produced. Third, several country labels in the data frame would be changed to match the labels that are available in the map data. Fourth, both of the map data and the data frame are joined altogether by using left_join to match each of the contents. This steps would make some of the countries to have a missing values as there are some countries that have not any contents produced and distributed in the Netflix Platform. Lastly, the plot is generated by using ggplot along with the combination of plotly to make the plot interactive.

A quick tip to make the plot is to consider the use of transformation when scaling the number of contents. Commonly, the transformation of the scale of the number for the scaling is to use a log transformation. Nevertheless, the log transformation itself could not put any 0 value into the scale. Another option for this is to use a ‘pseudo_log’ transformation so that the 0 value in the variable could be scaled.

# Importing map data from ggplot2
mapdata <- map_data("world")

#changing country labels in netflix dataframe
netflix_for_map <- df_netflix %>% 
  mutate(main_country = str_replace_all(main_country, 
                             c("United States" = "USA",
                               "United Kingdom" = "UK",
                               "Hong Kong" = "China",
                               "Soviet Union" = "Russia",
                               "West Germany" = "Germany")))

#summarize content counts by country
count_country <- netflix_for_map %>% 
  group_by(main_country) %>% 
  summarise(content_count = n()) %>% 
  ungroup() %>% 
  arrange(desc(content_count))

#join map data with the dataframe
map_join <- mapdata %>% 
  left_join(. , count_country, by = c("region"="main_country")) %>% 
  mutate(content_count = replace_na(content_count, 0))

# creating chroloplete map
temp <- ggplot() +
  geom_polygon(data = map_join, 
               aes(fill = content_count, x = long, y = lat, 
                   group = group, 
                   text = paste0(region, "<br>",
                                 "Netflix Contents: ", content_count)),
               size = 0, alpha = .9, color = "black"
               ) + 
  labs(title = "Distribution of Netflix Contents by Country") +
  theme_void() +
  scale_fill_gradient(name = "Content Count", 
                      trans = "pseudo_log",
                      breaks = c(0, 7, 56, 403, 3000),
                      labels = c(0, 7, 56, 403, 3000),
                      low =  "bisque2",
                      high = "#b20710") +
  theme(panel.grid.major = element_blank(),
        axis.line = element_blank(),
        plot.title = element_text(face = "bold")) 


ggplotly(temp, tooltip = "text") %>% 
  config(displayModeBar = F) %>% 
  layout(legend = list(x = .1, y = .9))

As seen on the plot, it is shown that most of the contents produced were derived from United States of America. It is a common sense that the Netflix contents were derived from this country as the platform itself was established from the country. Following the US, India has the second position to have the most contents on Netflix which 956 contents from India are distributed by the platform.

Genre Distributions by Countries

To visualize the common genres that are available in each countries, a bar plot could be used as it is functional to discern the comparison of a accounted numbers of contents by several categories. In the visualization below, the countries to be compared are United States, India, United Kingdom, and Indonesia.

us_genre <- df_netflix %>% 
  filter(main_country == "United States") %>% 
  group_by(genre) %>% 
  summarise(num_of_contents = n()) %>% 
  ungroup() %>% 
  arrange(desc(num_of_contents)) %>% 
  head(5) %>% 
  ggplot(aes(x = num_of_contents, y = fct_reorder(genre, num_of_contents))) +
  geom_col(fill = "grey16", alpha = .9, aes(text = paste0(genre, "<br>",
                                                          "Num. of Contents: ", 
                                                          num_of_contents))) +
  scale_y_discrete(label = function(x) stringr::str_trunc(x, 12)) +
    labs(title = "United States",
       x = "Number of Casts",
       y = "",
       color = "",
       ) +
  theme_minimal() +
  theme(title = element_text(hjust = 0.5),
        text = element_text(family = "lato"),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(color = "grey88"),
        axis.text.y = element_text(angle = 45)) 

india_genre <- df_netflix %>% 
  filter(main_country == "India") %>% 
  group_by(genre) %>% 
  summarise(num_of_contents = n()) %>% 
  ungroup() %>% 
  arrange(desc(num_of_contents)) %>% 
  head(5) %>% 
  ggplot(aes(x = num_of_contents, y = fct_reorder(genre, num_of_contents))) +
  geom_col(fill = "grey16", alpha = .9, aes(text = paste0(genre, "<br>",
                                                          "Num. of Contents: ", 
                                                          num_of_contents))) +
  scale_y_discrete(label = function(x) stringr::str_trunc(x, 12)) +
    labs(title = "India",
       x = "Number of Casts",
       y = "",
       color = "",
       ) +
  theme_minimal() +
  theme(title = element_text(hjust = 0.5),
        text = element_text(family = "lato"),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(color = "grey88"),
        axis.text.y = element_text(angle = 45))

uk_genre <- df_netflix %>% 
  filter(main_country == "United Kingdom") %>% 
  group_by(genre) %>% 
  summarise(num_of_contents = n()) %>% 
  ungroup() %>% 
  arrange(desc(num_of_contents)) %>% 
  head(5) %>% 
  ggplot(aes(x = num_of_contents, y = fct_reorder(genre, num_of_contents))) +
  geom_col(fill = "grey16", alpha = .9, aes(text = paste0(genre, "<br>",
                                                          "Num. of Contents: ", 
                                                          num_of_contents))) +
  scale_y_discrete(label = function(x) stringr::str_trunc(x, 12)) +
    labs(title = "UK",
       x = "Number of Casts",
       y = "",
       color = "",
       ) +
  theme_minimal() +
  theme(title = element_text(hjust = 0.5),
        text = element_text(family = "lato"),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(color = "grey88"),
        axis.text.y = element_text(angle = 45))

indo_genre <- df_netflix %>% 
  filter(main_country == "Indonesia") %>% 
  group_by(genre) %>% 
  summarise(num_of_contents = n()) %>% 
  ungroup() %>% 
  arrange(desc(num_of_contents)) %>% 
  head(5) %>% 
  ggplot(aes(x = num_of_contents, y = fct_reorder(genre, num_of_contents))) +
  geom_col(fill = "grey16", alpha = .9, aes(text = paste0(genre, "<br>",
                                                          "Num. of Contents: ", 
                                                          num_of_contents))) +
  scale_y_discrete(label = function(x) stringr::str_trunc(x, 12)) +
    labs(title = "Indonesia",
       x = "Number of Casts",
       y = "",
       color = "",
       ) +
  theme_minimal() +
  theme(title = element_text(hjust = 0.5),
        text = element_text(family = "lato"),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(color = "grey88"),
        axis.text.y = element_text(angle = 45))

library(gridExtra)
grid.arrange(us_genre, india_genre, uk_genre, indo_genre, ncol = 2, top = "Top 5 Genres by Countries")

Interestingly, most of the contents in the United States, India, UK, and Indonesia have different common genres. In the US, the most common genre would be ‘Documentations’ whereas the UK have a common genre for ‘British TV Shows’. Separately, India and Indonesia have contents with ‘Dramas’ as a common genre. This is surprising as most of the commonly known genre such as ‘Action & Adventure’ or ‘Horror’ did not make it to the top of the list for each countries.

Age of Contents Distributions by Selected Countries

To visualize the distribution of the length of the contents had been resided in the platform, a violin plot is used as it shows the shape of the distribution of each categories in a variable along with the central points of each distribution. Violin plot is one of the visualization that defines the distributions of a variable that could be categorized into multiple categories. However, the weakness of violin plot is that it could not define the outlier values of the distribution in a variable. Compared with its counterpart, a boxplot would be appropriate if the outlier values are desired to be shown in the plot. To make the plot, geom_violin and geom_boxplot functions would be combined in the ggplot which could be seen on the code chunk below.

us_box <- df_netflix %>% 
  mutate(age_dist = max(df_netflix$date_added) - date_added) %>% 
  filter(main_country == "United States") %>% 
  ggplot(aes(x = type, y = age_dist)) +
  geom_violin(width = 0.8, aes(fill = type)) +
  geom_boxplot(width = 0.05, alpha = 0.8, outlier.shape = NA) + 
  scale_fill_manual(values = c("firebrick", "grey16")) +
      labs(
       title = "US",
       x = "",
       y = "Num. of Days",
       ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        panel.grid.major.x = element_blank())

uk_box <- df_netflix %>% 
  mutate(age_dist = max(df_netflix$date_added) - date_added) %>% 
  filter(main_country == "United Kingdom") %>% 
  ggplot(aes(x = type, y = age_dist)) +
  geom_violin(width = 0.8, aes(fill = type)) +
  geom_boxplot(width = 0.05, alpha = 0.8, outlier.shape = NA) + 
  scale_fill_manual(values = c("firebrick", "grey16")) +
      labs(
       title = "UK", 
       x = "",
       y = "Num. of Days",
       ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        panel.grid.major.x = element_blank())

india_box <- df_netflix %>% 
  mutate(age_dist = max(df_netflix$date_added) - date_added) %>% 
  filter(main_country == "India") %>% 
  ggplot(aes(x = type, y = age_dist)) +
  geom_violin(width = 0.8, aes(fill = type)) +
  geom_boxplot(width = 0.05, alpha = 0.8, outlier.shape = NA) + 
  scale_fill_manual(values = c("firebrick", "grey16")) +
      labs(
       title = "India",
       x = "",
       y = "Num. of Days",
       ) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        panel.grid.major.x = element_blank())

indo_box <- df_netflix %>% 
  mutate(age_dist = max(df_netflix$date_added) - date_added) %>% 
  filter(main_country == "Indonesia") %>% 
  ggplot(aes(x = type, y = age_dist)) +
  geom_violin(width = 0.8, aes(fill = type)) +
  geom_boxplot(width = 0.05, alpha = 0.8, outlier.shape = NA) + 
  scale_fill_manual(values = c("firebrick", "grey16")) +
      labs(
       title = "Indonesia",
       x = "",
       y = "Num. of Days",
       ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        panel.grid.major.x = element_blank())

grid.arrange(us_box, uk_box, india_box, indo_box, ncol = 2, 
             top = "Age of Content Distributions")

The violin plot shows the shape of the distributions of Age of contents for Movie and TV Show Categories by the selected countries including United States, United Kingdom, India, and Indonesia. The age of each contents themselves were calculated by subtracting the maximum value of the date_added in the data set with the date_added value of the content. Based on the presentation above, it is shown that each of the distribution of age of the contents are different for each countries. Specifically, US and UK have more old contents than India and Indonesia. This may indicate that in the early stage of the Netflix platform development, most of the contents are derived from UK and US rather than from India or Indonesia.

Network Diagram of Director-Cast Relationship

In this section, the network diagram between director and cast of each contents would be presented. A network diagram is one of the visualization to represent the connections or link between variables. One of the ways to make this visualization is by using simpleNetwork as it would also create an interactive network diagram. Before creating the plot, both observations in director and cast variables are split to get one name each rows by using str_split and unnest function. This would generate a data frame with names of directors compared with the casts.

for_network <- df_netflix %>% 
  select(director, cast) %>% 
  mutate(director = str_split(director, ", ")) %>% 
  unnest(cols = c(director)) %>% 
  mutate(cast = str_split(cast, ", ")) %>% 
  unnest(c(cast)) %>% 
  filter(director != "Unknown",
         cast != "Unknown",
         cast %in% c("Adam Sandler", "Kevin Hart"))
for_network$director[1:10]

##  [1] "Peter Segal"        "Steve Brill"        "Peter Segal"       
##  [4] "Adam Shankman"      "Dennis Dugan"       "Frank Coraci"      
##  [7] "Neil LaBute"        "Michael Whitton"    "Genndy Tartakovsky"
## [10] "Steve Brill"

simpleNetwork(for_network, height = "100px", width = "100px",
              Source = 1,
              Target = 2,
              zoom = T,
              linkDistance = 100,
              fontSize = 14,
              charge = -100,
              fontFamily = "Candara",
              nodeColour = "#A93226")

The diagram above shows the relation of cast “Adam Sandler” and “Kevin Hart” with the directors of Movies/TV shows that they had casted in. In general, Adam Sandler had played several movies and/or TV shows by 11 different directors, while Kevin Hart had only played in several Movies/TV Shows by 7 different directors. The use of this diagram could be enhanced in the Shiny dashboard by using selectizeInput to select the favorite casts that the dashboard user want to search.

Conclusion

In conclusion, the Netflix data set has a lot of information that could be explored. In this documentation, several information that has been explored including the growth of the contents over the year, the distribution of contents by countries, the common genres in the selected countries, the age of contents distributions by each countries, and network of directors and casts in the Netflix contents. Interestingly, the contents of Netflix platform are dramatically increase from 2015-2019 which also shows the possibility of traction gains of the platform during the periods. The contents themselves were mostly derives from US, India, and UK as three of those countries have a high numbers of contents in the world. Likewise, the common genres and age of contents distributions for each of those countries are varied. The network diagram would also shows the connections of each casts to the directors of the movies/TV Shows that they have played in.

Overall, the visualizations of the data set eases the exploration of the data set which would then be processed for machine learning purpose. The type of the visualizations would be depended on which of the insights or information that would want to be presented. As a recommendation for this data set, a recommender machine learning could be deployed here which would clusters the contents and movies that have similar word in descriptions, directors, genres, and other variables in the data set.