IS428 - Assignment 5

1.Overview

This interactive visualization is aimed to explore patterns in the Netflix dataset, which is available on Kaggle. (https://www.kaggle.com/shivamb/netflix-shows) This dataset consists of TV shows and Movies available on Netflix as of January 2021, collected from Flixable which is a third-party Netflix search engine.

1.1 Design Challenges

The first design challenge is to have an overview of the distribution of produced Netflix Movies and TV Shows and in the different countries. To do so, a choropleth map is used to show the number of Movies/ TV Shows produced in each country. A Bar chart showing the top 10 countries which produced the most Movies/ TV Shows is also included.
To find out if there are any obvious differences in the duration of Movies/ TV Shows in the different countries. A box plot is selected as it summarises the spread and variation of the Netflix contents in each country. The box plot also shows if there are any outliers or skewness and allows for easy comparisons between countries.
To determine the top 20 genres for both Movies and TV Shows. A Bar Chart is used for this.
The last design challenge is to find the top 10 directors and actors for both Netflix Movies and TV Shows, which is shown in a datatable.

2. Proposed Design Sketch

3. Data Viz Step-By-Step

3.1 Install and Load R packages

Let us start be installing the required packages and libraries.

knitr::opts_chunk$set(echo = TRUE)

packages = c( 'dplyr', 'tidyverse', "RColorBrewer", "maps", "viridis", "plotly", 'gridExtra')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Dataset Overview

Next, load the dataset. There are 7787 rows in the netflix dataset. 2410 are listed as TV shows while the rest (5377), are listed as Movies. In the dataset, there are a total of 12 columns.

Column Names	Description
show_id	Unique ID for every Movie / TV Show
type	Identifier - A Movie or TV Show
title	Title of the Movie / TV Show
director	Director of the Movie
cast	Actors involved in the movie / TV show
country	Country where the movie / TV show was produced
date_added	Date it was added on Netflix
release_year	Actual Release year of the Movie / TV show
rating	TV Rating of the Movie / TV Show
duration	Total Duration - in minutes or number of seasons
listed_in	Genres of the Movie / TV Show
description	A brief summary of the Movie / TV Show

netflix_data <- read_csv("data/netflix_titles.csv") 


tv_data <- netflix_data %>% filter(type=="TV Show")
movies_data <- netflix_data %>% filter(type=="Movie")

world_map <- map_data("world")

dim(netflix_data)[1]
dim(netflix_data)[2]
nrow(tv_data)
nrow(movies_data)

3.3 Line Chart to show the amount of content added to the platform over the years.

Manipulated the date_added format to YYYY-MM-DD.

netflix_data$date_added <- as.Date(netflix_data$date_added, format = "%B %d, %Y")

df_by_date <- netflix_data %>% group_by(date_added,type) %>% summarise(added_today = n()) %>% 
            ungroup() %>% group_by(type) %>% mutate(total_number_of_content = cumsum(added_today))

netflix_line <- plot_ly(df_by_date, x = ~date_added, y = ~total_number_of_content, color = ~type, type = 'scatter', mode = 'lines', colors=c("#DA705F",  "#999999")) 
netflix_line <- netflix_line %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title="Amount Of Content Over Time")

netflix_line

From the line chart, we can see that since 2016, there is an exponential growth in the number of content added to the Netflix platform. Also, more movies are added to the platform as compared to TV Shows.

3.4 TV Shows

3.4.1 TV Shows - Choropleth Map showing the distribution in each country, Bar Chart showing the Top 10 countries with the highest amount of TV Shows produced.

Realised that there are rows with more than 1 countries.
Hence, I had to split the string by the delimeter “,”. Next, populate the dataframe with count.
Plotly uses ISO Codes to generate the World Map hence, to plot the world map, i took the ISO Code from a csv file from Plotly documentation to merge it with the current dataframe.

To plot the Bat Chart, the dataset is first sorted in a descending order then filtered to the top 10.

s1 <- strsplit(tv_data$country, split = ", ")
titles_countries_fuul <- data.frame(type = rep(tv_data$type, sapply(s1, length)), country = unlist(s1))
titles_countries_fuul$country <- as.character(gsub(",","",titles_countries_fuul$country))
amount_by_country <- na.omit(titles_countries_fuul)

tv_data_map <-amount_by_country %>% 
  group_by(country) %>% 
  summarise (count = n()) %>% 
  dplyr::rename(region = country, num_shows = count)


df <- read.csv("https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv")
df <- df %>% select(COUNTRY, CODE)
tv_data_map2 <-tv_data_map %>% dplyr::rename(COUNTRY = region)

tv_map2 <-left_join(tv_data_map2, df, by = "COUNTRY")


l <- list(color = toRGB("grey"), width = 0.5)
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'Mercator')
)

fig <- plot_geo(tv_map2)
fig <- fig %>% add_trace(
    z = ~num_shows, color = ~num_shows, colors = 'RdGy',
    text = ~COUNTRY, locations = ~CODE
  )
fig <- fig %>% colorbar(title = 'Number of Movies Produced')
fig <- fig %>% layout(
    title = 'Distribution of produced Netflix TV Shows<br>Source:<a href="https://www.kaggle.com/shivamb/netflix-shows">Flixable</a>',
    geo = g
  )

fig

ordered_tv_data_map <-tv_data_map[order(tv_data_map$num_shows, decreasing = TRUE),]  

ordered_tv_data_map <- ordered_tv_data_map[1:10,]


tv_countries_bar <- plot_ly(ordered_tv_data_map, x = ~region, y = ~num_shows, type = 'bar', marker = list(color = '#D5A1A3'))
tv_countries_bar <- tv_countries_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = ordered_tv_data_map$region, title="Countries"), yaxis = list(title = 'Number of TV Shows Produced'), title="Top 10 countries with the highest number of TV Shows produced")

tv_countries_bar

From both plots, United States is taking the lead in the number of TV Shows produced, followed by United Kingdom and Japan. Most of the countries in the list are developed countries.

3.4.2 TV Shows - Box Plot showing the spread in the duration of TV shows in the Top 10 countries.

For the Duration field for TV Shows, the data are recorded as a string.
Hence, convert them into Integers by splitting them with the delimeter " ".
The countries are ordered according to the Top 10 countries with the highest number of TV Shows produced for easier analysis.

tv_by_duration_country<-na.omit(tv_data[,c("country", "duration")])

tv_by_duration_country <- separate(tv_by_duration_country, duration, c("duration", "word"), sep= " ")
tv_by_duration_country$duration <- as.integer(tv_by_duration_country$duration) 
tv_by_duration_country <- tv_by_duration_country %>% select(country, duration)

s4 <- strsplit(tv_by_duration_country$country, split = ", ")

tv_by_duration_country_full <- data.frame(duration = rep(tv_by_duration_country$duration, sapply(s4, length)), country = unlist(s4))
tv_by_duration_country_full$duration <- as.numeric(gsub(" seasons","", tv_by_duration_country_full$duration))

tv_by_duration_country_full_subset <- tv_by_duration_country_full[tv_by_duration_country_full$country %in% ordered_tv_data_map$region,]


tv_duration_boxplot <- plot_ly(tv_by_duration_country_full_subset, y = ~duration, color = ~country, colors = "RdGy", type = "box")
tv_duration_boxplot <- tv_duration_boxplot %>% layout(xaxis=list(categoryorder = "array", categoryarray = c("United States", "United Kingdom", "Japan", "South Korea", "Canada", "France", "India", "Taiwan", "Australia", "Spain"), title="Countries"), yaxis = list(title = 'Duration (in min)'), 
        title="Box-Plots Of TV Shows Duration In Top 10 Countries")

tv_duration_boxplot

From the Boxplot, TV Shows produced in most of the countries last 1 season. Canada is the only country with a median of 2 seasons. United States have the highest spread and number of outliers.

3.4.3 TV Shows - Top 10 Genres.

Similarly, the listed_in field has more than one value in it hence I had to split them as well with the delimeter “,”.
The dataset is then sorted in a descending order.

s6 <- strsplit(tv_data$listed_in, split = ", ")
tv_listed_in <- data.frame(type = rep(tv_data$type, sapply(s6, length)), listed_in = unlist(s6))
tv_listed_in$listed_in <- as.character(gsub(",","",tv_listed_in$listed_in))

df_by_listed_in_full <- tv_listed_in %>% group_by(listed_in) %>% summarise(count = n()) %>%
  arrange(desc(count)) %>% top_n(20)

## Selecting by count

tv_genre_bar <- plot_ly(df_by_listed_in_full, x = ~listed_in, y = ~count, type = 'bar', marker = list(color = '#E9CCC4'))
tv_genre_bar <- tv_genre_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = df_by_listed_in_full$listed_in, title="Genre"), yaxis = list(title = 'Count'), title="Top 20 TV Shows Genres On Netflix")

tv_genre_bar

International TV Shows, TV Dramas and TV Comedies have the highest number of content on Netflix.

3.4.4 TV Shows - Top 10 Directors.

s8 <- strsplit(tv_data$director, split = ", ")
tv_director <- data.frame(type = rep(tv_data$type, sapply(s8, length)), director = unlist(s8))
tv_director$director <- as.character(gsub(",","",tv_director$director))
tv_director<-na.omit(tv_director) %>%  group_by(director)  %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(5)

## Selecting by count

tv_director<-as.data.frame(tv_director)
tv_director

##              director count
## 1 Alastair Fothergill     3
## 2           Ken Burns     3
## 3      Iginio Straffi     2
## 4          Jung-ah Im     2
## 5         Lynn Novick     2
## 6     Rob Seidenglanz     2
## 7         Shin Won-ho     2
## 8         Stan Lathan     2

3.4.5 TV Shows - Top 10 Actors.

s10 <- strsplit(tv_data$cast, split = ", ")
tv_actor <- data.frame(type = rep(tv_data$type, sapply(s10, length)), actor = unlist(s10))
tv_actor$actor <- as.character(gsub(",","",tv_actor$actor))
tv_actor<-na.omit(tv_actor) %>%  group_by(actor)  %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)

## Selecting by count

tv_actor<-as.data.frame(tv_actor)
tv_actor

##                 actor count
## 1    Takahiro Sakurai    22
## 2           Yuki Kaji    17
## 3           Ai Kayano    16
## 4         Daisuke Ono    16
## 5      Junichi Suwabe    15
## 6    Yoshimasa Hosoya    14
## 7     Yuichi Nakamura    14
## 8  David Attenborough    13
## 9      Hiroshi Kamiya    13
## 10       Jun Fukuyama    13
## 11      Kana Hanazawa    13
## 12       Vincent Tong    13

A wide number of the actors with the highest number of TV Shows produced are Japanese.

3.5 Movies

3.5.1 Movies - Choropleth Map showing the distribution in each country, Bar Chart showing the Top 10 countries with the highest amount of Movies produced.

s2 <- strsplit(movies_data$country, split = ", ")
titles_countries_fuul <- data.frame(type = rep(movies_data$type, sapply(s2, length)), country = unlist(s2))
titles_countries_fuul$country <- as.character(gsub(",","",titles_countries_fuul$country))
amount_by_country <- na.omit(titles_countries_fuul) 

movie_data_map <-amount_by_country %>% 
  group_by(country) %>% 
  summarise (count = n()) %>% 
  dplyr::rename(region = country, num_movie = count)

movie_data_map2 <-movie_data_map %>% dplyr::rename(COUNTRY = region)

movie_map2 <-left_join(movie_data_map2, df, by = "COUNTRY")


l <- list(color = toRGB("grey"), width = 0.5)
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'Mercator')
)

fig <- plot_geo(movie_map2)
fig <- fig %>% add_trace(
    z = ~num_movie, color = ~num_movie, colors = 'RdGy',
    text = ~COUNTRY, locations = ~CODE
  )
fig <- fig %>% colorbar(title = 'Number of Movies Produced')
fig <- fig %>% layout(
    title = 'Distribution of produced Netflix Movies<br>Source:<a href="https://www.kaggle.com/shivamb/netflix-shows">Flixable</a>',
    geo = g
  )

fig

ordered_movie_data_map <-movie_data_map[order(movie_data_map$num_movie, decreasing = TRUE),]  

ordered_movie_data_map <- ordered_movie_data_map[1:10,]

movie_countries_bar <- plot_ly(ordered_movie_data_map, x = ~region, y = ~num_movie, type = 'bar', marker = list(color = '#D5A1A3'))
movie_countries_bar <- movie_countries_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = ordered_movie_data_map$region, title="Countries"), yaxis = list(title = 'Number of Movies Produced'), title="Top 10 countries with the highest number of Movies produced")

movie_countries_bar

United States is also leading with the highest number of movies produced. This is followed by India and United Kingdom.

3.5.2 Movies - Box Plot showing the spread in the duration of Movies in the Top 10 countries.

movies_by_duration_country<-na.omit(movies_data[,c("country", "duration")])
s3 <- strsplit(movies_by_duration_country$country, split = ", ")

movies_by_duration_country_full <- data.frame(duration = rep(movies_by_duration_country$duration, sapply(s3, length)), country = unlist(s3))
movies_by_duration_country_full$duration <- as.numeric(gsub(" min","", movies_by_duration_country_full$duration))

movies_by_duration_country_full_subset <- movies_by_duration_country_full[movies_by_duration_country_full$country %in% ordered_movie_data_map$region,]


movie_duration_boxplot <- plot_ly(movies_by_duration_country_full_subset, y = ~duration, color = ~country, colors = "RdGy", type = "box")
movie_duration_boxplot <- movie_duration_boxplot %>% layout(xaxis=list(categoryorder = "array", categoryarray = c("United States", "India", "United Kingdom", "Canada", "France", "Spain", "Germany","Japan", "China", "Mexico"), title="Countries"), yaxis = list(title = 'Duration (in min)'), 
        title="Box-Plots Of Movie Duration In Top 10 Countries")


movie_duration_boxplot

From the boxplot, there is quite an even spread among the countries. United States has the highest number of outliers while India have the highest median duration of 127 mins.

3.5.3 Movies - Top 10 Genres.

s5 <- strsplit(movies_data$listed_in, split = ", ")
movies_listed_in <- data.frame(type = rep(movies_data$type, sapply(s5, length)), listed_in = unlist(s5))
movies_listed_in$listed_in <- as.character(gsub(",","",movies_listed_in$listed_in))

df_by_listed_in_full <- movies_listed_in %>% group_by(listed_in) %>% summarise(count = n()) %>%
  arrange(desc(count)) %>% top_n(20)

## Selecting by count

movie_genre_bar <- plot_ly(df_by_listed_in_full, x = ~listed_in, y = ~count, type = 'bar', marker = list(color = '#E9CCC4'))
movie_genre_bar <- movie_genre_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = df_by_listed_in_full$listed_in, title="Genre"), yaxis = list(title = 'Count'), title="Top 20 Movie Genres On Netflix")

movie_genre_bar

Similar to TV Shows, the top 3 genres are International Movies, Dramas and Comedies.

3.5.4 Movies - Top 10 Directors.

s7 <- strsplit(movies_data$director, split = ", ")
movie_director <- data.frame(type = rep(movies_data$type, sapply(s7, length)), director = unlist(s7))
movie_director$director <- as.character(gsub(",","",movie_director$director))
movie_director<-na.omit(movie_director) %>%  group_by(director)  %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)

## Selecting by count

movie_director<-as.data.frame(movie_director)
movie_director

##               director count
## 1            Jan Suter    21
## 2          Raúl Campos    19
## 3            Jay Karas    15
## 4         Marcus Raboy    15
## 5  Cathy Garcia-Molina    13
## 6          Jay Chapman    12
## 7      Martin Scorsese    12
## 8      Youssef Chahine    12
## 9     Steven Spielberg    10
## 10        David Dhawan     9
## 11     Shannon Hartman     9

3.5.5 Movies - Top 10 Actors.

s9 <- strsplit(movies_data$cast, split = ", ")
movie_actor <- data.frame(type = rep(movies_data$type, sapply(s9, length)), actor = unlist(s9))
movie_actor$actor <- as.character(gsub(",","",movie_actor$actor))
movie_actor<-na.omit(movie_actor) %>%  group_by(actor)  %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)

## Selecting by count

movie_actor<-as.data.frame(movie_actor)
movie_actor

##               actor count
## 1       Anupam Kher    41
## 2    Shah Rukh Khan    35
## 3  Naseeruddin Shah    30
## 4           Om Puri    30
## 5      Akshay Kumar    29
## 6  Amitabh Bachchan    27
## 7       Boman Irani    27
## 8      Paresh Rawal    27
## 9    Kareena Kapoor    25
## 10       Ajay Devgn    21

An obvious pattern is that the actors that acted in the most Netflix Movies are mostly from India.

4 Conclusion

The number of of content added to Netflix is exponentially increasing, hence the value of the platform is also rising.

The Chloropeth Map and Bar Chart gives insights to the distribution of Netflix content produced all around the world. United States takes the lead and United Kingdom is ranked top 3 for both Movies and TV Shows. There is also a high presence of Netflix content produce in India, Canada, France, Spain and Japan.

From the Boxplots for TV Shows, Canada and United States produced the longest standing TV shows with high number of outliers among the 10 countries that produced the most number of netflix tv shows. For the Movies boxplot, movies produced in India have the longest duration while Mexico have the shortest. Also, there doesn’t seem to be any relationship between the duration of content and the number of content produced in the country.

The most prevalent genres on Netflix for both TV Shows and Movies and those that are International, are Dramas or are Comedies. One reason might be that they appeal to the highest number of audiences.

For Directors, there doesn’t seem to be any patterns as they originate from all over the world for both TV SHows and Movies. For Actors, there is a high number of Japanese actors for TV Shows and a high amount of Indian actors for Movies. One reason for this might be the smaller pool of actors and actresses that acted in specialised content such as Animes and Indian movies.