This interactive visualization is aimed to explore patterns in the Netflix dataset, which is available on Kaggle. (https://www.kaggle.com/shivamb/netflix-shows) This dataset consists of TV shows and Movies available on Netflix as of January 2021, collected from Flixable which is a third-party Netflix search engine.
The first design challenge is to have an overview of the distribution of produced Netflix Movies and TV Shows and in the different countries. To do so, a choropleth map is used to show the number of Movies/ TV Shows produced in each country. A Bar chart showing the top 10 countries which produced the most Movies/ TV Shows is also included.
To find out if there are any obvious differences in the duration of Movies/ TV Shows in the different countries. A box plot is selected as it summarises the spread and variation of the Netflix contents in each country. The box plot also shows if there are any outliers or skewness and allows for easy comparisons between countries.
To determine the top 20 genres for both Movies and TV Shows. A Bar Chart is used for this.
The last design challenge is to find the top 10 directors and actors for both Netflix Movies and TV Shows, which is shown in a datatable.
Let us start be installing the required packages and libraries.
knitr::opts_chunk$set(echo = TRUE)
packages = c( 'dplyr', 'tidyverse', "RColorBrewer", "maps", "viridis", "plotly", 'gridExtra')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Next, load the dataset. There are 7787 rows in the netflix dataset. 2410 are listed as TV shows while the rest (5377), are listed as Movies. In the dataset, there are a total of 12 columns.
| Column Names | Description |
|---|---|
| show_id | Unique ID for every Movie / TV Show |
| type | Identifier - A Movie or TV Show |
| title | Title of the Movie / TV Show |
| director | Director of the Movie |
| cast | Actors involved in the movie / TV show |
| country | Country where the movie / TV show was produced |
| date_added | Date it was added on Netflix |
| release_year | Actual Release year of the Movie / TV show |
| rating | TV Rating of the Movie / TV Show |
| duration | Total Duration - in minutes or number of seasons |
| listed_in | Genres of the Movie / TV Show |
| description | A brief summary of the Movie / TV Show |
netflix_data <- read_csv("data/netflix_titles.csv")
tv_data <- netflix_data %>% filter(type=="TV Show")
movies_data <- netflix_data %>% filter(type=="Movie")
world_map <- map_data("world")
dim(netflix_data)[1]
dim(netflix_data)[2]
nrow(tv_data)
nrow(movies_data)
Manipulated the date_added format to YYYY-MM-DD.
netflix_data$date_added <- as.Date(netflix_data$date_added, format = "%B %d, %Y")
df_by_date <- netflix_data %>% group_by(date_added,type) %>% summarise(added_today = n()) %>%
ungroup() %>% group_by(type) %>% mutate(total_number_of_content = cumsum(added_today))
netflix_line <- plot_ly(df_by_date, x = ~date_added, y = ~total_number_of_content, color = ~type, type = 'scatter', mode = 'lines', colors=c("#DA705F", "#999999"))
netflix_line <- netflix_line %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title="Amount Of Content Over Time")
netflix_line
From the line chart, we can see that since 2016, there is an exponential growth in the number of content added to the Netflix platform. Also, more movies are added to the platform as compared to TV Shows.
Realised that there are rows with more than 1 countries.
Hence, I had to split the string by the delimeter “,”. Next, populate the dataframe with count.
Plotly uses ISO Codes to generate the World Map hence, to plot the world map, i took the ISO Code from a csv file from Plotly documentation to merge it with the current dataframe.
To plot the Bat Chart, the dataset is first sorted in a descending order then filtered to the top 10.
s1 <- strsplit(tv_data$country, split = ", ")
titles_countries_fuul <- data.frame(type = rep(tv_data$type, sapply(s1, length)), country = unlist(s1))
titles_countries_fuul$country <- as.character(gsub(",","",titles_countries_fuul$country))
amount_by_country <- na.omit(titles_countries_fuul)
tv_data_map <-amount_by_country %>%
group_by(country) %>%
summarise (count = n()) %>%
dplyr::rename(region = country, num_shows = count)
df <- read.csv("https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv")
df <- df %>% select(COUNTRY, CODE)
tv_data_map2 <-tv_data_map %>% dplyr::rename(COUNTRY = region)
tv_map2 <-left_join(tv_data_map2, df, by = "COUNTRY")
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
)
fig <- plot_geo(tv_map2)
fig <- fig %>% add_trace(
z = ~num_shows, color = ~num_shows, colors = 'RdGy',
text = ~COUNTRY, locations = ~CODE
)
fig <- fig %>% colorbar(title = 'Number of Movies Produced')
fig <- fig %>% layout(
title = 'Distribution of produced Netflix TV Shows<br>Source:<a href="https://www.kaggle.com/shivamb/netflix-shows">Flixable</a>',
geo = g
)
fig
ordered_tv_data_map <-tv_data_map[order(tv_data_map$num_shows, decreasing = TRUE),]
ordered_tv_data_map <- ordered_tv_data_map[1:10,]
tv_countries_bar <- plot_ly(ordered_tv_data_map, x = ~region, y = ~num_shows, type = 'bar', marker = list(color = '#D5A1A3'))
tv_countries_bar <- tv_countries_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = ordered_tv_data_map$region, title="Countries"), yaxis = list(title = 'Number of TV Shows Produced'), title="Top 10 countries with the highest number of TV Shows produced")
tv_countries_bar
From both plots, United States is taking the lead in the number of TV Shows produced, followed by United Kingdom and Japan. Most of the countries in the list are developed countries.
For the Duration field for TV Shows, the data are recorded as a string.
Hence, convert them into Integers by splitting them with the delimeter " ".
The countries are ordered according to the Top 10 countries with the highest number of TV Shows produced for easier analysis.
tv_by_duration_country<-na.omit(tv_data[,c("country", "duration")])
tv_by_duration_country <- separate(tv_by_duration_country, duration, c("duration", "word"), sep= " ")
tv_by_duration_country$duration <- as.integer(tv_by_duration_country$duration)
tv_by_duration_country <- tv_by_duration_country %>% select(country, duration)
s4 <- strsplit(tv_by_duration_country$country, split = ", ")
tv_by_duration_country_full <- data.frame(duration = rep(tv_by_duration_country$duration, sapply(s4, length)), country = unlist(s4))
tv_by_duration_country_full$duration <- as.numeric(gsub(" seasons","", tv_by_duration_country_full$duration))
tv_by_duration_country_full_subset <- tv_by_duration_country_full[tv_by_duration_country_full$country %in% ordered_tv_data_map$region,]
tv_duration_boxplot <- plot_ly(tv_by_duration_country_full_subset, y = ~duration, color = ~country, colors = "RdGy", type = "box")
tv_duration_boxplot <- tv_duration_boxplot %>% layout(xaxis=list(categoryorder = "array", categoryarray = c("United States", "United Kingdom", "Japan", "South Korea", "Canada", "France", "India", "Taiwan", "Australia", "Spain"), title="Countries"), yaxis = list(title = 'Duration (in min)'),
title="Box-Plots Of TV Shows Duration In Top 10 Countries")
tv_duration_boxplot
From the Boxplot, TV Shows produced in most of the countries last 1 season. Canada is the only country with a median of 2 seasons. United States have the highest spread and number of outliers.
Similarly, the listed_in field has more than one value in it hence I had to split them as well with the delimeter “,”.
The dataset is then sorted in a descending order.
s6 <- strsplit(tv_data$listed_in, split = ", ")
tv_listed_in <- data.frame(type = rep(tv_data$type, sapply(s6, length)), listed_in = unlist(s6))
tv_listed_in$listed_in <- as.character(gsub(",","",tv_listed_in$listed_in))
df_by_listed_in_full <- tv_listed_in %>% group_by(listed_in) %>% summarise(count = n()) %>%
arrange(desc(count)) %>% top_n(20)
## Selecting by count
tv_genre_bar <- plot_ly(df_by_listed_in_full, x = ~listed_in, y = ~count, type = 'bar', marker = list(color = '#E9CCC4'))
tv_genre_bar <- tv_genre_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = df_by_listed_in_full$listed_in, title="Genre"), yaxis = list(title = 'Count'), title="Top 20 TV Shows Genres On Netflix")
tv_genre_bar
International TV Shows, TV Dramas and TV Comedies have the highest number of content on Netflix.
s8 <- strsplit(tv_data$director, split = ", ")
tv_director <- data.frame(type = rep(tv_data$type, sapply(s8, length)), director = unlist(s8))
tv_director$director <- as.character(gsub(",","",tv_director$director))
tv_director<-na.omit(tv_director) %>% group_by(director) %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(5)
## Selecting by count
tv_director<-as.data.frame(tv_director)
tv_director
## director count
## 1 Alastair Fothergill 3
## 2 Ken Burns 3
## 3 Iginio Straffi 2
## 4 Jung-ah Im 2
## 5 Lynn Novick 2
## 6 Rob Seidenglanz 2
## 7 Shin Won-ho 2
## 8 Stan Lathan 2
s10 <- strsplit(tv_data$cast, split = ", ")
tv_actor <- data.frame(type = rep(tv_data$type, sapply(s10, length)), actor = unlist(s10))
tv_actor$actor <- as.character(gsub(",","",tv_actor$actor))
tv_actor<-na.omit(tv_actor) %>% group_by(actor) %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)
## Selecting by count
tv_actor<-as.data.frame(tv_actor)
tv_actor
## actor count
## 1 Takahiro Sakurai 22
## 2 Yuki Kaji 17
## 3 Ai Kayano 16
## 4 Daisuke Ono 16
## 5 Junichi Suwabe 15
## 6 Yoshimasa Hosoya 14
## 7 Yuichi Nakamura 14
## 8 David Attenborough 13
## 9 Hiroshi Kamiya 13
## 10 Jun Fukuyama 13
## 11 Kana Hanazawa 13
## 12 Vincent Tong 13
A wide number of the actors with the highest number of TV Shows produced are Japanese.
s2 <- strsplit(movies_data$country, split = ", ")
titles_countries_fuul <- data.frame(type = rep(movies_data$type, sapply(s2, length)), country = unlist(s2))
titles_countries_fuul$country <- as.character(gsub(",","",titles_countries_fuul$country))
amount_by_country <- na.omit(titles_countries_fuul)
movie_data_map <-amount_by_country %>%
group_by(country) %>%
summarise (count = n()) %>%
dplyr::rename(region = country, num_movie = count)
movie_data_map2 <-movie_data_map %>% dplyr::rename(COUNTRY = region)
movie_map2 <-left_join(movie_data_map2, df, by = "COUNTRY")
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
)
fig <- plot_geo(movie_map2)
fig <- fig %>% add_trace(
z = ~num_movie, color = ~num_movie, colors = 'RdGy',
text = ~COUNTRY, locations = ~CODE
)
fig <- fig %>% colorbar(title = 'Number of Movies Produced')
fig <- fig %>% layout(
title = 'Distribution of produced Netflix Movies<br>Source:<a href="https://www.kaggle.com/shivamb/netflix-shows">Flixable</a>',
geo = g
)
fig
ordered_movie_data_map <-movie_data_map[order(movie_data_map$num_movie, decreasing = TRUE),]
ordered_movie_data_map <- ordered_movie_data_map[1:10,]
movie_countries_bar <- plot_ly(ordered_movie_data_map, x = ~region, y = ~num_movie, type = 'bar', marker = list(color = '#D5A1A3'))
movie_countries_bar <- movie_countries_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = ordered_movie_data_map$region, title="Countries"), yaxis = list(title = 'Number of Movies Produced'), title="Top 10 countries with the highest number of Movies produced")
movie_countries_bar
United States is also leading with the highest number of movies produced. This is followed by India and United Kingdom.
movies_by_duration_country<-na.omit(movies_data[,c("country", "duration")])
s3 <- strsplit(movies_by_duration_country$country, split = ", ")
movies_by_duration_country_full <- data.frame(duration = rep(movies_by_duration_country$duration, sapply(s3, length)), country = unlist(s3))
movies_by_duration_country_full$duration <- as.numeric(gsub(" min","", movies_by_duration_country_full$duration))
movies_by_duration_country_full_subset <- movies_by_duration_country_full[movies_by_duration_country_full$country %in% ordered_movie_data_map$region,]
movie_duration_boxplot <- plot_ly(movies_by_duration_country_full_subset, y = ~duration, color = ~country, colors = "RdGy", type = "box")
movie_duration_boxplot <- movie_duration_boxplot %>% layout(xaxis=list(categoryorder = "array", categoryarray = c("United States", "India", "United Kingdom", "Canada", "France", "Spain", "Germany","Japan", "China", "Mexico"), title="Countries"), yaxis = list(title = 'Duration (in min)'),
title="Box-Plots Of Movie Duration In Top 10 Countries")
movie_duration_boxplot
From the boxplot, there is quite an even spread among the countries. United States has the highest number of outliers while India have the highest median duration of 127 mins.
s5 <- strsplit(movies_data$listed_in, split = ", ")
movies_listed_in <- data.frame(type = rep(movies_data$type, sapply(s5, length)), listed_in = unlist(s5))
movies_listed_in$listed_in <- as.character(gsub(",","",movies_listed_in$listed_in))
df_by_listed_in_full <- movies_listed_in %>% group_by(listed_in) %>% summarise(count = n()) %>%
arrange(desc(count)) %>% top_n(20)
## Selecting by count
movie_genre_bar <- plot_ly(df_by_listed_in_full, x = ~listed_in, y = ~count, type = 'bar', marker = list(color = '#E9CCC4'))
movie_genre_bar <- movie_genre_bar %>% layout(xaxis=list(categoryorder = "array", categoryarray = df_by_listed_in_full$listed_in, title="Genre"), yaxis = list(title = 'Count'), title="Top 20 Movie Genres On Netflix")
movie_genre_bar
Similar to TV Shows, the top 3 genres are International Movies, Dramas and Comedies.
s7 <- strsplit(movies_data$director, split = ", ")
movie_director <- data.frame(type = rep(movies_data$type, sapply(s7, length)), director = unlist(s7))
movie_director$director <- as.character(gsub(",","",movie_director$director))
movie_director<-na.omit(movie_director) %>% group_by(director) %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)
## Selecting by count
movie_director<-as.data.frame(movie_director)
movie_director
## director count
## 1 Jan Suter 21
## 2 Raúl Campos 19
## 3 Jay Karas 15
## 4 Marcus Raboy 15
## 5 Cathy Garcia-Molina 13
## 6 Jay Chapman 12
## 7 Martin Scorsese 12
## 8 Youssef Chahine 12
## 9 Steven Spielberg 10
## 10 David Dhawan 9
## 11 Shannon Hartman 9
s9 <- strsplit(movies_data$cast, split = ", ")
movie_actor <- data.frame(type = rep(movies_data$type, sapply(s9, length)), actor = unlist(s9))
movie_actor$actor <- as.character(gsub(",","",movie_actor$actor))
movie_actor<-na.omit(movie_actor) %>% group_by(actor) %>% summarise(count = n()) %>% arrange(desc(count)) %>% top_n(10)
## Selecting by count
movie_actor<-as.data.frame(movie_actor)
movie_actor
## actor count
## 1 Anupam Kher 41
## 2 Shah Rukh Khan 35
## 3 Naseeruddin Shah 30
## 4 Om Puri 30
## 5 Akshay Kumar 29
## 6 Amitabh Bachchan 27
## 7 Boman Irani 27
## 8 Paresh Rawal 27
## 9 Kareena Kapoor 25
## 10 Ajay Devgn 21
An obvious pattern is that the actors that acted in the most Netflix Movies are mostly from India.