Visualization of Netflix Titles Data

The dataset was published on “TidyTuesday” blog with contribution from “Kaggle”, credit to Shivam Bansal.

This dataset has 7787 Netflix titles as of year 2019

There are 12 variables for this dataset:

## show_id - character 7787
## type - factor 2
## title - character 7787
## director - character 4050
## cast - character 6832
## country - character 682
## date_added - character 1513
## release_year - numeric 73
## rating - character 15
## duration - character 216
## listed_in - character 492
## description - character 7769

 

The following table summarizes the number of missing values in each variable / factor in this dataset

Interestingly, there is a significant portion of content that did not have information for director and cast.

## # A tibble: 12 × 2
##    variables    NA_count
##    <chr>           <dbl>
##  1 show_id             0
##  2 type                0
##  3 title               0
##  4 director         2389
##  5 cast              718
##  6 country           507
##  7 date_added         10
##  8 release_year        0
##  9 rating              7
## 10 duration            0
## 11 listed_in           0
## 12 description         0

 

Here’s summary statistic for dataset Netflix Titles

The contents are either categorised as TV Show or Movie.

##    show_id               type         title             director        
##  Length:7787        TV Show:2410   Length:7787        Length:7787       
##  Class :character   Movie  :5377   Class :character   Class :character  
##  Mode  :character                  Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##      cast             country           date_added         release_year 
##  Length:7787        Length:7787        Length:7787        Min.   :1925  
##  Class :character   Class :character   Class :character   1st Qu.:2013  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2014  
##                                                           3rd Qu.:2018  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:7787        Length:7787        Length:7787        Length:7787       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

 

Top 10 Directors (in Number of Content)

The following bar graphs shows the top 10 directors with the most TV shows or movies in Netflix. Ranked at 10th are 8 directors. Therefore the top 10 list has 17 directors.

data %>% filter(!is.na(director)) %>% group_by(director) %>% summarize(count=n()) %>% arrange(desc(count)) %>% slice_max(count, n=10) %>%
  ggplot() +
  geom_bar(aes(x=forcats::fct_reorder(director,count), y=count), width=0.5, stat="identity", colour="blue", fill="skyblue") +
  coord_flip() +
  theme_light() +
  xlab("Directors") +
  ylab("Count")

Most contents for the top 10 directors rated for matured audience, followed by those rated for teenagers of at least 14 years old.

directors_10 <- data %>% filter(!is.na(director)) %>% group_by(director) %>% summarize(count=n()) %>% slice_max(count, n=10) %>% select(director) %>% unlist()

data[data$director %in% directors_10,] %>%
  mutate(type = as.factor(type)) %>% 
  ggplot() + 
  geom_bar(aes(x=forcats::fct_infreq(rating),fill=type),stat="count",colour="blue",fill="skyblue") + 
  theme_light() +
  xlab("Ratings") +
  ylab("Count")

The follow bar graphs break down the type of contents by director.

data[data$director %in% directors_10,] %>% 
  ggplot + 
  geom_bar(aes(x=forcats::fct_infreq(rating)),stat="count",colour="blue",fill="skyblue") + 
  facet_wrap(~director, nrow=3) + 
  theme_light() + 
  theme(axis.text.x = element_text(angle=90,vjust=0.5))

Top 10 Countries (in Number of Contents)

Similar visualiisation as directors were prepared for countries.

The top 10 countires that produced contents in Netflix are show below. USA has by far the highest number of contents, followed by Inida.

data %>% # top 10 countries that produced content in Netflix
  filter(!is.na(country)) %>% 
  group_by(country) %>% 
  summarize(count=n()) %>% 
  slice_max(count, n=10) %>%
  ggplot() +
  geom_bar(aes(x=forcats::fct_reorder(country,count), y=count), width=0.5,colour="blue", fill="skyblue", stat="identity") +
  coord_flip() +
  theme_bw() +
  xlab("Country") +
  ylab("Count")

Most contents are again rated for matured audience witht the second highest category being contents for 14 years old or higher.

country_10 <- data %>% 
  filter(!is.na(country)) %>% 
  group_by(country) %>% 
  summarize(count=n()) %>% 
  slice_max(count, n=10) %>% 
  select(country) %>%
  unlist()

# ratings of shows from top 10 countries
data[data$country %in% country_10,] %>% ggplot() + geom_bar(aes(x=forcats::fct_infreq(rating)), stat="count", colour="blue", fill="skyblue") + coord_flip() + theme_light() + xlab("Rating") + ylab("Count")

The rating of contents from different countries are shown below.

data[data$country %in% country_10,] %>% ggplot() + geom_bar(aes(x=forcats::fct_infreq(rating)), stat="count", colour="blue", fill="skyblue") + facet_wrap(~country) + theme_light() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Rating") + ylab("Count")

TV Shows vs Movies

Netflix moved from having more movies than tv shows in year 2000 to having similar number between the two types new contents release in recent years.

Most of the contents in Netflix are released from 2017 - 2018, which are contents that are 4 - 5 years old.

This may reflect audience and consumer preference of wanting to see more tv shows from the entertainment industries. It may also signal a decline in movie industries.

data %>% 
  group_by(type, release_year) %>% 
  summarize(count=n()) %>% 
  ggplot(aes(release_year, count)) + 
  geom_point(aes(color=type), fill=NA) + 
  geom_line(aes(color=type)) + 
  scale_color_manual(values=c('blue', 'skyblue')) + 
  xlim(c(2000,2022)) + 
  theme_light()

Length of Duration of Contents

From the plot below, it can be seen that content rated for younger age audience are shorter than others.

data <- data %>% mutate(
  minutes = as.integer(str_extract(data$duration, "\\d+(?= min)")),
  seasons = as.integer(str_extract(data$duration, "\\d+(?= Seasons)"))
)

data %>% 
  select(rating, type, minutes) %>%
  mutate(rating=if_else(str_detect(rating,"PG"),"PG",rating)) %>%
  mutate(rating=if_else(str_detect(rating,"-G"),"G",rating)) %>%
  mutate(rating=if_else(str_detect(rating,"UR"),"NR",rating)) %>%
  mutate(rating=if_else(str_detect(rating,"NC-17"),"R",rating)) %>%
  mutate(rating=if_else(str_detect(rating,"TV-MA"),"R",rating)) %>%
  mutate(rating=if_else(str_detect(rating,"-FV"),"TV-Y7",rating)) %>%
  filter(!is.na(minutes)&type=="Movie"&!is.na(rating)) %>%
  ggplot() +
  geom_density(aes(x=minutes, colour=rating), fill=NA, alpha=0.3) +
  theme_bw() +
  facet_wrap(~rating, ncol=1) + 
  theme_bw()

All TV shows in this dataset have a least 2 seasons.

The most seasons that any show had was 16 seasons.

Upon further examination, this long running show was Grey’s Anatomy

data %>%
  filter(type=="TV Show") %>%
  ggplot() +
  geom_bar(aes(x=seasons), colour="blue", fill="skyblue", alpha=0.3) +
  theme_bw() + scale_x_continuous(name="Seasons",breaks=c(0,4,8,12,16))

The code below count and combine categories that are the same but with slightly different descriptions.

The result is a dataframe that a more tidy that we can work with.

types <- data$listed_in %>% stringr::str_split(", ") %>% unlist() %>% unique()

show_types <- data %>% select(show_id, listed_in)

for (i in types) {
  #print(i)
  show_types[,i] <- 0
}

for (j in 1:nrow(show_types)) {
  j_types <- stringr::str_split(show_types[j,"listed_in"],", ") %>% unlist()
  for (k in j_types) {
    show_types[j,k] <- 1
  }
}

data1 <- left_join(data %>% select(-listed_in), show_types %>% select(-listed_in), by="show_id")
#colnames(data1)

data2 <- data1 %>% 
  mutate(International = if_else(`International TV Shows`==1 | `International Movies` == 1,1,0)) %>%
  select(-`International TV Shows`, -`International Movies`) %>%
  mutate(`Sci-Fi & Fantasy`=if_else(`TV Sci-Fi & Fantasy`==1, 1, `Sci-Fi & Fantasy`)) %>% 
  select(-`TV Sci-Fi & Fantasy`) %>%
  mutate(Dramas = if_else(`TV Dramas` == 1, 1, Dramas)) %>% 
  select(-`TV Dramas`) %>%
  mutate(Horror = if_else(`Horror Movies` == 1 | `TV Horror` == 1, 1, 0)) %>% 
  select(-`TV Horror`,-`Horror Movies`) %>%
  mutate(Romantic = if_else(`Romantic Movies` == 1 | `Romantic TV Shows` == 1, 1, 0)) %>% 
  select(-`Romantic Movies`, -`Romantic TV Shows`) %>%
  mutate(Comedies = if_else(`TV Comedies` == 1, 1, Comedies)) %>% 
  select(-`TV Comedies`) %>%
  mutate(Thrillers = if_else(`TV Thrillers` == 1, 1, Thrillers)) %>% 
  select(-`TV Thrillers`) %>%
  mutate(`Action & Adventure` = if_else(`TV Action & Adventure` == 1, 1, `Action & Adventure`)) %>% 
  select(-`TV Action & Adventure`) %>%
  mutate(`Anime` = if_else(`Anime Series`==1 | `Anime Features`==1, 1, 0)) %>% 
  select(-`Anime Series`,-`Anime Features`) %>%
  mutate(Documentaries = if_else(Docuseries == 1, 1, Documentaries)) %>% 
  select(-Docuseries) %>%
  mutate(Cult = if_else(`Cult Movies` == 1 | `Classic & Cult TV` == 1, 1, 0)) %>% 
  select(-`Cult Movies`, -`Classic & Cult TV`) %>%
  select(-Movies, -`TV Shows`)

show_type_names <- colnames(data2)[14:length(colnames(data2))]
#show_type_names

data3 <- data2[c("release_year", show_type_names)] %>% 
  gather("genre", "count", -release_year) %>%
  group_by(release_year, genre) %>%
  summarize(count=sum(count), .group=NULL) 

#top 6 show types in number of contents
top_types <- data3 %>% group_by(genre) %>%
  summarize(count=sum(count)) %>%
  slice_max(order_by=count, n=6) %>%
  select(genre) %>%
  unlist()

The plot below shows the top six genres with the most contents.

These genres are International, Dramas, Comedies, Documentaries, Action & Adventure, Romantic

data3 %>% 
  filter(genre %in% top_types) %>%
  ggplot(aes(release_year, count)) +
  geom_line(aes(group=genre,colour=genre)) +
  geom_point(aes(shape=genre,colour=genre),fill=NA) +
  theme_bw()

## Word Sentiments in TV Shows/Movie Titles

For the analysis below, the Lexicon used was published in “Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.”

The follwing script was used to prepare break down the title text and assigned sentiment to each word that appeared in the titles

#library(tidytext)
#sentiment <- get_sentiments("nrc")
#readr::write_csv(sentiment, "sentiment.csv")

sentiment <- readr::read_csv("sentiment.csv", show_col_types = FALSE)

title_words <- data2 %>% select(title) %>% mutate(title=str_split(title,pattern=" "))

title_words <- title_words[["title"]]

title_sentiment <- list()

for (i in 1:length(title_words)) {
  words <- title_words[i] %>% unlist() %>% str_to_lower()
  words <- str_remove_all(words, "[[:punct:]]")
  
  if (any(words %in% sentiment$word)) {
    title_sentiment[i] <- lapply(words, function(x){
      tryCatch(sentiment$sentiment[sentiment$word == x], 
               warning=function(w) {NULL}, 
               error = function(e) {NULL},
               finally=NA)
      })
  } else {
    title_sentiment[i] <- list("neutral")
  }
}

data2 <- cbind(data2, tibble::tibble(sentiments=title_sentiment))

Words used in the Netflix titles are classified as one of the following categories.

  • neutral, fear, negative, sadness, anticipation, joy, positive, trust, disgust, surprise, anger

Most of the words are neutral, for example those would be numbers, prepositions, or nouns that have no association to positive or negative sentiments.

These sentiments are furhter categorized into positive or negative sentiments for visualisation.

Surprise and anticipation can have positive or negative effect and are therefore left as their own categories.

color_palette  <- c("#69A6D1", "#94DFFF", "#C9EBEF", "#FFD4B1", "#FCADB0")

pie_data <- data2 %>% select(title, sentiments) %>% unnest(cols=c(sentiments)) %>%
  mutate(sentiments = if_else(sentiments %in% c("fear","sadness","disgust","anger"),"negative",sentiments))  %>%
  mutate(sentiments = if_else(sentiments %in% c("joy", "trust"),"positive",sentiments)) %>%
  group_by(sentiments) %>%
  summarize(prop=n()/nrow(data2)*100) %>%
  ungroup()

pie(pie_data$prop,
    labels=paste(pie_data$sentiments,paste0(round(pie_data$prop,0),"%")),
    radius=1,
    col=color_palette,
    main="Words Used in Netflix Content Titles"
    )