Explarotary Data Analysis

Introduction

This dataset consists of TV shows and Movies available on Netflix as of 2021 (updated) The dataset is collected from Flixable which is a third-party Netflix search engine.

Some of the interesting questions (tasks) which can be performed on this dataset -

  1. Understanding what content is available in different countries
  2. Identifying similar content by matching text-based features
  3. Network analysis of Actors / Directors and find interesting insights
  4. Does Netflix has more focus on TV Shows than movies in recent years.

Packages Required

  1. Following packages were used:

    • tidyverse: Used to clean the data (package required - tidyverse)
    • tidytext: Used to display the data on the screen in a scrollable format (package required - DT)
    • Wordcloud: Used to chart wordcloud in the genre text analysis (package required - wordcloud)
    • lubridate: Used for date manipulation (package required lubridate).
    • Plotly: Used to plot interactive charts (package required Plotly)
    • RColorBrewer: Used for choosing picking different colors. (package required RColorBrewer)
    • naniar: Used for the plotting of missing values and examination of imputations (package required naniar).
    • scales: Used to control the appearance of axis and legend labels (package used scales)
    • janitor: Used to clean the data frame header names and related functions. (package required janitor)
    • DT: Used to display the data frame as table format (package required DT).
library(tidyverse)
library(tidytext)
library(wordcloud)
library(lubridate)
library(plotly)
library(RColorBrewer)
library(naniar)
library(scales)
library(janitor)
library(DT)

Import Data

Importing the data set in a data frame and cleaning the colomn names using the function clean_names().

NetFlix <- read_csv("netflix_titles.csv") %>% 
  mutate(date_added = mdy(date_added)) %>% clean_names()

glimpse(NetFlix)
## Rows: 7,787
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type         <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",…
## $ title        <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70…
## $ director     <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "…
## $ cast         <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val…
## $ country      <chr> "Brazil", "Mexico", "Singapore", "United States", "United…
## $ date_added   <date> 2020-08-14, 2016-12-23, 2018-12-20, 2017-11-16, 2020-01-…
## $ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200…
## $ rating       <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"…
## $ duration     <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 …
## $ listed_in    <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",…
## $ description  <chr> "In a future where the elite inhabit an island paradise f…

Data Screening

Missing data

gg_miss_which(NetFlix)

  • director, cast, country, date_added and rating have missing data!

let us see what are the variables have missing values together!

gg_miss_upset(NetFlix)

This tells us:

  • Only director, cast, country, date_added and rating have missing values
  • director has the most missing values
  • There are 242 cases where both director and country have missing values together
  • There are 241 cases where both director and cast have missing values together
  • There are 58 cases where director, cast and country have missing values together
  • There are 38 cases where both cast and country have missing values together

Data Analysis

Are Movies on Netflix more than TV shows?

NetFlix %>% count(type, sort = T) %>%  
  mutate(prop = paste0(round(n / sum(n) * 100, 0), "%")) %>%
  ggplot(aes(x = "", y = prop, fill = type)) +
  geom_bar(
    stat = "identity",
    width = 1,
    color = "steelblue",
    size = 1
  ) +
  coord_polar("y", start = 0) +
  geom_text(
    aes(y = prop, label = prop),
    position = position_stack(vjust = 0.5),
    size = 6,
    col = "white",
    fontface = "bold"
  ) +
  scale_fill_manual (values = c('#e41a1c', '#377eb8')) +
  theme_void() +
  labs(
    title = "Are Movies on Netflix more than TV shows?",
    subtitle = "Pie Plot, proportion of Movies to TV shows",
    caption = "Kaggle: Netflix Movies and TV Shows",
    fill = ""
  )

It is clear that there are more Movies in this dataset than TV shows!

Years Difference between release year and added year!

lets create a new variable “year_diff” the difference years between release year and added year!

NetFlix <-  NetFlix %>% 
  mutate(year_diff = year(date_added)-release_year) 

NetFlix %>% count(year_diff, sort = F)
## # A tibble: 75 × 2
##    year_diff     n
##        <dbl> <int>
##  1        -3     1
##  2        -2     1
##  3        -1    10
##  4         0  2825
##  5         1  1485
##  6         2   644
##  7         3   439
##  8         4   336
##  9         5   226
## 10         6   218
## # … with 65 more rows

The items added in the same year of release-year are the most with 2825

  • 10 items added before release year, 1 year

  • 1 item added before release year, 2 year

  • 1 added before release year, 3 year

lets check them

datatable(
  NetFlix %>% select(-cast, -description) %>%
    filter(year_diff < 0) %>%
    arrange(year_diff),
  caption = NULL,
  options = list(dom = 't')
)

Years Difference distridution

NetFlix %>% select(year_diff) %>%
  filter(!is.na(year_diff)) %>%
  plot_ly(x = ~ year_diff,
          type = "histogram",
          marker = list(line = list(color = "darkgray",
                                    width = 1))) %>%
  layout(
    title = "Year difference between release_year and date_added",
    yaxis = list(title = "Count",
                 zeroline = FALSE),
    xaxis = list(title = "difference (Years)",
                 zeroline = FALSE)
  )

Just to visualize it, even there are items with more than 90 years difference.

lets check them!

datatable(NetFlix %>% select(title, type, release_year, date_added, year_diff) %>%
  filter(year_diff > 60) %>% 
    arrange(desc(year_diff)),
  caption = NULL,
  options = list(dom = 't')
)

Rating by Type

NetFlix %>% select(rating, type) %>%
  filter(!is.na(rating)) %>%
  mutate(rating = fct_lump(rating, 5)) %>%
  group_by(rating, type) %>%
  summarise(Count = n()) %>%
  arrange(Count) %>%
  plot_ly(
    x = ~ type ,
    y = ~ Count,
    type = "bar",
    color = ~ rating,
    text = ~ Count,
    textposition = 'outside',
    textfont = list(color = '#000000', size = 12)
  ) %>%
  layout(yaxis = list(categoryorder = "array",
                      categoryarray = ~ Count)) %>%
  layout(
    title = "Rating by Type",
    yaxis = list(title = "Type"),
    xaxis = list(title = "Count"),
    legend = list(title = list(text = '<b> Rating </b>'))
  )

Majority of Movies and TV-Show is with TV-MA.

lets have some explanation:

  • TV-MA: shouldn’’t be seen by anyone below the age of 17.

  • TV-14: shouldn’’t be seen by anyone under 14.

  • TV-PG:means that a show can be viewed by younger audiences but shouldn’’t be seen without their parents in the room.

  • PG-13: Parents Strongly Cautioned, Some Material May Be Inappropriate for Children Under 13.

  • R: Restricted, Children Under 17 Require Accompanying Parent or Adult Guardian.

Distribution by Countries Top 10

NetFlix %>% select(country) %>%
  filter(!is.na(country)) %>%
  mutate(country = fct_lump(country, 10)) %>%
  group_by(country) %>%
  summarise(Count = n()) %>%
  arrange(Count) %>%
  plot_ly(
    x = ~ Count ,
    y = ~ country,
    type = "bar",
    orientation = 'h'
  ) %>%
  layout(yaxis = list(categoryorder = "array", categoryarray = ~ Count)) %>%
  layout(
    title = "Items distribution by Country",
    yaxis = list(title = "Country"),
    xaxis = list(title = "Count")
  )

Obviously United States is the first , then India and united Kingdom … although there are some items are joined Country such like United States-United Kingdom, United States-Canada, United States-France, are not shown as Top 10

lets check them

NetFlix %>% select(country) %>%
  filter(!is.na(country)) %>%
  mutate(country = fct_lump(country, 45)) %>%
  group_by(country) %>%
  summarise(Count = n()) %>%
  arrange(Count) %>%
  plot_ly(
    x = ~ Count ,
    y = ~ country,
    type = "bar",
    orientation = 'h'
  ) %>%
  layout(yaxis = list(categoryorder = "array", categoryarray = ~ Count)) %>%
  layout(
    title = "Items distribution by Country",
    yaxis = list(title = "Country"),
    xaxis = list(title = "Count")
  )
  • Mexico-United States

  • United States- Japan

  • United States- Australia

  • Hong- China

  • United States- Germany

Dataset split to check the durations

movies <- NetFlix %>% select(country, type, duration, rating, title) %>%
  filter(type == "Movie") %>%
  drop_na() %>% 
  mutate(duration_min = parse_number(duration))

tv_show <- NetFlix %>% select(country, type, duration, rating, title) %>%
  filter(type == "TV Show") %>% 
  drop_na() %>% 
  mutate(duration_season = parse_number(duration))

Movies Durations

movies %>%
  plot_ly(
    x = ~ duration_min,
    type = "histogram",
    nbinsx = 40,
    marker = list(
      color = "drakblue",
      line = list(color = "black",
                  width = 1)
    )
  ) %>%
  layout(
    title = "Duration distrbution",
    yaxis = list(title = "Count",
                 zeroline = FALSE),
    xaxis = list(title = "Duration (min)",
                 zeroline = FALSE)
  ) 

Duration 90-99min are the most movies duration, then 100-109min, then 80-89min, then 110-119min

Im curious to see the movies have more than 200min duration.

datatable(movies %>% select(title, duration_min) %>% 
            filter(duration_min >200) %>% arrange(desc(duration_min)),
      caption = NULL,
      options = list(dom = 't'))

TV-Show Durations

tv_show %>% select(duration_season) %>%
  count(duration_season, sort = TRUE) %>%
  ggplot(aes(
    x = as.factor(duration_season),
    y = n,
    label = n
  )) +
  geom_col(aes(fill = duration_season)) +
  geom_text(vjust = -0.5, size = 3, col = "darkgreen") +
  theme_light() +
  theme(legend.position = "none") +
  labs(x = "Season duration",
       y = "Count",
    title = "Season distrbution",
    subtitle = "Column Plot, Season distrbution",
    caption = "Kaggle: Netflix Movies and TV Shows",
    fill = ""
  )

One season are the most among all Tv-Show! then two seasons and three seasons.

Just before leaving lets check the TV-show have 16 seasons! it is interesting to know

datatable(tv_show %>% select(title, duration_season) %>% 
            filter(duration_season >15) %>% arrange(desc(duration_season)),
      caption = NULL,
      options = list(dom = 't'))

Time series

ggplotly(
  NetFlix %>% select(date_added) %>%
    filter(!is.na(date_added)) %>%
    mutate(year_added = year(date_added)) %>%
    group_by(year_added) %>%
    summarise(Count = n()) %>%
    arrange(desc(Count)) %>%
    ggplot(aes(
      x = year_added,
      y = Count,
      label = Count
    )) +
    geom_line(size = 1, col = "darkred", alpha = 0.5) +
    geom_col(alpha = 0.6, fill = "steelblue") +
    geom_text(vjust = -0.7, size = 3) +
    theme_light() +
    scale_y_continuous(label = comma) +
    labs(
      x = "Year Added",
      y = "Count",
      title = "Number of Items added per year",
      subtitle = "Column and line Plot, Number of Items added per year",
      caption = "Kaggle: Netflix Movies and TV Shows"
    )
)

From 2016 to 2019 was the most titles were added to the streaming network the spike was in 2019 and of course, 2020 following 2019, remember here we are talking about titles added not the viewer’’s numbers. 2020 Covid-19 pandemic has stopped the increase, indeed due to decrease of Movies and TV-Shows production! I believe so!

Most frequent words in description variable For Movies (word cloud)

desc_words_m <- NetFlix %>% select(type, show_id, description) %>%
  filter(type == "Movie") %>% 
    unnest_tokens(word, description) %>%
    anti_join(stop_words)

count_word <- desc_words_m %>%
   count(word, sort = TRUE)


wordcloud(words = count_word$word,  
          freq = count_word$n, 
          min.freq = 50,  
          max.words = nrow(count_word), 
          random.order = FALSE,  
          rot.per = 0.1,  
          colors = brewer.pal(8, "Dark2")) 

  • Life, Women, Love, Friends, Family, Father, Home, Son and Daughter!

Nice Outcome…

lets se the Tv-Show wordcloud

Most frequent words in description variable For TV-Shows (word cloud)

desc_words_tv <- NetFlix %>% select(type, show_id, description) %>%
  filter(type == "TV Show") %>% 
    unnest_tokens(word, description) %>%
    anti_join(stop_words)

count_word <- desc_words_tv %>%
   count(word, sort = TRUE)


wordcloud(words = count_word$word,  
          freq = count_word$n, 
          min.freq = 30,  
          max.words = nrow(count_word), 
          random.order = FALSE,  
          rot.per = 0.1,  
          colors = brewer.pal(8, "Dark2")) 

If we put series aside!

Life, Love, Lives, Friends, Family, Crime, Drama, Romance are the most frequent words!

Interesting!

Summary

As we noted through this notebook; Netflix is very popular among various social classes, and as we all know it has increased in popularity during the Covid-19 pandemic.

For me, the important thing is to be aware of the importance of this digital platform in terms of entertainment and in terms of its impact on future generations. At the end of this notebook, I think we have to turn our daily work into something fun to be creative.