Netflix Data Analysis

By,
Sai Mounica Gudimella

Introduction

Netflix, an online streaming platform started off as rent-by-mail DVD service in 1997 which operated on a pay-per-rental model. Users would browse and add movies to their order and Netflix would post them by mail. On completion, users would post it back to Netflix. Rentals costed around $4 each with an additional $2 postage charge.

Later, Netflix switched to a model where users could keep the DVD’s for as long as they liked, but could only rent a new movie after returning their existing one.

Currently,Netflix has about 203.67 million paid subscribers worldwide as of the fourth quarter of 2020.

The word Netflix has been originated from the words Internet and flick.

Project Description

  • The aim of this project is to analyze the Netflix data and derive the following insights:
    • TV Show v/s Movie distribution
    • Identify the rating distribution on Netflix
    • Popular genre with respect to movies and TV shows
    • Comparison of content release across the world
    • Distribution of TV Show and Movies with respect to countries
    • Highest frequency words from TV Shows and Movies

Packages Required

library(dplyr)
library(tidyverse)
library(ggplot2)
library(data.table)
library(lubridate)
library(DT)
library(wordcloud)
library(tidytext)
library(ggthemes)

Data Cleaning

Step 1: Examine the structure of the dataset and understand the variables

## 'data.frame':    7787 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "TV Show" "Movie" "Movie" "Movie" ...
##  $ title       : chr  "3%" "7:19" "23:59" "9" ...
##  $ director    : chr  NA "Jorge Michel Grau" "Gilbert Chan" "Shane Acker" ...
##  $ cast        : chr  "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, M"| __truncated__ "Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato" "Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim" "Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasc"| __truncated__ ...
##  $ country     : chr  "Brazil" "Mexico" "Singapore" "United States" ...
##  $ date_added  : chr  "August 14, 2020" "December 23, 2016" "December 20, 2018" "November 16, 2017" ...
##  $ release_year: int  2020 2016 2011 2009 2008 2016 2019 1997 2019 2008 ...
##  $ rating      : chr  "TV-MA" "TV-MA" "R" "PG-13" ...
##  $ duration    : chr  "4 Seasons" "93 min" "78 min" "80 min" ...
##  $ listed_in   : chr  "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy" "Dramas, International Movies" "Horror Movies, International Movies" "Action & Adventure, Independent Movies, Sci-Fi & Fantasy" ...
##  $ description : chr  "In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join t"| __truncated__ "After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued wh"| __truncated__ "When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that's haunt"| __truncated__ "In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until"| __truncated__ ...
  • From the structure of the dataset, we can observe that 11 variables are of character datatype with release_year being of integer data type. There are 7787 records and 12 variables.

Step 2: Change data type of necessary variables

type <- as.factor(type)
country <- as.factor(country)
rating <- as.factor(rating)

Step 3: Create new columns from existing data if required

  • Here, we split the date_added column into year_added, month_added, day_added to perform further analysis.
date_added <- mdy(date_added)
netflix_data$year_added <- format(date_added,"%Y")
netflix_data$month_added <- format(date_added,"%B")
netflix_data$day_added <- format(date_added,"%d")

Step 4: Identify missing values

##      show_id         type        title     director         cast      country 
##            0            0            0         2389          718          507 
##   date_added release_year       rating     duration    listed_in  description 
##           10            0            7            0            0            0 
##   year_added  month_added    day_added 
##           10           10           10
  • From this we can see that there are missing values in the director, cast, country, rating and date_added columns.We cannot interpolate these with median values as it would signify wrong data against each movie. We will ignore these missing values while performing further analysis.

Step 5: Separate columns that have multiple values in the same cell

listed_in to be separated

country to be separated

Exploratory Data Analysis

Type and country Analysis

Which type of content is more popular on Netflix? Is it a TV show or Movie?

  • About 70% of the Netflix content is constituted of Movies with the rest being TV Shows.

Which country has more content of TV Show’s on Netflix?

  • From the above plot, we can notice that majority of the TV Shows are produced by US followed by UK and Japan.

Which country has more content of Movies on Netflix?

  • From the above plot, we can notice that majority of the Movies are produced by US followed by India and UK. Clearly, a larger portion of Netflix content is USA produced.

Top 10 countries with more content in 2020

  • From the above plot, we can notice that in the year 2020, majority of the content is produced by US followed by India and UK.

Genre and Rating Analysis

What kind of Genre is most prominent on Netflix?

  • It can be observed that International movies followed by Dramas, Comedies and Action-Adventure form a major part if Netflix content. Also movies of genre Anime, Faith and Spirituality, Cult have the least number of movie content on Netflix.

  • International, Dramas, Comedies, Crime and Kids TV are the most prominent genre’s. Also a major portion of these seem to be of TV-MA rating followed by TV-14.

Rating distribution over Netflix?

  • From the above graph it can be inferred that majority of TV show’s and Movies are of the rating TV-MA followed by R.

Year

How does content addition change over time from 2011 to 2021

Wordcloud

TV Show

Movie

Conclusion

  • From the analysis the following can be observed to improve the Netflix user base further -
    • Netflix can increase the number of TV show’s considering the significant increase over the years 2011 to 2021. This clearly indicates that there is a shift in viewership towards TV Shows
    • Genre also plays a vital role - International Movie/Show, Dramas, Comedies are most prominent among both Movies and TV Shows. This can be further extended to target audience based on country specific genre and rating distribution
    • From the word cloud, it can be inferred that for Movies, words like - family, documentary, friends, life, woman, world etc. are recurrent. Similarly, for TV Shows - series, love, friends, life, school etc. are significant. This depicts that content with themes related to these words tend to attain a better audience