Brief Explanation of the Data

This dataset consists of tv shows and movies available on Netflix as of January 2021. The dataset is collected from Flixable which is a third-party Netflix search engine.

Read the data

First of all, let’s read the data

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(glue)

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

netflix <-   read.csv("netflix_titles.csv")

Data Inspection

head(netflix)

dim(netflix)

## [1] 7787   12

names(netflix)

##  [1] "show_id"      "type"         "title"        "director"     "cast"        
##  [6] "country"      "date_added"   "release_year" "rating"       "duration"    
## [11] "listed_in"    "description"

From the inspections above we can conclude that there are 7787 rows and 12 columns in Netflix dataset.

The columns listed includes: “show_id”, “type”, “title”, “director”, “cast”, “country”, “date_added”, “release_year”, “rating”, “duration”, “listed_in” and “description”

Data Cleansing

str(netflix)

## 'data.frame':    7787 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : chr  "TV Show" "Movie" "Movie" "Movie" ...
##  $ title       : chr  "3%" "7:19" "23:59" "9" ...
##  $ director    : chr  "" "Jorge Michel Grau" "Gilbert Chan" "Shane Acker" ...
##  $ cast        : chr  "JoÃ£o Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, M"| __truncated__ "DemiÃ¡n Bichir, HÃ©ctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato" "Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim" "Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasc"| __truncated__ ...
##  $ country     : chr  "Brazil" "Mexico" "Singapore" "United States" ...
##  $ date_added  : chr  "August 14, 2020" "December 23, 2016" "December 20, 2018" "November 16, 2017" ...
##  $ release_year: int  2020 2016 2011 2009 2008 2016 2019 1997 2019 2008 ...
##  $ rating      : chr  "TV-MA" "TV-MA" "R" "PG-13" ...
##  $ duration    : chr  "4 Seasons" "93 min" "78 min" "80 min" ...
##  $ listed_in   : chr  "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy" "Dramas, International Movies" "Horror Movies, International Movies" "Action & Adventure, Independent Movies, Sci-Fi & Fantasy" ...
##  $ description : chr  "In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join t"| __truncated__ "After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued wh"| __truncated__ "When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that's haunt"| __truncated__ "In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until"| __truncated__ ...

There are some columns with wrong data types. Let’s covert some columns to its correct data types.

netflix$type <- as.factor(netflix$type)
netflix$country <- as.factor(netflix$country)
netflix$rating <- as.factor(netflix$rating)
netflix$director <- as.factor(netflix$director)
netflix$date_added <- as.Date(netflix$date_added, format = "%B %d, %Y")
str(netflix)

## 'data.frame':    7787 obs. of  12 variables:
##  $ show_id     : chr  "s1" "s2" "s3" "s4" ...
##  $ type        : Factor w/ 2 levels "Movie","TV Show": 2 1 1 1 1 2 1 1 1 1 ...
##  $ title       : chr  "3%" "7:19" "23:59" "9" ...
##  $ director    : Factor w/ 4050 levels "","Ã–mer Faruk Sorak",..: 1 1853 1302 3459 3189 3432 4004 2037 3499 3905 ...
##  $ cast        : chr  "JoÃ£o Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, M"| __truncated__ "DemiÃ¡n Bichir, HÃ©ctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato" "Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim" "Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasc"| __truncated__ ...
##  $ country     : Factor w/ 682 levels "","Argentina",..: 41 310 381 551 551 450 139 551 231 231 ...
##  $ date_added  : Date, format: "2020-08-14" "2016-12-23" ...
##  $ release_year: int  2020 2016 2011 2009 2008 2016 2019 1997 2019 2008 ...
##  $ rating      : Factor w/ 15 levels "","G","NC-17",..: 10 10 7 6 6 10 10 7 8 10 ...
##  $ duration    : chr  "4 Seasons" "93 min" "78 min" "80 min" ...
##  $ listed_in   : chr  "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy" "Dramas, International Movies" "Horror Movies, International Movies" "Action & Adventure, Independent Movies, Sci-Fi & Fantasy" ...
##  $ description : chr  "In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join t"| __truncated__ "After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued wh"| __truncated__ "When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that's haunt"| __truncated__ "In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until"| __truncated__ ...

The Column “duration” needs special codes to convert the data type since it has 2 kinds of information, duration in minutes, and seasons. In this case I decided to convert Seasons to minutes. Typically a season will have 20-26 episodes per season and each episode will have 37-42 minutes of runtime. So we will convert a season to 23 x 39.5 = 908.5 minutes. Source: Wikipedia, US Runtime.

netflix2 <- separate(netflix, duration, c("duration_time", "duration_type"), sep= " ")
netflix2$duration_time <- as.integer(netflix2$duration_time) 
netflix2$duration_in_min <- ifelse(netflix2$duration_type == "Seasons", netflix2$duration_time*908.5, netflix2$duration_time)
netflix2$duration_in_hour <- netflix2$duration_in_min/60
netflix2

Check for missing values & Drop unnecessary columns

colSums(is.na(netflix2))

##          show_id             type            title         director 
##                0                0                0                0 
##             cast          country       date_added     release_year 
##                0                0               98                0 
##           rating    duration_time    duration_type        listed_in 
##                0                0                0                0 
##      description  duration_in_min duration_in_hour 
##                0                0                0

netflix[is.na(netflix2$date_added),]

There are 98 rows of missing values in “date_added” column which represents about 1% of data. Considering the 1% data representation is considered minimal, we can drop the data. But then, we still don’t know whether we will use “date_added” column as a parameter to find the answer to our business question. Therefore I decided to keep it as it is for the moment.

On the other hand, We will drop director, cast and description column since we are not going to use them.The blank data in country will be converted to Missing Value data and we will drop them since they represents only about 6% of the total data.

netflix2 = subset(netflix2, select = -c(director, cast, description))

var1 <- c("", "NA na", "hai")
var2 <- c("trial", "trial", "trial")
dat <- data.frame(var1, var2)
dat

netflix2[netflix$country == "",] <- NA
netflix2 <- na.omit(netflix2)
netflix2

We are also going to break down “listed_in” column. Although this dataset is not compatible to answer all business questions, it will be useful to answer business question regarding “listed_in” column.

s <- strsplit(netflix2$listed_in, split = ",")
netflix_list <- data.frame(show_id = rep(netflix2$show_id, sapply(s, length)), listed_in = unlist(s))
netflix_listedin <- merge(x = netflix_list, y = netflix2, by = "show_id", all.x = TRUE)
netflix_listedin$listed_in.x <- as.factor(netflix_listedin$listed_in.x)
netflix_listedin

Data Explanation

summary(netflix2)

##    show_id               type         title                     country    
##  Length:7186        Movie  :5147   Length:7186        United States :2515  
##  Class :character   TV Show:2039   Class :character   India         : 922  
##  Mode  :character                  Mode  :character   United Kingdom: 381  
##                                                       Japan         : 222  
##                                                       South Korea   : 179  
##                                                       Canada        : 169  
##                                                       (Other)       :2798  
##    date_added          release_year      rating     duration_time   
##  Min.   :2008-01-01   Min.   :1942   TV-MA  :2681   Min.   :  1.00  
##  1st Qu.:2018-01-30   1st Qu.:2013   TV-14  :1769   1st Qu.:  3.00  
##  Median :2019-03-13   Median :2017   TV-PG  : 716   Median : 90.00  
##  Mean   :2019-01-04   Mean   :2014   R      : 663   Mean   : 72.24  
##  3rd Qu.:2020-01-24   3rd Qu.:2018   PG-13  : 383   3rd Qu.:107.00  
##  Max.   :2021-01-16   Max.   :2021   PG     : 244   Max.   :312.00  
##                                      (Other): 730                   
##  duration_type       listed_in         duration_in_min   duration_in_hour   
##  Length:7186        Length:7186        Min.   :    1.0   Min.   :  0.01667  
##  Class :character   Class :character   1st Qu.:   63.0   1st Qu.:  1.05000  
##  Mode  :character   Mode  :character   Median :   95.0   Median :  1.58333  
##                                        Mean   :  350.2   Mean   :  5.83660  
##                                        3rd Qu.:  118.0   3rd Qu.:  1.96667  
##                                        Max.   :14536.0   Max.   :242.26667  
##

netflix2

From the Data, we can summarize that:

The first movie added to Netflix was in 1st January 2008 and the last one on this data was on 16th January 2021.
There are 7186 unique Titles in Netflix.
Most of the data (Movie/TV shows) came from The US. (2515 titles)
Most titles in Netflix are Rated TV-MA (TV-Mature Audience, 17+), followed with TV-14(14 Years and above) and TV-PG (Parental Guidance).
In average, the titles have a run time of 350.2 minutes, with 1 minute as the shortest and 14,536 minutes as the longest.

Data Manipulation

Data Manipulation & Transformation

1. What is the average runtime of Movies?

Also check whether there are outliers and whether they can be tolerated.

movies <- netflix2[netflix2$type == "Movie",]
aggregate(duration_in_min ~ type, movies, mean)

aggregate(duration_in_min ~ type, movies, sd)

ggplot(data = movies, mapping = aes(x = type, y = duration_in_min)) +
  geom_boxplot(aes(fill = type))

The average run time of Netflix movies are 100.16 minutes From the Data Above we will calculate CV (Coefficient of Variation) CV = SD / MEAN CV = 27.58 / 100.16 = 0.28 The CV <1 is considered to be low, therefore the distribution of run time duration of Netflix Movies can still be tolerated.

2. How about TV Shows?

tvs <- netflix2[netflix2$type == "TV Show",]
aggregate(duration_in_min ~ type, tvs, mean)

aggregate(duration_in_min ~ type, tvs, sd)

ggplot(data = tvs, mapping = aes(x = type, y = duration_in_min)) +
  geom_boxplot(aes(fill = type))

The average run-time of TV Shows are 981.35 minutes. By using the same formula of CV, we can calculate CV = 1728.11 / 981.35 = 1.76 Therefore we can summarize that the distribution of TV Shows run time in Netflix varies a lot. This can happen because we calculate the total run time of the entire show, rather than a single episode. The big outliers might also show because some titles might have a lot of Seasons.

Please note that there are currently no data on how many episodes there are in a season & run-time of each episode in each TV Show titles. But, assuming we take the average episode per season and run time per episode data from US TV Shows (Source: Wikipedia), we can summarize that the Run time per Season is about 908.5 minutes.

Box Plot containing both data

ggplot(data = netflix2, mapping = aes(x = type, y = duration_in_min)) +
  geom_boxplot(aes(fill = type))

In my opinion, the box plot containing both data is not very informative since, most of the Movie data are not shown, this happens due to the big Variance in TV Shows data.

3. Find out the distribution of Titles’ type.

dist1 <- as.data.frame(round(prop.table(table(netflix2$type))*100,2))
dist1

ggplot(dist1, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Netflix Titles' Distribution", x = NULL, y = NULL) +
  theme_minimal()

From the table above, we can conclude that Netflix’s shows are dominated by Movies.

4. Analyze the data coming from The US

4.1 What are the distribution of type in The US?

us <- netflix2[netflix2$country == "United States",]
dist2 <- as.data.frame(round(prop.table(table(us$type))*100, 2))
ggplot(dist2, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Netflix US Titles' Distribution", x = NULL, y = NULL) +
  theme_minimal()

The distribution of show titles from The US are dominated by Movies.

4.2 What are the average run time of shows from The US?

aggregate(duration_in_min ~ type, us, mean)

Movie titles from The US has an average run time of 89.4 minutes, while TV Shows from The US has an average run time of 1436.4 minutes.

4.3 What are the Distribution of rating of titles coming from The US?

us_rating <- as.data.frame(table(us$type, us$rating), margin=1)
us_rated <- ggplot(us_rating, aes(x = Freq, 
                                  y = reorder(Var1, Freq),
                                  text = glue("Rating {Var2} 
                                  Frequency {Freq}
                                  Type {Var1}")
                                  )) +
  geom_col(mapping = aes(fill = Var2), position = "fill") +
  labs(x = "", y = NULL, 
       fill = NULL,
       title = "Distribution of Netflix Titles' Rating",
       subtitle = NULL) 
ggplotly(us_rated, tooltip = "text")

Conclusion: Both Movies and TV Shows from The US are mostly rated “TV-MA”

5. Find out about Category’s distribution of the whole data.

cat1 <- as.data.frame(table(netflix_listedin$listed_in.x))
cat1 <- head(cat1[order(cat1$Freq, decreasing = T),], 5)
cat1_plot <- ggplot(cat1,aes(y=reorder(Var1, Freq), 
                                    x=Freq, text = glue("Category {Var1}
                                                        Frequency {Freq}"))) +
  geom_col(aes(fill=Var1))+
  labs(title = "Top 5 Netflix Titles' Category", x = NULL, y = "Category") +
  theme_minimal() +
  theme(legend.position = "none")
ggplotly(cat1_plot, tooltip = "text")

Most of the titles fall under “International Movies” category.

6. Analyze Netflix’s trend for the last 2 years

year <- netflix2[netflix2$release_year >= "2019",]

6.1 Where does most titles come from?

year_country <- as.data.frame(table(year$country))
year_country <- year_country[order(year_country$Freq, decreasing = T),]
yc_plot <- ggplot(year_country[1:5,],aes(y= reorder(Var1, Freq), x=Freq,
                                         text = glue("Country {Var1}
                                                      Frequency {Freq}"))) +
  geom_col(aes(fill = Var1)) +
  labs(x=NULL, y= NULL, title= "Top 5 Netflix Titles' Origin Country") +
  theme_minimal() +
  theme(legend.position = "none")
ggplotly(yc_plot, tooltip = "text")

Most titles came from The US

6.2 Is Netflix has increasingly focusing on TV rather than movies in recent years.

table(year$type)

## 
##   Movie TV Show 
##     925     801

Netflix still has more new movies than TV Shows, but considering the total distribution from the main data, Netflix seems to add more TV Shows in the last two years. In the main database, TV Shows only has a distribution of around 31%. While in the last two years, the distribution of New TV Shows are around 46%. We can conclude that the trend are starting to change, more TV Shows are added in the past two years, and if the trend continues, there might be more TV Shows added than Movies in Netflix.

6.3 What are the distribution of ratings?

year_rating <- as.data.frame(table(year$rating))
year_rating <- year_rating[order(year_rating$Freq, decreasing = T),]
yr_plot <- ggplot(year_rating[0:5,],aes(x = Freq, y = reorder(Var1, Freq),
                                                              text = glue("Rating {Var1}
                                                                          Frequency {Freq}"))) +
  geom_col(mapping = aes(fill = Freq), col= "red") +
    labs(x = "", y = NULL, 
       fill = NULL,
       title = "Distribution of Netflix New Titles' Rating (2019-2021)",
       subtitle = NULL) +
  theme_minimal()
ggplotly(yr_plot, tooltip = "text")

Again, the rating of Netflix’s movie are dominated by TV-MA

6.4 Where are most of titles’ listed in?

For this question we will use netflix_listedin, then subset it.

year2 <- netflix_listedin[netflix_listedin$release_year >= "2019",]
year2_plot <- as.data.frame(table(year2$listed_in.x))
year2_plot <- head(year2_plot[order(year2_plot$Freq, decreasing = T),], 5)
y2_plot <- ggplot(year2_plot,aes(y=reorder(Var1, Freq), x=Freq, 
                                 text = glue("Category {Var1}
                                                        Frequency {Freq}"))) +
  geom_col(aes(fill=Var1))+
  labs(title = "Top 5 Netflix Titles' Category 2019 - Jan 2021", x = NULL, y = "Category") +
  theme_minimal() +
  theme(legend.position = "none")
ggplotly(y2_plot, tooltip = "text")

Most of Netflix’s new titles are listed in International Movies. But, looking at the difference with the whole data, we can conclude that the other category are starting to catch up.

Business Summary & Insight

Business Summary

In the 12 years since January 1st 2008, Netflix has uploaded a total of 7186 titles from both Movies and TV Shows. (After drop, 7787 before drop.) Most of the show titles came from The US, followed by India and The UK. More than two-thirds of the titles in Netflix are Movies. More than half of Netflix’s titles are rated “TV-MA” and “TV-14”, therefore we can conclude that Netflix’s target audience are that of a more mature audience. The oldest release year of Netflix’s titles is from 1942, with 2021 as the newest.

The run time duration of Netflix’s Movies are averaged in 100 minutes. There are some outlier but after counting the CV (Coefficient of Variation) we can conclude that it is within the normal range. TV Shows on the other hand, has a big Variation. This can be explained by the number of seasons mixed together as one title. There were also a lack of exact data, since there is only a number of season. The run time data of TV Shows should consists of the number of episode and average run time of the show. It would also be better if we separate each season to its own titles as to decrease the variance in run time. Regarding the lack of exact data, I decided to compute each season to 908.5 minutes per season. This number are acquired by multiplying the average run time of US TV Shows (since most of Netflix’s show titles came from The US) and the average episode per season.

The biggest distribution of shows’ category are from International Movies, followed by Dramas and Comedies. Though in the recent years, there has been an increase of TV Dramas popularity in Netflix, while Comedies has fall short out of the top 5 category.

Although more than two-thirds of Netflix’s shows are Movies (about 69% to be exact), we found out that Netflix has added more TV Shows in the recent years. In the past two years, Newly added TV Shows represents 46% of the show types added to Netflix. From this finding we can conclude that the trend in Netflix has shifted more towards TV Shows during the last two years. If the trend continues, there might be more TV Shows added to Netflix in the future than Movies.

Recommendations

From our findings, we can say that if you are looking to partner up with Netflix to make a movie or TV show, it would be better to make a “TV-MA” or “TV-14” rating show. Since Netflix’s target audience are more likely to be that of a more mature audience.
The trend of show types added to Netflix seems to have shifted a little. Netflix seems to add more TV Shows in the recent years. This can be a new opportunity for movie makers.
From our findings we can also conclude that “International Movies”, “TV Dramas” and “Dramas” are the most popular category in Netflix. Therefore we would like movie makers to make Movies or TV Shows on those categories.

Exploration of Netflix Data

by Calvin Christopher

March 28 2021