Overview

I used a csv file from Kaggle. In Project 2, I looked into which language represented in large portion of Netflix and which shows are the most popular with IMDB scores in other languages except that language. I also analyzed the data by creating line charts based on year and month when Netflix’s shows were released, and also looked at genres of Netflix videos.

Set up the libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
library(tidyr)

Load the dataset

Netflix <-read.csv("~/Downloads/NetflixOriginals.csv")
str(Netflix)
## 'data.frame':    584 obs. of  6 variables:
##  $ Title     : chr  "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
##  $ Genre     : chr  "Documentary" "Thriller" "Science fiction/Drama" "Horror thriller" ...
##  $ Premiere  : chr  "August 5, 2019" "August 21, 2020" "December 26, 2019" "January 19, 2018" ...
##  $ Runtime   : int  58 81 79 94 90 147 112 149 73 139 ...
##  $ IMDB.Score: num  2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
##  $ Language  : chr  "English/Japanese" "Spanish" "Italian" "English" ...

What are the most languages on Netflix?

NLang <- Netflix %>% 
  group_by(Language) %>% 
  summarise(Title = n()) %>% 
  mutate(count = Title)
NLang <- distinct(NLang)

NLang %>%
  filter(count > 5) %>%
  ggplot(aes(count, Language, fill = Language)) +
  geom_col() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top Netflix Language", x = "Language Count", y = "Language") +
  theme(panel.background = element_rect(fill = "lightgrey", size = 0.5, linetype = "solid"),
    panel.grid.major = element_line(size = 0.5, linetype = 'solid', color = "white"), 
    panel.grid.minor = element_line(size = 0.25, linetype = 'solid', color = "white")) 

It seems that most of the shows are made in English.

I also created a packed bubbles on Tableau and compared the proportion of other languages except English.

So, with English as the outlier, let’s pick out top 20 shows made in languages other than English.

SepLang <- Netflix %>%
  separate(Language, c('Language1','Language2','Language3'), sep='/')  # Split the Languages
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 581 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
NotEng <- SepLang %>%
  filter(!Language1 == "English") %>%
  arrange(desc(IMDB.Score)) %>%
  head(20)

NotEng %>%
  ggplot(aes(x = reorder(Title, IMDB.Score), y = IMDB.Score, fill = Language1)) +
  geom_col() +
  coord_flip() +
  theme(panel.background = element_rect(fill = "lightgrey", size = 0.5, linetype = "solid"),
    panel.grid.major = element_line(size = 0.5, linetype = 'solid', color = "white")) +
  labs(title = "Top 20 Netflix Originals (non English)", x = "Name of Shows", y = "IMDB Score", fill = "Language") +
  scale_fill_discrete()

Above bar chart shows Top 20 Netflix shows made by except English.

Now I would like to compare the average of IMDB scores including English with the average of IMDB scores for shows excluding English.

mean(Netflix$IMDB.Score) 
## [1] 6.271747
mean(NotEng$IMDB.Score)
## [1] 7.435

The graph above is the top 20 list excluding shows made in English. The overall average rating of the netflix show is 6.27 and the average rating of the above list is 7.44, so you can see that the above shows are also worth watching.

Now let’s find out which year and months are most release.

Reformat the Dates

Netflix$dates <- as.Date(Netflix$Premiere, format = "%B %d, %Y")
head(Netflix$dates)
## [1] "2019-08-05" "2020-08-21" "2019-12-26" "2018-01-19" "2020-10-30"
## [6] "2019-11-01"

Split the Month and Year.

Netflix$Release <- mdy(Netflix$Premiere)
str(Netflix)
## 'data.frame':    584 obs. of  8 variables:
##  $ Title     : chr  "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
##  $ Genre     : chr  "Documentary" "Thriller" "Science fiction/Drama" "Horror thriller" ...
##  $ Premiere  : chr  "August 5, 2019" "August 21, 2020" "December 26, 2019" "January 19, 2018" ...
##  $ Runtime   : int  58 81 79 94 90 147 112 149 73 139 ...
##  $ IMDB.Score: num  2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
##  $ Language  : chr  "English/Japanese" "Spanish" "Italian" "English" ...
##  $ dates     : Date, format: "2019-08-05" "2020-08-21" ...
##  $ Release   : Date, format: "2019-08-05" "2020-08-21" ...
Netflix$Month <- format(as.POSIXct(Netflix$Release), "%B")  # Split the Months
Netflix$Year <- format(as.POSIXct(Netflix$Release), "%Y")   # Split the Years

Number of Shows by Year in Netflix

Year <- Netflix %>% 
  group_by(Year) %>% 
  summarise(Title = n()) %>% 
  mutate(count = Title)

ggplot(data = Year, aes(x = Year, y = count)) +
     geom_point(size = 4, shape = "diamond", color = "pink") +
     geom_line(group = 4, color = "purple") +
     labs(title = "Number of Shows by Year in Netflix", x = "Year", y = "Number of Shows")

Netflix shows are on the rise year after year. The csv file I downloaded is a file made on 6/1/21, so it is expected to increase much more than 2020 by the end of this year.

Number of Shows by Month in Netflix

Month <- Netflix %>% 
  group_by(Month) %>%
  summarise(Title = n()) %>% 
  mutate(count = Title)

ggplot(data = Month, aes(x = factor(Month, levels = month.name), y = count)) +
     geom_point(size = 4, shape = "diamond", color = "pink") +
     geom_line(group=4, color = "purple") +
     labs(title = "Number of Shows by Month in Netflix", x = "Month", y = "Number of Shows")

The monthly trend of Netflix shows is to release the most shows in April and October, and the least in January and July. It’s not perfect, but it also looks like a symmetric.

Pie chart of top 20 genres that have the most weight on Netflix

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
genre <- Netflix %>% 
  group_by(Genre) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  top_n(20)
## Selecting by count
pie <- plot_ly(genre, labels = ~Genre, values = ~count, type = "pie") %>%
  layout(title = "Proportion of Shows by Genre")
pie

Documentaries account for the largest portion of Netflix shows. It is followed by dramas, comedies, romantic comedies, and thrillers.

Essay

The dataset that I selected from Kaggle. In Project 2, I looked into which languages are represented in shows on Netflix, meaning the original languages. I also analyzed the data by creating line charts based on year and month when Netflix’s shows were released, and also looked at genres of Netflix videos.

It describes various variables related in Genre, Language, IMDB.Score and Released Dates. I spent most of my time on dates, especially months, using this dataset. I spent the most time on dates, especially months, and languages, using this dataset. When I split the language row, I needed to use separate() command, and when I split the premiere row, I needed to use as.POSIXct() command. After learning how to separate the language row and premiere row, I had to study how to arrange the months from January to December, not alphabetically. When using the scale_x_discrete(limits = month.name) command, the months were sorted in the order I wanted, but the lines had a very ugly zigzag shape. However, deleting scale_x_discrete(limits = month.name) and changing the statement to aes(x = factor(Month, levels = month.name)) gave me the perfect shape I wanted.

The most interesting thing while studying this dataset was the monthly line chart. I spent almost a full day fixing and fixing this graph, and when the lines that continued to zigzag were drawn properly, I was surprised to see the most shows are relased in April and October. And I found that the pattern released many shows every half year. Finally, after making a pie chart by genre, I was surprised. I had no idea that documentaries were such a big part of Netflix’s shows.

Work Cited

NetflixOriginals.csv, Netflix Original Films & IMDB Scores as of 06/01/21, https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores