I used a csv file from Kaggle. In Project 2, I looked into which language represented in large portion of Netflix and which shows are the most popular with IMDB scores in other languages except that language. I also analyzed the data by creating line charts based on year and month when Netflix’s shows were released, and also looked at genres of Netflix videos.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(tidyr)
Netflix <-read.csv("~/Downloads/NetflixOriginals.csv")
str(Netflix)
## 'data.frame': 584 obs. of 6 variables:
## $ Title : chr "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
## $ Genre : chr "Documentary" "Thriller" "Science fiction/Drama" "Horror thriller" ...
## $ Premiere : chr "August 5, 2019" "August 21, 2020" "December 26, 2019" "January 19, 2018" ...
## $ Runtime : int 58 81 79 94 90 147 112 149 73 139 ...
## $ IMDB.Score: num 2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
## $ Language : chr "English/Japanese" "Spanish" "Italian" "English" ...
NLang <- Netflix %>%
group_by(Language) %>%
summarise(Title = n()) %>%
mutate(count = Title)
NLang <- distinct(NLang)
NLang %>%
filter(count > 5) %>%
ggplot(aes(count, Language, fill = Language)) +
geom_col() +
scale_fill_brewer(palette = "Set1") +
labs(title = "Top Netflix Language", x = "Language Count", y = "Language") +
theme(panel.background = element_rect(fill = "lightgrey", size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', color = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid', color = "white"))
SepLang <- Netflix %>%
separate(Language, c('Language1','Language2','Language3'), sep='/') # Split the Languages
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 581 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
NotEng <- SepLang %>%
filter(!Language1 == "English") %>%
arrange(desc(IMDB.Score)) %>%
head(20)
NotEng %>%
ggplot(aes(x = reorder(Title, IMDB.Score), y = IMDB.Score, fill = Language1)) +
geom_col() +
coord_flip() +
theme(panel.background = element_rect(fill = "lightgrey", size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', color = "white")) +
labs(title = "Top 20 Netflix Originals (non English)", x = "Name of Shows", y = "IMDB Score", fill = "Language") +
scale_fill_discrete()
mean(Netflix$IMDB.Score)
## [1] 6.271747
mean(NotEng$IMDB.Score)
## [1] 7.435
Netflix$dates <- as.Date(Netflix$Premiere, format = "%B %d, %Y")
head(Netflix$dates)
## [1] "2019-08-05" "2020-08-21" "2019-12-26" "2018-01-19" "2020-10-30"
## [6] "2019-11-01"
Netflix$Release <- mdy(Netflix$Premiere)
str(Netflix)
## 'data.frame': 584 obs. of 8 variables:
## $ Title : chr "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
## $ Genre : chr "Documentary" "Thriller" "Science fiction/Drama" "Horror thriller" ...
## $ Premiere : chr "August 5, 2019" "August 21, 2020" "December 26, 2019" "January 19, 2018" ...
## $ Runtime : int 58 81 79 94 90 147 112 149 73 139 ...
## $ IMDB.Score: num 2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
## $ Language : chr "English/Japanese" "Spanish" "Italian" "English" ...
## $ dates : Date, format: "2019-08-05" "2020-08-21" ...
## $ Release : Date, format: "2019-08-05" "2020-08-21" ...
Netflix$Month <- format(as.POSIXct(Netflix$Release), "%B") # Split the Months
Netflix$Year <- format(as.POSIXct(Netflix$Release), "%Y") # Split the Years
Year <- Netflix %>%
group_by(Year) %>%
summarise(Title = n()) %>%
mutate(count = Title)
ggplot(data = Year, aes(x = Year, y = count)) +
geom_point(size = 4, shape = "diamond", color = "pink") +
geom_line(group = 4, color = "purple") +
labs(title = "Number of Shows by Year in Netflix", x = "Year", y = "Number of Shows")
Month <- Netflix %>%
group_by(Month) %>%
summarise(Title = n()) %>%
mutate(count = Title)
ggplot(data = Month, aes(x = factor(Month, levels = month.name), y = count)) +
geom_point(size = 4, shape = "diamond", color = "pink") +
geom_line(group=4, color = "purple") +
labs(title = "Number of Shows by Month in Netflix", x = "Month", y = "Number of Shows")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
genre <- Netflix %>%
group_by(Genre) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20)
## Selecting by count
pie <- plot_ly(genre, labels = ~Genre, values = ~count, type = "pie") %>%
layout(title = "Proportion of Shows by Genre")
pie
The dataset that I selected from Kaggle. In Project 2, I looked into which languages are represented in shows on Netflix, meaning the original languages. I also analyzed the data by creating line charts based on year and month when Netflix’s shows were released, and also looked at genres of Netflix videos.
It describes various variables related in Genre, Language, IMDB.Score and Released Dates. I spent most of my time on dates, especially months, using this dataset. I spent the most time on dates, especially months, and languages, using this dataset. When I split the language row, I needed to use separate() command, and when I split the premiere row, I needed to use as.POSIXct() command. After learning how to separate the language row and premiere row, I had to study how to arrange the months from January to December, not alphabetically. When using the scale_x_discrete(limits = month.name) command, the months were sorted in the order I wanted, but the lines had a very ugly zigzag shape. However, deleting scale_x_discrete(limits = month.name) and changing the statement to aes(x = factor(Month, levels = month.name)) gave me the perfect shape I wanted.
The most interesting thing while studying this dataset was the monthly line chart. I spent almost a full day fixing and fixing this graph, and when the lines that continued to zigzag were drawn properly, I was surprised to see the most shows are relased in April and October. And I found that the pattern released many shows every half year. Finally, after making a pie chart by genre, I was surprised. I had no idea that documentaries were such a big part of Netflix’s shows.
NetflixOriginals.csv, Netflix Original Films & IMDB Scores as of 06/01/21, https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores