In this project, data from Netflix Original films is explored. You can find the Data here and an interactive Dashboard with this Dataset created by me here. The dataset consists of all Netflix Original films released as of June 1st, 2021. Additionally, it also includes all Netflix documentaries and specials.
Included in the Dataset is:
The main goal of this project is to get ideas which variables have an impact on IMDB Rating. Since we have a comprehensive dataset (all Netflix Originals are included) and it is not part of this project to predict future IMDB ratings, only descriptive statistics will be used.
library(data.table)
library(here)
library(janitor)
library(tidyverse)
library(skimr)
library(yarrr)
library(lubridate)
library(rstudioapi)
library(effsize)
The data was downloaded from kaggle and stored locally (login necessary). We will take a first look at the dataset:
#get data
netflix <- fread(here("Data", "NetflixOriginals.csv"))
skim(netflix)
| Name | netflix |
| Number of rows | 584 |
| Number of columns | 6 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Title | 0 | 1 | 2 | 106 | 0 | 584 | 0 |
| Genre | 0 | 1 | 3 | 36 | 0 | 115 | 0 |
| Premiere | 0 | 1 | 11 | 18 | 0 | 390 | 0 |
| Language | 0 | 1 | 4 | 26 | 0 | 38 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Runtime | 0 | 1 | 93.58 | 27.76 | 4.0 | 86.0 | 97.00 | 108 | 209 | ▁▂▇▁▁ |
| IMDB Score | 0 | 1 | 6.27 | 0.98 | 2.5 | 5.7 | 6.35 | 7 | 9 | ▁▂▇▇▁ |
There is no problem with missing data in the Dataset. The only two numeric variables in the dataset are IMDB Score and Runtime. We will focus mainly on IMDB Score in further analyses.
Basic Preprocessing was done to clean variable names and prepare relevant variables for further analyses.
# Clean variable names
netflix <- clean_names(netflix)
# Turn genre into lowercase
netflix$genre <- tolower(netflix$genre)
# have a look at unique genres
head(netflix[,.(count = .N),by =genre][order(-count)], 20)
## genre count
## 1: documentary 159
## 2: drama 77
## 3: comedy 49
## 4: romantic comedy 39
## 5: thriller 33
## 6: comedy-drama 14
## 7: crime drama 11
## 8: horror 9
## 9: biopic 9
## 10: action 7
## 11: aftershow / interview 6
## 12: romance 6
## 13: concert film 6
## 14: action comedy 5
## 15: romantic drama 5
## 16: animation 5
## 17: variety show 5
## 18: science fiction 4
## 19: psychological thriller 4
## 20: science fiction/thriller 4
# preprocess genre variable with the goal to get more entries in more common genres
netflix[,`:=` (genre = gsub(".*action.*", "action", genre)),]
netflix[,`:=` (genre = gsub(".*horror.*", "horror", genre)),]
netflix[,`:=` (genre = gsub(".*comedy.*", "comedy", genre)),]
netflix[,`:=` (genre = gsub(".*drama.*", "drama", genre)),]
netflix[,`:=` (genre = gsub(".*thriller.*", "thriller", genre)),]
netflix[,`:=` (genre = gsub(".*animation.*", "animation", genre)),]
netflix[,`:=` (genre = gsub(".*anime.*", "animation", genre)),]
netflix[,`:=` (genre = gsub(".*musical.*", "musical", genre)),]
netflix[, genre_count := .(.N), by = genre]
netflix[genre_count < 11, genre := "other"]
netflix <- netflix[order(-imdb_score),]
netflix$genre_count <- NULL
#save data for later use in shiny app
saveRDS(netflix, here("Shiny_App", "netflix_data.Rds"))
#look closer at distribution of IMDB Score
ggplot(data = netflix, aes(x = imdb_score)) +
geom_histogram(bins = 40) +
theme_classic()
Distribution of IMDB Score looks approximately normal, but there seems to be also slight deviations from normality. Since effect sizes are used in futher analysis, we take a closer look at the assumption of normality with a QQ plot.
#look closer at distribution of IMDB Score
ggplot(netflix, aes(sample = imdb_score)) +
stat_qq() + stat_qq_line() +
theme_classic()
As the data fall approximately along the reference line, the QQ plot gives further evidence of normality.
We start the analysis by looking at the relationship of IMDB Score and Genre. Only the most popular Genres are used for this purpose.
# subset of data with most used genres
netflix_genres <- setorder(netflix[,.(.N), by = .(genre)], -N)
popular_genres <- netflix_genres[N>20,genre]
netflix_subset <- netflix[genre %in% popular_genres]
#look at imdb score per genre
pirateplot(imdb_score ~ genre,
data = netflix_subset,
gl.col = "white")
Documentaries have the highest IMDB average Scores from the common Netflix genres. The lowest average is in the horror genre. The average IMDB Scores of thriller, action and comedy films are also quite low in comparison to documentaries from Netflix. To get further information about the magnitude of this differences, Cohen’s d is used as an effect size.
# Compute effect sizes
cohen.d(netflix[genre == "documentary", imdb_score], netflix[genre == "action", imdb_score])
##
## Cohen's d
##
## d estimate: 1.377766 (large)
## 95 percent confidence interval:
## lower upper
## 0.8976975 1.8578339
cohen.d(netflix[genre == "documentary", imdb_score], netflix[genre == "horror", imdb_score])
##
## Cohen's d
##
## d estimate: 1.751763 (large)
## 95 percent confidence interval:
## lower upper
## 1.267515 2.236011
cohen.d(netflix[genre == "documentary", imdb_score], netflix[genre == "drama", imdb_score])
##
## Cohen's d
##
## d estimate: 0.7689461 (medium)
## 95 percent confidence interval:
## lower upper
## 0.5148376 1.0230546
cohen.d(netflix[genre == "horror", imdb_score], netflix[genre == "drama", imdb_score])
##
## Cohen's d
##
## d estimate: -0.9677064 (large)
## 95 percent confidence interval:
## lower upper
## -1.4455280 -0.4898848
In line with the first impression, there is a quite large difference between the average IMDB Scores from documentaries in comparison to the average IMDB Scores from other Genres like horror or action movies. Even when looking at the difference between documentaries and the second highest rated Genre (drama), we find a medium to large effect size (Cohen’s d = 0.77). Therefore, Netflix user are especially satisifed with the quality of documentaries. Further analysis could be done on the possible reasons for the high perceived quality of Netflix documentaries. One possibility could be that other film studios simply do not allocate much money on the creation of documentaries.
In the next step, the effect of runtime on IMDB Score is looked at.
with(netflix, plot(runtime, imdb_score))
abline(lm(imdb_score ~ runtime, data = netflix), col = "blue")
From the first impression, there is no systematic pattern in the scatterplot. In line with this observation, the regression line is rather flat. Since a correlation coefficient like the Pearson correlation coefficient is also an effect size, we will have a look at the absolute size of this measure. From the plot, we already have the information that the size of the correlation coefficient should be quite small.
with(netflix, cor(runtime, imdb_score))
## [1] -0.04089629
The correlation coefficient is -0.04 and therfore very small. There is no meaningfull relationship between runtime and IMDB Score.
Another variable which could likely have a relationship with IMDB Score is the language. It seems likely that English movies are higher rated, since the movie industrie is most developed in the United States. We will have a closer look at this assumption by comparing the average IMDB Scores of the five most common languages.
popular_languages <- setorder(netflix[,.(.N),by = language], -N)[1:5,language]
netflix_lan <- netflix[language %in% popular_languages]
#have a look at imdb_score by language
pirateplot(formula = imdb_score ~ language,
data = netflix_lan,
gl.col = "white")
English movies are indeed the highest ratest movies along the most popular languages, but Spanish movies are rated quite similar in regard to quality. The perceived quality from Italian, French and Hindi Movies are lower in comparison to English and Spanish Movies. Effect sizes are used again to get further information about the magnitude of differences found.
cohen.d(netflix_lan[language == "English", imdb_score], netflix_lan[language == "Spanish", imdb_score])
##
## Cohen's d
##
## d estimate: 0.08315353 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.2832934 0.4496004
cohen.d(netflix_lan[language == "English", imdb_score], netflix_lan[language == "Italian", imdb_score])
##
## Cohen's d
##
## d estimate: 0.909821 (large)
## 95 percent confidence interval:
## lower upper
## 0.3717736 1.4478684
cohen.d(netflix_lan[language == "English", imdb_score], netflix_lan[language == "Hindi", imdb_score])
##
## Cohen's d
##
## d estimate: 0.4297891 (small)
## 95 percent confidence interval:
## lower upper
## 0.07269172 0.78688654
cohen.d(netflix_lan[language == "Hindi", imdb_score], netflix_lan[language == "Italian", imdb_score])
##
## Cohen's d
##
## d estimate: 0.3835436 (small)
## 95 percent confidence interval:
## lower upper
## -0.2637848 1.0308720
We get a negligible Cohen’s d when comparing English to Spanish movies in regard to IMDB Scores. The magnitude of the difference between English and Italian movies can be described as large, whereas the difference between Hindi movies and Italian movies is small to medium. In summary, we do have meaningufull differences between the perceived quality from movies in different languages.
In the last section, we will look how the IMDB Scores envolved over time. We will start with the year 2016, since there was only one Movie published in 2014 and only nine Movies published in 2015.
netflix[,premiere := mdy(premiere)]
netflix[,year := year(premiere)]
netflix_year <- setorder(netflix[,.(.N),by = year], year)
netflix_year
## year N
## 1: 2014 1
## 2: 2015 9
## 3: 2016 30
## 4: 2017 66
## 5: 2018 99
## 6: 2019 125
## 7: 2020 183
## 8: 2021 71
netflix_year <- netflix_year[N>29,year]
netflix_subset3 <- netflix[year %in% netflix_year]
pirateplot(imdb_score ~ year,
data = netflix_subset3,
gl.col = "white")
Surprisingly, we can see a constant negative trend from the IMDB Scores over time. Simultaneously, the number of published movies is increasing constantly (except of 2021, where we only have data from the first half). One possible reason for this could be that Netflix is changing its strategy to publishing more movies in many languages with fewer money allocated to each movie production. Unfortunately, this possible explanation can’t be further analyzed with the data at hand, since the dataset used for this analysis does not cover the amount spend on each movie production.
Another reason for the decrease of IMDB Ratings could be that fewer documentaries are published in more recent years, since Documentaries are the best rated movie Genre. We can have a closer look whether this is actually the case:
#see of there is a difference on the number of documentaries over the years, since these are best rated
netflix_subset3[order(year),.(per_documentaries = sum(genre=="documentary")/.N),by=year]
## year per_documentaries
## 1: 2016 0.4000000
## 2: 2017 0.3333333
## 3: 2018 0.2525253
## 4: 2019 0.3200000
## 5: 2020 0.2131148
## 6: 2021 0.1971831
In line with the reasing above, the relative number of documentaries published is increasing in the more recent years. Therefore, the question arises whether we see the decrease in IMDB Scores over the years still if we exclude documentaries.
#see changes over years without documentaries
pirateplot(imdb_score ~ year, theme = 1,
data = netflix_subset3[genre != "documentary"],
gl.col = "white")
Indeed, the decreasing trend is less clear over the years after excluding documentaries. The explanation that the lower portion of documentaries in more recent years do contribute to the slightly negative trend in IMDB Scores over time gets first evidence.
The main goal of this project was to get an overview about which variables have an impact on IMDB Scores from Netflix Original films published before June 2021. We found that there are large differences in the IMDB Ratings of different genres, with documentaries reaching the highest average IMDB Score of popular genres. Aside from genre, there are also meaningfull differences in the IMDB Scores depending on the languages of the film. Here, we find the highest average Scores in English and Spanish films. The effect of runtime on IMDB Rating is negligible on the other hand. Regarding time, we see a constant decrease of average IMDB Ratings in more recent years.