The topic of my project is the top 80 most watched original Netflix shows. I chose this data set partly because I was not sure at first what I wanted this project to be about but I ended up finding this data set interesting. I found this data set on kaggle which made using the source IMDB. There are a couple of shows on this list that I have watched and have enjoyed particularly. I will be using most of the variables in this data set. First of all I will have to change the names of two categorical variables that describe the rank and title of the show. Then there is the variable runtime in which I will get rid of the character string ‘min’ to make it quantitative. For the genre I will split it into 3 new categorical variables a main, second and third genre variable. Last of all I will be using rating and votes which are self-explanatory quantitative variables.
Loading everything in
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)library(highcharter)
Warning: package 'highcharter' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Rows: 80 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): lister-item-header, certificate, runtime, genre
dbl (2): lister-item-index, rating
num (1): votes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 6
`lister-item-index` `lister-item-header` runtime genre rating votes
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 Stranger Things 60 min Drama, Fa… 8.7 1.33e6
2 2 13 Reasons Why 60 min Drama, My… 7.5 3.14e5
3 3 Orange Is the New Black 59 min Comedy, C… 8 3.19e5
4 4 Black Mirror 60 min Drama, My… 8.7 6.36e5
5 5 Money Heist 60 min Action, C… 8.2 5.29e5
6 6 Lucifer 4,393 min Crime, Dr… 8.1 3.54e5
Tidying variables
## cleaning the namestop_netflix2 <- top_netflix |>rename("title"=`lister-item-header`,"rank"=`lister-item-index`) |>## removing 'min' from runtime to make the variable a doublemutate(runtime =parse_number(runtime) )head(top_netflix2)
# A tibble: 6 × 6
rank title runtime genre rating votes
<dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 1 Stranger Things 60 Drama, Fantasy, Horror 8.7 1327188
2 2 13 Reasons Why 60 Drama, Mystery, Thriller 7.5 314321
3 3 Orange Is the New Black 59 Comedy, Crime, Drama 8 319342
4 4 Black Mirror 60 Drama, Mystery, Sci-Fi 8.7 636319
5 5 Money Heist 60 Action, Crime, Drama 8.2 529086
6 6 Lucifer 4393 Crime, Drama, Fantasy 8.1 354155
For the my main visualization I will probably filter out the long run times so they don’t skew my plot
Genre plots
Now lets take a look at the genre composition of these shows
ggplot(top_netflix3) +geom_bar(aes(x = main_genre, fill = genre_two)) +coord_flip() +labs(title ="Genres of the most watched Netflix Originals",x ="Main Genre",y ="Number", fill ="Secondary Genre",caption ="Source: https://www.imdb.com/list/ls049223775/")
Although the most common main genre is Comedy, most of the shows seem to have drama as a secondary or third genre especially main genre crime. Lastly I will explore votes and rating through linear regression.
Scatter plot to precede linear regression
In my linear regression model I am going to explore if I can predict votes through rating
options(scipen =999)ggplot(top_netflix3, aes(x = rating, y = votes)) +geom_point()
Lets find out what the outlier is
top_netflix3 |>slice_max(votes)
# A tibble: 1 × 8
rank title runtime main_genre genre_two genre_three rating votes
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 Stranger Things 60 Drama " Fantasy" " Horror" 8.7 1327188
Stranger Things could be skewing the plot.
Removing Stranger Things to see if a truer relationship exists
This plot shows that there might be a moderate correlation between rating and votes now lets find the r value.
Linear regression
cor(netflix$rating, netflix$votes)
[1] 0.605176
cor(top_netflix3$rating, top_netflix3$votes)
[1] 0.5474993
Even when removing the outlier Stranger Things with over a million votes the correlation has not changed too significantly
For now I will proceed with the stronger correlation value
fit1 <-lm(rating ~ votes, data = netflix)summary(fit1)
Call:
lm(formula = rating ~ votes, data = netflix)
Residuals:
Min 1Q Median 3Q Max
-1.42544 -0.38986 0.06732 0.45466 1.13589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2966002082 0.0887210926 82.242 < 0.0000000000000002 ***
votes 0.0000029114 0.0000004364 6.671 0.00000000347 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5674 on 77 degrees of freedom
Multiple R-squared: 0.3662, Adjusted R-squared: 0.358
F-statistic: 44.5 on 1 and 77 DF, p-value: 0.000000003471
Despite the strong moderate correlation of about 0.6 the model yielded very low p values which suggest that a linear model would be appropriate for the relationship between votes and rating. Although the r squared value suggests that only about a third of the variation in the scatter plot can be explained by the regression model.
Regression Model
rating = 2.911e-06(votes) + 7.297
This is the equation of the model. Next I will look into the diagnostic plots.
Diagnostic Plots
autoplot(fit1, 1:4, nrow=2, ncol=2)
The fitted values graph is messy, the line is not straight and the plot lacks balance as well as randomness. This would suggest that a linear model might not be appropriate for a depicting a relationship between votes and rating. On the other hand the Normal QQ plot looks fairly good even though the points stray somewhat from the right end of the line. Overall there are many mixed signals regarding whether a linear model is appropriate for describing the relationship between rating and votes.
Plotting
Getting colors and filtering out abnormal run times for the graph
Most expectedly the graph generally shows that the ranking of drama shows have a direct relationship with rating and number of votes as in when the rank goes up so does the rating and votes. The same can generally be said about action and adventure shows. Additionally since the secondary genre of crime shows is drama, it is to be expected that crime shows follow a similar pattern that drama shows exhibit. Interestingly enough the number of votes and ratings of Comedy shows seem to much more spread out and random in terms of rank than some of the other genres. The same could be said about animation shows. Finally, as a whole including some of the one-off genres, the top 80 shows have somewhat of a direct relationship between rank and votes and rating similar to what was suggested by the regression model. If I had more time I would have explored trying to make my own theme or find an interesting pre-built theme for high chart. Perhaps for the last project I will try some more advanced visualizations and regression modeling.
Background Research
The article referenced below describes how Strange Things season 4, the rank 1 show in this data set, broke multiple Netflix streaming records while shooting to the top when it was released. The article explains how some of the metrics to calculate how shows are ranked including viewing hours and minutes streamed. Lastly the article mentions some of Netflix’s other hits including some from this data set such as Ozark.