Project 2

Author

Andrew George

Intro

The topic of my project is the top 80 most watched original Netflix shows. I chose this data set partly because I was not sure at first what I wanted this project to be about but I ended up finding this data set interesting. I found this data set on kaggle which made using the source IMDB. There are a couple of shows on this list that I have watched and have enjoyed particularly. I will be using most of the variables in this data set. First of all I will have to change the names of two categorical variables that describe the rank and title of the show. Then there is the variable runtime in which I will get rid of the character string ‘min’ to make it quantitative. For the genre I will split it into 3 new categorical variables a main, second and third genre variable. Last of all I will be using rating and votes which are self-explanatory quantitative variables.

Loading everything in

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)
library(highcharter)

Warning: package 'highcharter' was built under R version 4.3.3

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

setwd("C:/Users/andre/Downloads/Data 110")
top_netflixtv <- read_csv("imdb.csv")

Rows: 80 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): lister-item-header, certificate, runtime, genre
dbl (2): lister-item-index, rating
num (1): votes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

top_netflix <- top_netflixtv |>
  select(-certificate)
head(top_netflix)

# A tibble: 6 × 6
  `lister-item-index` `lister-item-header`    runtime   genre      rating  votes
                <dbl> <chr>                   <chr>     <chr>       <dbl>  <dbl>
1                   1 Stranger Things         60 min    Drama, Fa…    8.7 1.33e6
2                   2 13 Reasons Why          60 min    Drama, My…    7.5 3.14e5
3                   3 Orange Is the New Black 59 min    Comedy, C…    8   3.19e5
4                   4 Black Mirror            60 min    Drama, My…    8.7 6.36e5
5                   5 Money Heist             60 min    Action, C…    8.2 5.29e5
6                   6 Lucifer                 4,393 min Crime, Dr…    8.1 3.54e5

Tidying variables

## cleaning the names
top_netflix2 <- top_netflix |>
  rename("title" = `lister-item-header`,
         "rank" = `lister-item-index`) |>
## removing 'min' from runtime to make the variable a double
  mutate(
    runtime = parse_number(runtime)
  )
head(top_netflix2)

# A tibble: 6 × 6
   rank title                   runtime genre                    rating   votes
  <dbl> <chr>                     <dbl> <chr>                     <dbl>   <dbl>
1     1 Stranger Things              60 Drama, Fantasy, Horror      8.7 1327188
2     2 13 Reasons Why               60 Drama, Mystery, Thriller    7.5  314321
3     3 Orange Is the New Black      59 Comedy, Crime, Drama        8    319342
4     4 Black Mirror                 60 Drama, Mystery, Sci-Fi      8.7  636319
5     5 Money Heist                  60 Action, Crime, Drama        8.2  529086
6     6 Lucifer                    4393 Crime, Drama, Fantasy       8.1  354155

Splitting the genre variable

top_netflix3 <- top_netflix2 |>
  separate_wider_delim(genre, delim = ",",
                       names = c("main_genre", "genre_two", "genre_three"),
                       too_few = "align_start")
head(top_netflix3)

# A tibble: 6 × 8
   rank title             runtime main_genre genre_two genre_three rating  votes
  <dbl> <chr>               <dbl> <chr>      <chr>     <chr>        <dbl>  <dbl>
1     1 Stranger Things        60 Drama      " Fantas… " Horror"      8.7 1.33e6
2     2 13 Reasons Why         60 Drama      " Myster… " Thriller"    7.5 3.14e5
3     3 Orange Is the Ne…      59 Comedy     " Crime"  " Drama"       8   3.19e5
4     4 Black Mirror           60 Drama      " Myster… " Sci-Fi"      8.7 6.36e5
5     5 Money Heist            60 Action     " Crime"  " Drama"       8.2 5.29e5
6     6 Lucifer              4393 Crime      " Drama"  " Fantasy"     8.1 3.54e5

Now that I am done cleaning I am ready for the exploratory phase.

Exploring run times

ggplot(top_netflix3) +
  geom_density(aes(x = runtime))

For the my main visualization I will probably filter out the long run times so they don’t skew my plot

Genre plots

Now lets take a look at the genre composition of these shows

ggplot(top_netflix3) +
  geom_bar(aes(x = main_genre, fill = genre_two)) +
  coord_flip() +
  labs(title = "Genres of the most watched Netflix Originals",
       x = "Main Genre",
       y = "Number", 
       fill = "Secondary Genre",
       caption = "Source: https://www.imdb.com/list/ls049223775/")

ggplot(top_netflix3, aes(x = genre_three)) +
  geom_bar()

Although the most common main genre is Comedy, most of the shows seem to have drama as a secondary or third genre especially main genre crime. Lastly I will explore votes and rating through linear regression.

Scatter plot to precede linear regression

In my linear regression model I am going to explore if I can predict votes through rating

options(scipen = 999)
ggplot(top_netflix3, aes(x = rating, y = votes)) +
  geom_point()

Lets find out what the outlier is

top_netflix3 |>
  slice_max(votes)

# A tibble: 1 × 8
   rank title           runtime main_genre genre_two  genre_three rating   votes
  <dbl> <chr>             <dbl> <chr>      <chr>      <chr>        <dbl>   <dbl>
1     1 Stranger Things      60 Drama      " Fantasy" " Horror"      8.7 1327188

Stranger Things could be skewing the plot.

Removing Stranger Things to see if a truer relationship exists

options(scipen = 999)
netflix <- top_netflix3[top_netflix3$title != "Stranger Things",]
ggplot(netflix, aes(x = rating, y = votes)) +
  geom_point()

This plot shows that there might be a moderate correlation between rating and votes now lets find the r value.

Linear regression

cor(netflix$rating, netflix$votes)

[1] 0.605176

cor(top_netflix3$rating, top_netflix3$votes)

[1] 0.5474993

Even when removing the outlier Stranger Things with over a million votes the correlation has not changed too significantly

For now I will proceed with the stronger correlation value

fit1 <- lm(rating ~ votes, data = netflix)
summary(fit1)


Call:
lm(formula = rating ~ votes, data = netflix)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.42544 -0.38986  0.06732  0.45466  1.13589 

Coefficients:
                Estimate   Std. Error t value             Pr(>|t|)    
(Intercept) 7.2966002082 0.0887210926  82.242 < 0.0000000000000002 ***
votes       0.0000029114 0.0000004364   6.671        0.00000000347 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5674 on 77 degrees of freedom
Multiple R-squared:  0.3662,    Adjusted R-squared:  0.358 
F-statistic:  44.5 on 1 and 77 DF,  p-value: 0.000000003471

Despite the strong moderate correlation of about 0.6 the model yielded very low p values which suggest that a linear model would be appropriate for the relationship between votes and rating. Although the r squared value suggests that only about a third of the variation in the scatter plot can be explained by the regression model.

Regression Model

rating = 2.911e-06(votes) + 7.297

This is the equation of the model. Next I will look into the diagnostic plots.

Diagnostic Plots

autoplot(fit1, 1:4, nrow=2, ncol=2)

The fitted values graph is messy, the line is not straight and the plot lacks balance as well as randomness. This would suggest that a linear model might not be appropriate for a depicting a relationship between votes and rating. On the other hand the Normal QQ plot looks fairly good even though the points stray somewhat from the right end of the line. Overall there are many mixed signals regarding whether a linear model is appropriate for describing the relationship between rating and votes.

Plotting

Getting colors and filtering out abnormal run times for the graph

colors <- c("red", "maroon", "orange", "gold", "violet", "green", "darkgreen", "blue", "purple")
top_netflix4 <- top_netflix3 |>
  filter(runtime < 250)

Graphing

highchart() |>
  hc_add_series(data = top_netflix4,
                   type = "point",
                   hcaes(x = rank,
                   y = rating, 
                   group = main_genre,
                   size = votes)) |>
  hc_chart(style = list(fontFamily = "Aptos")) |>
  hc_colors(colors) |>
  hc_title(text="Most Watched Netflix Originals (Top 80)") |>
  hc_xAxis(title = list(text="Rank")) |>
  hc_yAxis(title = list(text="Rating (1-10)")) |>
  hc_caption(text = "Source: https://www.imdb.com/list/ls049223775/") |>
  hc_tooltip(borderColor = "black",
             pointFormat = "Title: {point.title}<br>Genre: {point.main_genre}<br>Secondary Genre: {point.genre_two}<br>Rank: {point.rank}<br>Rating: {point.rating}<br>Votes: {point.votes}<br>Runtime(mins): {point.runtime}"
  )

Conclusion Essay

Most expectedly the graph generally shows that the ranking of drama shows have a direct relationship with rating and number of votes as in when the rank goes up so does the rating and votes. The same can generally be said about action and adventure shows. Additionally since the secondary genre of crime shows is drama, it is to be expected that crime shows follow a similar pattern that drama shows exhibit. Interestingly enough the number of votes and ratings of Comedy shows seem to much more spread out and random in terms of rank than some of the other genres. The same could be said about animation shows. Finally, as a whole including some of the one-off genres, the top 80 shows have somewhat of a direct relationship between rank and votes and rating similar to what was suggested by the regression model. If I had more time I would have explored trying to make my own theme or find an interesting pre-built theme for high chart. Perhaps for the last project I will try some more advanced visualizations and regression modeling.

Background Research

The article referenced below describes how Strange Things season 4, the rank 1 show in this data set, broke multiple Netflix streaming records while shooting to the top when it was released. The article explains how some of the metrics to calculate how shows are ranked including viewing hours and minutes streamed. Lastly the article mentions some of Netflix’s other hits including some from this data set such as Ozark.

Reference:

https://collider.com/stranger-things-season-4-volume-1-breaks-neilsen-record/