Background

How Netflix, Prime Video, and Hulu compare to new streaming rivals like Disney Plus and HBO Max

Article Source & Reason for Selection:

The article that I based my project off of comes from businessinsider.com. I’ve selected this article in particular because as an avid TV watcher I was curious how these streaming services stacked up against each other. I want to compare different streaming services and consider which might be the best for me and my interests. Unfortunately, I will not be able to afford all of these services so I am curious, which providers have the highest number of highly-rated TV shows?

Variables of Interest:

  1. Rotten Tomatoe Score: Score 0-100 based on how many critics give the show a positive review
  2. IMDb Rating: Users score a show 0-10 and scores are aggregated to create an average
  3. Year: Year the tv show was released
  4. Age: Targeted age group

Article Key Ideas & My Opinion:

The article argues that Netflix has the highest number of subscribers with 183 million subscribers and the most “quality” and “high-quality” TV shows of any of the streamers. I agree that Netflix is the most popular and also have personally loved all of the shows Netflix offers. Thus, I am going to agree with the article that Netflix dominates in offering the most quality TV shows of all streaming platforms. Through this analysis I will assess the articles argument that Netflix has the most “high-quality” TV shows of any of the providers.

Data

Data Appropriateness:

Data was collected from Kaggle This data set is composed of TV shows available on Netflix, Hulu, Prime Video and Disney+. Each row is a unique show with the year it was produced, the targeted age group, the IMDb rating, the Rotten Tomatoes review. In addition, there is a column for each streaming service (Netflix, Hulu,Prime Video, and Disney+) with a binary response (0 = does not have show, 1 = has show). There are 5,368 rows. The data was scraped from these providers API’s and updated two months ago, it was then aggregated with data scraped from Rotten Tomatoes and IMDb. This data set will be a valuable extension of the information provided in the article. The article discusses the streaming platforms in terms of quantity of shows and movies, price, and number of providers. I will use this data to assess the quality of the content on each of these platforms based on their ratings.

Data Validation:

  1. Read in Data
# read in data & required packages
library(readr) 
library(tidyverse)
library(stringr)
library(gt)
library(ggplot2)
library(tidyr)
library(reshape2)
library(plotly)
shows <- read_csv("tv_shows.csv")
  1. Change Data Types: Although the data is mostly in the correct type and format, I will still need to adjust some of the variables to best answer my questions. For example, right now the ratings are both characters with the rating divided by 10 or 100. I will want to convert these into numbers only containing the numerator so I can find things such as average rating per shows for each streaming service provider. I am also going to remove the column type as it is not explained in the data dictionary of the dataset so I do not know what it means. Later on I also have to go through the process of converting these one-hot encoded columns, “Netflix”, “Hulu”, and “Prime Video” into on column called provider. I will eventually do the same for rotten tomatoes and imdb in one column called rating source so that I can utilize functions such as ggplot2.
# Is the data type correct for this field?
# making IMDb & rotten tomatoes rating from fraction to just numerator
shows$IMDb <- as.numeric(str_split_fixed(shows$IMDb, "/",2)[,1])
shows$`Rotten Tomatoes` <- as.numeric(str_split_fixed(shows$`Rotten Tomatoes`, "/",2)[,1])
shows <- shows[,2:10]

# Is the value within the valid range or part of a domain or enumerated list?
max1 <- max(shows$ID)

# Check for duplicates, for example of a unique key.
dup1 <- shows$ID[duplicated(shows$ID)]
dup2 <- shows$Title[duplicated(shows$Title)]

# Check for nulls. Are there mandatory values, or are null / empty values allowed? Are the null types consistent (NaN, infinity, empty strings, etc.)?

empty1 <- sum(is.na(shows))
empty2 <- sum(is.na(shows$Age)) 
empty3 <- sum(is.na(shows$IMDb))

shows <- shows[!is.na(shows$IMDb), ]  
  1. Remove Duplicates/Nulls: There are no duplicates of ID, Titles, or X1. These are the only columns I must check because for all the other columns duplicates are allowed and expected. There are nulls in the Age and IMDb category. There are 2,127 nulls in the Age column and 962 nulls in the IMDb ratings column. Because I am not going to use target age to answer any of my questions (so far) I am going to ignore that column for now. However, I know I want to use the IMDb ratings to answer some of my questions, thus I am going to omit the rows that contain NAs for IMDb ratings. That will leave me with 4,406 of complete data that I can use to answer my questions.

Data Head

Here is a peak at what the data looks like from the first few rows.

# create table that shows first few rows of cleaned dataset
head(shows) %>% gt()  %>% tab_header(
    title = md("First Rows of Data")
  )
First Rows of Data
ID Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video
1 Breaking Bad 2008 18+ 9.4 100 1 0 0
2 Stranger Things 2016 16+ 8.7 96 1 0 0
3 Attack on Titan 2013 18+ 9.0 95 1 1 0
4 Better Call Saul 2015 18+ 8.8 94 1 0 0
5 Dark 2017 16+ 8.8 93 1 0 0
6 Avatar: The Last Airbender 2005 7+ 9.3 93 1 0 1

Data Summary

Some quick summary statistics about our data are provided below.

# create some summary statistics
total <-  prettyNum(nrow(shows) , big.mark=",",scientific=FALSE)  # total number of shows

# number of netflix, hulu, and prime shows
netflix <- prettyNum(sum(shows$Netflix) , big.mark=",",scientific=FALSE)
hulu <-  prettyNum(sum(shows$Hulu) , big.mark=",",scientific=FALSE)
prime <-  prettyNum(sum(shows$`Prime Video`) , big.mark=",",scientific=FALSE)

# number of shows by age group
eighteen <- prettyNum(nrow(shows[shows$Age == "18+",]),big.mark=",",scientific=FALSE)
sixteen <- prettyNum(nrow(shows[shows$Age == "16+",]),big.mark=",",scientific=FALSE)
thirteen <- prettyNum(nrow(shows[shows$Age == "13+",]),big.mark=",",scientific=FALSE)
seven <- prettyNum(nrow(shows[shows$Age == "7+",]),big.mark=",",scientific=FALSE)
all <- prettyNum(nrow(shows[shows$Age == "all",]),big.mark=",",scientific=FALSE)


# looking at mean rating for each rating system 
rt <- round(mean(shows$`Rotten Tomatoes`),1)
imdb <- round(mean(shows$IMDb),1)

# putting it all together in a dataframe
Summary <- c("Number of Shows","Mean Rotten Tomatoes Score","Mean IMDb Score","Netflix Shows","Hulu Shows","Prime Video Shows","18+ Shows","16+ Shows","13+ Shows","7+ Shows","All Ages Allowed Shows")
Statistic <- c(total, rt, imdb, netflix, hulu, prime, eighteen, sixteen, thirteen, seven, all)
summary <- data.frame(Summary, Statistic) 
summary %>% gt()  %>% tab_header(
    title = md("Summary Statistics Data Table")
  )
Summary Statistics Data Table
Summary Statistic
Number of Shows 4,406
Mean Rotten Tomatoes Score 53.8
Mean IMDb Score 7.1
Netflix Shows 1,875
Hulu Shows 1,418
Prime Video Shows 1,182
18+ Shows 2,051
16+ Shows 2,186
13+ Shows 1,208
7+ Shows 2,023
All Ages Allowed Shows 1,734

Plots

Number of Shows for Each Provider

Because we want to answer the question of who has the highest number of highly-rated TV shows. It is first important to consider who offers the most number of tv shows, regardless of rating.

We see from the bar graph below that Netflix has the highest number of tv show offerings with 1,875 shows. This is 400+ more than Hulu and 600+ more than Prime Video. This bar graph supports the articles claims that Netflix has the most options when it comes to tv shows.

# create df for number of tv shows per provider
attach(shows)
df <- data.frame(Provider = c("Netflix","Hulu","Prime Video"), Show_Count = c(sum(Netflix), sum(Hulu), sum(`Prime Video`)))

detach(shows)

# plot
attach(df)

ggplot(df, aes(x = reorder(Provider, -Show_Count), y = Show_Count, fill = Provider)) + geom_bar(stat = "identity")+ geom_text(aes(label = Show_Count),position = position_dodge(width=0.9),vjust=-0.25) + labs( x = "Provider", y = "Number of TV Shows", title = "Count of TV Shows by Provider") + scale_fill_manual(values=c("#66aa33","#E50914", "#146eb4"))+
  theme(plot.title = element_text(hjust = 0.5)) 

detach(df)
Mean Rating for Each Provider

Because we are interested in ratings and how they are distributed across the different providers I created a bar graph that looks at average rating across provider for both Rotten Tomatoes and IMDb scores.

The chart below shows that actually Prime Video and Hulu have higher IMDb scores and Rotten Tomatoes respectively than Netflix. We see that IMDb ratings are pretty consistent across all three providers, with Prime Video having a .1 advantage over the other two providers. Whereas, we see a more significant difference between rotten tomatoes across the three providers. Hulu has the highest rating, followed by netflix, then prime video. However, because this only looks at averages it still does not perfectly answer our question of which provider has the highest number of highly rated tv shows.

# breakdown data frames into netflix, hulu, and prime video dataframes
netflix <- shows[shows$Netflix == 1,]
hulu <- shows[shows$Hulu == 1,]
prime <- shows[shows$`Prime Video` == 1,]

# find means for rotten tomatoes and IMDb for each provider

# netflix
nrt <- round(mean(netflix$`Rotten Tomatoes`),1)
nimdb <- round(mean(netflix$IMDb),1)

#hulu
hrt <- round(mean(hulu$`Rotten Tomatoes`),1)
himdb <- round(mean(hulu$IMDb),1)

#prime

prt <- round(mean(prime$`Rotten Tomatoes`),1)
pimdb <- round(mean(prime$IMDb),1)

#rating table

ratings <- data.frame(Provider = c(rep(c("Netflix","Hulu","Prime Video"),2)), rating_type = c(rep("Rotten Tomatoes",3), rep("IMDb",3)), rating = c(nrt, hrt, prt,nimdb, himdb, pimdb) )

# plot ratings table
attach(ratings)
ggplot(ratings, aes(x = Provider, y = rating, fill = Provider)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~rating_type, scales = "free") + geom_text(aes(label = rating),position = position_dodge(width=0.9),vjust=-0.25) + labs( x = "Provider", y = "Average Rating", title = "Ratings by Provider") + scale_fill_manual(values=c("#66aa33","#E50914", "#146eb4"))+
  theme(plot.title = element_text(hjust = 0.5)) 

detach(ratings)

Boxplot by Rating Type by Provider

To more clearly look at the distribution of these ratings across providers it is helpful to look at a boxplot. From these boxplots we once again see pretty similar trends across the providers for IMDb ratings. However, when we look at Rotten Tomatoes we see that Hulu still has a higher distribution of average ratings, however we also see that Netflix has a higher number of outlier shows that are receiving 93+ rotten tomatoes. This may point to a possibility of Netflix having the most amount of quality tv shows.

# convert hulu into factors
shows$Hulu <- as.factor(shows$Hulu)
shows$Netflix <- as.factor(shows$Netflix)
shows$`Prime Video` <- as.factor(shows$`Prime Video`)

# melt data so that 0,1, columns for netflix, hulu, and prime become one column
long <- melt(shows, id.vars = c("ID","Title","Year","Age","IMDb","Rotten Tomatoes"),
             measure.vars = c("Netflix","Hulu","Prime Video"))

# remove 0 values
long <- long[long$value == 1,]

# remove 0,1 column
long <- long[,-8]

colnames(long)[7] <- 'Provider'
# making rotten tomatoes and IMDb one column

long2 <- melt(long, id.vars = c("ID","Title","Year","Age","Provider"),
             measure.vars = c("IMDb","Rotten Tomatoes"))

# change variable names 
colnames(long2)[c(6,7)] <- c('Rating Source','Rating')

# plot this
attach(long2)


plot <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black") + 
  scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4"))  + facet_wrap(long2$`Rating Source`, scale = "free") + theme(plot.title = element_text(hjust = 0.5)) 

ggplotly(plot)
detach(long2)

Boxplot for Rotten Tomatoes, Low, Medium, and High Rated Shows

We can see more clearly here that Netflix does have the highest number of highly-rated tv shows. However, we can see that the margins are small as Hulu offers 8 less highly-rated shows than Netflix. We also see that Netflix offers the highest number of medium-rated shows. This bar graph proves the author’s point that Netflix does have the highest number of highly-rated tv-shows. Although it is not as much of a clean sweep for Netflix as the article painted it out to be. We can see that Hulu is closely following Netflix and it will be interesting to follow this data in the coming years as these services continue to compete and grow.

# create cut-offs for high, medium, and low-rated shows, cutoffs based off averaging q3 and q1 for each provider
shows2 <- long2[long2$`Rating Source` == "Rotten Tomatoes",]
shows2 <- shows2 %>% mutate(RatingCategory = case_when(Rating >= 64~ "High",
                                                  Rating <= 45.08 ~ "Low",
                                                  Rating > 45.08 & Rating<64~ "Medium"))

grouped <- shows2 %>% select(Provider, RatingCategory) %>% group_by(Provider, RatingCategory) %>% summarise(count = n())

ggplot(grouped, aes(x = Provider, y = count, fill = Provider)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~RatingCategory) + labs( x = "Providers", y = "Number of TV Shows", title = "Count of Shows by Rating Category") + scale_color_manual(values=c("#66aa33","#E50914", "#146eb4")) + geom_text(aes(label = count),position = position_dodge(width=0.9),vjust=-0.25)+
  theme(plot.title = element_text(hjust = 0.5)) 

Conclusion

Ultimately, what we discovered through these plots is that Netflix offers the most amount of TV shows of the three provider (Netflix, Hulu, and Prime Video). At first when we looked at average rating for these providers, it looked like all of the three services were averaging similar scores for Rotten Tomatoes and IMDb. Potentially even seeing that Hulu on average has higher ratings. However, when we look at a boxplot to better understand this, we see that Netflix has a higher number of highly-rated outliers. This realization inspired us to look into breaking down the ratings into categories of highly-rated, medium-rated, and low-rated shows. We see from this final bar graph that Netflix does have the highest number of highly-rated shows. Thus, we can agree with the article. Overall, it was super interesting to observe how TV streaming services compared. It will be interesting to follow this data in the coming years as these services continue to compete and grow.

Appendix

Boxplot by Age

Checking for evidence of Simpson’s Paradox. Work that led me to plots I wanted to use in main analysis:

# turn age into factore
long2$Age <- as.factor(long2$Age)

# plot by facet wrapped by age 
plot2 <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") +
  scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4"))  + facet_wrap(long2$Age, scale = "free")  + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black")+ theme(plot.title = element_text(hjust = 0.5)) 

ggplotly(plot2)

Boxplot by Year

# turn age into numeric
long2$Age <- as.numeric(long2$Age)

# collapse year into 1900s and 2000s 
long2 <- long2 %>% mutate(YearGroup = case_when(Year >= 2000~ "2000s",
                                                  Year < 2000 ~ "1900s"))
# turn collapsed variable into a factor
long2$YearGroup <- as.factor(long2$YearGroup)

# plot by year 
plot3 <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") +
  scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4"))  + facet_wrap(long2$YearGroup, scale = "free") + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black")+ theme(plot.title = element_text(hjust = 0.5)) 

ggplotly(plot3)