How Netflix, Prime Video, and Hulu compare to new streaming rivals like Disney Plus and HBO Max
The article that I based my project off of comes from businessinsider.com. I’ve selected this article in particular because as an avid TV watcher I was curious how these streaming services stacked up against each other. I want to compare different streaming services and consider which might be the best for me and my interests. Unfortunately, I will not be able to afford all of these services so I am curious, which providers have the highest number of highly-rated TV shows?
Variables of Interest:
The article argues that Netflix has the highest number of subscribers with 183 million subscribers and the most “quality” and “high-quality” TV shows of any of the streamers. I agree that Netflix is the most popular and also have personally loved all of the shows Netflix offers. Thus, I am going to agree with the article that Netflix dominates in offering the most quality TV shows of all streaming platforms. Through this analysis I will assess the articles argument that Netflix has the most “high-quality” TV shows of any of the providers.
Data was collected from Kaggle This data set is composed of TV shows available on Netflix, Hulu, Prime Video and Disney+. Each row is a unique show with the year it was produced, the targeted age group, the IMDb rating, the Rotten Tomatoes review. In addition, there is a column for each streaming service (Netflix, Hulu,Prime Video, and Disney+) with a binary response (0 = does not have show, 1 = has show). There are 5,368 rows. The data was scraped from these providers API’s and updated two months ago, it was then aggregated with data scraped from Rotten Tomatoes and IMDb. This data set will be a valuable extension of the information provided in the article. The article discusses the streaming platforms in terms of quantity of shows and movies, price, and number of providers. I will use this data to assess the quality of the content on each of these platforms based on their ratings.
# read in data & required packages
library(readr)
library(tidyverse)
library(stringr)
library(gt)
library(ggplot2)
library(tidyr)
library(reshape2)
library(plotly)
shows <- read_csv("tv_shows.csv")
# Is the data type correct for this field?
# making IMDb & rotten tomatoes rating from fraction to just numerator
shows$IMDb <- as.numeric(str_split_fixed(shows$IMDb, "/",2)[,1])
shows$`Rotten Tomatoes` <- as.numeric(str_split_fixed(shows$`Rotten Tomatoes`, "/",2)[,1])
shows <- shows[,2:10]
# Is the value within the valid range or part of a domain or enumerated list?
max1 <- max(shows$ID)
# Check for duplicates, for example of a unique key.
dup1 <- shows$ID[duplicated(shows$ID)]
dup2 <- shows$Title[duplicated(shows$Title)]
# Check for nulls. Are there mandatory values, or are null / empty values allowed? Are the null types consistent (NaN, infinity, empty strings, etc.)?
empty1 <- sum(is.na(shows))
empty2 <- sum(is.na(shows$Age))
empty3 <- sum(is.na(shows$IMDb))
shows <- shows[!is.na(shows$IMDb), ]
Here is a peak at what the data looks like from the first few rows.
# create table that shows first few rows of cleaned dataset
head(shows) %>% gt() %>% tab_header(
title = md("First Rows of Data")
)
| First Rows of Data | ||||||||
|---|---|---|---|---|---|---|---|---|
| ID | Title | Year | Age | IMDb | Rotten Tomatoes | Netflix | Hulu | Prime Video |
| 1 | Breaking Bad | 2008 | 18+ | 9.4 | 100 | 1 | 0 | 0 |
| 2 | Stranger Things | 2016 | 16+ | 8.7 | 96 | 1 | 0 | 0 |
| 3 | Attack on Titan | 2013 | 18+ | 9.0 | 95 | 1 | 1 | 0 |
| 4 | Better Call Saul | 2015 | 18+ | 8.8 | 94 | 1 | 0 | 0 |
| 5 | Dark | 2017 | 16+ | 8.8 | 93 | 1 | 0 | 0 |
| 6 | Avatar: The Last Airbender | 2005 | 7+ | 9.3 | 93 | 1 | 0 | 1 |
Some quick summary statistics about our data are provided below.
# create some summary statistics
total <- prettyNum(nrow(shows) , big.mark=",",scientific=FALSE) # total number of shows
# number of netflix, hulu, and prime shows
netflix <- prettyNum(sum(shows$Netflix) , big.mark=",",scientific=FALSE)
hulu <- prettyNum(sum(shows$Hulu) , big.mark=",",scientific=FALSE)
prime <- prettyNum(sum(shows$`Prime Video`) , big.mark=",",scientific=FALSE)
# number of shows by age group
eighteen <- prettyNum(nrow(shows[shows$Age == "18+",]),big.mark=",",scientific=FALSE)
sixteen <- prettyNum(nrow(shows[shows$Age == "16+",]),big.mark=",",scientific=FALSE)
thirteen <- prettyNum(nrow(shows[shows$Age == "13+",]),big.mark=",",scientific=FALSE)
seven <- prettyNum(nrow(shows[shows$Age == "7+",]),big.mark=",",scientific=FALSE)
all <- prettyNum(nrow(shows[shows$Age == "all",]),big.mark=",",scientific=FALSE)
# looking at mean rating for each rating system
rt <- round(mean(shows$`Rotten Tomatoes`),1)
imdb <- round(mean(shows$IMDb),1)
# putting it all together in a dataframe
Summary <- c("Number of Shows","Mean Rotten Tomatoes Score","Mean IMDb Score","Netflix Shows","Hulu Shows","Prime Video Shows","18+ Shows","16+ Shows","13+ Shows","7+ Shows","All Ages Allowed Shows")
Statistic <- c(total, rt, imdb, netflix, hulu, prime, eighteen, sixteen, thirteen, seven, all)
summary <- data.frame(Summary, Statistic)
summary %>% gt() %>% tab_header(
title = md("Summary Statistics Data Table")
)
| Summary Statistics Data Table | |
|---|---|
| Summary | Statistic |
| Number of Shows | 4,406 |
| Mean Rotten Tomatoes Score | 53.8 |
| Mean IMDb Score | 7.1 |
| Netflix Shows | 1,875 |
| Hulu Shows | 1,418 |
| Prime Video Shows | 1,182 |
| 18+ Shows | 2,051 |
| 16+ Shows | 2,186 |
| 13+ Shows | 1,208 |
| 7+ Shows | 2,023 |
| All Ages Allowed Shows | 1,734 |
Because we want to answer the question of who has the highest number of highly-rated TV shows. It is first important to consider who offers the most number of tv shows, regardless of rating.
We see from the bar graph below that Netflix has the highest number of tv show offerings with 1,875 shows. This is 400+ more than Hulu and 600+ more than Prime Video. This bar graph supports the articles claims that Netflix has the most options when it comes to tv shows.
# create df for number of tv shows per provider
attach(shows)
df <- data.frame(Provider = c("Netflix","Hulu","Prime Video"), Show_Count = c(sum(Netflix), sum(Hulu), sum(`Prime Video`)))
detach(shows)
# plot
attach(df)
ggplot(df, aes(x = reorder(Provider, -Show_Count), y = Show_Count, fill = Provider)) + geom_bar(stat = "identity")+ geom_text(aes(label = Show_Count),position = position_dodge(width=0.9),vjust=-0.25) + labs( x = "Provider", y = "Number of TV Shows", title = "Count of TV Shows by Provider") + scale_fill_manual(values=c("#66aa33","#E50914", "#146eb4"))+
theme(plot.title = element_text(hjust = 0.5))
detach(df)
Because we are interested in ratings and how they are distributed across the different providers I created a bar graph that looks at average rating across provider for both Rotten Tomatoes and IMDb scores.
The chart below shows that actually Prime Video and Hulu have higher IMDb scores and Rotten Tomatoes respectively than Netflix. We see that IMDb ratings are pretty consistent across all three providers, with Prime Video having a .1 advantage over the other two providers. Whereas, we see a more significant difference between rotten tomatoes across the three providers. Hulu has the highest rating, followed by netflix, then prime video. However, because this only looks at averages it still does not perfectly answer our question of which provider has the highest number of highly rated tv shows.
# breakdown data frames into netflix, hulu, and prime video dataframes
netflix <- shows[shows$Netflix == 1,]
hulu <- shows[shows$Hulu == 1,]
prime <- shows[shows$`Prime Video` == 1,]
# find means for rotten tomatoes and IMDb for each provider
# netflix
nrt <- round(mean(netflix$`Rotten Tomatoes`),1)
nimdb <- round(mean(netflix$IMDb),1)
#hulu
hrt <- round(mean(hulu$`Rotten Tomatoes`),1)
himdb <- round(mean(hulu$IMDb),1)
#prime
prt <- round(mean(prime$`Rotten Tomatoes`),1)
pimdb <- round(mean(prime$IMDb),1)
#rating table
ratings <- data.frame(Provider = c(rep(c("Netflix","Hulu","Prime Video"),2)), rating_type = c(rep("Rotten Tomatoes",3), rep("IMDb",3)), rating = c(nrt, hrt, prt,nimdb, himdb, pimdb) )
# plot ratings table
attach(ratings)
ggplot(ratings, aes(x = Provider, y = rating, fill = Provider)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~rating_type, scales = "free") + geom_text(aes(label = rating),position = position_dodge(width=0.9),vjust=-0.25) + labs( x = "Provider", y = "Average Rating", title = "Ratings by Provider") + scale_fill_manual(values=c("#66aa33","#E50914", "#146eb4"))+
theme(plot.title = element_text(hjust = 0.5))
detach(ratings)
To more clearly look at the distribution of these ratings across providers it is helpful to look at a boxplot. From these boxplots we once again see pretty similar trends across the providers for IMDb ratings. However, when we look at Rotten Tomatoes we see that Hulu still has a higher distribution of average ratings, however we also see that Netflix has a higher number of outlier shows that are receiving 93+ rotten tomatoes. This may point to a possibility of Netflix having the most amount of quality tv shows.
# convert hulu into factors
shows$Hulu <- as.factor(shows$Hulu)
shows$Netflix <- as.factor(shows$Netflix)
shows$`Prime Video` <- as.factor(shows$`Prime Video`)
# melt data so that 0,1, columns for netflix, hulu, and prime become one column
long <- melt(shows, id.vars = c("ID","Title","Year","Age","IMDb","Rotten Tomatoes"),
measure.vars = c("Netflix","Hulu","Prime Video"))
# remove 0 values
long <- long[long$value == 1,]
# remove 0,1 column
long <- long[,-8]
colnames(long)[7] <- 'Provider'
# making rotten tomatoes and IMDb one column
long2 <- melt(long, id.vars = c("ID","Title","Year","Age","Provider"),
measure.vars = c("IMDb","Rotten Tomatoes"))
# change variable names
colnames(long2)[c(6,7)] <- c('Rating Source','Rating')
# plot this
attach(long2)
plot <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black") +
scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4")) + facet_wrap(long2$`Rating Source`, scale = "free") + theme(plot.title = element_text(hjust = 0.5))
ggplotly(plot)
detach(long2)
We can see more clearly here that Netflix does have the highest number of highly-rated tv shows. However, we can see that the margins are small as Hulu offers 8 less highly-rated shows than Netflix. We also see that Netflix offers the highest number of medium-rated shows. This bar graph proves the author’s point that Netflix does have the highest number of highly-rated tv-shows. Although it is not as much of a clean sweep for Netflix as the article painted it out to be. We can see that Hulu is closely following Netflix and it will be interesting to follow this data in the coming years as these services continue to compete and grow.
# create cut-offs for high, medium, and low-rated shows, cutoffs based off averaging q3 and q1 for each provider
shows2 <- long2[long2$`Rating Source` == "Rotten Tomatoes",]
shows2 <- shows2 %>% mutate(RatingCategory = case_when(Rating >= 64~ "High",
Rating <= 45.08 ~ "Low",
Rating > 45.08 & Rating<64~ "Medium"))
grouped <- shows2 %>% select(Provider, RatingCategory) %>% group_by(Provider, RatingCategory) %>% summarise(count = n())
ggplot(grouped, aes(x = Provider, y = count, fill = Provider)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~RatingCategory) + labs( x = "Providers", y = "Number of TV Shows", title = "Count of Shows by Rating Category") + scale_color_manual(values=c("#66aa33","#E50914", "#146eb4")) + geom_text(aes(label = count),position = position_dodge(width=0.9),vjust=-0.25)+
theme(plot.title = element_text(hjust = 0.5))
Ultimately, what we discovered through these plots is that Netflix offers the most amount of TV shows of the three provider (Netflix, Hulu, and Prime Video). At first when we looked at average rating for these providers, it looked like all of the three services were averaging similar scores for Rotten Tomatoes and IMDb. Potentially even seeing that Hulu on average has higher ratings. However, when we look at a boxplot to better understand this, we see that Netflix has a higher number of highly-rated outliers. This realization inspired us to look into breaking down the ratings into categories of highly-rated, medium-rated, and low-rated shows. We see from this final bar graph that Netflix does have the highest number of highly-rated shows. Thus, we can agree with the article. Overall, it was super interesting to observe how TV streaming services compared. It will be interesting to follow this data in the coming years as these services continue to compete and grow.
Checking for evidence of Simpson’s Paradox. Work that led me to plots I wanted to use in main analysis:
# turn age into factore
long2$Age <- as.factor(long2$Age)
# plot by facet wrapped by age
plot2 <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") +
scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4")) + facet_wrap(long2$Age, scale = "free") + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black")+ theme(plot.title = element_text(hjust = 0.5))
ggplotly(plot2)
# turn age into numeric
long2$Age <- as.numeric(long2$Age)
# collapse year into 1900s and 2000s
long2 <- long2 %>% mutate(YearGroup = case_when(Year >= 2000~ "2000s",
Year < 2000 ~ "1900s"))
# turn collapsed variable into a factor
long2$YearGroup <- as.factor(long2$YearGroup)
# plot by year
plot3 <- ggplot(long2, aes(x = Provider, y = Rating, fill = Provider, label = Title)) + geom_boxplot() + labs( x = "Provider", y = "Rating", title = "Scores by Provider") +
scale_fill_manual(values=c("#E50914","#66aa33", "#146eb4")) + facet_wrap(long2$YearGroup, scale = "free") + stat_summary(fun=mean, geom="point", shape=20, size=3, fill = "black")+ theme(plot.title = element_text(hjust = 0.5))
ggplotly(plot3)