As someone who loves their streaming services and having a plethora of options for TV shows to watch, I often fall into a common problem of barely using the streaming services I pay for each month because when decision-time arises, I almost have too many options available and am unable to make a choice. It’s also getting to the point where there are so many streaming services available that I might not even know what’s available to watch or where I should cut costs potentially. The data set I have chosen to use is from Kaggle of TV shows available to watch via Netflix, Hulu, Disney + and Prime Video (https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney). I did do some initial clean up in the services columns due to the fact that I have all four of these services and the data layout of the raw file was just not conducive to using for my purposes.
The file includes the title of the TV show, year it was first released (many of the shows ran for multiple years so this is just the year it began), age demographic the show is intended for, IMDb rating (scale of 1-10), Rotten Tomatoes rating (percent), the streaming service the TV show is available on, and type which is listed as ‘1’ for every row (this appears to be how the data collector indicated these titles as TV shows versus something else like a movie).
colnames(df_TVShows)
## [1] "Title" "Year" "Age"
## [4] "IMDb" "Rotten.Tomatoes" "Streaming.Service"
## [7] "type"
head(df_TVShows)
## Title Year Age IMDb Rotten.Tomatoes Streaming.Service type
## 1 Breaking Bad 2008 18+ 9.5 0.96 Netflix 1
## 2 Stranger Things 2016 16+ 8.8 0.93 Netflix 1
## 3 Money Heist 2017 18+ 8.4 0.91 Netflix 1
## 4 Sherlock 2010 16+ 9.1 0.78 Netflix 1
## 5 Better Call Saul 2015 18+ 8.7 0.97 Netflix 1
## 6 The Office 2005 16+ 8.9 0.81 Netflix 1
In order to not only evaluate what show I may watch, but also which streaming services potentially have the most value to me, I’m going to be answering the following questions through data visualizations:
• Which services have the highest rated TV shows?
• What titles are the highest rated TV shows?
• Which decade were the titles on each Streaming Service released?
• Out of the highest rated TV shows, which streaming service has the most value at their given price point?
• Which streaming services could I potentially get rid of?
• Which TV shows do I want to watch?
As there are so many titles and factors to consider, I knew I wanted to clean-up the data even before beginning to create and analyze the visualizations. Since there are two rating systems, IMDb and Rotten Tomatoes, these two numbers need to be combined. Luckily, IMDb is on a scale of 1-10 and Rotten Tomatoes is a percent on a scale of 100. Seeing that I don’t prefer one rating system over the other, I’m going to simply multiple them to create a “Weighted Rating” that will be used for most of the visuals.
I have also narrowed the data set down in R because there are over 5,600 titles between the four streaming services. I’m not likely to consider or want to watch many of these especially if they’re titles that I consider to be “edge” or poorly rated. Therefore, I’ve used the script to eliminate any titles that have a weighted rating below 7.5. In terms of the data analysis I’m going to do to determine what’s available on the different streaming services, their value and what I ultimately want to watch, I only want to consider these “higher rated” shows based on both IMDb and Rotten Tomatoes. There was then also a need to remove some null values that were in the data either due to the information available or no ratings from either source.
As you can see below, the main columns in the data set “Rated TV Shows” after clean-up are still:
• Title
• Year
• Age
• IMDb Rating
• Rotten Tomatoes Rating
• Streaming Service
and now the additional column of “Weighted Rating” has been created from both the IMDb rating and Rotten Tomatoes rating.
colnames(df_RatedTVShows)
## [1] "Title" "Year" "Age"
## [4] "IMDb" "Rotten.Tomatoes" "Streaming.Service"
## [7] "WeightedRating"
str(df_RatedTVShows)
## 'data.frame': 228 obs. of 7 variables:
## $ Title : chr "Avatar: The Last Airbender" "Breaking Bad" "Fullmetal Alchemist: Brotherhood" "The Planets" ...
## $ Year : int 2005 2008 2009 2019 2019 2017 2012 2001 2015 1969 ...
## $ Age : chr "7+" "18+" "18+" "all" ...
## $ IMDb : num 9.2 9.5 9.1 9.1 9.1 9.1 8.9 9.4 8.8 8.8 ...
## $ Rotten.Tomatoes : num 1 0.96 1 1 1 0.98 1 0.94 1 1 ...
## $ Streaming.Service: chr "Netflix" "Netflix" "Netflix & Hulu" "Prime Video" ...
## $ WeightedRating : num 9.2 9.12 9.1 9.1 9.1 ...
## - attr(*, "na.action")= 'omit' Named int [1:4603] 86 94 97 120 129 133 152 160 178 184 ...
## ..- attr(*, "names")= chr [1:4603] "NA" "NA.1" "NA.2" "NA.3" ...
This narrows the data down to the top 228 shows (based on a weighted rating of 7.5 or above) across all four platforms and any combination of shows on multiple platforms.
length(unique(df_RatedTVShows$Title))
## [1] 228
As we seek to analyze the value of each streaming service, the first question we will attempt to answer is “Which services have the highest rated TV shows?” Since we’ve already cleaned up the data to only include higher rated titles, this eliminates any concern in the original data set around the sheer volume of titles available that may not be quality contenders that I would want to watch. In the pie chart below, the top 228 rated services were separated by the streaming service they’re available on including the titles that are available on multiple platforms. As you can see, Netflix and Hulu have the highest percent of titles that fall on this list, followed by Prime Video and titles that are on both Netflix and Hulu.
TopRatedServices <- count(df_RatedTVShows, Streaming.Service)
Percent_TopRatedServices <- TopRatedServices[TopRatedServices$Streaming.Service %in% c("Disney+", "Hulu", "Hulu & Disney+", "Hulu & Prime Video", "Netflix", "Netflix & Hulu", "Netflix & Prime Video", "Netflix, Hulu & Prime Video", "Prime Video"),"n"] / sum(TopRatedServices$n)
Percent_TopRatedServices <- round(100*Percent_TopRatedServices,1)
TopRatedServices$Percent <- Percent_TopRatedServices
TopRatedServices$Streaming.Service = factor(TopRatedServices$Streaming.Service, levels=c("Netflix","Hulu", "Prime Video", "Netflix & Hulu", "Hulu & Prime Video", "Netflix & Prime Video", "Netflix, Hulu & Prime Video", "Disney+", "Hulu & Disney+"))
ggplot(data = TopRatedServices, aes(x = "", y = n, fill = Streaming.Service)) +
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta = "y", start = 0) +
labs(fill = "Streaming Services", x= NULL, y= NULL, title = "Highest Rated Shows by Streaming Service",
caption = "Show ratings are determined by a weighted rating of IMDb scores X Rotten Tomatoes
scores where the Weighted Rating is greater than or equal to 7.5
Slices under 2% are not labeled") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
scale_fill_manual(values=c("#8C96C6","#9EBCDA","#BFD3E6", "#E0ECF4","#F7FCFD","#8C6BB1", "#88419D", "#810F7C", "#4D004B")) +
geom_text(aes(x=1.55, label=ifelse(Percent>2,paste0(Percent, "%"),"")),
size=3,
position=position_fill(vjust = 0.5))
Continuing our analysis, seeing that the highest rated TV shows are on Netflix and Hulu, it begs the question, “what titles are the highest rated TV shows?” irregardless of the service. Since there are 228 titles in our data set, I’ve narrowed this visual down to just the top 25 titles seeing that it could take a while to even get through just a few titles depending on how long they were on the air. The top two titles, Avatar and Breaking Bad, are exclusively on Netflix (two titles I’ve watched in part but not completed in their entirety). The third highest rated title is The Planets on Prime Video (a title I’ve not heard of yet). The fourth is The Imagineering Story on Disney + and the fifth is Fullmetal Alchemist which is available on both Netflix and Hulu. Just in looking at the top shows, by the colors of the bars, Netflix and Hulu are proving to be very prevalent in the top 25 titles just as indicated in the pie chart.
df_Top25Shows <- head(df_RatedTVShows,25)
ggplot(df_Top25Shows, aes(x= WeightedRating, y= reorder(Title, WeightedRating), fill = Streaming.Service)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#810F7C","#9EBCDA", "#4D004B", "#8C96C6","#E0ECF4","#88419D","#8C6BB1", "#BFD3E6")) +
labs(title= "Top 25 Rated Shows by Streaming Service", x = "Weighted Rating", y = "Title", fill = "Streaming Service")
This chart was a bit of a wildcard in terms of just general curiosity about the types of shows available on the different streaming services. As much as I enjoy binge watching modern TV shows, I also enjoy shows from various decades. However, in this analysis the focus is on quality of shows available on each service so decade could be an indicator of sorts of services that aren’t just hosting older shows and services that aren’t just brand new shows either. This chart is a breakdown of how many TV shows from each decade are available on each platform. This will give me a better idea at the very least at the modernity of the shows I might have to choose from on each service. By in large, the TV shows available were released in the 2010s followed by the 2000s with a few on Netflix from just the past year (2020s decade).
df_TitleCountbyYearandDecade <- df_RatedTVShows %>%
select(Title, Year, Streaming.Service)
df_TitleCountbyYearandDecade$Decade <- floor(df_RatedTVShows$Year/10)*10
df_TitleCountbyYearandDecade <- df_TitleCountbyYearandDecade %>%
group_by(Streaming.Service, Decade)
df_TitleCountbyYearandDecade2 <- df_TitleCountbyYearandDecade %>%
select(Streaming.Service, Decade)
df_TitleCountbyYearandDecadeSUMMARY <- df_TitleCountbyYearandDecade2 %>%
group_by(Streaming.Service, Decade) %>%
mutate(count = n())
df_TitleCountbyYearandDecadeSUMMARY <- df_TitleCountbyYearandDecadeSUMMARY[!duplicated(df_TitleCountbyYearandDecadeSUMMARY), ]
StackedBar <- ggplot(df_TitleCountbyYearandDecadeSUMMARY, aes(x = Decade, y = count, fill = Streaming.Service)) +
geom_bar(stat = "identity") +
geom_text(size = 2, color = "white", position= position_stack(vjust = 0.5),
aes(x = Decade, y = count, label = count)) +
scale_fill_manual(values=c("#810F7C","#9EBCDA","#4D004B","#F7FCFD", "#8C96C6","#E0ECF4","#8C6BB1", "#88419D","#BFD3E6")) +
labs(title = "Number of Titles by Decade by Streaming Service", x = "Decade", y = "Number of Titles", fill = "Streaming Service") +
theme(plot.title = element_text(hjust = 0.5))
StackedBar <- StackedBar + scale_x_continuous(labels = df_TitleCountbyYearandDecadeSUMMARY$Decade, breaks = df_TitleCountbyYearandDecadeSUMMARY$Decade)
StackedBar
Out of pure curiosity, not regarding rating, I did want to see what some of the highest rated titles were that were released just this past year. These seven titles are listed below, some I have already seen and some I have not. This information might be used to help me make a decision on what to watch if I’m looking for something new and well-liked. All of these newer, but highly rated shows are available on Netflix (one of the higher cost platforms as you’ll see in the next visualization).
df_2020Titles <- df_TitleCountbyYearandDecade[(df_TitleCountbyYearandDecade$Decade >= 2020),]
df_2020Titles
## # A tibble: 7 x 4
## # Groups: Streaming.Service, Decade [1]
## Title Year Streaming.Service Decade
## <chr> <int> <chr> <dbl>
## 1 Middleditch & Schwartz 2020 Netflix 2020
## 2 The Innocence Files 2020 Netflix 2020
## 3 Cheer 2020 Netflix 2020
## 4 Never Have I Ever 2020 Netflix 2020
## 5 Unorthodox 2020 Netflix 2020
## 6 The Midnight Gospel 2020 Netflix 2020
## 7 Feel Good 2020 Netflix 2020
The next question we seek to answer is “which streaming service has the greatest value at their given price point?” In this case, we will judge value based on the total number of shows available on the platform that have a weighted rating of 7.5 and above. This is shown by the bar chart using the main X and Y axis. A line has then been added to the chart that looks at the yearly price (rounded to the next whole dollar (based on the cost I’m paying currently)) of the service on the X axis indicated on the second Y axis. For titles that are on multiple services, the lowest possible yearly cost based on the which platforms the title is on is what’s indicated in the chart. Based on this chart, the platforms with the largest number of highly rated TV show titles that are most in line with the yearly cost for the service prove to be Netflix and Hulu. For Netflix, it has the highest yearly cost among the platforms, but also the largest number of highly rated titles. While Hulu has a lower number of highly rated titles than Netflix, it has a much lower yearly cost too.
df_RatedTVShowsCountbyService <- df_RatedTVShows %>%
select(Title, Streaming.Service)
df_RatedTVShowsCountbyService <- df_RatedTVShowsCountbyService %>%
group_by(Streaming.Service) %>%
mutate(count = n())
df_RatedTVShowsCountbyService <- df_RatedTVShowsCountbyService %>%
select(Streaming.Service, count)
df_RatedTVShowsCountbyService <- df_RatedTVShowsCountbyService[!duplicated(df_RatedTVShowsCountbyService), ]
df_RatedTVShowsCountbyService <- df_RatedTVShowsCountbyService %>%
mutate(LowestYearlyCost = case_when(
endsWith(Streaming.Service, "Netflix") ~ "168",
endsWith(Streaming.Service, "Hulu") ~ "72",
endsWith(Streaming.Service, "Disney+") ~ "84",
endsWith(Streaming.Service, "Prime Video") ~ "108",
endsWith(Streaming.Service, "Netflix & Hulu") ~ "72",
endsWith(Streaming.Service, "Hulu & Disney+") ~ "72",
endsWith(Streaming.Service, "Hulu & Prime Video") ~ "72",
endsWith(Streaming.Service, "Netflix & Prime Video") ~ "108",
endsWith(Streaming.Service, "Netflix, Hulu & Prime Video") ~ "72",
TRUE ~ NA_character_))
df_RatedTVShowsCountbyService <- transform(df_RatedTVShowsCountbyService, LowestYearlyCost = as.numeric(LowestYearlyCost))
ggplot(df_RatedTVShowsCountbyService, aes(x = Streaming.Service, y = count, fill = Streaming.Service)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#810F7C","#9EBCDA","#4D004B","#F7FCFD", "#8C96C6","#E0ECF4","#8C6BB1", "#88419D","#BFD3E6")) +
labs(title = "Number of Title by Streaming Service w/ Cost", x = "Streaming Service", y = "Number of Titles", fill = "Streaming Service") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position = "none") +
geom_line(inherit.aes = FALSE, data = df_RatedTVShowsCountbyService,
aes(x = Streaming.Service, y = LowestYearlyCost, colour = "Lowest Yearly Cost", group=1), size=1) +
scale_color_manual(NULL, values = "grey") +
scale_y_continuous(sec.axis = sec_axis(~., labels=scales::dollar_format(), name = "Cost per Year")) +
geom_point(inherit.aes = FALSE, data = df_RatedTVShowsCountbyService,
aes(x = Streaming.Service, y = LowestYearlyCost, colour = "Lowest Yearly Cost", group=1),
size = 2, shape = 21, fill = "white", color = "black")
Given that Netflix and Hulu are providing the most value at their yearly price point, this helps to narrow in on a TV show to watch. Given that I may be looking to stop services I don’t find value in anymore, I would not want to start watching a show on a service that I might be looking to quit. That being said, below are the top 25 shows that are available on Hulu and Netflix or any combination of platforms that include these two services.
Not_Hulu_or_Netflix <- c("Prime Video", "Disney+")
Disney_or_Prime_Only <- which(df_RatedTVShows$Streaming.Service %in% Not_Hulu_or_Netflix)
df_NetflixHulu <- df_RatedTVShows[-Disney_or_Prime_Only,]
df_Top25NetflixHulu <- head(df_NetflixHulu, 25)
ggplot(df_Top25NetflixHulu, aes(x= WeightedRating, y= reorder(Title, WeightedRating), fill = Streaming.Service)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#9EBCDA","#4D004B","#F7FCFD","#8C96C6","#E0ECF4","#88419D")) +
labs(title= "Top 25 Rated Shows on Hulu or Netflix", x = "Weighted Rating", y = "Title", fill = "Streaming Service")
Primarily out of curiosity, I wanted to continue to break this down further into just the top rated titles on Netflix versus just the top rated titles on Hulu. This is mainly due to occasionally just wanting to use one service versus the other to make sure I’m “getting my money’s worth” out of the service. This might be an additional data point/list I would want to have as I’m making my watching decisions or looking to offload unnecessary costs. The chart below shows the top 25 rated titles on Netflix or a combination of Netflix and any other services.
Not_Netflix <- c("Prime Video", "Disney+", "Hulu", "Hulu & Disney+", "Hulu & Prime Video")
Not_Netflix_Platforms <- which (df_RatedTVShows$Streaming.Service %in% Not_Netflix)
df_NetflixOnly <- df_RatedTVShows[-Not_Netflix_Platforms,]
df_Top25Netflix <- head(df_NetflixOnly, 25)
ggplot(df_Top25Netflix, aes(x= WeightedRating, y= reorder(Title, WeightedRating), fill = Streaming.Service)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#8C96C6","#E0ECF4","#88419D")) +
labs(title= "Top 25 Rated Shows on Netflix", x = "Weighted Rating", y = "Title", fill = "Streaming Service") +
theme(axis.text=element_text(size=5))
Similar to the tab prior to this of just the “Top 25 Titles on Netflix” this chart shows titles that are just on Hulu or a combination of services that includes Hulu. Sometimes, for whatever reason, I may be in the mood to just watch Hulu over Netflix. Beyond some level of analysis on potentially dropping certain services, these two charts of just Netflix or just Hulu would allow me to further narrow in if I ever got to the point where I just wanted maintain the cost of a single streaming service.
Not_Hulu <- c("Prime Video", "Disney+", "Netflix", "Netflix & Prime Video")
Not_Hulu_Platforms <- which (df_RatedTVShows$Streaming.Service %in% Not_Hulu)
df_HuluOnly <- df_RatedTVShows[-Not_Hulu_Platforms,]
df_Top25Hulu <- head(df_HuluOnly, 25)
ggplot(df_Top25Hulu, aes(x= WeightedRating, y= reorder(Title, WeightedRating), fill = Streaming.Service)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#9EBCDA","#4D004B","#F7FCFD","#8C96C6","#88419D")) +
labs(title= "Top 25 Rated Shows on Hulu", x = "Weighted Rating", y = "Title", fill = "Streaming Service") +
theme(axis.text=element_text(size=5))
In conclusion, with this analysis, we can see that when it comes to the highest rated TV shows (a weighted rating of IMDb and Rotten Tomato scores above 7.5), Netflix and Hulu present the best titles by volume. While Hulu is considerably cheaper on a monthly and yearly basis than Netflix, Netflix does outshine the other streaming services when it comes to the sheer number of highly rated TV shows which would potentially make it worth the cost. Therefore, if I were looking to eliminate some monthly costs when it comes to streaming services I could likely drop Disney + and Prime Video without too much impact to my number of choices on shows I could watch.
When it comes to the ultimate question explored through this data, I can look at this last set of charts of the Top 25 Highest Rated Titles on Netflix and Hulu and begin working my way down the list. I could also stat working my way down the list of shows on just Netflix versus shows on just Hulu too. Some of the titles on these lists are ones I have heard about such as Avatar, Breaking Bad, Middleditch & Schwartz, and Rick and Morty and some others I have not the slightest clue about. Currently, the most promising title based on my interests and no additional research proves to be “Chef’s Table” so I may start there. However, this analysis at least might be able to rank some shows for me to explore at the very least the next time I’m stuck in indecision about what to binge watch next.