library(reticulate)
use_python("/Users/andrew/opt/anaconda3/bin/python")
Video streaming services provide a vast number of various movies, television shows, and other audio-visual media titles. More and more of what people are watching is coming via a streaming video service as more users flock to the services. As the number of services also grow, it is becoming increasingly difficult to decide which service or service to subscribe to. This analysis will look at four of the major services in the “streaming wars” to compare and contrast their offerings. The four services are Disney+, Hulu, Netflix, and Prime Video.
The data came from Kaggle. It is nine months old at the time of analysis (February 2021), so it is fairly out of date, but the general content on each service shouldn’t have changed that drastically since then.
There are 16744 rows in the data with the following columns:
Column Name | Description |
---|---|
Title | The title of media |
Year | The release year of the title |
Age | The recommended age for the title |
IMDb | The IMDb score for the title |
Rotten Tomatoes | The Rotten Tomatoes score for the title |
Netflix, Hulu, Prime Video, Disney+ | Four columns of a one-hot style coding indicating if the title is available on the service |
Directors | The director(s) of the title |
Genres | The genre(s) of the title |
Country | The country of countries where the title was filmed |
Language | The language or languages used in the title |
Runtime | The runtime in minutes of the title |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
path = "/Users/andrew/DS736/py_datafiles/MoviesOnStreamingPlatforms_updated.csv"
movies_df = pd.read_csv(path, index_col="index")
del(movies_df["ID"])
del(movies_df["Type"])
movies_df[["Year","IMDb", "Runtime"]].describe()
## Year IMDb Runtime
## count 16744.000000 16173.000000 16152.000000
## mean 2003.014035 5.902751 93.413447
## std 20.674321 1.347867 28.219222
## min 1902.000000 0.000000 1.000000
## 25% 2000.000000 5.100000 82.000000
## 50% 2012.000000 6.100000 92.000000
## 75% 2016.000000 6.900000 104.000000
## max 2020.000000 9.300000 1256.000000
The number columns are all in pretty good shape. Year has an entry for ever row and is bounded by 1902 and 2020. IMDb ratings and Runtime are also present for a large majority of the rows. IMDb ratings have a min and max of 0 and 9.3. Runtime also has a reasonable minimum of 1 minute, but the max runtime is 1256 minutes. That might be an issue we’ll have to deal with.
movies_df[["Netflix", "Hulu", "Prime Video", "Disney+"]].describe()
## Netflix Hulu Prime Video Disney+
## count 16744.000000 16744.000000 16744.000000 16744.000000
## mean 0.212613 0.053930 0.737817 0.033684
## std 0.409169 0.225886 0.439835 0.180419
## min 0.000000 0.000000 0.000000 0.000000
## 25% 0.000000 0.000000 0.000000 0.000000
## 50% 0.000000 0.000000 1.000000 0.000000
## 75% 0.000000 0.000000 1.000000 0.000000
## max 1.000000 1.000000 1.000000 1.000000
There’s a value in every row for the service columns, and they are either one or zero. Good to go here!
movies_df["Rotten Tomatoes"].value_counts()
## 100% 407
## 80% 162
## 50% 136
## 83% 131
## 67% 126
## ...
## 5% 10
## 7% 10
## 4% 9
## 2% 4
## 3% 4
## Name: Rotten Tomatoes, Length: 99, dtype: int64
movies_df["Rotten Tomatoes"].isna().sum()
## 11586
The Rotten Tomatoes scores look reasonable. However, over 11k are missing. That’s probably too many to be able to make any use of this column, especially given that the IMDb scores are also present.
movies_df.Age.value_counts()
## 18+ 3474
## 7+ 1462
## 13+ 1255
## all 843
## 16+ 320
## Name: Age, dtype: int64
movies_df.Age.isna().sum()
## 9390
The Age categories all make sense, although there are quite a few missing.
movies_df.Genres.value_counts()
## Drama 1341
## Documentary 1229
## Comedy 1040
## Comedy,Drama 446
## Horror 436
## ...
## Adventure,Drama,Fantasy,Music 1
## Adventure,Comedy,Family,Romance 1
## Action,Crime,Horror,Mystery,Sport 1
## Action,Adventure,Comedy,Fantasy,Horror,Mystery,Thriller 1
## Documentary,Biography,Comedy,Family 1
## Name: Genres, Length: 1909, dtype: int64
movies_df.Genres.isna().sum()
## 275
There are a lot of genres and a lot of the titles are in multiple genres. I’ll have to probably break these up to make them more useful.
movies_df.Country.value_counts()
## United States 8776
## India 1064
## United Kingdom 905
## Canada 555
## Australia 202
## ...
## India,Sweden 1
## United States,Nicaragua 1
## France,Israel 1
## Italy,Switzerland,Albania,Poland 1
## Argentina,Uruguay,Serbia 1
## Name: Country, Length: 1303, dtype: int64
movies_df.Country.isna().sum()
## 435
Most of these titles are filmed in the US. There are also a bunch from India, indicating that the services probably have a lot of films from Bollywood.
movies_df.Language.value_counts()
## English 10955
## Hindi 503
## English,Spanish 276
## Spanish 267
## English,French 174
## ...
## English,Malay 1
## English,Malay,Hokkien 1
## Haitian,English,French 1
## Mandarin,Min Nan,Cantonese 1
## Hindi,Gujarati,English 1
## Name: Language, Length: 1102, dtype: int64
movies_df.Language.isna().sum()
## 599
The overwhelming number of titles feature English. This makes sense for US-based streaming services. Things look good here.
I skipped over Titles and Directors because there are so many different values for those columns. I do not plan to analyze/visualize those columns so there isn’t a need to check for NaNs and empty values.
Below are visualizations showing differences and similarities between the different streaming services. Depending on consumer preferences they may help to decide which service(s) to subscribe to.
One thing to look at in the different services is the age ratings for the titles available. Is one service more “adult” than the others? Perhaps one service is better for small kids and family?
movies_df["Age"] = movies_df["Age"].fillna("No Age")
my_age_order = ["all", "7+", "13+", "16+", "No Age", "18+"]
movies_df.Age = pd.Categorical(movies_df.Age, categories=my_age_order, ordered=True)
movies_df.sort_values(by="Age", inplace=True)
prime_df = movies_df[movies_df["Prime Video"] ==1]
hulu_df = movies_df[movies_df["Hulu"] ==1]
netflix_df = movies_df[movies_df["Netflix"] ==1]
disney_df = movies_df[movies_df["Disney+"] ==1]
prime_df["Service"] = "Prime Video"
netflix_df["Service"] = "Netflix"
hulu_df["Service"] = "Hulu"
disney_df["Service"] = "Disney+"
fig = plt.figure(figsize=(18,16))
nf_ax = fig.add_subplot(2,2,1)
di_ax = fig.add_subplot(2,2,2)
pr_ax = fig.add_subplot(2,2,3)
hu_ax = fig.add_subplot(2,2,4)
num_ages = len(movies_df.Age.unique())
colormap = plt.get_cmap("Set2")
my_colors = colormap(np.arange(num_ages))
netflix_df.groupby(["Age"])["Title"].count().plot(
kind="pie", radius=1, ax=nf_ax, colors=my_colors,
pctdistance=1.15, labels=None,
wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
startangle=90,
autopct= "%1.2f%%"
)
disney_df.groupby(["Age"])["Title"].count().plot(
kind="pie", radius=1, ax=di_ax, colors=my_colors,
pctdistance=1.15, labels=None,
wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
startangle=90,
autopct= "%1.2f%%"
)
prime_df.groupby(["Age"])["Title"].count().plot(
kind="pie", radius=1, ax=pr_ax, colors=my_colors,
pctdistance=1.15, labels=None,
wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
startangle=90,
autopct= "%1.2f%%"
)
hulu_df.groupby(["Age"])["Title"].count().plot(
kind="pie", radius=1, ax=hu_ax, colors=my_colors,
pctdistance=1.15, labels=None,
wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
startangle=90,
autopct= "%1.2f%%",
)
di_ax.set_title("Disney+", fontsize=20, fontweight="bold")
pr_ax.set_title("Prime Video", fontsize=20, fontweight="bold")
nf_ax.set_title("Netflix", fontsize=20, fontweight="bold")
hu_ax.set_title("Hulu", fontsize=20, fontweight="bold")
di_ax.yaxis.set_visible(False)
pr_ax.yaxis.set_visible(False)
nf_ax.yaxis.set_visible(False)
hu_ax.yaxis.set_visible(False)
di_ax.text(0,0, "Total Titles:\n{:,}".format(disney_df.shape[0]), ha="center", va="center", size=18)
nf_ax.text(0,0, "Total Titles:\n{:,}".format(netflix_df.shape[0]), ha="center", va="center", size=18)
hu_ax.text(0,0, "Total Titles:\n{:,}".format(hulu_df.shape[0]), ha="center", va="center", size=18)
pr_ax.text(0,0, "Total Titles:\n{:,}".format(prime_df.shape[0]), ha="center", va="center", size=18)
handles, labels = hu_ax.get_legend_handles_labels()
legend = plt.legend(movies_df.Age.unique(), bbox_to_anchor=(1.15,1.2), fontsize=16, title="Age Rating", fancybox=True)
legend.get_title().set_fontsize(16)
plt.suptitle("Age Ratings for Each Service", fontsize=24, fontweight="bold")
plt.tight_layout(pad=4)
plt.show()
Disney+ is a much different service than the other three services. Over 80% of Disney+’s offerings are aimed at all ages or kids aged seven and up. There’s also almost no content on Disney+ that is rated for only people over 16. It is very much the family-friendly/kid-friendly service of the three. Disney+ is also the most thorough in their rating systems, with only 11.35% of their content having no rating, compared to the other three servcies which each have at least three times as much unrated content. The other three services have roughly similar distributions of content for the various ages.
Also of note here is that Prime Video has most of the titles available in this data.
To some users, the release years of movies in a service may be an important discriminate. Some users may want older movies to rewatch the classics. Others may want to watch only the newest releases.
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(1,1,1)
sns.distplot(
disney_df.Year, ax=ax, hist=False, color="blue", label="Disney+",
kde_kws={"clip":(disney_df.Year.min(), disney_df.Year.max()), "linewidth": 3, "fill":True},
)
sns.distplot(
hulu_df.Year, ax=ax, hist=False, color="green", label="Hulu",
kde_kws={"clip":(hulu_df.Year.min(), hulu_df.Year.max()), "linewidth": 3, "fill":True}
)
sns.distplot(
netflix_df.Year, ax=ax, hist=False, color="red", label="Netflix",
kde_kws={"clip":(netflix_df.Year.min(), netflix_df.Year.max()), "linewidth": 3, "fill":True}
)
sns.distplot(
prime_df.Year, ax=ax, hist=False, color="orange", label="Prime Video",
kde_kws={"clip":(prime_df.Year.min(), prime_df.Year.max()), "linewidth": 3, "fill":True}
)
ax.set_xlabel("Year", fontsize=18, fontweight="bold")
ax.set_ylabel("Density", fontsize=18, fontweight="bold")
plt.xticks(np.arange(movies_df.Year.min(), movies_df.Year.max(), 5), fontsize=14, rotation=60)
plt.yticks(fontsize=14)
plt.title("Density Plots of Titles by Year", fontsize=20, fontweight="bold")
plt.legend(fontsize=16, loc="upper left", fancybox=True)
plt.tight_layout()
plt.show()
The above chart shows the density of the titles over the years. That is, given a year, what proportion of titles from that server were released that year? Netflix and Hulu have mostly recent content. The majority of their titles were produced in the late 2010s, as evidenced by the large peaks on the right of the graph. Netflix is probably dominated by its original content, which is mostly from the last five years. Hulu was created a decade or so ago to re-broadcast over the air television shows, so they would likely only have recent titles. Prime Video has the oldest titles in the collection, going back to 1902. Prime also has the most titles of any service, so the long tail isn’t entirely surprising. Disney+ has a fairly even distribution of titles from the last thirty or so years. They also have thickest tail going back in time. Given that Disney has been producing film, animation, and television since the 1920s, it is able to provide quite a bit of older content. The other three services are producing their own content, but they are all also fairly recent media companies, especially compared to Disney.
The quality of titles on a service is very important to some people. They may want to know that they are getting the top quality for their money. Using IMDb ratings of the titles, can we see which service has the “best” titles in its collection.
movies_df_cold = pd.concat([hulu_df, netflix_df, prime_df, disney_df], ignore_index=True)
movies_df_cold.shape
fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(1,1,1)
sns.violinplot(
x="IMDb", y="Service", data=movies_df_cold,
palette=["blue", "green", "red", "orange"],
order=["Disney+", "Hulu", "Netflix", "Prime Video"],
cut=0,
scale="width",
inner="quartiles",
ax=ax
)
plt.xticks(np.arange(0, 10, 0.5), fontsize=14)
plt.yticks(fontsize=14)
ax.set_xlabel("IMDb Rating", fontsize=18, labelpad=20, fontweight="bold")
ax.set_ylabel("Streaming Service", fontsize=18, labelpad=20, fontweight="bold")
plt.title("IMDb Ratings for Titles on Each Service", fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()
Disney+ has the best titles on average. It also has the highest rated upper quartile of titles. However, unlike the other three services is does not have any titles rated over 9.0. Prime Video, which has the most titles, has the longest tail of lower rated titles. It also has the lowest mean score for titles. Prime Video appears to have focused on increasing quantity and does not care as much about quality. Netflix and Hulu are fairly similar in their quality, with Hulu being slightly worse by IMBd rating than Netflix. Netflix has the second most titles, so it’s doing a pretty good job at balancing quality and quantity.
Each of these services presents a broad spectrum of genres for their users. People may be concerned with certain genres over another, so it would be good to see how the services distribute their collections among the various genres. The following plots are based on ten popular genres: Action, Adventure, Animation, Comedy, Documentary, Drama, Family, Horror, Romance, and Sci-Fi.
from collections import defaultdict
def get_genre_counts(df, genre_list):
genre_dict = defaultdict(int)
for row in df.Genres:
for genre in genre_list:
try:
if genre in row:
genre_dict[genre] +=1
except:
genre_dict["Unknown"] += 1
return genre_dict
my_genres = sorted(["Comedy", "Romance", "Animation", "Family", "Drama",
"Adventure", "Sci-Fi", "Action", "Documentary", "Horror"]
)
fig = plt.figure(figsize=(12,12), constrained_layout=True)
di_ax = plt.subplot(221, projection="polar")
hu_ax = plt.subplot(222, projection="polar")
nf_ax = plt.subplot(223, projection="polar")
pr_ax = plt.subplot(224, projection="polar")
values = get_genre_counts(prime_df, my_genres)
values = [values[genre] for genre in my_genres]
angles = [n/len(values) *2 * np.pi for n in range(len(my_genres))]
values += [values[0]]
angles += [angles[0]]
pr_ax.plot(angles, values, marker=".", color="orange")
pr_ax.fill(angles, values, alpha=0.25, color="orange")
values = get_genre_counts(hulu_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
hu_ax.plot(angles, values, marker=".", color="green")
hu_ax.fill(angles, values, alpha=0.25, color="green")
values = get_genre_counts(disney_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
di_ax.plot(angles, values, marker=".", color="blue")
di_ax.fill(angles, values, alpha=0.25, color="blue")
values = get_genre_counts(netflix_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
nf_ax.plot(angles, values, marker=".", color="red")
nf_ax.fill(angles, values, alpha=0.25, color="red")
del(angles[-1])
pr_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)
pr_ax.set_title("Prime Video", fontsize=18, fontweight="bold", pad=20)
pr_ax.set_theta_offset(np.pi/2)
pr_ax.tick_params(pad=19)
hu_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)
hu_ax.set_title("Hulu", fontsize=18, fontweight="bold", pad=20)
hu_ax.set_theta_offset(np.pi/2)
hu_ax.tick_params(pad=19)
di_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)
di_ax.set_title("Disney+", fontsize=18, fontweight="bold", pad=20)
di_ax.set_theta_offset(np.pi/2)
di_ax.tick_params(pad=19)
nf_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)
nf_ax.set_title("Netflix", fontsize=18, fontweight="bold", pad=20)
nf_ax.set_theta_offset(np.pi/2)
nf_ax.tick_params(pad=19)
fig.subplots_adjust(top=0.8)
fig.suptitle("Radar Plots of Titles by Selected Genre by Service", fontsize=24, fontweight="bold")
plt.tight_layout(pad=4)
plt.show()
Disney+ is pretty unique in what genres of titles it contains. It is very much oriented towards family titles. It also focuses on Adventure and Comedy. Surprisingly for a company that started with animation, titles in the Animation genre do not stick out for Disney+. Disney+ has no Horror titles and barely any Action, Documentary, Romance, or Sci-Fi titles.
The other three services are mostly oriented towards Comedy and Drama. They also have a good bit of Action, Documentary, and Romance. Hulu and Prime Video also have good allocation of titles to the Horror genre, something Netflix does not.
Looking at the runtimes of titles on a service, we can get a good idea of the types of each title. Runtimes under ten minutes indicate shorts. Runtimes between ten and thirty minutes are likely sit-coms and other television comedies. Runtimes between thirty minutes and an hour are likely television dramas, runtimes between one and two hours are a movie and runtimes over two hours are long movies. Viewers may want to discriminate the services based on what types of titles are available
movies_df_cold["MovieLength"] = pd.cut(
movies_df_cold.Runtime, bins=[-1,0,10,30,60,120,movies_df_cold.Runtime.max()], labels=[-1,0,1,2,3,4]
)
movies_df_cold.MovieLength.fillna(-1, inplace=True)
movies_df_cold.MovieLength.unique()
length_df = movies_df_cold.groupby(["Service", "MovieLength"])["Title"].count().reset_index(name="Count")
stacked_len_df = length_df.pivot(index="Service", columns="MovieLength", values="Count")
stacked_len_df = stacked_len_df.div(stacked_len_df.sum(axis=1), axis=0) #Normalize it
stacked_len_df
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1,1,1)
stacked_len_df.plot(
kind="barh",
#stacked=True,
ax=ax,
color=my_colors,
width=.8
)
rects = ax.patches
for rect in rects:
width = rect.get_width()
width_label = "{:.2f}%".format(width*100)
ax.text(width+.05, rect.get_height()/2 + rect.get_y(), width_label, ha="center", va="center", fontsize=14)
handles, labels = ax.get_legend_handles_labels()
my_labels = ["Unknown Length", "Shorts (Under 10 Min)", "Sitcoms (10-30 Min)",
"Dramas (30 Min-1 Hour)", "Movies (Under 2 Hours)", "Long Movies (Over 2 Hours)"]
legend = plt.legend(
handles, my_labels, bbox_to_anchor=(.95,-.06),
fontsize=12, ncol=3, title="Runtime Category", fancybox=True
)
legend.get_title().set_fontsize(14)
plt.title("Normalized Runtime Category by Service", fontsize=18, fontweight="bold", pad=15)
ax.set_ylabel("Service", fontsize=18)
ax.set_xlabel("Percent", fontsize=18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlim(0, 1)
plt.grid(True, axis="x")
plt.tight_layout()
plt.show()
The runtime categories have been normalized for each service to a percentage out of 100. The overwhelming majority of titles on each service are movies with runtimes under two hours.
All four of the services have similar distributions in terms of runtime categories. Hulu is a surprise here because it was built to stream television shows and yet a smaller proportion of sitcoms and drama tv shows than any of the other services.
Disney+ is also a bit of an outlier here with its comparatively large percentage of shorts. This is probably its collection of old cartoons that other services don’t have access to.
Netflix, Hulu, and Prime Video are built similarly in the collection of titles that they have amassed. They cater to the same age ranges, have mostly recent titles, have similar distributions of genres, and have similar runtimes of titles. Their biggest differences are in quantity and quality of titles.
Disney+ is pretty unique in what it offers. It leans into its family friendly history by focusing on family and all ages content. Disney+ also can take advantage of the history by delving into their large back catalog of offerings. This is something no other service can offer.
Disney+ has also focused on quality over quantity. It has the fewest titles but the highest average IMDb rating. Prime Video has taken the opposite tack and has a lot of titles but their quality is lacking compared to the other services.