Streaming Visualizations

library(reticulate)
use_python("/Users/andrew/opt/anaconda3/bin/python")

Introduction

Video streaming services provide a vast number of various movies, television shows, and other audio-visual media titles. More and more of what people are watching is coming via a streaming video service as more users flock to the services. As the number of services also grow, it is becoming increasingly difficult to decide which service or service to subscribe to. This analysis will look at four of the major services in the “streaming wars” to compare and contrast their offerings. The four services are Disney+, Hulu, Netflix, and Prime Video.

Dataset

The data came from Kaggle. It is nine months old at the time of analysis (February 2021), so it is fairly out of date, but the general content on each service shouldn’t have changed that drastically since then.

There are 16744 rows in the data with the following columns:

Column Name	Description
Title	The title of media
Year	The release year of the title
Age	The recommended age for the title
IMDb	The IMDb score for the title
Rotten Tomatoes	The Rotten Tomatoes score for the title
Netflix, Hulu, Prime Video, Disney+	Four columns of a one-hot style coding indicating if the title is available on the service
Directors	The director(s) of the title
Genres	The genre(s) of the title
Country	The country of countries where the title was filmed
Language	The language or languages used in the title
Runtime	The runtime in minutes of the title

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
path = "/Users/andrew/DS736/py_datafiles/MoviesOnStreamingPlatforms_updated.csv"
movies_df = pd.read_csv(path, index_col="index")
del(movies_df["ID"])
del(movies_df["Type"])
movies_df[["Year","IMDb", "Runtime"]].describe()

##                Year          IMDb       Runtime
## count  16744.000000  16173.000000  16152.000000
## mean    2003.014035      5.902751     93.413447
## std       20.674321      1.347867     28.219222
## min     1902.000000      0.000000      1.000000
## 25%     2000.000000      5.100000     82.000000
## 50%     2012.000000      6.100000     92.000000
## 75%     2016.000000      6.900000    104.000000
## max     2020.000000      9.300000   1256.000000

The number columns are all in pretty good shape. Year has an entry for ever row and is bounded by 1902 and 2020. IMDb ratings and Runtime are also present for a large majority of the rows. IMDb ratings have a min and max of 0 and 9.3. Runtime also has a reasonable minimum of 1 minute, but the max runtime is 1256 minutes. That might be an issue we’ll have to deal with.

movies_df[["Netflix", "Hulu", "Prime Video", "Disney+"]].describe()

##             Netflix          Hulu   Prime Video       Disney+
## count  16744.000000  16744.000000  16744.000000  16744.000000
## mean       0.212613      0.053930      0.737817      0.033684
## std        0.409169      0.225886      0.439835      0.180419
## min        0.000000      0.000000      0.000000      0.000000
## 25%        0.000000      0.000000      0.000000      0.000000
## 50%        0.000000      0.000000      1.000000      0.000000
## 75%        0.000000      0.000000      1.000000      0.000000
## max        1.000000      1.000000      1.000000      1.000000

There’s a value in every row for the service columns, and they are either one or zero. Good to go here!

movies_df["Rotten Tomatoes"].value_counts()

## 100%    407
## 80%     162
## 50%     136
## 83%     131
## 67%     126
##        ... 
## 5%       10
## 7%       10
## 4%        9
## 2%        4
## 3%        4
## Name: Rotten Tomatoes, Length: 99, dtype: int64

movies_df["Rotten Tomatoes"].isna().sum()

## 11586

The Rotten Tomatoes scores look reasonable. However, over 11k are missing. That’s probably too many to be able to make any use of this column, especially given that the IMDb scores are also present.

movies_df.Age.value_counts()

## 18+    3474
## 7+     1462
## 13+    1255
## all     843
## 16+     320
## Name: Age, dtype: int64

movies_df.Age.isna().sum()

## 9390

The Age categories all make sense, although there are quite a few missing.

movies_df.Genres.value_counts()

## Drama                                                      1341
## Documentary                                                1229
## Comedy                                                     1040
## Comedy,Drama                                                446
## Horror                                                      436
##                                                            ... 
## Adventure,Drama,Fantasy,Music                                 1
## Adventure,Comedy,Family,Romance                               1
## Action,Crime,Horror,Mystery,Sport                             1
## Action,Adventure,Comedy,Fantasy,Horror,Mystery,Thriller       1
## Documentary,Biography,Comedy,Family                           1
## Name: Genres, Length: 1909, dtype: int64

movies_df.Genres.isna().sum()

## 275

There are a lot of genres and a lot of the titles are in multiple genres. I’ll have to probably break these up to make them more useful.

movies_df.Country.value_counts()

## United States                       8776
## India                               1064
## United Kingdom                       905
## Canada                               555
## Australia                            202
##                                     ... 
## India,Sweden                           1
## United States,Nicaragua                1
## France,Israel                          1
## Italy,Switzerland,Albania,Poland       1
## Argentina,Uruguay,Serbia               1
## Name: Country, Length: 1303, dtype: int64

movies_df.Country.isna().sum()

## 435

Most of these titles are filmed in the US. There are also a bunch from India, indicating that the services probably have a lot of films from Bollywood.

movies_df.Language.value_counts()

## English                       10955
## Hindi                           503
## English,Spanish                 276
## Spanish                         267
## English,French                  174
##                               ...  
## English,Malay                     1
## English,Malay,Hokkien             1
## Haitian,English,French            1
## Mandarin,Min Nan,Cantonese        1
## Hindi,Gujarati,English            1
## Name: Language, Length: 1102, dtype: int64

movies_df.Language.isna().sum()

## 599

The overwhelming number of titles feature English. This makes sense for US-based streaming services. Things look good here.

I skipped over Titles and Directors because there are so many different values for those columns. I do not plan to analyze/visualize those columns so there isn’t a need to check for NaNs and empty values.

Findings

Below are visualizations showing differences and similarities between the different streaming services. Depending on consumer preferences they may help to decide which service(s) to subscribe to.

Age Ratings

One thing to look at in the different services is the age ratings for the titles available. Is one service more “adult” than the others? Perhaps one service is better for small kids and family?

movies_df["Age"] = movies_df["Age"].fillna("No Age")

my_age_order = ["all", "7+", "13+", "16+", "No Age", "18+"]
movies_df.Age = pd.Categorical(movies_df.Age, categories=my_age_order, ordered=True)

movies_df.sort_values(by="Age", inplace=True)

prime_df = movies_df[movies_df["Prime Video"] ==1]
hulu_df = movies_df[movies_df["Hulu"] ==1]
netflix_df  = movies_df[movies_df["Netflix"] ==1]
disney_df = movies_df[movies_df["Disney+"] ==1]
prime_df["Service"] = "Prime Video"

netflix_df["Service"] = "Netflix"
hulu_df["Service"] = "Hulu"
disney_df["Service"] = "Disney+"

fig = plt.figure(figsize=(18,16))
nf_ax = fig.add_subplot(2,2,1)
di_ax = fig.add_subplot(2,2,2)
pr_ax = fig.add_subplot(2,2,3)
hu_ax = fig.add_subplot(2,2,4)

num_ages = len(movies_df.Age.unique())
colormap = plt.get_cmap("Set2")
my_colors = colormap(np.arange(num_ages))


netflix_df.groupby(["Age"])["Title"].count().plot(
    kind="pie", radius=1, ax=nf_ax, colors=my_colors,
    pctdistance=1.15, labels=None,
    wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
    startangle=90,
    autopct= "%1.2f%%"
)
disney_df.groupby(["Age"])["Title"].count().plot(
    kind="pie", radius=1, ax=di_ax, colors=my_colors,
    pctdistance=1.15, labels=None,
    wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
    startangle=90,
    autopct= "%1.2f%%"
)
prime_df.groupby(["Age"])["Title"].count().plot(
    kind="pie", radius=1, ax=pr_ax, colors=my_colors,
    pctdistance=1.15, labels=None,
    wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
    startangle=90,
    autopct= "%1.2f%%"
)
hulu_df.groupby(["Age"])["Title"].count().plot(
    kind="pie", radius=1, ax=hu_ax, colors=my_colors,
    pctdistance=1.15, labels=None,
    wedgeprops=dict(edgecolor="w", width=.5), textprops={"fontsize":14},
    startangle=90,
    autopct= "%1.2f%%",
)
di_ax.set_title("Disney+", fontsize=20, fontweight="bold")
pr_ax.set_title("Prime Video", fontsize=20, fontweight="bold")
nf_ax.set_title("Netflix", fontsize=20, fontweight="bold")
hu_ax.set_title("Hulu", fontsize=20, fontweight="bold")

di_ax.yaxis.set_visible(False)
pr_ax.yaxis.set_visible(False)
nf_ax.yaxis.set_visible(False)
hu_ax.yaxis.set_visible(False)

di_ax.text(0,0, "Total Titles:\n{:,}".format(disney_df.shape[0]), ha="center", va="center", size=18)
nf_ax.text(0,0, "Total Titles:\n{:,}".format(netflix_df.shape[0]), ha="center", va="center", size=18)
hu_ax.text(0,0, "Total Titles:\n{:,}".format(hulu_df.shape[0]), ha="center", va="center", size=18)
pr_ax.text(0,0, "Total Titles:\n{:,}".format(prime_df.shape[0]), ha="center", va="center", size=18)


handles, labels = hu_ax.get_legend_handles_labels()
legend = plt.legend(movies_df.Age.unique(), bbox_to_anchor=(1.15,1.2), fontsize=16, title="Age Rating", fancybox=True)
legend.get_title().set_fontsize(16)

plt.suptitle("Age Ratings for Each Service", fontsize=24, fontweight="bold")
plt.tight_layout(pad=4)
plt.show()

Disney+ is a much different service than the other three services. Over 80% of Disney+’s offerings are aimed at all ages or kids aged seven and up. There’s also almost no content on Disney+ that is rated for only people over 16. It is very much the family-friendly/kid-friendly service of the three. Disney+ is also the most thorough in their rating systems, with only 11.35% of their content having no rating, compared to the other three servcies which each have at least three times as much unrated content. The other three services have roughly similar distributions of content for the various ages.

Also of note here is that Prime Video has most of the titles available in this data.

Release Years

To some users, the release years of movies in a service may be an important discriminate. Some users may want older movies to rewatch the classics. Others may want to watch only the newest releases.

fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(1,1,1)
sns.distplot(
    disney_df.Year, ax=ax, hist=False, color="blue", label="Disney+",
    kde_kws={"clip":(disney_df.Year.min(), disney_df.Year.max()), "linewidth": 3, "fill":True},
)
sns.distplot(
    hulu_df.Year, ax=ax, hist=False, color="green", label="Hulu",
    kde_kws={"clip":(hulu_df.Year.min(), hulu_df.Year.max()), "linewidth": 3, "fill":True}
)

sns.distplot(
    netflix_df.Year, ax=ax, hist=False, color="red", label="Netflix", 
    kde_kws={"clip":(netflix_df.Year.min(), netflix_df.Year.max()), "linewidth": 3, "fill":True}
)

sns.distplot(
    prime_df.Year, ax=ax, hist=False, color="orange", label="Prime Video", 
    kde_kws={"clip":(prime_df.Year.min(), prime_df.Year.max()), "linewidth": 3, "fill":True}
)

ax.set_xlabel("Year", fontsize=18, fontweight="bold")
ax.set_ylabel("Density", fontsize=18, fontweight="bold")
plt.xticks(np.arange(movies_df.Year.min(), movies_df.Year.max(), 5), fontsize=14, rotation=60)

plt.yticks(fontsize=14)

plt.title("Density Plots of Titles by Year", fontsize=20, fontweight="bold")
plt.legend(fontsize=16, loc="upper left", fancybox=True)
plt.tight_layout()
plt.show()

The above chart shows the density of the titles over the years. That is, given a year, what proportion of titles from that server were released that year? Netflix and Hulu have mostly recent content. The majority of their titles were produced in the late 2010s, as evidenced by the large peaks on the right of the graph. Netflix is probably dominated by its original content, which is mostly from the last five years. Hulu was created a decade or so ago to re-broadcast over the air television shows, so they would likely only have recent titles. Prime Video has the oldest titles in the collection, going back to 1902. Prime also has the most titles of any service, so the long tail isn’t entirely surprising. Disney+ has a fairly even distribution of titles from the last thirty or so years. They also have thickest tail going back in time. Given that Disney has been producing film, animation, and television since the 1920s, it is able to provide quite a bit of older content. The other three services are producing their own content, but they are all also fairly recent media companies, especially compared to Disney.

IMDb Ratings

The quality of titles on a service is very important to some people. They may want to know that they are getting the top quality for their money. Using IMDb ratings of the titles, can we see which service has the “best” titles in its collection.

movies_df_cold = pd.concat([hulu_df, netflix_df, prime_df, disney_df], ignore_index=True)
movies_df_cold.shape

fig = plt.figure(figsize=(15,10))
ax = fig.add_subplot(1,1,1)
sns.violinplot(
    x="IMDb", y="Service", data=movies_df_cold, 
    palette=["blue", "green", "red", "orange"],
    order=["Disney+", "Hulu", "Netflix", "Prime Video"],
    cut=0,
    scale="width",
    inner="quartiles",
    ax=ax
)
plt.xticks(np.arange(0, 10, 0.5), fontsize=14)

plt.yticks(fontsize=14)

ax.set_xlabel("IMDb Rating", fontsize=18, labelpad=20, fontweight="bold")
ax.set_ylabel("Streaming Service", fontsize=18, labelpad=20, fontweight="bold")
plt.title("IMDb Ratings for Titles on Each Service", fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()

Disney+ has the best titles on average. It also has the highest rated upper quartile of titles. However, unlike the other three services is does not have any titles rated over 9.0. Prime Video, which has the most titles, has the longest tail of lower rated titles. It also has the lowest mean score for titles. Prime Video appears to have focused on increasing quantity and does not care as much about quality. Netflix and Hulu are fairly similar in their quality, with Hulu being slightly worse by IMBd rating than Netflix. Netflix has the second most titles, so it’s doing a pretty good job at balancing quality and quantity.

Genres

Each of these services presents a broad spectrum of genres for their users. People may be concerned with certain genres over another, so it would be good to see how the services distribute their collections among the various genres. The following plots are based on ten popular genres: Action, Adventure, Animation, Comedy, Documentary, Drama, Family, Horror, Romance, and Sci-Fi.

from collections import defaultdict
def get_genre_counts(df, genre_list):
    genre_dict = defaultdict(int)
    for row in df.Genres:
        for genre in genre_list:
            try:
                if genre in row:
                    genre_dict[genre] +=1
            except:
                genre_dict["Unknown"] += 1
    return genre_dict

            
my_genres = sorted(["Comedy", "Romance", "Animation", "Family", "Drama", 
             "Adventure", "Sci-Fi", "Action", "Documentary", "Horror"]
)

fig = plt.figure(figsize=(12,12), constrained_layout=True)
di_ax = plt.subplot(221, projection="polar")
hu_ax = plt.subplot(222, projection="polar")
nf_ax = plt.subplot(223, projection="polar")
pr_ax = plt.subplot(224, projection="polar")

values = get_genre_counts(prime_df, my_genres)
values = [values[genre] for genre in my_genres]
angles = [n/len(values) *2 * np.pi for n in range(len(my_genres))]
values += [values[0]]
angles += [angles[0]]
pr_ax.plot(angles, values, marker=".", color="orange")
pr_ax.fill(angles, values, alpha=0.25, color="orange")

values = get_genre_counts(hulu_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
hu_ax.plot(angles, values, marker=".", color="green")
hu_ax.fill(angles, values, alpha=0.25, color="green")

values = get_genre_counts(disney_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
di_ax.plot(angles, values, marker=".", color="blue")
di_ax.fill(angles, values, alpha=0.25, color="blue")

values = get_genre_counts(netflix_df, my_genres)
values = [values[genre] for genre in my_genres]
values += [values[0]]
nf_ax.plot(angles, values, marker=".", color="red")
nf_ax.fill(angles, values, alpha=0.25, color="red")

del(angles[-1])
pr_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)

pr_ax.set_title("Prime Video", fontsize=18, fontweight="bold", pad=20)
pr_ax.set_theta_offset(np.pi/2)
pr_ax.tick_params(pad=19)

hu_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)

hu_ax.set_title("Hulu", fontsize=18, fontweight="bold", pad=20)
hu_ax.set_theta_offset(np.pi/2)
hu_ax.tick_params(pad=19)

di_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)

di_ax.set_title("Disney+", fontsize=18, fontweight="bold", pad=20)
di_ax.set_theta_offset(np.pi/2)
di_ax.tick_params(pad=19)

nf_ax.set_thetagrids(np.degrees(angles), my_genres, fontsize=11)

nf_ax.set_title("Netflix", fontsize=18, fontweight="bold", pad=20)
nf_ax.set_theta_offset(np.pi/2)
nf_ax.tick_params(pad=19)

fig.subplots_adjust(top=0.8)

fig.suptitle("Radar Plots of Titles by Selected Genre by Service", fontsize=24, fontweight="bold")
plt.tight_layout(pad=4)
plt.show()

Disney+ is pretty unique in what genres of titles it contains. It is very much oriented towards family titles. It also focuses on Adventure and Comedy. Surprisingly for a company that started with animation, titles in the Animation genre do not stick out for Disney+. Disney+ has no Horror titles and barely any Action, Documentary, Romance, or Sci-Fi titles.

The other three services are mostly oriented towards Comedy and Drama. They also have a good bit of Action, Documentary, and Romance. Hulu and Prime Video also have good allocation of titles to the Horror genre, something Netflix does not.

Runtimes

Looking at the runtimes of titles on a service, we can get a good idea of the types of each title. Runtimes under ten minutes indicate shorts. Runtimes between ten and thirty minutes are likely sit-coms and other television comedies. Runtimes between thirty minutes and an hour are likely television dramas, runtimes between one and two hours are a movie and runtimes over two hours are long movies. Viewers may want to discriminate the services based on what types of titles are available

movies_df_cold["MovieLength"] = pd.cut(
    movies_df_cold.Runtime, bins=[-1,0,10,30,60,120,movies_df_cold.Runtime.max()], labels=[-1,0,1,2,3,4]
)

movies_df_cold.MovieLength.fillna(-1, inplace=True)
movies_df_cold.MovieLength.unique()

length_df = movies_df_cold.groupby(["Service", "MovieLength"])["Title"].count().reset_index(name="Count")
stacked_len_df = length_df.pivot(index="Service", columns="MovieLength", values="Count")
stacked_len_df = stacked_len_df.div(stacked_len_df.sum(axis=1), axis=0) #Normalize it
stacked_len_df

fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1,1,1)

stacked_len_df.plot(
    kind="barh",
    #stacked=True,
    ax=ax,
    color=my_colors,
    width=.8
)
rects = ax.patches
for rect in rects:
    width = rect.get_width()
    width_label = "{:.2f}%".format(width*100)
    ax.text(width+.05, rect.get_height()/2 + rect.get_y(), width_label, ha="center", va="center", fontsize=14)
handles, labels = ax.get_legend_handles_labels()
my_labels = ["Unknown Length", "Shorts (Under 10 Min)", "Sitcoms (10-30 Min)", 
             "Dramas (30 Min-1 Hour)", "Movies (Under 2 Hours)", "Long Movies (Over 2 Hours)"]
legend = plt.legend(
    handles, my_labels, bbox_to_anchor=(.95,-.06), 
    fontsize=12, ncol=3, title="Runtime Category", fancybox=True
)
legend.get_title().set_fontsize(14)
plt.title("Normalized Runtime Category by Service", fontsize=18, fontweight="bold", pad=15)
ax.set_ylabel("Service", fontsize=18)
ax.set_xlabel("Percent", fontsize=18)
plt.xticks(fontsize=14)

plt.yticks(fontsize=14)

plt.xlim(0, 1)

plt.grid(True, axis="x")
plt.tight_layout()
plt.show()

The runtime categories have been normalized for each service to a percentage out of 100. The overwhelming majority of titles on each service are movies with runtimes under two hours.

All four of the services have similar distributions in terms of runtime categories. Hulu is a surprise here because it was built to stream television shows and yet a smaller proportion of sitcoms and drama tv shows than any of the other services.

Disney+ is also a bit of an outlier here with its comparatively large percentage of shorts. This is probably its collection of old cartoons that other services don’t have access to.

Conclusion

Netflix, Hulu, and Prime Video are built similarly in the collection of titles that they have amassed. They cater to the same age ranges, have mostly recent titles, have similar distributions of genres, and have similar runtimes of titles. Their biggest differences are in quantity and quality of titles.

Disney+ is pretty unique in what it offers. It leans into its family friendly history by focusing on family and all ages content. Disney+ also can take advantage of the history by delving into their large back catalog of offerings. This is something no other service can offer.

Disney+ has also focused on quality over quantity. It has the fewest titles but the highest average IMDb rating. Prime Video has taken the opposite tack and has a lot of titles but their quality is lacking compared to the other services.