Introduction

Video streaming services have become an increasingly common product throughout many American households. The delivery of a wide variety of content to subscribers for a price that is often cheaper than traditional cable services has made the act of streaming a go-to for many people looking to kick back and relax at the end of a long day. Netflix was an early visionary of video streaming, but competitors are stronger and more numerous in recent years and are giving consumers a wide range of platforms that they must now choose from.

As a player in this industry, it is critical to be informed of the content that each service provides, to identify strengths, weaknesses, opportunites, and threats. The purpose of this analysis is to analyze the content within Netflix's catalog, to describe the current state of their offering. We will investigate trends regarding content types, genres, ratings, release dates, countries of production, and more. We hope that this analysis will give readers a deeper understanding of the Netflix catalog offering to further spark a conversation about their competitive advantages and disadvantages.

The Data

Throughout this analysis, we will be leveraging a dataset containing information on the 7,787 titles in Netflix's catalog as of January 18, 2021. This dataset was downloaded as a csv file from kaggle and placed on a local directory for subsequent analysis. The data contains a variety of interesting features on each title, including, but not limited to:

  • The type of the content (TV Show or Movie)
  • A list of the cast from the production
  • The country that the title was produced in
  • The date that the title was added to the Netflix platform
  • The rating of the title (either the TV Parental Guidelines rating for TV shows, or the Motion Picture Association film rating for movies)
  • The duration of the title, in either minutes or seasons
  • A list of genres associated with the title
# Import relevant modules
from math import ceil
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd

# The dataset is stored in a csv file in a local directory
file_path = 'C:/Users/Rick/Documents/Loyola/DS_736/Modules/2_Python/Project/netflix_titles.csv'

# Read in the file
netflix_df = pd.read_csv(file_path)

Titles Released Over Time

Netflix is continuously updating their catalog with new content on a regular basis, adding movies and television shows to their taxonomy, while simultaneously removing seasoned content from their platform, in an effort to stay relevant to streaming users' ever-changing interests. A question we may ask is: when has the material currently available on Netflix been released to the platform? We investigate this question by plotting the number of titles released for each year and quarter, plotting separate lines for movies and television shows below.

###########################
# Start Data Preparation
###########################

# Extract relevant columns
title_by_qtr = netflix_df.loc[:, ['type', 'date_added']]

# Convert date field to datetime
title_by_qtr['date_added'] = pd.to_datetime(title_by_qtr['date_added'])

# Drop rows with missing values
title_by_qtr.dropna(inplace = True)

# Extract the year and convert to integer
title_by_qtr['year_added'] = title_by_qtr['date_added'].dt.year
title_by_qtr['year_added'] = title_by_qtr['year_added'].astype(int)

# Extract the quarter
title_by_qtr['quarter_added'] = 'Q' + title_by_qtr['date_added'].dt.quarter.astype(str)

# Tabulate titles released by content type, year, and quarter
title_by_qtr = title_by_qtr.groupby(['type', 'year_added', 'quarter_added'])['year_added'].count().reset_index(name = 'titles_released')

# Combine year and quarter into a period feature
title_by_qtr['period_added'] = title_by_qtr['year_added'].astype(str) + ' ' + title_by_qtr['quarter_added']

# Extract relevant columns
title_by_qtr = title_by_qtr[['type', 'period_added', 'titles_released']]

# Next we will add in rows for periods that had no data available

# Define the scope of our analysis
year_scope = np.arange(2008, 2022)
quarter_scope = [1, 2, 3, 4]

# Iteratively check that each period has a record for each type (Movie or TV Show)
# Add in rows when a missing period is detected 
for period in [str(y) + ' Q' + str(q) for y in year_scope for q in quarter_scope]:
    if period not in title_by_qtr.loc[title_by_qtr['type'] == 'Movie', 'period_added'].values:
        new_row = pd.DataFrame(
            {
                'type': ['Movie'],
                'period_added': [period],
                'titles_released': [0]
            }
        )
        title_by_qtr = pd.concat([title_by_qtr, new_row], axis = 0)
    if period not in title_by_qtr.loc[title_by_qtr['type'] == 'TV Show', 'period_added'].values:
        new_row = pd.DataFrame(
            {
                'type': ['TV Show'],
                'period_added': [period],
                'titles_released': [0]
            }
        )
        title_by_qtr = pd.concat([title_by_qtr, new_row], axis = 0)

# Drop a few excess periods that are not needed
# (This dataset only went up to Q1 of 2021)
title_by_qtr = title_by_qtr.loc[~title_by_qtr['period_added'].isin(['2021 Q2', '2021 Q3', '2021 Q4']),:]

# Sort values and reset index
title_by_qtr = title_by_qtr.sort_values(['type', 'period_added'])
title_by_qtr.reset_index(inplace = True)
###########################
# Start Visualization
###########################

fig = plt.figure(figsize = (18, 10))

ax = fig.add_subplot(1, 1, 1)

ax.set_facecolor('black')

ax.plot(
    title_by_qtr.loc[title_by_qtr['type'] == 'Movie', 'period_added'], 
    title_by_qtr.loc[title_by_qtr['type'] == 'Movie', 'titles_released'],
    label = 'Movies',
    color = 'red',
    marker = '$M$',
    linewidth = 1.5,
    markersize = 12,
    linestyle = 'dotted'
)
ax.plot(
    title_by_qtr.loc[title_by_qtr['type'] == 'TV Show', 'period_added'], 
    title_by_qtr.loc[title_by_qtr['type'] == 'TV Show', 'titles_released'],
    label = 'TV Shows',
    color = 'white',
    marker = '$TV$',
    linewidth =1.5,
    markersize = 12,
    linestyle = 'dotted'
)
plt.xticks(rotation = 90)
plt.legend(facecolor = 'lightgrey', title = 'Content Type:')
plt.title('Number of Titles Released by Year and Quarter\n Netflix Catalog as of Jan 18, 2021', 
          fontsize = 18)
plt.xlabel('Year and Quarter', fontsize = 14)
plt.ylabel('Titles Released', fontsize = 14)
plt.yticks(np.arange(0, title_by_qtr['titles_released'].max() + 100, 50))
plt.grid(linestyle = 'dotted', linewidth = 0.5)

plt.show()

We can observe that the oldest content that exists on Netflix was originally added in the first quarter of 2008, but having such a seasoned piece of content is relatively rare. Relatively few of the existing titles were added prior to 2016, and the bulk of the content currently on Netflix was added roughly at some point within the last 5 years. In particular, the 4th quarter of 2019 is the period with the most content still available on Netflix; this is true for both movies and television, as each has a respective maximum at this point.

Analysis of Content Length

Streaming is a serious time commitment for some customers, and the rise of the phenomenon of 'binge-watching' is often synonymous with Netflix's growth in popularity. It can be informative to know what you're getting yourself into when starting a new TV show or movie, so we'll next investigate the content lengths of the titles in our dataset.

For movies in our dataset, the content duration is measured in minutes, whereas the television shows in our dataset have their durations measured in number of seasons. We'll plot separate histograms for each type and inspect the results below:

###########################
# Start Data Preparation
###########################

# Extract relevant columns
duration_df = netflix_df.loc[:,['type', 'duration']]

# TV shows are measured in seasons and movies are measure in minutes
# We will extract these strings from each column 
str_to_remove = ['Seasons', 'Season', 'min']
for s in str_to_remove:
    duration_df['duration'] = duration_df['duration'].str.replace(s, '')

# Now convert each column to an integer
duration_df['duration'] = duration_df['duration'].astype(int)
###########################
# Start Visualization
###########################

fig = plt.figure(figsize = (18, 10))

plt.suptitle('Histograms of Content Length for Movies and TV Shows\n Netflix Catalog as of Jan 18, 2021', fontsize = 18)
ax1 = fig.add_subplot(2, 1, 1)

ax1.set_facecolor('black')

movie_hist = ax1.hist(
    duration_df.loc[duration_df['type'] == 'Movie', 'duration'],
    label = 'Movies',
    color = 'red',
    edgecolor = 'white',
    bins = np.arange(0, duration_df.loc[duration_df['type'] == 'Movie', 'duration'].max() + 15, 15)
)

ax1.legend(facecolor = 'lightgrey', title = 'Content Type:')
ax1.set_xlabel('Movie Length in Minutes', fontsize = 14)
ax1.set_xticks(np.arange(0, duration_df.loc[duration_df['type'] == 'Movie', 'duration'].max() + 15, 15))

# Annotate each bar
for i, height in enumerate(movie_hist[0]):
    ax1.text(
        1 + 15*i + (15/2),
        height + 50,
        format(int(height),','),
        color = 'white',
        va = 'center', ha = 'center',
        fontweight = 'bold'
    )
    
ax1.set_ylim(0, movie_hist[0].max() + 200)
ax1.grid(linestyle = 'dotted', linewidth = 0.5)

ax1.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))

ax2 = fig.add_subplot(2, 1, 2)

ax2.set_facecolor('black')

tv_hist = ax2.hist(
    duration_df.loc[duration_df['type'] == 'TV Show', 'duration'],
    label = 'TV Shows',
    color = 'white',
    edgecolor = 'red',
    bins = np.arange(1, duration_df.loc[duration_df['type'] == 'TV Show', 'duration'].max() + 2, 1)
)

ax2.set_xlabel('TV Show Length in Seasons', fontsize = 14)
ax2.set_xticks(
    ticks = np.arange(1, duration_df.loc[duration_df['type'] == 'TV Show', 'duration'].max() + 1, 1) + 0.5
)
ax2.set_xticklabels(
    labels = np.arange(1, duration_df.loc[duration_df['type'] == 'TV Show', 'duration'].max() + 1, 1)
)
ax2.set_xlim(0, 18)
ax2.legend(facecolor = 'lightgrey', title = 'Content Type:')
ax2.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))

# Annotate each bar
for i, height in enumerate(tv_hist[0]):
    ax2.text(
        1 + 1*i + (1/2),
        height + 50,
        format(int(height),','),
        color = 'white',
        va = 'center', ha = 'center',
        fontweight = 'bold'
    )
    
ax2.set_ylim(0, tv_hist[0].max() + 200)
ax2.grid(linestyle = 'dotted', linewidth = 0.5)

fig.add_subplot(111, frame_on=False)
plt.tick_params(labelcolor="none", bottom=False, left=False)
plt.ylabel('Count of Titles', fontsize = 14, labelpad = 20)
plt.show()

For movies, we see a bell-shaped distribution, with the peak of the distribution falling between 90 minutes to 105 minutes. There are generally fewer movies as we move farther from this center. For example, there are 13 movies that fall under 15 minutes in total length, and there is 1 movie (Black Mirror: Bandersnatch) that runs at a whopping 312 minutes.

Our distribution of television shows in seasons tells a very different story. The majority of television shows only have a single season available on Netflix. Presumably many television shows are only on air for a single season - many of these are likely either intentional mini-series or shows that were simply canceled shortly after inception. It becomes progressively rarer to find content on Netflix as we increase the number of seasons. Only 10 shows have more than 10 seasons on Netflix, the longest of which is Grey's Anatomy which has 16 seasons on Netflix.

Content Genres and Ratings

There exists a significant amount of content on Netflix to fit just about any individual's interest. Our dataset includes a list of genres affiliated with each title, and we can exploit this data to learn more about the taxonomy of titles on Netflix. Below, we will investigate a plot of the number of titles listed under each genre, for both movies and television shows, while color-coding by ratings.

###########################
# Start Data Preparation
###########################

# Start data preparation for Movies

# Subset columns
movie_genres_by_rating = netflix_df.loc[netflix_df['type'] == 'Movie', ['rating', 'listed_in']]

# Drop missing values
movie_genres_by_rating = movie_genres_by_rating.dropna()

# Create genre list
movie_genres_by_rating['genre_list'] = movie_genres_by_rating['listed_in'].apply(lambda x: [g.strip() for g in x.split(',')])

# Remove unncessary columns
movie_genres_by_rating = movie_genres_by_rating.loc[:,['rating', 'genre_list']]

# Extract genres and years into desired format
ratings = []
genres = []

for index, row in movie_genres_by_rating.iterrows():
    ratings.extend([row[0] for genre in row[1]])
    genres.extend([genre for genre in row[1]])

movie_genres_by_rating = pd.DataFrame({
    'rating': ratings,
    'genre': genres
})

# Drop invalid ratings
valid_movie_ratings = ['G', 'PG', 'PG-13', 'R', 'NC-17']
movie_genres_by_rating = movie_genres_by_rating.loc[movie_genres_by_rating['rating'].isin(valid_movie_ratings),:]

# Count movies by rating and genres
tmp = movie_genres_by_rating.groupby(['rating', 'genre'])['genre'].count().reset_index(name = 'count')

# Pivot results
movie_pivot = tmp.pivot(index='genre', columns='rating', values='count').fillna(0)

# Sort values by most common genre
movie_pivot['tmp_total'] = movie_pivot.sum(axis = 1)
movie_pivot = movie_pivot.sort_values('tmp_total', ascending = True)

# Remove temp column
del movie_pivot['tmp_total']

# Start data preparation for TV Shows

# Subset columns
tv_genres_by_rating = netflix_df.loc[netflix_df['type'] == 'TV Show', ['rating', 'listed_in']]

# Drop missing values
tv_genres_by_rating = tv_genres_by_rating.dropna()

# Create genre list
tv_genres_by_rating['genre_list'] = tv_genres_by_rating['listed_in'].apply(lambda x: [g.strip() for g in x.split(',')])

# Remove unncessary columns
tv_genres_by_rating = tv_genres_by_rating.loc[:,['rating', 'genre_list']]

# Extract genres and years into desired format
ratings = []
genres = []

for index, row in tv_genres_by_rating.iterrows():
    ratings.extend([row[0] for genre in row[1]])
    genres.extend([genre for genre in row[1]])
    
tv_genres_by_rating = pd.DataFrame({
    'rating': ratings,
    'genre': genres
})

# Drop invalid ratings
valid_tv_ratings = ['TV-Y','TV-Y7','TV-G','TV-PG','TV-14','TV-MA']
tv_genres_by_rating = tv_genres_by_rating.loc[tv_genres_by_rating['rating'].isin(valid_tv_ratings),:]

# Count titles by genre and rating
tmp = tv_genres_by_rating.groupby(['rating', 'genre'])['genre'].count().reset_index(name = 'count')

# Pivot the results
tv_pivot = tmp.pivot(index='genre', columns='rating', values='count').fillna(0)

# Sort by most common genres
tv_pivot['tmp_total'] = tv_pivot.sum(axis = 1)
tv_pivot = tv_pivot.sort_values('tmp_total', ascending = True)

# Remove temporary column
del tv_pivot['tmp_total']
###########################
# Start Visualization
###########################

fig = plt.figure(figsize = (21, 17))

# Movies

ax1 = fig.add_subplot(2, 1, 1)

ax1.set_facecolor('black')

bar_start = np.zeros(len(movie_pivot.index))

my_cmap = plt.get_cmap('YlOrRd')

for index, rating in enumerate(valid_movie_ratings):
    ax1.barh(
        movie_pivot.index,
        movie_pivot[rating],
        left = bar_start,
        label = rating,
        edgecolor = 'black',
        linewidth = 0.35,
        color = my_cmap(index/len(valid_movie_ratings))
    )
    bar_start += movie_pivot[rating]
ax1.legend(facecolor = 'lightgrey', title = 'Movie Rating:', loc = 'lower right')
ax1.get_xaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))

ax1.set_title('Movies by Genre and Rating', fontsize = 16)
ax1.set_xlabel('Number of Movies', fontsize = 14)
ax1.set_ylabel('Movie Genre', fontsize = 14)
ax1.set_xticks(np.arange(0, int(ceil(movie_pivot.sum(axis = 1).max() / 100.0)) * 100, 50))
ax1.grid(linestyle = 'dotted', linewidth = 0.5)

# TV Shows

ax2 = fig.add_subplot(2, 1, 2)

ax2.set_facecolor('black')

bar_start = np.zeros(len(tv_pivot.index))

for index, rating in enumerate(valid_tv_ratings):
    ax2.barh(
        tv_pivot.index,
        tv_pivot[rating],
        left = bar_start,
        label = rating,
        edgecolor = 'black',
        linewidth = 0.35,
        color = my_cmap(index/len(valid_tv_ratings))
    )
    bar_start += tv_pivot[rating]
ax2.legend(facecolor = 'lightgrey', title = 'TV Rating:', loc = 'lower right')
ax2.get_xaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))

ax2.set_title('TV Shows by Genre and Rating', fontsize = 16)
ax2.set_xlabel('Number of TV Shows', fontsize = 14)
ax2.set_ylabel('TV Genre', fontsize = 14)
ax2.set_xticks(np.arange(0, int(ceil(tv_pivot.sum(axis = 1).max() / 100.0)) * 100 + 100, 100))
ax2.grid(linestyle = 'dotted', linewidth = 0.5)

plt.suptitle('Bar Charts of Genres and Ratings for Movies and TV Shows\n Netflix Catalog as of Jan 18, 2021', 
             fontsize = 18,
             y = 0.95
            )
plt.show()

For movies, we observe that dramas, comedies, and action & adventure are the 3 most commonly classified genres to the Netflix titles. In terms of ratings, the most family friendly genre is, unsurprisingly, children and family movies - a large number of titles are rated either G or PG. On the other hand, independent movies, thrillers, horror movies, cult movies, LGBTQ movies, and stand-up comedy are most commonly suited for adults - very few of these titles are rated G or PG.

For television shows, international TV shows, TV dramas, and TV comedies round out the top 3 for the most commonly classified genres. Kids' shows are by far (and again unsurpisingly) the most family-friendly titles in terms of ratings. Crime TV shows, TV mysteries, and TV horror appear to be the genres that are most suitable for adults, given their higher rates of TV-14 and TV-MA ratings.

Content Production by Country

With an increasingly globalized world, the accessibility of content from other countries is rising, and thus we may be interested in analyzing our Netflix data by looking at the country that each title was produced in. Below, we will find the top 10 countries in terms of content production, and plot the number of titles they have on Netflix, with separate bars for movies and television shows.

###########################
# Start Data Preparation
###########################

# Determine top 10 countries in content production
top10_countries = netflix_df.groupby('country')['show_id'].count().sort_values(ascending = False)[0:10]

# Filter dataframe to include only titles produced in these countries
country_df = netflix_df.loc[netflix_df['country'].isin(top10_countries.index),:]

# Count titles released by country and content type
country_df = country_df.groupby(['country', 'type'])['show_id'].count().reset_index(name = 'titles_released')

# Convert country to ordered categorical
country_df['country'] = pd.Categorical(country_df['country'], top10_countries.index)

# Sort by country and type
country_df = country_df.sort_values(['country', 'type'])
###########################
# Start Visualization
###########################

fig = plt.figure(figsize = (18, 10))

ax = fig.add_subplot(1, 1, 1)

ax.set_facecolor('black')

movie_bars = ax.bar(
    np.arange(len(country_df.loc[country_df['type'] == 'Movie', 'country'])) - 0.2,
    country_df.loc[country_df['type'] == 'Movie', 'titles_released'],
    label = 'Movies',
    color = 'red',
    edgecolor = 'white',
    width = 0.4
)

tv_bars = ax.bar(
    np.arange(len(country_df.loc[country_df['type'] == 'TV Show', 'country'])) + 0.2,
    country_df.loc[country_df['type'] == 'TV Show', 'titles_released'],
    label = 'TV Shows',
    color = 'white',
    edgecolor = 'red',
    width = 0.4
)

plt.xticks(rotation = 0)
plt.legend(facecolor = 'lightgrey', loc = 'center right', title = 'Content Type:')
plt.title('Top 10 Countries for Content Production\n Netflix Catalog as of Jan 18, 2021', fontsize = 18)
plt.xlabel('Country of Production', fontsize = 14)
plt.ylabel('Number of Titles Produced', fontsize = 14)
plt.yticks(np.arange(0, country_df['titles_released'].max() + 300, 100))
plt.grid(linestyle = 'dotted', linewidth = 0.5)

plt.xticks(
    np.arange(len(country_df.loc[country_df['type'] == 'Movie', 'country'])),
    country_df.loc[country_df['type'] == 'Movie', 'country']
)

# Annotate each movie bar
for bar in movie_bars:
    ax.text(
        bar.get_x() + bar.get_width()/2, 
        bar.get_height() + 20,
        format(int(bar.get_height()), ','),
        color = 'white',
        ha = 'center', va = 'center',
        fontweight = 'bold'
    )
    
# Annotate each TV bar
for bar in tv_bars:
    ax.text(
        bar.get_x() + bar.get_width()/2, 
        bar.get_height() + 20,
        format(int(bar.get_height()), ','),
        color = 'white',
        ha = 'center', va = 'center',
        fontweight = 'bold'
    )

# Add annotations for total titles produced at the top of our visualization 
for i in range(len(top10_countries)):
    ax.text(
        i,
        country_df['titles_released'].max() + 150,
        f'Total Titles:\n{format(int(top10_countries[i]), ",")}',
        color = 'white',
        ha = 'center', va = 'center',
        fontweight = 'bold',
        bbox=dict(facecolor='black', edgecolor='red')
    )
    
ax.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
    
plt.show()

We can see that the United States generates most of the content that is on Netflix, by a fairly wide margin. The U.S. provided more movies than it does TV shows, though both are still sizeable relative to the production from other countries. India produced the second most titles, and the majority of their contribution came in the form of movies. The U.K. produced the third most content, providing similar numbers of movies and television shows.

Actor Collaboration Graphs

Another interesting feature of our Netflix data is the inclusion of cast information for each of the titles in the taxonomy. With a bit of work, we can create a process to find, and visualize, any collaborators that a particular actor has had with other cast members. Think of this exercise as a helpful tool to make the parlor game The Six Degrees of Kevin Bacon a bit easier.

In this example, we'll arbitrarily select the actor Jack Nicholson as our base actor that we want to visualize. The result is a graph-like structure that maps Nicholson to his collaborators on the Netflix platform, color-coded by content title.

###########################
# Start Data Preparation
###########################

# First we will define a function that given an actor's name,
# will return a list of other actors that they have collaborated with,
# along with the name of the title collaborated on.
def find_collaborators(origin):
    collaborators = []
    origin = origin.lower()
    cast_series = netflix_df.loc[netflix_df['cast'].fillna('NA').str.lower().str.contains(origin)].cast
    title_series = netflix_df.loc[netflix_df['cast'].fillna('NA').str.lower().str.contains(origin)].title.reset_index(drop = True)
    for i, cast in enumerate(cast_series):
        for member in cast.split(','):
            if (member.strip().lower() != origin):
                collaborators.append((member.strip(), title_series[i]))
    return collaborators

# Find collaborators for the actor Jack Nicholson
collabs = find_collaborators('Jack Nicholson')
collab_df = pd.DataFrame(
    {
        'Collaborator': [item[0] for item in collabs],
        'Title': [item[1] for item in collabs]
    }
)

# Add an angle column to evenly space out collaborators in our graph 
angle_step = 2*np.pi/len(collab_df['Collaborator'])
collab_df['angle'] = [i*angle_step for i in range(0, len(collab_df['Collaborator']))]
###########################
# Start Visualization
###########################

fig = plt.figure(figsize = (16, 16), facecolor = 'black')

colors = {
    'Anger Management':'red',
    'As Good as It Gets':'yellow',
    "Something's Gotta Give":'blue',
    'The Bucket List':'orange',
    'The Departed':'green',
}

anger = mpatches.Patch(color = 'red', label = 'Anger Management')
good = mpatches.Patch(color = 'yellow', label = 'As Good as It Gets')
something = mpatches.Patch(color = 'blue', label = "Something's Gotta Give")
bucket = mpatches.Patch(color = 'orange', label = 'The Bucket List')
departed = mpatches.Patch(color = 'green', label = 'The Departed')

ax = fig.add_subplot(1, 1, 1, projection = 'polar')

ax.set_facecolor('black')

for c in range(len(collab_df)):
    rot = collab_df['angle']/(2*np.pi) * 360
    rot = rot[c]
    if (rot > 90) & (rot < 270):
        rot += 180
    ax.text(
        collab_df['angle'][c],
        1.40,
        collab_df.loc[c,'Collaborator'],
        rotation = rot,
        va = 'center',
        ha = 'center',
        color = 'white',
        fontsize = 14
    )
    ax.plot(
        [0, collab_df['angle'][c]],
        [0, 1],
        color = colors[collab_df['Title'][c]],
        zorder = 1
    )
ax.scatter(
    collab_df['angle'],
    [1] * len(collab_df['Collaborator']),
    color = collab_df['Title'].apply(lambda x: colors[x]),
    edgecolor = 'black',
    s = 300,
    zorder = 2
    
)
    
ax.set_ylim(0, 1.75)
ax.set_yticklabels([])
ax.grid(False)

plt.title('Collaborators with Actor Jack Nicholson\n Netflix Catalog as of Jan 18, 2021', fontsize = 18, color = 'white')
ax.legend(
    title = 'Title of Content Collaborated With:',
    handles = [anger, good, something, bucket, departed], 
    loc = [.9,0],
    facecolor = 'white',
    framealpha = 1.0,
    edgecolor = 'red'
)
ax.fill_between(np.linspace(0.0, 2*np.pi,100), [0.4]*100, color = 'white', zorder = 3)
ax.text(0, 0, 'Jack Nicholson',
       va = 'center', 
       ha = 'center', fontsize = 14)
plt.xticks([])
plt.show()

We can see that Nicholson collaborated on 5 Netflix titles, all of which happen to be movies. He's worked with a wide variety of individuals, including Morgan Freeman, Leonardo DiCaprio, Matt Damon, Adam Sandler, Woody Harrelson, Helen Hunt, and Diane Keaton to name a few.

Conclusion

Throughout our analysis, we've investigated the release dates of the content on Netflix, analyzed the content durations on the platform, stratified titles by genres and ratings, and reviewed the major countries that produce content for the platform. We also created a fun tool to visualize the collaborators with a given actor based off of cast information provided in our Netflix dataset.

As competitors continue to enter and fight in the streaming wars, it's important for companies such as Netflix to have a deep understanding of their content and how it compares to their competitors. Understanding the strengths and weaknesses of a streaming catalog can lead to more informed business decisions, and it is critical that companies such as Netflix routinely perform these analyses to ensure that they are not being left behind in an increasing competitive industry.