Introduction

The dataset that I’ve chosen to analyze consists of movies and TV shows available on Netflix as of 2021. In 2007, Netflix announced that they will start streaming video; however, this did not take off until 2015. This analysis looks into the content available on Netflix and certain aspects that makes each title unique. The visualizations will look into the timeline of additions to Netflix in which can offer cyclical tendencies for additions. In addition, visualizations will look into where movies or TV shows were created and which directors contribute a significant amount of content for Netflix. Within the visualizations, the amount of movies compared to TV shows is present.

Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import wget
import datetime

warnings.filterwarnings("ignore")

filename = '/Users/danwigley/Desktop/netflix_titles.csv'
df = pd.read_csv(filename)
df.columns
## Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
##        'release_year', 'rating', 'duration', 'listed_in', 'description'],
##       dtype='object')
df.info
## <bound method DataFrame.info of      show_id  ...                                        description
## 0         s1  ...  In a future where the elite inhabit an island ...
## 1         s2  ...  After a devastating earthquake hits Mexico Cit...
## 2         s3  ...  When an army recruit is found dead, his fellow...
## 3         s4  ...  In a postapocalyptic world, rag-doll robots hi...
## 4         s5  ...  A brilliant group of students become card-coun...
## ...      ...  ...                                                ...
## 7782   s7783  ...  When Lebanon's Civil War deprives Zozo of his ...
## 7783   s7784  ...  A scrappy but poor boy worms his way into a ty...
## 7784   s7785  ...  In this documentary, South African rapper Nast...
## 7785   s7786  ...  Dessert wizard Adriano Zumbo looks for the nex...
## 7786   s7787  ...  This documentary delves into the mystique behi...
## 
## [7787 rows x 12 columns]>

The dataset is collected from Flixable which is a third-party Netflix search engine. I pulled this dataset from kaggle.com. The data in this public dataset includes 7,777 titles currently available on Netflix. Each title has a unique show ID which is accompanied by whether the film is a movie or TV show, the director, the cast, the country the film was made in, the date added to Netflix, the release year of the film, the age rating, the length of the film, the genre, and a brief description.

Findings

Where is Netflix pulling content from?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import wget
import datetime

warnings.filterwarnings("ignore")

filename = '/Users/danwigley/Desktop/netflix_titles.csv'
df = pd.read_csv(filename)
df
country_df = pd.read_csv(filename, usecols = ['country'])


country_df.country.fillna('Unknown', inplace=True)
country_df.value_counts()
country_count = country_df.groupby(['country']).agg({'country':['count']}).reset_index()


country_count.columns = ['Country', 'Count']

country_count = country_count.sort_values('Count', ascending=False)

BadRows = country_count[country_count['Country'].str.contains('Unknown')]

country_count = country_count[ -country_count['Country'].isin(BadRows.Country)]

country_count.reset_index(inplace=True, drop=True)
fig = plt.figure(figsize=(18,10))
plt.bar(country_count.loc[0:9, 'Country'], country_count.loc[0:9, 'Count'], label = 'Country Count')
#plt.legend(loc = 'upper right', fontsize =14)
plt.title('Top 10 Countries', color = 'blue', fontsize = 22)
x = fig.add_subplot(1,1,1)
x.set_xlabel('Title Country', fontsize=18)
x.set_ylabel('Title Count', fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show()

The bar chart above identifies the top ten countries that Netflix movies and TV shows originate from. The United States has the most Netflix titles with over 2,500 titles. This is due to the fact that we have Hollywood which produces a majority of the world’s movies and TV shows. Second to the United States is India with just under 1,000 titles. India has a large film industry located in Mumbai which is known as “Bollywood”. “Bollywood” is known as one of the centers of the film industry outside the United States. After India, the United Kingdom has just under 500 titles which is a large drop from the United States. The drop off after the United States identifies how strong the film industry is in the United States.

What part of the year is content being added?

df['date_added'] = pd.to_datetime(df['date_added'])
df['date_added'] = pd.to_datetime(df['date_added'], format = '%Y-%m-%d')
df['Year'] = df['date_added'].dt.year
df['Quarter'] = df['date_added'].dt.quarter
df['Quarter Name'] = "Quarter " + df.Quarter.astype('string')
df['Month'] = df['date_added'].dt.month
df['Day'] = df['date_added'].dt.day
df['DayOfTheWeek'] = df['date_added'].dt.dayofweek
df['DayOfTheYear'] = df['date_added'].dt.dayofyear
df['Monthname'] = df['date_added'].dt.strftime('%B')
df['Weekday'] = df['date_added'].dt.strftime('%A')
date_added_df = df.groupby(['Month', 'Weekday'])['show_id'].count().reset_index(name='Title Count')
date_added_df = date_added_df.pivot(index = 'Month', columns = 'Weekday', values = 'Title Count')
fig2 = plt.figure(figsize=(18,12))
ax = fig2.add_subplot(1,1,1)
date_added_df.plot(kind='bar', stacked=True, ax = ax)
plt.ylabel("Number of Movies/TV Shows Added", fontsize=18, labelpad=10)
plt.xlabel("Month Added to Netflix", fontsize=18)
plt.xticks(rotation=45, horizontalalignment = 'center', fontsize = 14)
plt.yticks(fontsize=14)
labels = ['January',
'February',
'March',
'April',
'May',
'June',
'July',
'August',
'September',
'October',
'November',
'December']
ax.set_xticklabels(labels)
plt.title('Titles Added by Month and Weekday', fontsize=20)
plt.show()

The stacked bar chart above shows the number of titles added to Netflix by each month. The columns are broken up into each day of the week that the title was added to Netflix. The months that have seen the most titles added to Netflix are October, November, December, and January. Movies and TV shows are most likely added during these months more than others because these months have holidays that occur during them. Halloween and Christmas movies are more prevalent around these months, so Netflix adds them around this time of the year. There are not as many additions to Netflix during the warmer summer months because more people are outside. People are more inclined to stay inside and watch Netflix during the colder months. The most additions to Netflix occur on Fridays. This could be due to the end of the work week. People have more time to watch Netflix, so there needs to be more content. The most additions have came towards the end of the work week.

Content Over the Years

year_added_df = df.groupby(['Year', 'type'])['show_id'].count().reset_index(name='Title Count')
year_added_df = year_added_df.pivot(index = 'Year', columns = 'type', values = 'Title Count')
year_added_df = year_added_df.loc[7:]
year_added_df = year_added_df.loc[2015:]
fig3 = plt.figure(figsize=(18,10))
ax1 = fig3.add_subplot(1,1,1)
year_added_df.plot(kind='bar', stacked=True, ax = ax1)
plt.ylabel("Number of Movies/TV Shows Added", fontsize=18, labelpad=10)
plt.xlabel("Year Added to Netflix", fontsize=18)
plt.xticks(rotation=45, horizontalalignment = 'center', fontsize = 14)
plt.yticks(fontsize=14)
labels = ['2015', '2016', '2017', '2018', '2019', '2020', '2021']
ax1.set_xticklabels(labels)
plt.title('Number of Movies/TV Shows Added to Netflix by Year', fontsize=20)
plt.show()

The bar chart above shows the number of titles added each year starting in 2015. Each bar is broken into movies and TV shows. 2019 saw the most additions to Netflix as Netflix was becoming a hot commodity around the world. Streaming services have become the main way for people to watch movies or TV shows. Each year, Netflix has constantly added more content; however, this number dropped in 2020. This drop is due to the Covid-19 pandemic. The pandemic has negatively impacted the film industry which could not produce as much content like years prior. 2021 has the same amount of additions to Netflix as 2015 and the year is not even halfway done. It’s important to note that the number of TV shows added has not seen a crazy increase over the years because TV shows often have multiple seasons which are not represented. Netflix adds more movies than TV shows each year which is identified in the bar chart.

Titles by Quarter

pie_df = df.groupby(['Quarter Name', 'Monthname'])['show_id'].count().reset_index(name='TitleCount')
pie_df
number_outside_colors = len(pie_df['Quarter Name'].unique())
outside_color_ref_number = np.arange(number_outside_colors)*4
print(outside_color_ref_number)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

total_titles = pie_df.TitleCount.sum()
# print(total_titles)

pie_df.groupby(['Quarter Name'])['TitleCount'].sum().plot(
        kind='pie', radius=1, colors= outer_colors, pctdistance=0.85, labeldistance=1.1,
        wedgeprops = dict(edgecolor='w'), textprops={'fontsize':14}, 
        autopct = lambda p: '{:.2f}%\n{:.1f} titles'.format(p,(p/100)*total_titles),
        startangle=90)

hole = plt.Circle((0,0),0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Added Titles by Quarter', fontsize=18)

ax.text(0,0, 'Total Titles Added\n' + str(total_titles), ha = 'center', va= 'center', fontsize=18)

ax.axis('equal')
plt.tight_layout()
plt.show()

The donut chart above identifies the percentage of titles added in each quarter of the year. Netflix has added the most titles during quarter 4 throughout the years with 2,356 compared to 7,777 total titles added. Quarter 4 additions account for 30.29% of Netflix additions. Again, this can be attributed to the cold holiday season that promotes staying inside. Quarter 2 accounts for the least amount of Netflix additions with 21.68%. Quarter 2, which includes the months of April, May, and June, tends to include warmer weather which is not positive for streaming. The donut chart shows that Netflix operates in a somewhat cyclical business due to certain months attracting more business.

Directors on Netflix

director_df = df.groupby(['director'])['show_id'].count().reset_index()
director_df = director_df.sort_values('show_id', ascending = False)
director_df.reset_index(inplace=True, drop=True)
fig6 = plt.figure(figsize=(28,16))
plt.barh(director_df.loc[:9, 'director'], director_df.loc[:9, 'show_id'], label = 'Title Count')
#plt.legend(loc = 'upper right', fontsize =14)
plt.title('Netflix Titles Added by Top Ten Directors', color = 'blue', fontsize = 24)
x1 = fig6.add_subplot(1,1,1)
x1.set_xlabel('Title Count', fontsize=18)
x1.set_ylabel('Director', fontsize=19)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.show()

The horizontal bar chart above looks at the top ten most common directors on Netflix. The second highest most common director, Marcus Raboy, has 16 films directed. Raboy has many films on Netflix because they are mostly stand-up comedies which are highly common on Netflix. Netflix includes famous directors multiple times such as Martin Scorcese and Steven Spielberg. This shows that Netflix includes both new films with younger directors and older classic films. There isn’t one director that has a large number of films directed which shows that Netflix has a balanced array of content.

Conclusion

After visualizing the Netflix dataset, it’s clear that Netflix has grown their content count over the years since streaming services have taken off. However, majority of Netflix’s additions come during quarter 4. Quarter 4 includes October, November, and December. The high number of content additions during this time of the year shows that Netflix has a cyclical business cycle. It’s clear that the United States produces an overwhelming majority of the content Netflix has to offer. Majority of content added to netflix are movies compared to TV shows. After looking into the date added for content, it was evident that Friday is the most popular day for Netflix to add content leading into the weekend.