Na’ol Kebede
April 2, 2023
For this assignment I will be looking at the Netflix Movies and TV Shows Dataset. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.
With this data, I will explore the distribution of entertainment on Netflix and try to understand emerging trends.
Lets have a look at our data to make sure that it is proper in the sense that it does not have null values, each variable is in the correct domain, and it is ready for our visualizations.
## loading necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## setting up enviornment
df = pd.read_csv("netflix_titles.csv")
df.head()
## show_id ... description
## 0 s1 ... As her father nears the end of his life, filmm...
## 1 s2 ... After crossing paths at a party, a Cape Town t...
## 2 s3 ... To protect his family from a powerful drug lor...
## 3 s4 ... Feuds, flirtations and toilet talk go down amo...
## 4 s5 ... In a city of coaching centers known to train I...
##
## [5 rows x 12 columns]
for i in ["director", "cast", "country", "date_added", "rating"]:
print("{} missing values: {}".format(i, round(df[i].isna().sum()*100/len(df),2)))
## director missing values: 29.91
## cast missing values: 9.37
## country missing values: 9.44
## date_added missing values: 0.11
## rating missing values: 0.05
df['country'] = df['country'].fillna(df['country'].mode()[0])
df['cast'].replace(np.nan, 'No Data',inplace = True)
df['director'].replace(np.nan, 'No Data',inplace = True)
df.dropna(inplace=True)
df.drop_duplicates(inplace= True)
df["date_added"] = pd.to_datetime(df['date_added'])
df['month_added']=df['date_added'].dt.month
df['month_name_added']=df['date_added'].dt.month_name()
df['year_added'] = df['date_added'].dt.year
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 8790 entries, 0 to 8806
## Data columns (total 15 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 show_id 8790 non-null object
## 1 type 8790 non-null object
## 2 title 8790 non-null object
## 3 director 8790 non-null object
## 4 cast 8790 non-null object
## 5 country 8790 non-null object
## 6 date_added 8790 non-null datetime64[ns]
## 7 release_year 8790 non-null int64
## 8 rating 8790 non-null object
## 9 duration 8790 non-null object
## 10 listed_in 8790 non-null object
## 11 description 8790 non-null object
## 12 month_added 8790 non-null int64
## 13 month_name_added 8790 non-null object
## 14 year_added 8790 non-null int64
## dtypes: datetime64[ns](1), int64(3), object(11)
## memory usage: 1.1+ MB
All necessary columns have been modified and added to equip us with the correctly formatted data for graphing.
top_producers = df["country"].value_counts().sort_values(ascending=False).head(10)
movietvsplit = df.loc[df.country.isin(top_producers.index)].groupby(['country', 'type'])["country"].value_counts()
d = {'Movie':movietvsplit[::2].values, 'Tv Show':movietvsplit[1::2].values}
movietv = pd.DataFrame(data=d, index=movietvsplit[1::2].index.get_level_values(0))
movietv = movietv.apply(lambda x: round(x*100/x.sum(),1), axis=1).sort_values("Movie", ascending=False)
ax = movietv.plot(kind='barh', stacked=True, figsize=(8,6), title="Top 10 Countries Split by Type",
xlabel=("Percent Share"),
ylabel=("Country"),
colormap="Wistia")
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
for c in ax.containers:
ax.bar_label(c, label_type='center')
ratings_ages = {
'TV-PG': 'Older Kids',
'TV-MA': 'Adults',
'TV-Y7-FV': 'Older Kids',
'TV-Y7': 'Older Kids',
'TV-14': 'Teens',
'R': 'Adults',
'TV-Y': 'Kids',
'NR': 'Adults',
'PG-13': 'Teens',
'TV-G': 'Kids',
'PG': 'Older Kids',
'G': 'Kids',
'UR': 'Adults',
'NC-17': 'Adults'
}
df['target_ages'] = df['rating'].replace(ratings_ages)
targetsplit = df.loc[df.country.isin(top_producers.index)].groupby(['country', 'target_ages'])["country"].value_counts()
d = {'Adults':targetsplit[::4].values,
'Kids':targetsplit[1::4].values,
'Older Kids':targetsplit[2::4].values,
'Teens':targetsplit[3::4].values,}
targetage = pd.DataFrame(data=d, index=targetsplit[::4].index.get_level_values(0))
targetage = targetage.apply(lambda x: round(x*100/x.sum(),0), axis=1).sort_values("Adults", ascending=False)
plt.figure(figsize=(14,10))
plt.title("Target ages proportion of content by country")
sns.heatmap(targetage, cmap=plt.cm.get_cmap('RdGy').reversed(), annot=True)
moviebyyear = np.insert(df.groupby('year_added')['type'].value_counts().values, [3,4,5,6], 0)[::2]
tvbyyear = np.insert(df.groupby('year_added')['type'].value_counts().values, [3,4,5,6], 0)[1::2]
d = {'Movie':moviebyyear, 'Tv Show':tvbyyear}
year_added = pd.DataFrame(data=d, index=np.arange(2008,2022))
year_added.plot.area(figsize=(12,8), title='Programs Added Over Time by Type', xlabel='Year', ylabel='Count',colormap='RdGy')
months = pd.get_dummies(df['month_name_added']).sum()
new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December']
months = months.reindex(new_order, axis=0)
plt.figure(figsize=(10,8))
ax = months.plot(kind='pie', autopct='%1.1f%%', pctdistance=.85, colormap='RdGy', title='Content by Month');
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.text(0, 0, f'Largest slice:\n {months.idxmax()}', ha='center', va='center')
plt.show()
ratings = df.groupby('rating')['type'].value_counts().values
moviebyrating = np.insert(ratings, [0,1,4,5,len(ratings)], 0)[::2]
tvbyrating = -1*np.insert(ratings, [0,1,4,5,len(ratings)], 0)[1::2]
d = {'Movie':moviebyrating, 'Tv Show':tvbyrating}
rating = pd.DataFrame(data=d, index=df.groupby('rating')['type'].value_counts().index.get_level_values('rating').unique())
rating = rating.sort_values('Movie', ascending=False).drop("UR")
ax1 = rating.plot(kind='bar', stacked='true', title='Rating Count by Type', figsize=(10,8), yticks=[], colormap='RdGy')
for c in ax1.containers:
ax1.bar_label(c)
The data shows a very interesting spread of between the different features. The type of programs appears to be an important feature as the country, rating and date of the addition show clear trends. Looking at the dates added, we can see the dip during 2020 which might be due to the pandemic, however the month added graph show about a similar rate per month. Age demographics show many discrepancies between countries which also could be tied into program type as India and South Korea have quite different correlations. Overall, the features explored all tie into the evolution of Netflix content and display noteworthy trends.