Python Assignment

Na’ol Kebede

April 2, 2023




Table Of Contents


1. Introduction

For this assignment I will be looking at the Netflix Movies and TV Shows Dataset. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

With this data, I will explore the distribution of entertainment on Netflix and try to understand emerging trends.

2. Data processing

Lets have a look at our data to make sure that it is proper in the sense that it does not have null values, each variable is in the correct domain, and it is ready for our visualizations.

## loading necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## setting up enviornment
df = pd.read_csv("netflix_titles.csv")
df.head()
##   show_id  ...                                        description
## 0      s1  ...  As her father nears the end of his life, filmm...
## 1      s2  ...  After crossing paths at a party, a Cape Town t...
## 2      s3  ...  To protect his family from a powerful drug lor...
## 3      s4  ...  Feuds, flirtations and toilet talk go down amo...
## 4      s5  ...  In a city of coaching centers known to train I...
## 
## [5 rows x 12 columns]
for i in ["director", "cast", "country", "date_added", "rating"]:
    print("{} missing values: {}".format(i, round(df[i].isna().sum()*100/len(df),2)))
## director missing values: 29.91
## cast missing values: 9.37
## country missing values: 9.44
## date_added missing values: 0.11
## rating missing values: 0.05
df['country'] = df['country'].fillna(df['country'].mode()[0])
df['cast'].replace(np.nan, 'No Data',inplace  = True)
df['director'].replace(np.nan, 'No Data',inplace  = True)
df.dropna(inplace=True)
df.drop_duplicates(inplace= True)

df["date_added"] = pd.to_datetime(df['date_added'])

df['month_added']=df['date_added'].dt.month
df['month_name_added']=df['date_added'].dt.month_name()
df['year_added'] = df['date_added'].dt.year
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 8790 entries, 0 to 8806
## Data columns (total 15 columns):
##  #   Column            Non-Null Count  Dtype         
## ---  ------            --------------  -----         
##  0   show_id           8790 non-null   object        
##  1   type              8790 non-null   object        
##  2   title             8790 non-null   object        
##  3   director          8790 non-null   object        
##  4   cast              8790 non-null   object        
##  5   country           8790 non-null   object        
##  6   date_added        8790 non-null   datetime64[ns]
##  7   release_year      8790 non-null   int64         
##  8   rating            8790 non-null   object        
##  9   duration          8790 non-null   object        
##  10  listed_in         8790 non-null   object        
##  11  description       8790 non-null   object        
##  12  month_added       8790 non-null   int64         
##  13  month_name_added  8790 non-null   object        
##  14  year_added        8790 non-null   int64         
## dtypes: datetime64[ns](1), int64(3), object(11)
## memory usage: 1.1+ MB

All necessary columns have been modified and added to equip us with the correctly formatted data for graphing.

3. Visualization

3.1 Country Program Split

top_producers = df["country"].value_counts().sort_values(ascending=False).head(10)
movietvsplit = df.loc[df.country.isin(top_producers.index)].groupby(['country', 'type'])["country"].value_counts()
d = {'Movie':movietvsplit[::2].values, 'Tv Show':movietvsplit[1::2].values}
movietv = pd.DataFrame(data=d, index=movietvsplit[1::2].index.get_level_values(0))
movietv = movietv.apply(lambda x: round(x*100/x.sum(),1), axis=1).sort_values("Movie", ascending=False)

ax = movietv.plot(kind='barh', stacked=True, figsize=(8,6), title="Top 10 Countries Split by Type", 
                  xlabel=("Percent Share"), 
                  ylabel=("Country"), 
                  colormap="Wistia")
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
for c in ax.containers:
    ax.bar_label(c, label_type='center')

3.2 Target Age by Country

ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

df['target_ages'] = df['rating'].replace(ratings_ages)
targetsplit = df.loc[df.country.isin(top_producers.index)].groupby(['country', 'target_ages'])["country"].value_counts()
d = {'Adults':targetsplit[::4].values,
     'Kids':targetsplit[1::4].values,
     'Older Kids':targetsplit[2::4].values,
     'Teens':targetsplit[3::4].values,}
targetage = pd.DataFrame(data=d, index=targetsplit[::4].index.get_level_values(0))
targetage = targetage.apply(lambda x: round(x*100/x.sum(),0), axis=1).sort_values("Adults", ascending=False)
  
plt.figure(figsize=(14,10))
plt.title("Target ages proportion of content by country")
sns.heatmap(targetage, cmap=plt.cm.get_cmap('RdGy').reversed(), annot=True)

3.3 Year to Year Program Additions

moviebyyear = np.insert(df.groupby('year_added')['type'].value_counts().values, [3,4,5,6], 0)[::2]
tvbyyear = np.insert(df.groupby('year_added')['type'].value_counts().values, [3,4,5,6], 0)[1::2]
d = {'Movie':moviebyyear, 'Tv Show':tvbyyear}
year_added = pd.DataFrame(data=d, index=np.arange(2008,2022))

year_added.plot.area(figsize=(12,8), title='Programs Added Over Time by Type', xlabel='Year', ylabel='Count',colormap='RdGy')
  

3.4 Month to Month Program Additions

months = pd.get_dummies(df['month_name_added']).sum()
new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
             'August', 'September', 'October', 'November', 'December']
months = months.reindex(new_order, axis=0)

plt.figure(figsize=(10,8))
ax = months.plot(kind='pie', autopct='%1.1f%%', pctdistance=.85, colormap='RdGy', title='Content by Month');
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.text(0, 0, f'Largest slice:\n {months.idxmax()}', ha='center', va='center')
plt.show()
  

3.5 Rating Spread by Program Type

ratings = df.groupby('rating')['type'].value_counts().values
moviebyrating = np.insert(ratings, [0,1,4,5,len(ratings)], 0)[::2]
tvbyrating = -1*np.insert(ratings, [0,1,4,5,len(ratings)], 0)[1::2]
d = {'Movie':moviebyrating, 'Tv Show':tvbyrating}
rating = pd.DataFrame(data=d, index=df.groupby('rating')['type'].value_counts().index.get_level_values('rating').unique())
rating = rating.sort_values('Movie', ascending=False).drop("UR")
  
ax1 = rating.plot(kind='bar', stacked='true', title='Rating Count by Type', figsize=(10,8), yticks=[], colormap='RdGy')
for c in ax1.containers:
    ax1.bar_label(c)

4. Conclusion

The data shows a very interesting spread of between the different features. The type of programs appears to be an important feature as the country, rating and date of the addition show clear trends. Looking at the dates added, we can see the dip during 2020 which might be due to the pandemic, however the month added graph show about a similar rate per month. Age demographics show many discrepancies between countries which also could be tied into program type as India and South Korea have quite different correlations. Overall, the features explored all tie into the evolution of Netflix content and display noteworthy trends.