PythonMarkdown

Introduction

Television entertainment has changed dramatically in the short time frame that it has been around. Its hard to believe how advanced the technology has gotten since TVs became commercially available in 1947. In just 77 years the industry has had major trends emerge.

Dataset

This dataset comprises 1,816 rows that tell a story about different TV shows. This dataset tracks all the way back to the 1950s giving us a great view of the trends that have emerged over the years. Some of the important columns are their title, episode count, years aired, original channel, production companies, animation techniques, and ratings from Google and IMDb. Check out the the dataset on Kaggle now: Kaggle.com Published on March 8, 2024.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import matplotlib.patches as mpatches
import seaborn as sns
from matplotlib.ticker import FuncFormatter

# I am unable to directly download the dataset from Kaggle. So I have manually downloaded it to a folder
data = 'Data/Animated_Shows.csv'

Findings

One interesting finding, is that people have grown less satisfied with TV shows over the years. Maybe due to the massive increase in the amount of shows airing people have become much more critical of shows. Another finding I was extremely suprised to see, is Sesame street has 4633 episodes, the closest second is The Simpsons with 753. Also, the year of Covid led to a big increase in the amount of shows being produced, followed quickly by a downturn as the pandemic faded away.

Descriptive Statistics

The dataset was in great condition. Only minimal cleanup needed to be done, with the exception of the technique column. Techniques got out of hand quickly due to how many different animation techniques shows will use throughout developement and their overall life.

# This will create and manipulate the data for the visualization
data = 'Data/Animated_Shows.csv'
df = pd.read_csv(data, encoding='latin1')


# Print the number of rows and columns in the dataframe
print("Number of rows:", df.shape[0])

## Number of rows: 1816

print("Number of columns:", df.shape[1])

## Number of columns: 10

# Print the column names
print("Column names:", df.columns.tolist())

## Column names: ['Id', 'Title', 'Episodes', 'Year', 'Original channel', 'American company', 'Note', 'Technique', 'IMDb', 'Google users']

# Print the first few rows of the dataframe
print("First few rows:")

## First few rows:

print(df.head())

##    Id                     Title  Episodes  ...    Technique IMDb Google users
## 0   1             2 Stupid Dogs      26.0  ...  Traditional  7.2          91%
## 1   2           3-2-1 Penguins!      27.0  ...          CGI  6.5          91%
## 2   3                   3-South      13.0  ...  Traditional  8.0          87%
## 3   4  3Below: Tales of Arcadia      28.0  ...          CGI  7.6          77%
## 4   5                3rd & Bird      51.0  ...        Flash  7.8          92%
## 
## [5 rows x 10 columns]

# Print the summary statistics of numerical columns
print("Summary statistics:")

## Summary statistics:

print(df.describe())

##                 Id     Episodes         IMDb
## count  1816.000000  1798.000000  1812.000000
## mean    908.500000    45.366518     6.712031
## std     524.378362   120.085662     1.228804
## min       1.000000     1.000000     1.500000
## 25%     454.750000    13.000000     6.200000
## 50%     908.500000    26.000000     6.900000
## 75%    1362.250000    52.000000     7.500000
## max    1816.000000  4633.000000     9.300000

# Print the number of missing values in each column
print("Missing values per column:")

## Missing values per column:

print(df.isnull().sum())

## Id                     0
## Title                  0
## Episodes              18
## Year                   0
## Original channel       7
## American company      45
## Note                1328
## Technique              0
## IMDb                   4
## Google users           4
## dtype: int64

# Print the unique values in the 'Technique' column
print("Unique techniques:", df['Technique'].unique())

## Unique techniques: ['Traditional' 'CGI' 'Flash' 'Stop-Motion' 'Traditional/Live-Action'
##  'CGI/Flash/Stop-Motion/Live-Action' 'Flash/Live-Action'
##  'CGI/Flash/Stop-motion/Traditional'
##  'Traditional (Seasons 1-15, 20-25)\r\nCel (Seasons 1-3)\r\nDigital ink-and-paint (Seasons 4-15, 20-25)\r\nFlash (Seasons 16-25)\r\nAdobe Flash (Seasons 16-19)\r\nToon Boom Harmony (Seasons 20-25)'
##  'Traditional/Live-Action/Flash/CGI' 'CGI/Live-Action'
##  'Traditional (season 1)/Flash (season 2-3)' 'Flash/Traditional/CGI'
##  'Traditional/Stop-Motion/CGI/Flash'
##  'Traditional (season 1)/Flash (season 2)'
##  'Traditional/Digital ink-and-paint (seasons 1-5)/Flash/Adobe Flash (season 6-present)'
##  'Flash/Traditional' 'Traditional (seasons 1-9)/Flash (season 10)'
##  'Stop-motion' 'Flash/Live-action' 'Traditional/Live-action'
##  'Live-Action/Traditional' 'Live-Action/Flash' 'CGI/Flash/Live-action'
##  'CGI/Live-action'
##  'Traditional/Digital ink-and-paint (season 1)/Flash/Adobe Flash (seasons 2-6)'
##  'Traditional/CGI' 'Traditional/Flash' 'Traditional/CGI/Stop-Motion'
##  'Traditional/CGI/Stop-Motion/Live-Action' 'Stop-Motion/Live-Action'
##  'Stop-Motion/CGI/Flash' 'Flash/Live action'
##  'Traditional/Flash/CGI/Stop-Motion' 'CGI/Flash' 'CGI/Traditional/Flash'
##  'CGI/Flash/Stop-Motion/Traditional/Live-action'
##  'CGI/Stop-Motion/Flash/Traditional' 'Flash/Live-Action/Traditional/CGI'
##  'Flash/Traditional/Live-Action' 'Stop-motion/Live-action']

Scatterplot

This graph shows an interesting correlation between the release date of shows and their Google Users rating. They averaged between 80%-90% from 1965-2004.

After that period of happiness, people became dissatisfied with their shows. From 2004-2024 average user satisfaction went down linearly from ~80% to 65% over 2004 to 2024. That also directly correlated with the increase in shows that were released. The year with the most show releases was in 2021 with around 80 new shows entering the dataset.

# This will be for the function to split the year column into two columns. 
# I.e: 2005-2008 to StartYear : 2005 EndYear : 2008
# Handles scenarios where a show was cancelled and came back. 2001-2003, 2005-2008
def extract_first_start_last_end(years):
    current_year = datetime.now().year  
    periods = [period.split('-') for period in years.split(', ')]
    start_years = [int(period[0]) for period in periods]
    end_years = []
    for period in periods:
        if period[-1] == 'present':
            end_years.append(current_year)  # Use current year for 'present'
        else:
            end_years.append(int(period[-1]))
    return min(start_years), max(end_years)

# Cleans the Google users rating column by removing percentages
# Handles scenarios where a show was cancelled and came back Ex. 90%, 80%, 70%
def clean_google_users_rating(rating):
    if isinstance(rating, str):
        ratings = rating.rstrip('%').split('%')
        ratings = [float(r) for r in ratings if r]
        return sum(ratings) / len(ratings) if ratings else None
    return rating

# Creating the dataframe for visualization 1
df = pd.read_csv(data, encoding='latin1', usecols=['Year', 'Title', 'Google users', 'IMDb'])

# This will be for cleaning the Google users rating column
df['Google users'] = df['Google users'].apply(clean_google_users_rating)

# Applying the function to clean the year column and create two new columns
df[['StartYear', 'EndYear']] = pd.DataFrame(df['Year'].apply(lambda x: extract_first_start_last_end(x)).tolist(), index=df.index)

# Drop rows with NaN values
df = df[df['IMDb'].notna() & df['Google users'].notna()]  

# Checking for the unique values in the dataframe
# Group by 'StartYear' to calculate the average Google user rating and count the shows
agg_df = df.groupby('StartYear').agg(
    avg_google_rating=('Google users', 'mean'),count=('Title', 'count')
    ).reset_index()


# Creating the first visualization
plt.figure(figsize=(18,10))
plt.scatter(agg_df['StartYear'], agg_df['avg_google_rating'],
             marker='8',label='Google Ratings',c=agg_df['count'], cmap='viridis',
             s=agg_df['count']*10, edgecolors='black')

plt.title('Average Google User Rating per Release Year', fontsize=20)
plt.xlabel('Release Year', fontsize=15)
plt.ylabel('Average Google User Rating', fontsize=15)

plt.colorbar().set_label('Number of Shows',rotation=270,fontsize=15,color='black', labelpad=20)

my_x_ticks = [*range(agg_df['StartYear'].min(), agg_df['StartYear'].max()+1, 5)]
plt.xticks(my_x_ticks, fontsize=12)

my_y_ticks = np.arange(np.floor(agg_df['avg_google_rating'].min()), np.ceil(agg_df['avg_google_rating'].max())+1, 2.5)
plt.yticks(my_y_ticks, fontsize=12)

plt.gca().set_yticklabels(['{:.0f}%'.format(y) for y in my_y_ticks])
plt.show()

Top Bar Charts

These bar graphs show the top 10 shows by episode count and the top 10 rated shows on IMDb. The top 10 shows by episode count shows one towering above the rest. Sesame Street has a 3,880 show lead over The Simpsons in second place. After the outlier with over 4,600 shows, the rest of them go down from 750-325.

None of the shows with the most episodes are contenders for the highest rated shows. When analyzing the highest rated shows, I was surprised to see Rick and Morty was only in 4th place. This should have been in 1st place, but it was sadly beaten out by Avatar: The Last Airbender.

# Creating the dataframe for visualization 2
df = pd.read_csv(data, encoding='latin1')

df2 = df[['Title', 'Episodes', 'IMDb', 'Google users', 'Year']]
df2 = df.sort_values(by='IMDb', ascending=False).reset_index(drop=True)
df2 = df2.head(10)

df = df.sort_values(by='Episodes', ascending=False).reset_index(drop=True)
df = df.dropna(subset=['Episodes'])
df = df.head(10)

# Creating the second visualization
fig = plt.figure(figsize=(16, 13))
fig.suptitle('Highest Episode Count & Highest Rated Shows:\nTop 10', fontsize=20, fontweight='bold')
plt.subplots_adjust(hspace=0.5, bottom = 0.15) 

# For the first subplot
ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(df['Title'], df['Episodes'], color="#4e79a7", label='Episodes Count')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)

# Adding labels to the bars. Getting rid of the .0 in the floats
df['Episodes'] = df['Episodes'].astype(int)
for row_counter, value_at_row_counter in enumerate(df.Episodes):
    ax1.text(row_counter, value_at_row_counter + 50, str(value_at_row_counter), size=12, fontweight='bold', ha='center')

ax1.set_ylabel('Episodes Count', fontsize=14)
ax1.set_title('Shows with most episodes', fontsize=20)
ax1.set_xticks(range(len(df['Title'])))  
ax1.set_xticklabels(df['Title'], rotation=45, fontsize=10, ha='right')


# Colors for second plot
colors = ['#70E2F7','#64E4E9','#57E6DA','#4BE8CC','#3EEABE','#32ECAF','#26EEA1','#19F093','#0DF285','#00F477']

# For the second subplot
ax2 = fig.add_subplot(2, 1, 2)
ax2.bar(df2['Title'], df2['IMDb'], color=colors, label='IMDb Rating')


for row_counter, value_at_row_counter in enumerate(df2.IMDb):
    ax2.text(row_counter, value_at_row_counter + .05, str(value_at_row_counter), size=12, fontweight='bold', ha='center')


ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_ylabel('IMDb Rating', fontsize=14)
ax2.set_title('Top Rated Shows by IMDb Rating', fontsize=20)
ax2.set_xticks(range(len(df2['Title'])))
ax2.set_xticklabels(df2['Title'], rotation=45, fontsize=10, ha='right')
ax2.set_ylim(7, 10)

plt.show()

Stacked Bar Graph

This stacked bar graph not only shows the increase of shows over time, but shows the evolution of techniques used in those shows overtime. In the early days, Tradition methods were not used, initially it was only CGI and Flash. After traditional entered the market, it would become the leading technique for the foreseeable future. The second most popular technique was Flash, followed by CGI.

The number of shows went up dramatically over time. Starting at just 1-2 shows in 1949, in just 10 years it went to 25. After a little slump in the mid 1970s, the amount of shows continued to go up, peaking around 230 in 2021. This has decreased since the pandemic left to around 170 shows in the dataset in 2024, a substantial 60 show decrease in just 3 years.

# Creating the dataframe for visualization 3
data = 'Data/Animated_Shows.csv'
df = pd.read_csv(data, encoding='latin1')

df['Google users'] = df['Google users'].apply(clean_google_users_rating)

# Apply the year function and create new columns for StartYear and EndYear
df[['StartYear', 'EndYear']] = pd.DataFrame(df['Year'].apply(lambda x: extract_first_start_last_end(x)).tolist(), index=df.index)
df = df.drop(columns=['Year'])


# Formatting all of the techniques the same
df['Technique'] = df['Technique'].str.replace('Live Action', 'Live-Action', case=False, regex=False)
df['Technique'] = df['Technique'].str.replace('Adobe Flash', 'Flash', case=False, regex=False)

# Removing seasons, numbers, and other text from the technique column
df['Technique'] = df['Technique'].str.replace(r'\s\([^)]*\)', '', regex=True).replace('\r\n', '/', regex=True)

# Formatting all of the data the same
df['Technique'] = df['Technique'].str.title()

# Split on the / and create new rows for each technique
df['Technique'] = df['Technique'].str.split('/').explode('Technique')


# This will create a new row for each year the show was on air
df['YearOnAir'] = df.apply(lambda x: list(range(x['StartYear'], x['EndYear'] + 1)), axis=1)
df = df.explode('YearOnAir')
df = df.drop(columns=['StartYear', 'EndYear', 'Title'])

# Grouping and organizing data for visualization
tech_count_df = df.groupby(['Technique', 'YearOnAir'])['Technique'].count().reset_index(name='TechniqueCount')
tech_count_df = tech_count_df.groupby('Technique').filter(lambda x: len(x['YearOnAir'].unique()) > 25)

stacked_df = tech_count_df.pivot(index='YearOnAir', columns='Technique', values='TechniqueCount')
stacked_df = stacked_df.fillna(0)

data_order = ['Traditional', 'Flash', 'Cgi', 'Live-Action', 'Stop-Motion']
stacked_df = stacked_df.reindex(columns=data_order)


# This will create the stacked bar plot
fig = plt.figure(figsize=(25, 15))
ax = fig.add_subplot(1,1,1)

stacked_df.plot(kind='bar', stacked=True, ax=ax)
plt.title('Techniques Used in Animated Shows Over Time', fontsize=35)
plt.ylabel('Number of Shows', fontsize=25)
plt.xlabel('Year', fontsize=25)
plt.xticks(rotation=60, fontsize=17)

plt.yticks(np.arange(0, 250, 10), fontsize = 20)

plt.legend(loc='best', fontsize = 20)

plt.tight_layout()
plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.05)
plt.show()

Pie Chart

This pie chart shows an interesting difference between the main Networks producing TV shows. Syndication, which amounts for 24% of the shows in the data set, was the primary network method of the 1970-s1990s. This was networks that would sell the right to broadcast their TV shows to multiple TV stations, without going through a broadcast network. This starkly contrasts the secondary leader, Netflix, who has an average show year of 2019.6.

# Creating the dataframe for visualization 4
data = 'Data/Animated_Shows.csv'
df = pd.read_csv(data, encoding='latin1')

df['Google users'] = df['Google users'].apply(clean_google_users_rating)

# Apply the year function and create new columns for StartYear and EndYear
df[['StartYear', 'EndYear']] = pd.DataFrame(df['Year'].apply(lambda x: extract_first_start_last_end(x)).tolist(), index=df.index)
df = df.drop(columns=['Year'])

# Compute the value counts of the original channels
channel_counts = df['Original channel'].value_counts(dropna=False)

# Map the counts back to the original channels of the shows
df['Original Channel Count'] = df['Original channel'].map(channel_counts)
df = df.sort_values(by='Original Channel Count', ascending=False)

# Calculate the average StartYear for each channel
channel_avg_year = df.groupby('Original channel')['StartYear'].mean()

# Map the average year back to the original channels of the shows
df['Average Start Year'] = df['Original channel'].map(channel_avg_year)
df['Average Start Year'] = df['Average Start Year'].round(1)

# Grabbing the top 6 channels
top_channels = channel_counts.head(6).index

# Filter the DataFrame to include only the top 6 channels
df_top_channels = df[df['Original channel'].isin(top_channels)]

# Sorting to get the most popular channels at the top
df_top_channels = df_top_channels.sort_values(by='Original Channel Count', ascending=False)

# Drop duplicate rows based on 'Original channel'
df_unique_channels = df_top_channels.drop_duplicates(subset=['Original channel'])

# Manipulating the data for the pie chart
pie_df = df_unique_channels[['Original channel', 'Original Channel Count', 'Average Start Year']]
pie_df = pie_df.sort_values(by='Original Channel Count', ascending=False)

number_of_channels = len(df_unique_channels)
outside_color_ref_number = np.arange(number_of_channels)*2

all_channels = pie_df['Original Channel Count'].sum()
pie_df['Percentage'] = (pie_df['Original Channel Count'] / all_channels) * 100

# Creating the plot for visualization 4
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1)

# Creating the colors section for the pie chart
colormap = plt.get_cmap('tab20c')
outer_colors = colormap(outside_color_ref_number)

# Assigning the colors to a list for the inner pie chart
number_inside_channels = len(df_unique_channels)
inside_color_ref_number = np.arange(number_of_channels)*2
inside_color_ref_number = []
for each in outside_color_ref_number:
    inside_color_ref_number.append(each+1)


# Creating the outside pie chart
pie_df['Original Channel Count'].plot(
    kind='pie', 
    radius=1, 
    colors = outer_colors,
    labels = pie_df['Original channel'],
    pctdistance = 0.85,
    labeldistance = 1.1,
    wedgeprops = dict(edgecolor='w'), 
    textprops = {'fontsize':18},
    autopct = lambda p: ' {:.2f}% \n({:.0f})\n'.format(p,(p/100)*all_channels),
    startangle=90,)


# Creating the inside pie chart
inner_colors = colormap(inside_color_ref_number)
average_year_labels = ['Average Year\n{}'.format(year) for year in pie_df['Average Start Year']]

pie_df['Original Channel Count'].plot(
    kind='pie', 
    radius=.7, 
    colors = inner_colors,
    pctdistance = .55,
    labeldistance = 0.65,
    wedgeprops = dict(edgecolor='w'), 
    textprops = {'fontsize':15},
     labels=average_year_labels,
    startangle=90,)

ax.yaxis.set_visible(False)
plt.title('Top Networks by Channel Count', fontsize=18)


# Creating the hole in the center and the text to go with it
hole = plt.Circle((0, 0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.text(0, 0, 'Total Channels: ' + str(all_channels) +'\n out of '+ str(channel_counts.sum()), fontsize=18, ha='center')
ax.axis('equal')

plt.tight_layout()

plt.show()

Heatmap

This heatmap shows the sweetspot for good ratings. Shows with 35-50 episodes were the most dominant on the heatmap, which amounted for 120 of the observations. These had an average of 70%-80% ratings. For the 80%-90% categories, the sweet spot was between 50-70 episodes. Also in the 80%-90% categories, the shows with 15-25 episodes amounted for the many of the highly rated shows.

# This will create and manipulate data for visualization 5
data = 'Data/Animated_Shows.csv'
df = pd.read_csv(data, encoding='latin1', usecols=['Google users', 'IMDb', 'Episodes'])


df['Google users'] = df['Google users'].apply(clean_google_users_rating)
df = df.sort_values(by='IMDb', ascending=False)
df6 = df.copy()

# This will be to remove the shows that have a NaN rating in IMDb or Google Users
df6 = df6.dropna(subset=['IMDb', 'Google users'])

# This will be the average of the Google users rating and the IMDb rating
df6['Average_Rating'] = ((df6['Google users']) + (df6['IMDb']*10))/2

# Convert Episodes to numeric
df6['Episodes'] = pd.to_numeric(df6['Episodes'])

# Drop rows with NaN values
df6 = df6.dropna(subset=['Episodes'])

# Defining bin edges
bin_edges = [0, 5, 10, 15, 25, 35, 50, 70, 100, df6['Episodes'].max()]
rating_bin_edges = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Generate descriptive labels for the Episode and Rating bins
episode_bin_labels = ['{} - {}'.format(bin_edges[i], bin_edges[i+1]) for i in range(len(bin_edges)-1)]
rating_bin_labels = ['{}%+'.format(rating_bin_edges[i], rating_bin_edges[i+1]) for i in range(len(rating_bin_edges)-1)]

# Create the bins
df6['Episodes_Bins'] = pd.cut(df6['Episodes'], bins=bin_edges, labels=False)
df6['Rating_Bin'] = pd.cut(df6['Average_Rating'], bins=rating_bin_edges, labels=False)

# Creating a pivot table where the rows are the episode bins and the columns are rating bins
heatmap_data = pd.pivot_table(df6, values='Average_Rating', 
                              index='Episodes_Bins', 
                              columns='Rating_Bin', 
                              aggfunc='count', fill_value=0)
# Convert the pivot table to a numpy matrix for plotting the heatmap
heatmap_data = heatmap_data.to_numpy()


# This will create the heatmap for visualization 5
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1,1,1)
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

# Creating the heatmap
ax = sns.heatmap(heatmap_data, linewidth = 0.2, annot = False, cmap = 'coolwarm', 
                square = True, annot_kws={'size':8}, fmt='d',
                cbar_kws = {'orientation': 'vertical'},
                xticklabels=rating_bin_labels,
                yticklabels=episode_bin_labels)

# Adding descriptive labels and title
plt.title('Heatmap of Average Rating and Number of Episodes in a Show', fontsize=18, pad = 15)
plt.xlabel('Rating %', fontsize=14, labelpad=10)
plt.ylabel('Episode Count', fontsize=14, labelpad=10, rotation=0, ha='right')
plt.yticks(rotation=0, size=14)

plt.xticks(size=14)

ax.invert_yaxis()
plt.yticks(ticks=np.arange(len(episode_bin_labels)), labels=episode_bin_labels, rotation=0, size=14)

# Creating the necessary pieces for the colorbar key
cbar = ax.collections[0].colorbar
max_count = heatmap_data.max()
min_count = heatmap_data.min()
my_colorbar_ticks = [*range(min_count, max_count+1, 10)]
cbar.set_ticks(my_colorbar_ticks)
cbar.set_ticklabels(my_colorbar_ticks)
cbar.set_label('Number of Shows', rotation=270, labelpad=20, fontsize=14, color='black')

plt.show()

Conclusion

Throughout this analysis, I discovered many patterns in the data set. The most notable trends that I identified were the decreased satisfaction over time from users, covid spiking the releasing of TV shows, the average years of networks showing the difference between legacy media networks and the new streaming services like Netflix.