Introduction

This report provides a brief insight into a selection of songs included in Spotify curated playlists that were available to users January, 2020. Spotify algorithms create specific playlists for users based on taste and preferences, along with a numerous amount of song characteristic data. These playlists hope to help users enjoy their favorite types of music, while also discovering new music.

Dataset

The Data is comprised of over 28,000 different song occurrences within some of the playlists curated by Spotify. This data set does not include the entire song catalog, and only includes songs from 6 genres: EDM, Rock, Rap, Latin, R&B, and Pop. The playlists were available to users January, 2020. The Data was sourced from the public domain website Kaggle. The data was downloaded from Spotify’s API, using the Spotifyr package. This data includes over 23 variables, from track name, artist, release date, playlist, genre, sub-genre, and various other quantitative descriptors such as tempo and danceability. The data was last updated January 2020 and includes songs dating back to 1957.

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import colorcet as cc
from matplotlib.cm import get_cmap

warnings.filterwarnings("ignore")

path = "U:\\"
filename = "spotify_songs.csv"
spotify = pd.read_csv(path + filename)

spotify['track_album_release_date'] = pd.to_datetime(spotify['track_album_release_date'], format = '%Y-%m-%d')
spotify['Year'] = spotify['track_album_release_date'].dt.year

Findings

Average Song Popularity by Year and Genre

The scatter plot below depicts the average popularity score of songs included in the spotify curated playlists by year for each genre. Popularity scores ranged from 0 to 1000, with 1000 being the most popular. There is an obvious trend that pop is the most popular genre. There is a clear recency bias towards newer songs, which is logical based on the normality of users listening to more recent songs.

prod =  spotify.groupby(['Year','playlist_genre']).track_popularity.mean().sort_index(ascending = False).head(126).reset_index(name='mean')
prod['mean'] = prod['mean']*10

plt.figure(figsize=(22,15))

plt.scatter(prod['playlist_genre'],prod['Year'], marker ='8', cmap = 'YlOrRd',
            c=prod['mean'], s = prod['mean'], edgecolors = 'black')
plt.title('Average Song Popularity by Year and Genre', fontsize = 20)
plt.xlabel('Genre', fontsize = 18, labelpad = 35)
plt.ylabel('Year', fontsize = 18,labelpad = 35 )

cbar = plt.colorbar()
cbar.set_label('Average Popularity', rotation=270, fontsize=18, color = 'black', labelpad = 35);
colorbar_ticks = [200,250,300,350,400,450,500,550];
cbar.set_ticks(colorbar_ticks);


plt.xticks(fontsize = 14, color = 'black');

y_ticks = [*range(int(prod['Year'].min()), 2021,1   )];
plt.yticks(y_ticks, fontsize = 18, color = 'black');

Trend of Songs Included in Spotify Curated Playlists by Release Date Year

The Line plot below shows the number of songs included in the playlists based on their release date. Songs are more likely to be included with increased popularity, so the same recency bias trend seen in the previous chart, is also demonstrated here. There is also a potential skew in the data based on the relatively new technology of streaming music. Many artists that have older songs may not release music on Spotify. This effect is likely compounded by the age range of Spotify users. Many people who would potentially listen to the older songs may not use Spotify to listen to music.


spotifyRY = spotify.groupby(['Year'])['track_name'].count().reset_index(name='Tracks_included');
spotifyRY = pd.DataFrame(spotifyRY);
last_row = len(spotifyRY) -1
spotifyRY = spotifyRY.drop(spotifyRY.index[last_row]);

fig = plt.figure(figsize = (18,10));
ax= fig.add_subplot(1,1,1);

plt.plot(spotifyRY['Year'], spotifyRY['Tracks_included'], c = 'DarkOrange')
plt.title('Trend of Songs Included in Spotify Curated Playlists by Release Date Year', fontsize = 20)
plt.xlabel('Year', fontsize = 15, labelpad = 35)
plt.ylabel('Songs Released', fontsize = 15,labelpad = 35 )
plt.grid(axis = 'y')

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')));

Top 10 Tracks by Number of Times Song Was Included in a Different Spotify Curated Playlists

This bar chart depicts the top 10 songs that have been included the most times in different curated playlists. This is another way to evaluate the most popular songs at the time (Jan. 2020). The more popular the song, the more likely it will be included on more playlists.


trackpop = spotify.groupby('track_name').track_popularity.count().reset_index().sort_values(by='track_popularity', ascending = False).head(10)
trackpop = pd.DataFrame(trackpop)

plt.figure(figsize = (16,9))
plt.bar(trackpop['track_name'], trackpop['track_popularity'], label = 'Track_popularity', color= 'DarkOrange')
plt.title('Top 10 Tracks by Number of Times Song Was Included in a Different Spotify Curated Playlists', fontsize = 17)
plt.xlabel('Song', fontsize = 15, labelpad = 25)
plt.ylabel('Times Included', fontsize = 15, labelpad = 25 )
plt.grid(axis = 'y')

Number of Songs From Each Genre and Sub-Genre

The pie chart depicts the percentage of genres and respective sub-genres that were included in the selection of songs/playlists in the data set. It is clear that the data was evenly extracted between the 6 genres: EDM, Rock, Rap, Latin, R&B, and Pop. I do not think this would be an accurate description of the entire Spotify song catalog, but does provide a great sample size and understanding of the songs when evaluating specific trends across the genres and sub-genres.


genre = spotify.groupby(['playlist_genre'])['playlist_subgenre'].value_counts().reset_index(name='Total')
genre = pd.DataFrame(genre);

num_out_c = len(genre.playlist_genre.unique());
out_c_ref = np.arange(num_out_c)*4;

num_in_c = len(genre.playlist_subgenre.unique());
all_c_ref = np.arange(num_out_c + num_in_c);

in_c_ref =[]
for each in all_c_ref:
    if each not in out_c_ref:
        in_c_ref.append(each);

fig = plt.figure(figsize = (12,12))
ax = fig.add_subplot(1,1,1);

colormap = plt.get_cmap('cet_glasbey_cool')
outer_colors = colormap(out_c_ref);

all_fines = genre.Total.sum();

genre.groupby(['playlist_genre'])['Total'].sum().plot(
kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1, 
wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':18}, 
    autopct = lambda p: '{:.1f}%\n({:.0f})'. format(p,(p/100)*all_fines), startangle = 90)

inner_colors = colormap(in_c_ref);

genre.Total.plot(
kind = 'pie', radius = .7, colors = inner_colors, pctdistance = 0.47, labeldistance = .77, 
wedgeprops = dict(edgecolor = 'w'), rotatelabels =True,
    textprops = dict(rotation_mode = 'anchor', va='center', ha='center'), 
    labels = genre.playlist_subgenre,
    autopct = '%1.1f%%', startangle = 90)


ax.yaxis.set_visible(False);
plt.title('Number of Songs From Each Genre and Sub-Genre', fontsize = 20);
ax.axis('equal');
plt.tight_layout()

Number of Songs Included on Curated Playlists by Energy and Tempo

The heat map below breaks down the sample songs into 10 equal energy rating buckets and 10 equal tempo range buckets. There is a clear distinction that songs with a higher energy rating and a tempo between 96-144 are far more likely to be included in the playlists. Many of these songs with increased energy and tempo are viewed as “catchy” or “happy”, which will likely increase the popularity of the song.

labels = ["0-24", "24-48", "48-72", "72-96", "96-120", "120-144", "144-168", "168-192", "192-216", "216-240"];

spotify['bucket'] = pd.cut(spotify['tempo'],bins = [0, 24, 48, 72, 96, 120, 144, 168, 192, 216, 240],  labels= labels);

labels2 = ["0.0-0.1", "0.1-0.2", "0.2-0.3", "0.3-0.4", "0.4-0.5", "0.5-0.6", "0.6-0.7", "0.7-0.8", "0.8-0.9", "0.9-1.0"];

spotify['enbuck'] = pd.cut(spotify['energy'],bins = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1],  labels= labels2);

hm = spotify.groupby(['bucket','enbuck'])['track_name'].count().reset_index(name='new');

hmcount = spotify.groupby(['bucket'])['danceability'].mean().reset_index();

hm_df = pd.pivot_table(hm, index='bucket', columns = 'enbuck',values = 'new');

fig = plt.figure(figsize = (15,12));
ax = fig.add_subplot(1,1,1);

comma_fmt = FuncFormatter(lambda x , p: format(int(x), ','));

ax = sns.heatmap(hm_df, linewidth = 0.2, annot =True, cmap = 'coolwarm', fmt = ',.0f', 
                square = True, annot_kws = {'size':11},
                cbar_kws = {'format': comma_fmt, 'orientation':'vertical'})
plt.title('Number of Songs Included on Curated Playlists by Energy and Tempo', fontsize = 18, pad = 15)
plt.xlabel('Energy Rating', fontsize = 18, labelpad = 10);
plt.ylabel('Tempo', fontsize = 18, labelpad =10);
plt.yticks(rotation = 0, fontsize = 14);
plt.xticks(fontsize = 14);

ax.invert_yaxis();

cbar = ax.collections[0].colorbar;
cbar.set_label('Number of Songs', rotation = 270, color = 'black', fontsize = 14, labelpad = 25);

Conclusion

Within this sample data set of songs included in a selection of Spotify curated playlists, there are a few clear factors that effect the popularity of songs: release date, genre, energy, and tempo. There are trends across these factors that suggests that popularity of a song inevitably increases the likelihood that a song will be included on a playlist for users.