import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'D:/Anaconda3/Library/plugins/platforms'
This public data set is a representation of the top 50 songs in 2019 from Spotify. From this data set, an analysis was performed in order to understand the popularity between the songs, artists, and genre, with a further focus on the beats per minute of the songs.
The beginning analysis with the Line Chart was to analyze the average popularity of songs by genre. Understanding that this data is only 50 songs, it was very interesting to note the amount of genres that are represented and how varied the results are across each.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Popularity', 'Genre'])
avg_popularity = df.groupby('Genre')['Popularity'].mean()
fig, ax = plt.subplots(figsize=(10,6))
avg_popularity.plot(kind='line', ax=ax)
ax.set_title('Average Popularity by Genre')
ax.set_ylabel('Popularity', fontsize=14)
ax.set_xlabel('Genre', fontsize=14)
xticks = [i for i in range(len(avg_popularity.index))]
xticklabels = list(avg_popularity.index)
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels)
plt.xticks(rotation=90)
plt.show()
Diving further into a Scatter Plot of an analysis of popularity of songs by genre, this is able to provide those individual artist counts across both data points. While the Line Chart provided the average to see the moving line across all genres, the Scatter Plot allows for a more in depth look at to what is making up those averages and providing a little more insight into how the averages are as low or as high as they are showing.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Popularity', 'Genre', 'Artist.Name'])
x = df.groupby(['Genre', 'Popularity'])['Artist.Name'].count().reset_index(name='Count of Artists')
plt.figure(figsize=(14,9))
plt.scatter(x['Popularity'], x['Genre'], c=x['Count of Artists'], s=100, edgecolors='black')
plt.title('Top 50 Songs by Artist Count of Genre and Popularity', fontsize=18)
plt.xlabel('Popularity', fontsize=14)
plt.ylabel('Genre', fontsize=14)
cbar = plt.colorbar()
cbar.set_label('Number of Artists', rotation=270, fontsize=14, color='black', labelpad=10)
my_colorbar_ticks = [1,2]
cbar.set_ticks(my_colorbar_ticks)
my_x_ticks = [*range(x['Popularity'].min(), x['Popularity'].max()+1,1)]
plt.xticks(my_x_ticks, fontsize=11, color='black')
plt.show()
There was some notion that the beats per minute of a song is what renders its popularity. Performing an overall analysis of beats per minute by genre through a Donut Chart provides some of that insight as shown in percentages.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Genre', 'Beats.Per.Minute'])
total_bpm_by_genre = df.groupby("Genre")["Beats.Per.Minute"].sum()
total_bpm = total_bpm_by_genre.sum()
labels = total_bpm_by_genre.index
sizes = total_bpm_by_genre.values
fig, ax = plt.subplots(figsize=(12,12))
ax.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=45, pctdistance=0.85, textprops={'fontsize':8})
ax.pie([1], colors=['w'], radius=0.5)
center_circle = plt.Circle((0, 0), 0.5, color='black', fc='white', linewidth=0)
fig.gca().add_artist(center_circle)
plt.text(0, 0, f"{total_bpm}\nTotal BPM", ha='center', va='center', fontsize=18)
plt.title("Total Beats per Minute by Genre", fontsize = 18)
plt.show()
Continuing with the analysis into genre by beats per minute, a Box Plot was utilized to show that distribution and pull any outliers.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Genre', 'Beats.Per.Minute'])
grouped_data = df.groupby('Genre')['Beats.Per.Minute'].mean()
sns.boxplot(x='Genre', y='Beats.Per.Minute', data=df)
plt.xticks(rotation=90)
plt.xlabel("Genre", fontsize=14)
plt.ylabel("Beats Per Minute", fontsize=14)
plt.title("Genre by Beats per Minute", fontsize=20)
plt.show()
Finally, looking at two bar chart comparisons of popularity and beats per minute by the top 15 track names, it can be seen that the top song that has the most beats per minute is NOT actually the most popular song. From this analysis, one can see that there is no direct correlation between the popularity of songs to the songs’ beats per minute.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Track.Name', 'Beats.Per.Minute'])
df2 = df[['Track.Name', 'Beats.Per.Minute']].sort_values(by = 'Beats.Per.Minute', ascending=False)
df2.reset_index(drop=True,inplace=True)
df3 = df2.head(15)
plt.bar("Track.Name", "Beats.Per.Minute", data=df3, color="blue")
plt.xlabel("Track Name", fontsize=14)
plt.xticks(rotation=90)
plt.ylabel("Beats Per Minute", fontsize=14)
plt.title("Top 15 Songs by Beats per Minute", fontsize=20)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("U:/Spotify_Dataset2019.csv", usecols = ['Track.Name', 'Popularity'])
df2 = df[['Track.Name', 'Popularity']].sort_values(by = 'Popularity', ascending=False)
df2.reset_index(drop=True,inplace=True)
df3 = df2.head(15)
plt.bar("Track.Name", "Popularity", data=df3, color="red")
plt.xlabel("Track Name", fontsize=14)
plt.xticks(rotation=90)
plt.ylabel("Popularity", fontsize=14)
plt.title("Top 15 Songs by Popularity", fontsize=20)
plt.show()