The dataset is a collection of 30000 songs from spotify!
It contains many songs, genres and subgenres, and some stats like how danceable a song is, the energy, key, etc.
I didn’t make it this time, I got it from kaggle!
https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/data
You could login to get it, or obtain this through R
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience
/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
# Or read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO UPDATE tidytuesdayR from GitHub
# Either ISO-8601 date or year/week works!
# Install via devtools::install_github("thebioengineer/tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2020-01-21')
tuesdata <- tidytuesdayR::tt_load(2020, week = 4)
spotify_songs <- tuesdata$spotify_songs
# Save as a CSV file
write.csv(spotify_songs, "raw_spotify_data.csv", row.names = FALSE)
I made a python script that tries to filter only english songs (wasn’t super efficient at it)
Then also filter other things like remixes, live music, edits, etc. so that only the original music is there.
import pandas as pd
import langid
df = pd.read_csv('Project Folder/raw_spotify_data.csv')
# Detect the language of each song's title
try:
df['language'] = df['track_name'].astype(str).apply(lambda x: langid.classify(x)[0])
except Exception as e:
df['language'] = None
# Filter for English songs only
df = df[df['language'] == 'en']
# Drop the language column as it's no longer needed
df = df.drop(columns=['language'])
# Let me know that the filtering is done
print("Filtered to English songs only.")
# Drop remixes or mixes as they may not represent the original song
df = df[~df['track_name'].str.contains('remix', case=False, na=False)]
df = df[~df['track_name'].str.contains('mix', case=False, na=False)]
# Let me know that the remix filtering is done
print("Removed remixes and mixes from the dataset.")
# Then also drop edits and live versions
df = df[~df['track_name'].str.contains('edit', case=False, na=False)]
df = df[~df['track_name'].str.contains('live', case=False, na=False)]
# Save the cleaned data to a new CSV file
df.to_csv('cleaned_spotify_data.csv', index=False)
Also (attempted to) filter only english songs.
This filters to around 21440 songs.
Did an entire separate analysis that grabs the Modulation Mel-Frequency Cesptral Coefficient (MMFCC).
Here is the paper that it was based on.
This one is the 115 slide presentation for MA694.
MMFCC Workflow
This will take hours to explain everything, so lets pretend a fairy flew by and gave me this part of the dataset ;3.
track_id, track_name, track_artist, track_popularity, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre
danceability, energy,key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms,
feature_1, feature_2, feature_3, … , feature_112
This results in a 21440 songs x 134 columns!
First, I wanted to check for any missing and duplicate values from track_name.
proc freq data=PROJECT.spotify_data order=freq;
tables track_name / noprint out=freq_out;
run;
We see that there are missing values ” ” and many songs that appear in the dataset.
It shows that we realistically have 14529 songs.
Also check any missing and duplicate values from track_id.
proc freq data=PROJECT.spotify_data order=freq;
tables track_id / noprint out=freq_out;
run;
We also see that there are duplicates.
Then also we have missing for MMFCC features. Could be because I didn’t have the song on the analysis when getting the MMFCC.
Removing the duplicates
/* First remove missing */
data spotify_clean1;
set PROJECT.spotify_data;
if strip(track_name) ne "";
run;
/* Then remove duplicate for ID */
PROC SORT DATA = spotify_clean1 out = spotify_clean2;
by track_id;
run;
/* Then for track_name */
PROC SORT DATA = spotify_clean2 out = spotify_clean3
nodupkey;
by track_name;
run;
/* Then also missing for features, we only need to see 1 to consider it missing */
data spotify_clean;
set spotify_clean3;
if feature_1 ne .;
run;
Results
Count of songs per subgenra
First, I wanted to know how many songs are on each subgenra.
PROC FREQ DATA = spotify_clean;
tables playlist_subgenre;
run;
Frequent Artist in dataset
To find the most frequent artist, we can use FREQ.
/* Grab the top 10 frequent arists */
PROC FREQ DATA = spotify_clean noprint;
tables track_artist / out = spotify_data_freq_artist;
run;
/* Sort */
PROC SORT DATA = spotify_data_freq_artist;
by descending count;
run;
proc print data=spotify_data_freq_artist(obs=10);
run;
What kind of songs does Ariana Grande has in the dataset?
We can print out and simply filter using where.
proc print data = spotify_clean;
where track_artist = "Ariana Grande";
var track_name track_artist;
run;
Average duration of each subgenre
First I need to convert ms to minutes.
data spotify_data_minutes;
set spotify_clean;
duration_min = duration_ms / 60000;
run;
Then to get average just grab the mean.
proc means data=spotify_data_minutes mean;
class playlist_subgenre;
var duration_min;
run;
Average Duration (in minutes)
We can see that most the songs are within 3 - 4 minutes
We are looking into these 3 columns
danceability: How danceable a song is (from 0 - 1)
energy: How loud a song is (from 0 - 1)
valence: How positive a song is (from 0 - 1)
Since we are doing pop songs, we will make a subset of just pop songs
data spotify_clean_pop;
set spotify_clean;
where playlist_genre = "pop";
run;
Then apply ttest to each feature.
/* First, dance pop */
title "T-Test between dance pop and electropop";
proc ttest data=spotify_clean_pop;
where playlist_subgenre in ("dance pop", "electropo");
class playlist_subgenre;
var danceability energy valence;
run;
title "T-Test between dance pop and indie pop";
proc ttest data=spotify_clean_pop;
where playlist_subgenre in ("dance pop", "indie pop");
class playlist_subgenre;
var danceability energy valence;
run;
title "T-Test between dance pop and post-teen";
proc ttest data=spotify_clean_pop;
where playlist_subgenre in ("dance pop", "post-teen");
class playlist_subgenre;
var danceability energy valence;
run;
/* Then electropop*/
title "T-Test between electropop and indie pop";
proc ttest data=spotify_clean;
where playlist_subgenre in ("electropo", "indie pop");
class playlist_subgenre;
var danceability energy valence;
run;
title "T-Test between electropop and post-teen";
proc ttest data=spotify_clean_pop;
where playlist_subgenre in ("electropo", "post-teen");
class playlist_subgenre;
var danceability energy valence;
run;
/* Then post-teen*/
title "T-Test between post-teen and indie pop";
proc ttest data=spotify_clean_pop;
where playlist_subgenre in ("post-teen", "indie pop");
class playlist_subgenre;
var danceability energy valence;
run;
We can get specifically the p-values for a better look
| Comparison | Feature | Mean Difference | t-Value | p-Value | Significant |
|---|---|---|---|---|---|
| Dance Pop vs Electropop | Danceability | 0.0210 | 2.82 | <0.001 | Yes |
| Energy | 0.00291 | 0.32 | 0.7477 | No | |
| Valence | 0.0250 | 1.87 | 0.0623 | No | |
| Dance Pop vs Indie Pop | Danceability | 0.0205 | 2.97 | 0.0030 | Yes |
| Energy | 0.1165 | 11.84 | <0.001 | Yes | |
| Valence | 0.0496 | - | 0.0028 | Yes | |
| Dance Pop vs Post-Teen | Danceability | 0.0237 | 3.07 | 0.0022 | Yes |
| Energy | 0.00489 | 0.52 | 0.6031 | No | |
| Valence | -0.0606 | -5.04 | <0.001 | Yes | |
| Electropop vs Indie Pop | Danceability | -0.00109 | -0.14 | 0.8876 | No |
| Energy | 0.1136 | 10.37 | <0.001 | Yes | |
| Valence | 0.0245 | 1.84 | 0.0659 | No | |
| Electropop vs Post-Teen | Danceability | 0.00204 | 0.24 | 0.8125 | No |
| Energy | 0.00198 | 0.19 | 0.8466 | No | |
| Valence | -0.0856 | -6.53 | <0.001 | Yes | |
| Post-Teen vs Indie Pop | Danceability | 0.00314 | 0.42 | 0.6768 | No |
| Energy | -0.1116 | -11.09 | <0.001 | Yes | |
| Valence | -0.1101 | -9.32 | <0.001 | Yes |
Note: Bold indicates statistical significance at p < 0.01.
To me, there is enough differences between each subgenre that we can try and model with a multinomial model.
Multinomial is basically like a binomial model, but for situations where you have more than two possible outcomes. In our case, we have 4 subgenres, indie pop, electro pop, post-teen, dance pop.
\[ P(X_1=x_1, X_2=x_2, \dots, X_k=x_k) = \frac{n!}{x_1! x_2! \dots x_k!} p_1^{x_1} p_2^{x_2} \dots p_k^{x_k} \]
Our goal is to create a model that can predict the subgenre of a song based on some predictors.
Lets start simple by only having the each subgenre
/* This could be a justification for a multinomial model */
proc logistic data=spotify_clean_pop descending;
class playlist_subgenre (ref="dance pop") / param=ref;
model playlist_subgenre = danceability energy valence / link=glogit;
/* we can also get the predictions */
output out=pred_probs predicted=pred_prob;
run;
If you are confused this is the R translation:
library(nnet)
fit <- multinom(playlist_subgenre ~ danceability + energy + valence,
data = spotify_clean_pop)
I then did the following to see how accuracte our model is.
Converted to long data, then transposed… (I don’t think this is necessary):
proc print data=pred_probs(obs=10);
title "Long dataset";
run;
proc sort data=pred_probs;
by track_id playlist_subgenre track_name track_artist track_popularity;
run;
proc transpose data=pred_probs out=pred_probs_wide(drop=_NAME_);
by track_id playlist_subgenre track_name track_artist track_popularity;
id _LEVEL_;
var pred_prob;
run;
Then grabbed the highest probability and class:
data accuracy_spotify_data1;
set pred_probs_wide;
/* find what is the class */
if 'post-teen'n > max('dance pop'n, electropop, 'indie pop'n) then pred_class = "post-teen";
else if 'dance pop'n > max('post-teen'n, electropop, 'indie pop'n) then pred_class = "dance_pop";
else if 'indie pop'n > max('post-teen'n, electropop, 'dance pop'n) then pred_class = "indie_pop";
else pred_class = "electropop";
/* then compare if it is accurate */
if pred_class = playlist_subgenre then accuracy = 1;
else accuracy = 0;
run;
Then grab the mean for accuracy to get the percentage:
proc means data = accuracy_spotify_data1;
var accuracy;
run;
This is the results of model 1
10%… not very good is it?
Maybe interaction would make things better?
proc logistic data=spotify_clean_pop descending;
class playlist_subgenre (ref="dance pop") / param=ref;
model playlist_subgenre =
danceability|energy|valence / link=glogit;
output out=pred_probs2 predicted=pred_prob;
run;
R Translation
fit2 <- multinom(playlist_subgenre ~ (danceability + energy + valence)^2,
data = spotify_clean_pop)
This is the results of model 2
We somehow dropped accuracy??
Even though multinomial is good for helping us model multiple genres, it is clear that there are some issues.
We do have 1 more model to try
KNN is a model that classifies through how close a point is with a cluster.
k indicates how many of points we check to classify. We classify by majority vote.
First we need to split the data into a training and test dataset. I split it 60:40 (60% training, 40% testing).
proc surveyselect data=spotify_clean_pop
out=spotify_clean_pop_split_data
seed=505
samprate=0.6 /* This is for 60% */
outall
method=srs;
run;
data spotify_clean_pop_training spotify_clean_pop_testing;
set spotify_clean_pop_split_data;
if selected = 1 then output spotify_clean_pop_training;
else output spotify_clean_pop_testing;
drop selected;
run;
Then we can implement with SAS’s building in knn function (thank heavens).
proc discrim data=spotify_clean_pop_training test=spotify_clean_pop_testing method=npar k=40;
class playlist_subgenre; /* this is the column I am predicting */
var feature_1-feature_112; /* this is the column(s) that I am using as predictors */
priors proportional;
run;