MA505 Presentation: Poor Man’s Music Analysis Project

Mark Ira Galang

Overview

Dataset
- What is in the dataset?
- Where did I get the dataset?
- Some Additions/Changes
- Summary of dataset
“Research” Questions
- General Questions
- Statistical Questions
- Model Building
Preliminary
- Cleaning the dataset
Analysis
- Exploratory Analysis
- Statistical Analysis
- Model Building

THE DATASET

What is the dataset?

The dataset is a collection of 30000 songs from spotify!

It contains many songs, genres and subgenres, and some stats like how danceable a song is, the energy, key, etc.

Where did I get the dataset?

I didn’t make it this time, I got it from kaggle!

https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/data

You could login to get it, or obtain this through R

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience
                                 /tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

# Or read in with tidytuesdayR package (https://github.com/thebioengineer/tidytuesdayR)
# PLEASE NOTE TO USE 2020 DATA YOU NEED TO UPDATE tidytuesdayR from GitHub

# Either ISO-8601 date or year/week works!

# Install via devtools::install_github("thebioengineer/tidytuesdayR")

tuesdata <- tidytuesdayR::tt_load('2020-01-21') 
tuesdata <- tidytuesdayR::tt_load(2020, week = 4)


spotify_songs <- tuesdata$spotify_songs

# Save as a CSV file
write.csv(spotify_songs, "raw_spotify_data.csv", row.names = FALSE)

Some Additions/Changes

I made a python script that tries to filter only english songs (wasn’t super efficient at it)

Then also filter other things like remixes, live music, edits, etc. so that only the original music is there.

import pandas as pd
import langid

df = pd.read_csv('Project Folder/raw_spotify_data.csv')

# Detect the language of each song's title
try:
    df['language'] = df['track_name'].astype(str).apply(lambda x: langid.classify(x)[0])
except Exception as e:
    df['language'] = None

# Filter for English songs only
df = df[df['language'] == 'en']

# Drop the language column as it's no longer needed
df = df.drop(columns=['language'])

# Let me know that the filtering is done
print("Filtered to English songs only.")

# Drop remixes or mixes as they may not represent the original song
df = df[~df['track_name'].str.contains('remix', case=False, na=False)]
df = df[~df['track_name'].str.contains('mix', case=False, na=False)]

# Let me know that the remix filtering is done
print("Removed remixes and mixes from the dataset.")

# Then also drop edits and live versions
df = df[~df['track_name'].str.contains('edit', case=False, na=False)]
df = df[~df['track_name'].str.contains('live', case=False, na=False)]

# Save the cleaned data to a new CSV file
df.to_csv('cleaned_spotify_data.csv', index=False)

Also (attempted to) filter only english songs.

This filters to around 21440 songs.

Did an entire separate analysis that grabs the Modulation Mel-Frequency Cesptral Coefficient (MMFCC).

Here is the paper that it was based on.

This one is the 115 slide presentation for MA694.

MMFCC Workflow

have audio waves ready in a song folder
pre-emphasize each audio file
frame each signal into short overlapping windows
apply Hamming window to reduce spectral leakage
compute magnitude/power spectrum via Fast Fourier Transformation (FFT)
apply Mel filterbank
take log of filterbank energies
compute cepstral coefficients (MMFCCs)
create 8 subbands
modulate by taking another FFT

This will take hours to explain everything, so lets pretend a fairy flew by and gave me this part of the dataset ;3.

Summary of Dataset

Track Information:

track_id, track_name, track_artist, track_popularity, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre

Track features:

danceability, energy,key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms,

MMFCC Features:

feature_1, feature_2, feature_3, … , feature_112

This results in a 21440 songs x 134 columns!

RESEARCH QUESTIONS

Research Questions

Exploratory Analysis
- What kind of subgenrera are in the dataset and the count of songs?
- What are the most frequent artists within the dataset and from one of them, what songs were there?
- Average duration of each subgenre?
Statistical Analysis
- Looking at a specific subgenra from pop, are the features significantly different?
Model building to Predict subgenre
- Multinomial Building using features
  - Attempt to improve models
- KNN approach through SAS

Preliminary

First, I wanted to check for any missing and duplicate values from track_name.

proc freq data=PROJECT.spotify_data order=freq;
    tables track_name / noprint out=freq_out;
run;

We see that there are missing values ” ” and many songs that appear in the dataset.

It shows that we realistically have 14529 songs.

Also check any missing and duplicate values from track_id.

proc freq data=PROJECT.spotify_data order=freq;
    tables track_id / noprint out=freq_out;
run;

We also see that there are duplicates.

Then also we have missing for MMFCC features. Could be because I didn’t have the song on the analysis when getting the MMFCC.

Removing the duplicates

/* First remove missing */
data spotify_clean1;
    set PROJECT.spotify_data;
    if strip(track_name) ne "";  
run;

/* Then remove duplicate for ID */
PROC SORT DATA = spotify_clean1 out = spotify_clean2;
    by track_id;
run;

/* Then for track_name */
PROC SORT DATA = spotify_clean2 out = spotify_clean3
    nodupkey;
    by track_name;
run;

/* Then also missing for features, we only need to see 1 to consider it missing */
data spotify_clean;
    set spotify_clean3;
    if feature_1 ne .;
run;

Results

Exploratory Analysis

Count of songs per subgenra

First, I wanted to know how many songs are on each subgenra.

PROC FREQ DATA = spotify_clean;
    tables playlist_subgenre;
run;

Frequent Artist in dataset

To find the most frequent artist, we can use FREQ.

/* Grab the top 10 frequent arists */
PROC FREQ DATA = spotify_clean noprint;
    tables track_artist / out = spotify_data_freq_artist;
run;

/* Sort */
PROC SORT DATA = spotify_data_freq_artist;
    by descending count;
run;

proc print data=spotify_data_freq_artist(obs=10);
run;

What kind of songs does Ariana Grande has in the dataset?

We can print out and simply filter using where.

proc print data = spotify_clean;
    where track_artist = "Ariana Grande";
    var track_name track_artist;
run;

Average duration of each subgenre

First I need to convert ms to minutes.

data spotify_data_minutes;
    set spotify_clean;
    duration_min = duration_ms / 60000; 
run;

Then to get average just grab the mean.

proc means data=spotify_data_minutes mean;
    class playlist_subgenre;
    var duration_min;
run;

Average Duration (in minutes)

We can see that most the songs are within 3 - 4 minutes

Statistical Analysis

I want to find out within the pop genre, are the features different?

We are looking into these 3 columns

danceability: How danceable a song is (from 0 - 1)

energy: How loud a song is (from 0 - 1)

valence: How positive a song is (from 0 - 1)

Since we are doing pop songs, we will make a subset of just pop songs

data spotify_clean_pop;
    set spotify_clean;
    where playlist_genre = "pop";
run;

There are 2370 rows (songs)

6 way t-test

Then apply ttest to each feature.

/* First, dance pop */
title "T-Test between dance pop and electropop";
proc ttest data=spotify_clean_pop;
    where playlist_subgenre in ("dance pop", "electropo");
    class playlist_subgenre;
    var danceability energy valence;
run;


title "T-Test between dance pop and indie pop";
proc ttest data=spotify_clean_pop;
    where playlist_subgenre in ("dance pop", "indie pop");
    class playlist_subgenre;
    var danceability energy valence;
run;

title "T-Test between dance pop and post-teen";
proc ttest data=spotify_clean_pop;
    where playlist_subgenre in ("dance pop", "post-teen");
    class playlist_subgenre;
    var danceability energy valence;
run;

/* Then electropop*/

title "T-Test between electropop and indie pop";
proc ttest data=spotify_clean;
    where playlist_subgenre in ("electropo", "indie pop");
    class playlist_subgenre;
    var danceability energy valence;
run;

title "T-Test between electropop and post-teen";
proc ttest data=spotify_clean_pop;
    where playlist_subgenre in ("electropo", "post-teen");
    class playlist_subgenre;
    var danceability energy valence;
run;

/* Then post-teen*/

title "T-Test between post-teen and indie pop";
proc ttest data=spotify_clean_pop;
    where playlist_subgenre in ("post-teen", "indie pop");
    class playlist_subgenre;
    var danceability energy valence;
run;

Results of ttest

Get p-values

We can get specifically the p-values for a better look

Comparison	Feature	Mean Difference	t-Value	p-Value	Significant
Dance Pop vs Electropop	Danceability	0.0210	2.82	<0.001	Yes
	Energy	0.00291	0.32	0.7477	No
	Valence	0.0250	1.87	0.0623	No
Dance Pop vs Indie Pop	Danceability	0.0205	2.97	0.0030	Yes
	Energy	0.1165	11.84	<0.001	Yes
	Valence	0.0496	-	0.0028	Yes
Dance Pop vs Post-Teen	Danceability	0.0237	3.07	0.0022	Yes
	Energy	0.00489	0.52	0.6031	No
	Valence	-0.0606	-5.04	<0.001	Yes
Electropop vs Indie Pop	Danceability	-0.00109	-0.14	0.8876	No
	Energy	0.1136	10.37	<0.001	Yes
	Valence	0.0245	1.84	0.0659	No
Electropop vs Post-Teen	Danceability	0.00204	0.24	0.8125	No
	Energy	0.00198	0.19	0.8466	No
	Valence	-0.0856	-6.53	<0.001	Yes
Post-Teen vs Indie Pop	Danceability	0.00314	0.42	0.6768	No
	Energy	-0.1116	-11.09	<0.001	Yes
	Valence	-0.1101	-9.32	<0.001	Yes

Note: Bold indicates statistical significance at p < 0.01.

To me, there is enough differences between each subgenre that we can try and model with a multinomial model.

Model Building (Multinomial)

What is multinomial model?

Multinomial is basically like a binomial model, but for situations where you have more than two possible outcomes. In our case, we have 4 subgenres, indie pop, electro pop, post-teen, dance pop.

\[ P(X_1=x_1, X_2=x_2, \dots, X_k=x_k) = \frac{n!}{x_1! x_2! \dots x_k!} p_1^{x_1} p_2^{x_2} \dots p_k^{x_k} \]

Our goal is to create a model that can predict the subgenre of a song based on some predictors.

Model 1

Lets start simple by only having the each subgenre

/* This could be a justification for a multinomial model */
proc logistic data=spotify_clean_pop descending;
    class playlist_subgenre (ref="dance pop") / param=ref;
    model playlist_subgenre = danceability energy valence / link=glogit;
    
    /* we can also get the predictions */
    output out=pred_probs predicted=pred_prob;
run;

If you are confused this is the R translation:

library(nnet) 
fit <- multinom(playlist_subgenre ~ danceability + energy + valence,
                 data = spotify_clean_pop)

How I got Accuracy

I then did the following to see how accuracte our model is.

Converted to long data, then transposed… (I don’t think this is necessary):

proc print data=pred_probs(obs=10);
    title "Long dataset";
run;

proc sort data=pred_probs;
    by track_id playlist_subgenre track_name track_artist track_popularity;
run;

proc transpose data=pred_probs out=pred_probs_wide(drop=_NAME_);
    by track_id playlist_subgenre track_name track_artist track_popularity;
    id _LEVEL_;      
    var pred_prob;   
run;

Then grabbed the highest probability and class:

data accuracy_spotify_data1;
    set pred_probs_wide;
    
    /* find what is the class */
    if 'post-teen'n > max('dance pop'n, electropop, 'indie pop'n) then pred_class = "post-teen";
    else if 'dance pop'n > max('post-teen'n, electropop, 'indie pop'n) then pred_class = "dance_pop";
    else if 'indie pop'n > max('post-teen'n, electropop, 'dance pop'n) then pred_class = "indie_pop";
    else pred_class = "electropop";
    
    /* then compare if it is accurate */
    if pred_class = playlist_subgenre then accuracy = 1;
    else accuracy = 0;
run;

Then grab the mean for accuracy to get the percentage:

proc means data = accuracy_spotify_data1;
    var accuracy;
run;

Accuracy (Model 1)

This is the results of model 1

10%… not very good is it?

Model 2 (Adding interaction)

Maybe interaction would make things better?

proc logistic data=spotify_clean_pop descending;
    class playlist_subgenre (ref="dance pop") / param=ref;
    model playlist_subgenre = 
        danceability|energy|valence / link=glogit;
        
    output out=pred_probs2 predicted=pred_prob;
run;

R Translation

fit2 <- multinom(playlist_subgenre ~ (danceability + energy + valence)^2,
                 data = spotify_clean_pop)

Accuracy (Model 2)

This is the results of model 2

We somehow dropped accuracy??

Multinomial Issues Conclusion

Even though multinomial is good for helping us model multiple genres, it is clear that there are some issues.

predictors don’t clearly separate the subgroups
3 predictors (danceability, energy, and valence) isn’t enough, we may need more
Adding interactions didn’t help much because it just increases complexity
accuracy stayed low, meaning multinomial may not be a good fit

We do have 1 more model to try

KNN Modeling

What is KNN?

KNN is a model that classifies through how close a point is with a cluster.

k indicates how many of points we check to classify. We classify by majority vote.

Implementation of KNN

First we need to split the data into a training and test dataset. I split it 60:40 (60% training, 40% testing).

proc surveyselect data=spotify_clean_pop 
    out=spotify_clean_pop_split_data
    seed=505              
    samprate=0.6 /* This is for 60% */     
    outall                   
    method=srs;              
run;

data spotify_clean_pop_training spotify_clean_pop_testing;
    set spotify_clean_pop_split_data;
    if selected = 1 then output spotify_clean_pop_training;  
    else output spotify_clean_pop_testing;                   
    drop selected;
run;

KNN Modeling

Then we can implement with SAS’s building in knn function (thank heavens).

proc discrim data=spotify_clean_pop_training test=spotify_clean_pop_testing method=npar k=40;
    class playlist_subgenre; /* this is the column I am predicting */
    var feature_1-feature_112; /* this is the column(s) that I am using as predictors */
    priors proportional; 
run;

Results

We see that we have a 64% error for training and a 66% for testing. This means we have a 36% accuracy and 34% accuracy.

This is much better!

Conclusion

Currently, getting just the MMFCC is enough to get over randomly choosing 4 subgenre (25%).
Multinomial may need more predictors to improve accuracy, but it could be too complex. The other part is that the data may be too similar.

Summary

Dataset
- What is in the dataset?
- Where did I get the dataset?
- Some Additions/Changes
- Summary of dataset
“Research” Questions
- General Questions
- Statistical Questions
- Model Building
Preliminary
- Cleaning the dataset
Analysis
- Exploratory Analysis
- Statistical Analysis
- Model Building