Predicting mood with my Spotify data
Intro
Sobat pengguna Spotify pasti tidak asing dengan playlist-playlist yang sering kita dengarkan baik yang ditentukan dari pihak spotify ataupun kita sendiri. Akan tetapi pernah kah kita terpikir selama ini dengan mendengarkan lagu-lagu dari playlist yang sering kita dengarkan bagaimana dalam menggambarkan mood kita sendiri? Oleh karena itu dalam artikel ini saya akan coba melakukan prediksi mood menggunakan data spotify saya sendiri.
Dalam artikel ini saya melakukan penarikan data yaitu 6 buah playlist dari beberapa playlist yang saya simpan dengan package spotifyr. Package ini dapat kita install melaui panduan link berikut ini. Tahap berikutnya yaitu clustering menggunakan algoritma K-Means untuk menentukan kluster/label mood kita. Berikutnya dilanjutkan tahap pemodelan menggunakan algoritma KNN, model Support Vector Machine, Random Forest, dan Naive Bayes.
Pengerjaan project ini saya terinspirasi dari artikel ini. Mari kita langsung ke tahap pengerjaan!
Input library
library(tidyverse) # wrangling
library(ggplot2) # visualisasi
library(spotifyr) # scraping
library(lubridate)
library(RColorBrewer)
library(class) # knn
library(caret)
library(e1071) # naivebayesSetting up
Kita atur terlebih akun Spotify Developer kita. Tata cara nya bisa mengacu pada link ini
# Sys.setenv(SPOTIFY_CLIENT_ID = 'XXXXXXXX') # Client ID
# Sys.setenv(SPOTIFY_CLIENT_SECRET = 'XXXXXX') # Client Secret
# access_token <- get_spotify_access_token() # tokenGet the data
Penarikan data untuk project ini salah satunya terdapat audio feature dari masing-masing lagu yang ada per playlist. Selain itu dengan menggunakan package ini kita juga bisa mendapatkan informasi lebih mengenai lagu dari salah satu artis tertentu, genre, dan lain sebagainya yang bisa kita acu pada link ini.
# scrap the data from my playlist
# playlist <- get_playlist_audio_features(
# username = "M Taufiq",
# playlist_uris =
# c("0O9bUEzVmXu3mjZ0VfF291?si=849df101321c445b",
# "37i9dQZF1DX4SBhb3fqCJd?si=38a7e7d6a5804749",
# "37i9dQZF1DWYtDSKIiDhua?si=c740499ad3b24d32",
# "37i9dQZF1DWUa8ZRTfalHk?si=2abc9064a7be4559",
# "37i9dQZF1DX70RN3TfWWJh?si=1bc4bfa4bcac4743",
# "37i9dQZF1DWSfMe9z89s9B",
# "37i9dQZF1DWUa8ZRTfalHk"
# ))
#
# # convert as df
# playlist <- playlist %>%
# as.data.frame()
#
# head(playlist)Dataset playlist terdiri dari 61 kolom dengan 924 baris. Berikutnya kita coba seleksi kolom mana saja yang dibutuhkan untuk proses analisa kedepannya.
# my_playlist <- playlist[,c("playlist_name",
# "track.name",
# "playlist_owner_name",
# "danceability",
# "energy",
# "loudness",
# "speechiness",
# "acousticness",
# "instrumentalness",
# "liveness",
# "valence",
# "tempo",
# "track.popularity",
# "track.album.name",
# "track.album.release_date",
# "track.duration_ms")]
# save as csv
# write.csv(x = my_playlist, file = 'data/my_playlist.csv')Data Preparation and EDA
my_playlist <- read.csv("data/my_playlist.csv")
head(my_playlist)## X playlist_name track.name playlist_owner_name danceability
## 1 1 My RnB Because Of You M Taufiq 0.810
## 2 2 My RnB So Sick M Taufiq 0.452
## 3 3 My RnB My Boo M Taufiq 0.662
## 4 4 My RnB No Air (feat. Chris Brown) M Taufiq 0.466
## 5 5 My RnB No One M Taufiq 0.644
## 6 6 My RnB Supermodel M Taufiq 0.613
## energy loudness speechiness acousticness instrumentalness liveness valence
## 1 0.538 -5.784 0.0356 0.5280 0.00000000 0.0951 0.828
## 2 0.574 -8.336 0.3100 0.2460 0.00000000 0.1890 0.580
## 3 0.507 -8.238 0.1180 0.2570 0.00000000 0.0465 0.676
## 4 0.759 -4.978 0.1990 0.0521 0.00000000 0.0587 0.328
## 5 0.549 -5.415 0.0285 0.0209 0.00000885 0.1340 0.167
## 6 0.442 -8.874 0.2880 0.6510 0.00000000 0.2600 0.252
## tempo track.popularity track.album.name
## 1 109.970 75 Because Of You
## 2 92.791 76 In My Own Words
## 3 86.412 77 Confessions (Expanded Edition)
## 4 160.033 61 Jordin Sparks
## 5 90.040 78 As I Am (Expanded Edition)
## 6 119.737 71 Ctrl
## track.album.release_date track.duration_ms
## 1 2007-01-01 266840
## 2 2006-01-01 207186
## 3 2004-03-23 223440
## 4 2007-11-19 264373
## 5 2007-11-09 253813
## 6 2017-06-09 181120
Cek Missing Value
## X playlist_name track.name
## 0 0 0
## playlist_owner_name danceability energy
## 0 0 0
## loudness speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo track.popularity track.album.name
## 0 0 0
## track.album.release_date track.duration_ms
## 0 0
Tidak terdapat missing value pada data kita.
Berikutnya terlebih dahulu kita periksa summary dari data frame my_playlist. Sebagai tambahan informasi, berikut penjelasan terkait apa saja yang terdapat pada audio feature di data kita berdasarkan Spotify API Docs sebagaimana dalam membantu kita mengerjakan project ini:
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Cek summary my_playlist
## X playlist_name track.name playlist_owner_name
## Min. : 1.0 Length:924 Length:924 Length:924
## 1st Qu.:231.8 Class :character Class :character Class :character
## Median :462.5 Mode :character Mode :character Mode :character
## Mean :462.5
## 3rd Qu.:693.2
## Max. :924.0
## danceability energy loudness speechiness
## Min. :0.2360 Min. :0.1930 Min. :-18.118 Min. :0.02400
## 1st Qu.:0.5630 1st Qu.:0.4850 1st Qu.: -7.810 1st Qu.:0.03950
## Median :0.6630 Median :0.6280 Median : -6.309 Median :0.05270
## Mean :0.6376 Mean :0.6067 Mean : -6.559 Mean :0.08095
## 3rd Qu.:0.7210 3rd Qu.:0.7390 3rd Qu.: -4.965 3rd Qu.:0.08360
## Max. :0.9130 Max. :0.9600 Max. : -1.856 Max. :0.61100
## acousticness instrumentalness liveness valence
## Min. :0.00125 Min. :0.000000 Min. :0.01900 Min. :0.0389
## 1st Qu.:0.04620 1st Qu.:0.000000 1st Qu.:0.09527 1st Qu.:0.3098
## Median :0.19600 Median :0.000000 Median :0.11700 Median :0.4595
## Mean :0.27244 Mean :0.006044 Mean :0.16561 Mean :0.4762
## 3rd Qu.:0.43550 3rd Qu.:0.000054 3rd Qu.:0.18425 3rd Qu.:0.6240
## Max. :0.95500 Max. :0.833000 Max. :0.89200 Max. :0.9560
## tempo track.popularity track.album.name track.album.release_date
## Min. : 45.78 Min. : 0.00 Length:924 Length:924
## 1st Qu.: 94.72 1st Qu.: 64.00 Class :character Class :character
## Median :115.46 Median : 73.00 Mode :character Mode :character
## Mean :117.27 Mean : 69.16
## 3rd Qu.:134.06 3rd Qu.: 79.00
## Max. :199.89 Max. :100.00
## track.duration_ms
## Min. :108613
## 1st Qu.:173929
## Median :187488
## Mean :194329
## 3rd Qu.:213837
## Max. :588139
Dari hasil summary() terdapat beberapa insights:
Terdapat kolom yang kita tidak perlukan yaitu kolom
XPerlu kita ubah tipe data dari kolom berikut yaitu
playlist_name,playlist_owner_name, dantrack.album.release_datePerlu kita rapihkan format dari kolom
track.duration_msmenjadi bentuk%M:%SBerdasarkan Spotify API Docs untuk tiap audio feature di setiap track kebanyakan memiliki rentang nilai 0 sampai 1. Namun dari hasil
summarybahwa kolomdanceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valenceperlu kita normalisasikan agar memiliki rentang nilai 0 sampai 1
Kita rapihkan data kita
# wrangling
playlist.clean <- my_playlist %>%
select(-X) %>%
mutate_at(.vars = c("playlist_name","playlist_owner_name"), .funs = as.factor) %>%
mutate(track.album.release_date = as.Date(track.album.release_date)) %>%
mutate(track.duration_ms = format(
as.POSIXct(Sys.Date(),tz = "GMT")+track.duration_ms/1000,"%M:%S"))
# rename
playlist.clean <- playlist.clean %>%
rename(track.duration = track.duration_ms)Popular Track
Berikutnya coba kita lihat lagu apa saja yang paling saya dengar berdasarkan nilai popularitas nya
Dari hasil grafik diketahui bahwa lagu As It Was oleh Harry Styles adalah lagu populer yang sering saya dengarkan. Selanjutnya kita akan lihat lagu-lagu keluaran tahun apa yang sering saya dengar
Songs by Year
Dari hasil barplot diketahui bahwa lagu yang sering saya dengarkan adalah lagu keluaran di tahun 2022. Selanjutnya kita lakukan analisa insight apa yang kita dapat dari hubungan antar parameter-parameter dari audio feature
Heatmap correlation audio feature
Melalui plot heatmap dapat kita temukan beberapa insights:
energydanloudnessberkorelasi kuat positif (0,8)danceabilitydanvalenceberkorelasi positif (0,5)energydanloudnessberkorelasi positif terhadapvalence(0,3)
Berikutnya akan kita lihat hubungan dari insights yang kita temukan ke dalam bentuk visual scatterplot
Dari hasil scatterplot ditemukan bahwa seluruh variabel berkorelasi linear dengan korelasi kuat positif terjadi pada energy dengan loudness
Berikutnya kita lihat sebaran audio_feature dari masing-masing playlist melalui radar chart. Namun kita akan cek terlebih dahulu outlier pada data
Cek outlier
Removing outlier
outlier <- c(271,286,316,317,665,666,680,709,747,748)
playlist_normal <- playlist.clean[-outlier,]Setelah membuang outiler, kita lakukan tahap normalisasi dengan metode min-max scaler
# rownames() track
data.playlist <- playlist_normal %>%
distinct(track.name, .keep_all = T) %>%
as.tibble() %>%
column_to_rownames(var = "track.name") %>%
select(-c("playlist_name","playlist_owner_name",
"tempo", "track.popularity", "track.album.name",
"track.album.release_date","track.duration"))
# Normalize the data (Has 0-1 range)
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
data.playlist['acousticness'] <- normalize(data.playlist['acousticness'])
data.playlist['danceability'] <- normalize(data.playlist['danceability'])
data.playlist['energy'] <- normalize(data.playlist['energy'])
data.playlist['instrumentalness'] <- normalize(data.playlist['instrumentalness'])
data.playlist['liveness'] <- normalize(data.playlist['liveness'])
data.playlist['loudness'] <- normalize(data.playlist['loudness'])
data.playlist['speechiness'] <- normalize(data.playlist['speechiness'])
data.playlist['valence'] <- normalize(data.playlist['valence'])
data.playlist$avg_feature <- data.playlist %>%
rowMeans()
data_radar <- data.playlist %>%
arrange(-avg_feature)
# Add the maximum, minimum row value above our dataset
playlist.web <- rbind(1,0 , data_radar) %>%
select(-avg_feature) %>%
head(15)Playlist Analysis
Dari plot radar chart ditemukan bahwa audio feature masing-masing playlist sebagai berikut,
keseluruhan playlist memiliki instrumentalness yang rendah
playlist My RnB mendominasi pada bagian speechiness dan acousticness
playlist mood hanya mendominasi bagian acousticness sedangkan playlist Pop Rising hanya mendominasi bagian valence
playlist Are & Be mendominasi tiga bagian audio feature yaitu energy,loudness, dandanceability sedangkan playlist Workout mendominasi di bagian loudness, energy, dan valence
playlist Alternative R&B mendominasi dua bagian audio feature yaitu speechiness dan liveness
Tahap berikutnya menentukan nilai k optimum dengan membandingkan antara metode Elbow method dengan Silhouette Method. Hal ini dikarenakan dalam project ini kita akan menggunakan metode clustering dengan algoritma K-Means. Sehingga diperlukan berapa nilai optimum k yang dibutuhkan
Clustering
Elbow method vs Silhouette Method
Berdasarkan hasil dari grafik elbow method kita akan menggunakan nilai k optimum yaitu 4 dari hasil Elbow Method
Setelah kita mengetahui nilai k yang optimum, berikutnya kita bangun algoritma K-Means
K-Means Cluster
Cluster Profiling
# melakukan profiling dengan summarise data
playlist.profil <- data.playlist %>%
group_by(cluster) %>%
summarise_all(mean)
playlist.profil %>%
pivot_longer(-cluster, names_to = "audio_feature") %>%
group_by(audio_feature) %>%
summarize(cluster_min_val = which.min(value),
cluster_max_val = which.max(value))## # A tibble: 8 x 3
## audio_feature cluster_min_val cluster_max_val
## <chr> <int> <int>
## 1 acousticness 3 1
## 2 danceability 1 3
## 3 energy 1 3
## 4 instrumentalness 3 1
## 5 liveness 1 4
## 6 loudness 1 3
## 7 speechiness 3 2
## 8 valence 4 3
Melalui hasil profiling cluster kita temukan karakteristik dari masing-masing klaster.
- Cluster 1 (Chill) : High acousticness and instrumentalness; low danceability, energy,danceability, liveness, loudness
## # A tibble: 5 x 1
## value
## <chr>
## 1 Supermodel
## 2 Drew Barrymore
## 3 The Weekend
## 4 Go Gina
## 5 20 Something
- Cluster 2 (Romantic) : High speechiness
## # A tibble: 5 x 1
## value
## <chr>
## 1 Because Of You
## 2 So Sick
## 3 My Boo
## 4 Garden (Say It Like Dat)
## 5 No Love (with SZA)
- Cluster 3 (Energetic) : High danceability, energy, loudness; low acousticness, instrumentalness, speechiness
## # A tibble: 5 x 1
## value
## <chr>
## 1 Kiss Me More (feat. SZA)
## 2 Hit Different
## 3 All The Stars (with SZA)
## 4 Consideration
## 5 With You
- Cluster 4 (Sad) : High liveness, low valence
## # A tibble: 5 x 1
## value
## <chr>
## 1 No Air (feat. Chris Brown)
## 2 No One
## 3 Love Galore (feat. Travis Scott)
## 4 Prom
## 5 Broken Clocks
Modeling My Mood Cluster
Pemodelan untuk melakukan klasifikasi terhadap cluster mood yang telah kita peroleh dilakukan dengan metode KNN, Random Forest, dan SVM. Sebelum melakukan tahap tersebut kita lakukan tahap cross validation dengan asumsi membagi data kita sebanyak 80% data train sedangkan sisanya untuk data test
data.modeling <- data.playlist %>% `rownames<-`( NULL )
data.modeling <- data.modeling %>%
mutate(cluster = recode(cluster, "1" = "Chill",
"2" = "Romantic",
"3" = "Energetic",
"4" = "Sad"))
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(data.modeling), size = nrow(data.modeling)*0.8)
# splitting
mood_train <- data.modeling[index,]
mood_test <- data.modeling[-index,]
# # target
mood_train_y <- mood_train[,"cluster"]
mood_test_y <- mood_test[,"cluster"]KNN
# build knn
knn_mood <- knn(train = mood_train[,-9],
test = mood_test[,-9],
cl = mood_train$cluster,
k = round(sqrt((nrow(mood_train))),2))
# confusion matrix
knn_result <- confusionMatrix(as.factor(knn_mood),
as.factor(mood_test_y))
df_knn_result <- as.data.frame(as.table(knn_result)) %>%
mutate(
model = 'KNN'
)
model_result <- df_knn_result
# result for Accuracy
acc_knn <- as.data.frame(as.table(
as.matrix(knn_result,
what = "overall"))) %>%
mutate(
Model = 'KNN'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- acc_knnNaive Bayes
# build model naive bayes
naives_mood <- naiveBayes(mood_train %>% select(-cluster),
mood_train$cluster)
# predict
naive_pred <- predict(naives_mood, mood_test, type = "class")
# confusion matrix
naive_result <- confusionMatrix(as.factor(naive_pred),
as.factor(mood_test$cluster))
df_naive_result <- as.data.frame(as.table(naive_result)) %>%
mutate(
model = 'NAIVE BAYES'
)
model_result <- rbind(model_result, df_naive_result)
# result for Accuracy
acc_naive <- as.data.frame(as.table(
as.matrix(naive_result,
what = "overall"))) %>%
mutate(
Model = 'NAIVE BAYES'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, acc_naive)Random Forest
library(partykit)
set.seed(123)
ctrl <- trainControl(method = "repeatedcv",
number = 6, repeats = 3)
rf_mood <- train(cluster ~.,
mood_train,
method = "rf",
trControl = ctrl)
# predict
rf_pred <- predict(rf_mood, mood_test, type = "raw")
# confusion matrix
rf_result <- confusionMatrix(as.factor(rf_pred), as.factor(mood_test$cluster))
df_rf_result <- as.data.frame(as.table(rf_result)) %>%
mutate(
model = 'RANDOM FOREST'
)
model_result <- rbind(model_result, df_rf_result)
# result for Accuracy
acc_rf <- as.data.frame(as.table(
as.matrix(rf_result,
what = "overall"))) %>%
mutate(
Model = 'RANDOM FOREST'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, acc_rf)SVM
# preparation
mood_train <- mood_train %>%
mutate(cluster = as.factor(cluster))
# build model svm
svm_mood <- svm(
formula = cluster ~.,
data = mood_train,
type = 'C-classification',
kernel = "linear"
)
# predict with data test
svm_pred <- predict(svm_mood, mood_test)
svm_result <- confusionMatrix(as.factor(svm_pred),
as.factor(mood_test$cluster))
df_svm_result <- as.data.frame(as.table(svm_result)) %>%
mutate(
model = 'SVM'
)
model_result <- rbind(model_result, df_svm_result)
# result for Accuracy
acc_svm <- as.data.frame(as.table(
as.matrix(svm_result, what = "overall"))) %>%
mutate(
Model = 'SVM'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, acc_svm)
# show all model performance by accuracy
accuracy$Accuracy <- round(accuracy$Accuracy,2)*100
accuracy## Model Accuracy
## 1 KNN 92
## 2 NAIVE BAYES 88
## 3 RANDOM FOREST 84
## 4 SVM 92
Dari hasil akurasi diperoleh bahwa selisih antar kedua metode antara KNN/SVM sudah cukup baik dalam memprediksi mood kita sebagai pendengar.
Kesimpulan
Kesimpulan dari project ini adalah bahwa project ini cukup menyenangkan bagi kita sebagai pengguna aktif Spotify. Dari project ini kita dapat mengetahui kira-kira bagaimana mood kita dari lagu-lagu playlist yang sering kita dengarkan. Selain itu kita juga makin paham feelings masing-masing pendengar dalam skala besar setelah kita mengetahui ciri khas jenis musik apa yang didengarkan oleh user. Terlepas dari itu akan sangat berpotensi jika kita dapat membuat sistem rekomendasi lagu yang mirip berdasarkan mood user itu sendiri dengan metode yang lebih berbeda dari project ini.