Spotify: Profiling My Songs through Mood Clustering and Give the Songs Recommendation
Introduction
Bagi para pengguna Spotify, tentu tidak asing dengan fitur-fitur di Spotify seperti “Uniquely Yours”, “Sesuaikan dengan Aktivitasmu”, “Made for You”, dan lain-lain, dimana Spotify seakan bisa mengetahui musik yang kita sukai dan keadaan emotional kita. Oleh karena itu, kita akan mencoba melakukan “profiling” musik yang ada di Spotify kita dengan melakukan clustering berdasarkan mood dan memberikan rekomendasi berdasarkan mood tersebut dari lagu-lagu populer tahun 2005 - sekarang.
Tahap pertama yang kita lakukan adalah mengambil daftar playlist dan lagu yang ada di playlist kita. Masing-masing lagu akan dideskripsikan berdasarkan fiturnya. Saya menggunakan sebuah library dari Python bernama Spotipy untuk melakukan scrapping data spotify dan mendapatkan fitur lagu. Dokumentasi terkait Spotipy bisa dibaca disini.
Tahap berikutnya adalah melakukan clustering menggunakan metode KNN untuk mengklasifikasikan lagu-lagu di playlist kita berdasarkan mood dan setelah kita akan memprediksi mood ketika mendengarkan sebuah lagu lalu memberikan rekomendasi berdasarkan lagu yang terpopuler di mood tersebut.
Data Preparation
Load the Dataset
Berikut merupakan contoh 10 data pertama :
Data terdiri dari 3,099 baris dan 20 variabel. Berikut merupakan penjelasan dari masing-masing variable :
playlist: Nama playlist.playlist_owner: Pembuat playlist.added_at: Kapan lagu tersebut ditambahkan ke playlist.artist_id: ID Spotify dari penyanyiartist_name: Nama penyanyitrack_id: ID Spotify dari lagutrack_name: Judul lagurelease_year: Tahun dirilisnya lagu tersebut.duration_ms: Durasi lagu.track_pop: Nilai kepopuleran dari sebuah lagu. Range nya 0 - 100, dimana 100 artinya sangat populer. Popularity merupakan hasil kalkulasi dari algoritma Spotify dimana sebagian besar berdasarkan seberapa sering lagu tersebut didengarkan.
Variabel berikutnya merupakan fitur dari masing-masing audio / lagu:
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Untuk melihat fitur audio yang tersedia di Spotify, dapat melihat melalui link ini.
Sebelum memulai clustering, mari kita lihat apakah selera musik saya berubah dari tahun ke tahun :
Clustering
Elbow Method
Sebelum kita melakukan cluster analysis, kita harus menentukan jumlah cluster yang optimal menggunakan Elbow Method:
Berdasarkan Elbow Method di atas, kita ketahui bahwa lagu kita terdiri dari 4 cluster.
K-Means Clustering
Create cluster using K-Means Clustering Method :
Cluster Profiling
Berdasarkan data di atas, kita akan memberikan nama cluster kita :
- Cluster 1 (Romantic Mood) : High acousticness, danceability. Low instrumentalness, energy, speechiness, valence.
## label
## 1 Emily Hackett,Will Anderson - Take My Hand (The Wedding Song) [feat. Will Anderson]
## 2 Chris Medina - What Are Words
## 3 Collin Raye - Love, Me
## 4 Cristian Castro - Por Amarte AsÃ
## 5 Shayne Ward - No Promises
## 6 Collin Raye - In This Life
- Cluster 2 (Cheerful Mood) : High danceability, energy, valence. Low acousticness, instrumentalness, speechiness.
## label
## 1 Andru Donalds - All Out Of Love
## 2 R. City,Adam Levine - Locked Away
## 3 Alan Walker - Spectre
## 4 Alesso,Sirena - Sweet Escape
## 5 Antoine Clamaran,Fenja - This Is My Goodbye - Radio Edit
## 6 Avicii - The Nights
- Cluster 3 (Chill Mood) : High acousticness, instrumentalness. Low danceability, energy, speechiness, valence.
## label
## 1 Neneh Yacobi - Too Good At Goodbyes
## 2 Neneh Yacobi - Love Me Like You Do
## 3 Neneh Yacobi - Here Comes The Sun
## 4 Rita May - Colors of the Wind
## 5 Serenity for Sleep - Home
## 6 Moux - Back Home
- Cluster 4 (Energetic Mood) : High danceability, energy. Low acousticness, instrumentalness, speechiness, valence.
## label
## 1 Skye Townsend - Noreg
## 2 Anima - Bintang
## 3 Avril Lavigne - When You're Gone
## 4 One Direction - More Than This
## 5 Shayne Ward - You're Not Alone
## 6 Collin Raye - If I Were You
Modeling My Moods Label
Sebelum melakukan pemilihan model yang akan digunakan, kita akan membagi data kita menjadi data train dan data test. Kita akan set 70% dari data kita untuk menjadi data train dan sisanya akan menjadi data test.
KNN
pred_label <- knn(train = data_train[, -7],
test = data_test[, -7],
cl = data_train$cluster,
k = round(sqrt(nrow(data_train))))
# result for confusionMatrix
results1 <- confusionMatrix(as.factor(pred_label), as.factor(data_test$cluster), positive = "1")
df_result1 <- as.data.frame( as.table(results1) ) %>%
mutate(
model = 'KNN'
)
df_result <- df_result1
# result for Accuracy
accuracy1 <- as.data.frame(as.table(as.matrix(results1, what = "overall"))) %>%
mutate(
Model = 'KNN'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- accuracy1NAIVE BAYES CLASSIFIER
model_2 <- naiveBayes(formula = cluster~., data = data_train, laplace = 1)
pred <- predict(model_2, data_test)
# result for confusionMatrix
results2 <- confusionMatrix(as.factor(pred), as.factor(data_test$cluster), positive = "1")
df_result2 <- as.data.frame( as.table(results2) ) %>%
mutate(
model = 'NAIVE BAYES'
)
df_result <- rbind(df_result, df_result2)
# result for Accuracy
accuracy2 <- as.data.frame(as.table(as.matrix(results2, what = "overall"))) %>%
mutate(
Model = 'NAIVE BAYES'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, accuracy2)DECISION TREE
model_3 <- ctree(formula = cluster~.,
data = data_train,
control = ctree_control(mincriterion=0.005, minsplit=0, minbucket=0))
pred <- predict(model_3, data_test)
# result for confusionMatrix
results3 <- confusionMatrix(as.factor(pred), as.factor(data_test$cluster), positive = "1")
df_result3 <- as.data.frame( as.table(results3) ) %>%
mutate(
model = 'DECISION TREE'
)
df_result <- rbind(df_result, df_result3)
# result for Accuracy
accuracy3 <- as.data.frame(as.table(as.matrix(results3, what = "overall"))) %>%
mutate(
Model = 'DECISION TREE'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, accuracy3)RANDOM FOREST
ctrl <- trainControl(method = "cv", number = 5, repeats = 3)
model_4 <- train(cluster ~ ., data = data_train, method = "rf", trControl = ctrl)
pred <- predict(model_4, data_test)
# result for confusionMatrix
results4 <- confusionMatrix(as.factor(pred), as.factor(data_test$cluster), positive = "1")
df_result4 <- as.data.frame( as.table(results4) ) %>%
mutate(
model = 'RANDOM FOREST'
)
df_result <- rbind(df_result, df_result4)
# result for Accuracy
accuracy4 <- as.data.frame(as.table(as.matrix(results4, what = "overall"))) %>%
mutate(
Model = 'RANDOM FOREST'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, accuracy4)SVM
model_5 <- svm(formula = cluster ~ .,
data = data_train,
type = 'C-classification',
kernel = 'linear')
pred <- predict(model_5, data_test)
# result for confusionMatrix
results5 <- confusionMatrix(as.factor(pred), as.factor(data_test$cluster), positive = "1")
df_result5 <- as.data.frame( as.table(results5) ) %>%
mutate(
model = 'SVM'
)
df_result <- rbind(df_result, df_result5)
# result for Accuracy
accuracy5 <- as.data.frame(as.table(as.matrix(results5, what = "overall"))) %>%
mutate(
Model = 'SVM'
) %>%
filter(
Var1 == 'Accuracy'
) %>%
rename(
Accuracy = Freq
) %>%
select(Model, Accuracy)
accuracy <- rbind(accuracy, accuracy5)