library(rsample)
library(tidymodels)
library(caret)
library(dplyr)
library(plotly)
Spotify offers digital copyright restricted recorded music and podcasts, including more than 60 million songs, from record labels and media companies.[7] As a freemium service, basic features are free with advertisements and limited control, while additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on artist, album, or genre, and can create, edit, and share playlists.
In this case we wanted to analyze the behavior between valence and all the measures that Spotify API gives.
Source of Data and information :
library(rsample)
library(tidyr)
library(tidymodels)
library(caret)
library(tidyverse)
library(factoextra)
library(FactoMineR)
library(animation)
library(dplyr)
The dataset were obtained from kaggle.com website containing various music tracks with also various of API Index
music_yes <- read.csv("SpotifyFeatures.csv", stringsAsFactors = T)
head(music_yes)
glimpse(music_yes)
#> Rows: 232,725
#> Columns: 18
#> $ ï..genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, M...
#> $ artist_name <fct> Henri Salvador, Martin & les fées, Joseph William...
#> $ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (pa...
#> $ track_id <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0C...
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, ...
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.749...
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0...
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 2122...
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0....
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0....
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, ...
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0....
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970...
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, M...
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0....
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479...
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, ...
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0....
COlumn description;
colSums(is.na(music_yes))
#> ï..genre artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 0 0 0 0
#> time_signature valence
#> 0 0
# Changing name of first column
music_yes <- music_yes %>%
rename(genre = "ï..genre")
head(music_yes)
glimpse(music_yes)
#> Rows: 232,725
#> Columns: 18
#> $ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, M...
#> $ artist_name <fct> Henri Salvador, Martin & les fées, Joseph William...
#> $ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (pa...
#> $ track_id <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0C...
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, ...
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.749...
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0...
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 2122...
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0....
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0....
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, ...
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0....
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970...
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, M...
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0....
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479...
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, ...
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0....
genre : genre of each tracks
artist_name : name of artist
track_name : Name of each track
track_id : id of each tracks
popularity : The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
duration_ms : The duration of the track in milliseconds.
energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
key : The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
head(music_yes)
# Selecting Necessary Column
music_ready <- music_yes %>%
select(-c(genre, artist_name, track_name, track_id,key, mode, time_signature))
GGally::ggcorr(music_ready, label = T)
plot(prcomp(music_ready))
From the above graphic we see that all the information is gather only in PC1 therefor we need to scale the data first before transform it into PC
# Scaling
music_z <- scale(music_ready)
prcomp(music_z)$rotation
#> PC1 PC2 PC3 PC4 PC5
#> popularity 0.23639321 -0.29846005 0.09943228 0.422119586 -0.4736815401
#> acousticness -0.42022853 0.18772369 -0.20855119 0.006879475 0.0282681519
#> danceability 0.33424065 0.06019282 -0.45070673 0.239688252 0.2376309068
#> duration_ms -0.06002993 -0.03275554 0.59352497 0.418326167 0.6447009218
#> energy 0.44628732 0.09692738 0.24698541 -0.097178416 0.0002462744
#> instrumentalness -0.32180898 -0.18288404 0.06985215 -0.132956545 0.1569641273
#> liveness 0.02965112 0.61932242 0.25373682 -0.056658396 -0.1249026942
#> loudness 0.46703496 -0.02465578 0.15341481 0.011412322 -0.0633230456
#> speechiness 0.02970675 0.64556159 0.02852151 0.068078940 -0.0954483000
#> tempo 0.15717921 -0.14937864 0.25897753 -0.733221667 0.0571792098
#> valence 0.32385768 0.07008504 -0.41173869 -0.128880686 0.4960754526
#> PC6 PC7 PC8 PC9 PC10
#> popularity 0.32083437 -0.42232755 0.31413684 0.24820465 0.01528871
#> acousticness 0.30090066 0.04694641 0.15387384 0.26137273 -0.69955892
#> danceability 0.22268142 -0.29141242 -0.18169713 -0.58512988 -0.14978489
#> duration_ms 0.22709391 0.01628894 0.01447981 0.01538797 -0.01145835
#> energy -0.35002606 -0.03904784 -0.11993763 0.22449855 -0.14230425
#> instrumentalness -0.44386715 -0.75377712 -0.07019165 -0.01147147 -0.17749694
#> liveness -0.11179760 -0.13086640 0.60551414 -0.36598616 -0.01034614
#> loudness -0.15515093 0.14147365 -0.11624804 0.01293022 -0.62277521
#> speechiness 0.21165146 -0.28954981 -0.51727247 0.33736400 0.15979238
#> tempo 0.55013273 -0.18408344 -0.01084837 -0.08771153 -0.02238735
#> valence -0.04261779 -0.09837511 0.42101780 0.47043909 0.14856079
#> PC11
#> popularity -0.0278823444
#> acousticness -0.2640209720
#> danceability -0.1878030923
#> duration_ms -0.0008992083
#> energy -0.7154782335
#> instrumentalness 0.1184748572
#> liveness 0.0451919382
#> loudness 0.5549575890
#> speechiness 0.1795962710
#> tempo -0.0135260826
#> valence 0.1607522457
eigen(cov(music_z))
#> eigen() decomposition
#> $values
#> [1] 3.6104585 1.7100322 1.1712478 0.9998348 0.8617223 0.7567566 0.6378557
#> [8] 0.4853785 0.3751915 0.2767450 0.1147772
#>
#> $vectors
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.23639321 -0.29846005 -0.09943228 -0.422119586 0.4736815401
#> [2,] -0.42022853 0.18772369 0.20855119 -0.006879475 -0.0282681519
#> [3,] 0.33424065 0.06019282 0.45070673 -0.239688252 -0.2376309068
#> [4,] -0.06002993 -0.03275554 -0.59352497 -0.418326167 -0.6447009218
#> [5,] 0.44628732 0.09692738 -0.24698541 0.097178416 -0.0002462744
#> [6,] -0.32180898 -0.18288404 -0.06985215 0.132956545 -0.1569641273
#> [7,] 0.02965112 0.61932242 -0.25373682 0.056658396 0.1249026942
#> [8,] 0.46703496 -0.02465578 -0.15341481 -0.011412322 0.0633230456
#> [9,] 0.02970675 0.64556159 -0.02852151 -0.068078940 0.0954483000
#> [10,] 0.15717921 -0.14937864 -0.25897753 0.733221667 -0.0571792098
#> [11,] 0.32385768 0.07008504 0.41173869 0.128880686 -0.4960754526
#> [,6] [,7] [,8] [,9] [,10] [,11]
#> [1,] 0.32083437 0.42232755 0.31413684 -0.24820465 -0.01528871 -0.0278823444
#> [2,] 0.30090066 -0.04694641 0.15387384 -0.26137273 0.69955892 -0.2640209720
#> [3,] 0.22268142 0.29141242 -0.18169713 0.58512988 0.14978489 -0.1878030923
#> [4,] 0.22709391 -0.01628894 0.01447981 -0.01538797 0.01145835 -0.0008992083
#> [5,] -0.35002606 0.03904784 -0.11993763 -0.22449855 0.14230425 -0.7154782335
#> [6,] -0.44386715 0.75377712 -0.07019165 0.01147147 0.17749694 0.1184748572
#> [7,] -0.11179760 0.13086640 0.60551414 0.36598616 0.01034614 0.0451919382
#> [8,] -0.15515093 -0.14147365 -0.11624804 -0.01293022 0.62277521 0.5549575890
#> [9,] 0.21165146 0.28954981 -0.51727247 -0.33736400 -0.15979238 0.1795962710
#> [10,] 0.55013273 0.18408344 -0.01084837 0.08771153 0.02238735 -0.0135260826
#> [11,] -0.04261779 0.09837511 0.42101780 -0.47043909 -0.14856079 0.1607522457
pca_music <- prcomp(x = music_ready,scale = T)
head(pca_music$x)
#> PC1 PC2 PC3 PC4 PC5 PC6
#> [1,] 0.9904481 0.9993903 -0.1597105 -3.0962912 0.71325308 -0.74345195
#> [2,] 1.2096105 0.2730414 -0.6842276 -2.7298245 1.28269932 -0.12134335
#> [3,] -2.1123978 0.3531611 -1.8668172 -0.2666744 0.70972338 0.39952609
#> [4,] -1.9545010 -0.1868432 0.2513141 -2.6613814 -0.02211554 0.60691571
#> [5,] -2.9355815 0.3915754 -1.1231315 -2.0937683 -0.02518811 0.41551544
#> [6,] -2.2612410 0.7007202 -1.7307093 -0.1446502 0.52331036 0.03778243
#> PC7 PC8 PC9 PC10 PC11
#> [1,] 1.3200974 0.4246202 0.576058781 -1.16234095 -0.04739096
#> [2,] 0.9083572 -0.4478824 -0.057434104 -0.09104371 0.13059333
#> [3,] 1.4628733 -0.4420655 -0.872496223 -0.59542626 0.09975048
#> [4,] 1.7930776 -0.5882799 -0.041750169 -0.17564117 0.22897820
#> [5,] 1.1238702 0.2555989 -0.007753993 0.30349575 -0.41629524
#> [6,] 1.5234275 -0.7598055 -0.626642226 0.09946476 0.44224934
summary(pca_music)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 1.9001 1.3077 1.0822 0.99992 0.92829 0.8699 0.79866
#> Proportion of Variance 0.3282 0.1555 0.1065 0.09089 0.07834 0.0688 0.05799
#> Cumulative Proportion 0.3282 0.4837 0.5902 0.68105 0.75939 0.8282 0.88617
#> PC8 PC9 PC10 PC11
#> Standard deviation 0.69669 0.61253 0.52607 0.33879
#> Proportion of Variance 0.04413 0.03411 0.02516 0.01043
#> Cumulative Proportion 0.93030 0.96441 0.98957 1.00000
From the above graphic we see that the new dimension is acknowledge that if we want to observe above 85% data we might use 7 PCA to be process
GGally::ggcorr(pca_music$x, label = T)
from the graph above we see that there is no correlation from one dimension to another which mean our PCA is already ideal
library(ggfortify)
ggplot2::autoplot(pca_music, label = FALSE, loadings.label = TRUE)
As we know that Dimension 1 is containing most of the information compare to other Dimension then we want to know what are 3 top variable giving the impact on this dimension using below graph
fviz_contrib(X = pca_music,
choice = "var", #lihat kontribusi berdasarkan variabel,
axes = 1)
From the above graphic we acknowldege that loudness, energy, and accousticness is giving the impact more than any variable
Now we want try to clustering our PCA so we can grab information from our data
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
# kmeansTunning(your_data, maxK = 5)
kmeansTunning(music_z, maxK = 5)
From the above graphic we know that the ideal cluster for us to be implement are 3 cluster because at this point the decrasing of total within sum of squares is already unsignificant squares.
set.seed(101)
music_km <- kmeans(x = music_z,centers = 3)
music_yes$cluster <- music_km$cluster
head(music_yes)
music_yes %>%
group_by(cluster) %>%
select(-c(genre, artist_name, track_name, track_id,key, mode, time_signature)) %>%
summarise_all("mean")
From the above graphic we know several insight as below :
Popularity : Cluster 2 having the highest point
acousticness : Cluster 3 having the highest point
danceability : Cluster 2 having the highest point
duration_ms : Cluster 3 having the highest point
energy : Cluster 2 having the highest point
instrumentalness : Cluster 3 having the highest point
liveness : Cluster 1 having the highest point
loudness : Cluster 2 having the highest point
spechiness : Cluster 1 having the highest point
tempo : Cluster 2 having the highest point
valence : Cluster 2 having the highest point
Now if we are big fans of Henri Slavador and we want to know the character of music Henri has then we can see this from our clustering
music_yes %>%
filter (artist_name == "Henri Salvador",
cluster == 1)
music_yes %>%
filter (artist_name == "Henri Salvador",
cluster == 2)
music_yes %>%
filter (artist_name == "Henri Salvador",
cluster == 3)
From the above information we know that most of Henri Slavador music is in cluster 3 and 2 which means that his music is having a high point of popularity and dancebility for example.