Spotify
Spotify is a digital music streaming service that offers millions of songs and podcasts from across the world. The firm, which was founded in Sweden in 2006, has swiftly grown to become one of the most popular music platforms, with over 365 million monthly active users and 165 million subscribers in over 170 countries. Spotify has altered the way people listen to music with its easy user interface and tailored suggestions, and has emerged as a prominent competitor in the highly competitive music market.
In the big data age, Spotify analyzes client listening patterns by asking questions when they first log in, such as what are your favorite music genres, and utilizes machine learning algorithms to recommend our favorite songs daily and weekly. In order to identify our preferences, our listening history is also being gathered.
The method by which Spotify categorizes music into broad categories is of interest. What are the distinguishing features of each genre, and how are they utilized to classify them? We’ll go further into the challenges in this project.
Based on the data, each song is allocated 12 audio features, 6 broad genres, and 24 subgenres. These 14 variables will be the focus of the next sections.
The goal of completing this project is to:
To achieve such objectives, we will:
Data set we all can access here
What is unsupervised learning ?
In this project, we are going to use K-Means as a simplify model of unsupervised learning.
the k-means algorithm is a classifies data into as many K groups as we define. This algorithm is called flat clustering, meaning that one group has an equal position with the other groups.
Workflow K-means :
library(dplyr) # function data wrangling
library(tidyr)
library(factoextra) # for fviz_contrib(), visualization fviz_pca_biplot()
library(FactoMineR) # for PCA()
library(ggiraphExtra) # for visualization profiling
library(GGally) # for ggcorr()
library(ggplot2) # Visual Graph
library(plotly) # Visual Graphspotify <- read.csv("data_input/SpotifyFeatures.csv", stringsAsFactors = T,encoding = "UTF-8")
head(spotify)dim(spotify)#> [1] 232725 18
Insight :
track_name based on
genre with their own
popularity,acousticness,danceability,duration_ms
until valenceglimpse(spotify)#> Rows: 232,725
#> Columns: 18
#> $ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name <fct> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0CoSD…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Description of variabels/columns :
genre: the genre of a song track.artist_name: singer name.track_name: track title.track_id: id number of the track.popularity: the popularity rating of the track.accousticness: a measure of confidence that the song is
acoustic, ranges from 0-1.danceability: describes how suitable a track is for
dance based on a combination of musical elements including tempo,
rhythmic stability, beat strength, and overall regularity. A value of 0
is the least able to dance and 1 is the most capable of dancing.duration_ms: song duration in millisecondsenergy: represents a perceived measure of intensity and
activity with a range of 0-1. Typically, energetic tracks feel fast,
loud, and noisy. Getting closer to range 1 means it’s getting faster
while getting closer to number 0 means a song has a slower and softer
tempo.instrumentalness: detects a song without vocals. The
sounds “Ooh” and “aah” are treated as instruments in this context. A
song that contains a rap or track with words is referred to as vocals.
The closer the instrument value is to 1, the more likely the song
contains no vocal content. Values above 0.5 are meant to represent an
instrument track, but confidence is higher as values get closer to
1.key: key of the entire predicted track. Value-to-pitch
mapping uses standard Pitch Class notation. For example. 0 = C, 1 = C♯ /
D♭, 2 = D, and so on. If no key is detected, the value is -1.liveness: detects the presence of viewers in the
recording. A higher life value indicates an increased probability that
the song is played live. Values above 0.8 give a strong possibility that
the track is live.loudness: the overall loudness of the track in decibels
(dB). Loudness values are averaged across tracks and are useful for
comparing the relative loudness of tracks.mode: indicates the modality (major or minor) of a
track, the type of scale from which the melodic content originates.
Major is represented by 1 and minor is 0.speechiness: detects the presence of spoken words on
the track. The more exclusive the recording resembles speech (e.g. talk
shows, audiobooks, poetry), the closer to 1 the attribute value is.
Values above 0.66 describe tracks that may be made up entirely of spoken
words. A value between 0.33 and 0.66 describes a track that may contain
both music and speech, either in sections or in layers, including cases
such as rap music. Values below 0.33 most likely represent music and
other non-speech tracks.tempo: the approximate tempo of the entire track in
beats per minute (BPM). In musical terminology, tempo is the speed or
pace of a particular piece and is derived directly from the average beat
duration.time_signature: approximation of the overall time
signature of a track. The time signature (meter) is a notational
convention for determining how many beats are in each bar (or
measure).valence: a measure from 0-1 that describes the positive
music conveyed by a track. Tracks with high valence sound more positive
(eg happy, cheerful, excited), while tracks with low valence sound more
negative (eg sad, depressed, angry).Due to our project based on popularity, as we can see summary of popularity below
range(spotify$popularity)#> [1] 0 100
summary(spotify$popularity)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 29.00 43.00 41.13 55.00 100.00
Insight :
In this case we are going to classified into 4 class based on popularity.
spotify <- spotify %>%
mutate(popularity.class = as.numeric(case_when(
((popularity > 0) & (popularity < 20)) ~ "1",
((popularity >= 20) & (popularity < 40))~ "2",
((popularity >= 40) & (popularity < 60)) ~ "3",
TRUE ~ "4"))
)
table(spotify$popularity.class)#>
#> 1 2 3 4
#> 24138 70473 95348 42766
colSums(is.na(spotify))#> genre artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 0 0 0 0
#> time_signature valence popularity.class
#> 0 0 0
Insight :
ggcorr(spotify,label=T)
Insight :
We are going to get insight information based on data-set popular music.
spotify_popular <- spotify %>%
filter(popularity.class == 4)
head(spotify_popular)What genres do streamers listen to the most?
spotify_genre_rank <- spotify_popular %>%
group_by(genre) %>%
count() %>%
rename(total = n)
spotify_genre_rank <- spotify_genre_rank %>%
head(20) %>%
ggplot(mapping = aes(x = total, y = reorder(genre, total))) +
geom_col(aes(fill=total))+
scale_fill_gradient(low ='#90e0ef', high ='#415a77')+
geom_label(aes(label=total),color="black",size=2,nudge_x= 0.8)+
theme(legend.position = 'none')+
labs(title = "Top 20 Genres Spotify Streaming",
subtitle = 'Across All Categories',
x = 'Total Frequent',
y ='Genre',
fill='count',
caption = 'Source Spotify Track DB')
spotify_genre_rank
Insight :
Artists who have the most songs on spotify ?
spotify_artist_rank <- spotify_popular %>%
group_by(artist_name) %>%
count() %>%
rename(total = n)
spotify_artist_rank <- spotify_artist_rank %>%
head(20) %>%
ggplot(mapping = aes(x = total, y = reorder(artist_name, total))) +
geom_col(aes(fill=total))+
scale_fill_gradient(low ='#90e0ef', high ='#415a77')+
geom_label(aes(label=total),color="black",size=2,nudge_x= 0.8)+
theme(legend.position = 'none')+
labs(title = "Top 20 Artist who have the most songs on spotify",
subtitle = 'Across All Categories',
x = 'Total Frequent',
y ='Artist Name',
fill='count',
caption = 'Source Spotify Track DB')
spotify_artist_rank
Insight :
How about Genre in every popularity class ?
spotify_genre_data <- spotify %>%
group_by(genre,popularity.class) %>%
mutate(popularity.class = as.factor(popularity.class)) %>%
count() %>%
rename(total = n) %>%
ggplot(spotify_genre_data, mapping = aes(fill=genre, y=total, x=popularity.class)) +
geom_bar(position="dodge", stat="identity")ggplotly(spotify_genre_data)Insight :
spotify_clean <- spotify %>%
select(-track_id) %>% # Drop out column track id
mutate_at(vars(genre,artist_name,track_name,time_signature,mode,key),as.factor)
summary(spotify_clean)#> genre artist_name track_name
#> Comedy : 9681 Giuseppe Verdi : 1394 Home : 100
#> Soundtrack: 9646 Giacomo Puccini : 1137 You : 71
#> Indie : 9543 Kimbo Children's Music : 971 Intro : 69
#> Jazz : 9441 Nobuo Uematsu : 825 Stay : 63
#> Pop : 9386 Richard Wagner : 804 Wake Up: 59
#> Electronic: 9377 Wolfgang Amadeus Mozart: 800 Closer : 58
#> (Other) :175651 (Other) :226794 (Other):232305
#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
#> 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
#> Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
#> Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
#> 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
#> Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
#>
#> energy instrumentalness key liveness
#> Min. :0.0000203 Min. :0.0000000 C :27583 Min. :0.00967
#> 1st Qu.:0.3850000 1st Qu.:0.0000000 G :26390 1st Qu.:0.09740
#> Median :0.6050000 Median :0.0000443 D :24077 Median :0.12800
#> Mean :0.5709577 Mean :0.1483012 C# :23201 Mean :0.21501
#> 3rd Qu.:0.7870000 3rd Qu.:0.0358000 A :22671 3rd Qu.:0.26400
#> Max. :0.9990000 Max. :0.9990000 F :20279 Max. :1.00000
#> (Other):88524
#> loudness mode speechiness tempo
#> Min. :-52.457 Major:151744 Min. :0.0222 Min. : 30.38
#> 1st Qu.:-11.771 Minor: 80981 1st Qu.:0.0367 1st Qu.: 92.96
#> Median : -7.762 Median :0.0501 Median :115.78
#> Mean : -9.570 Mean :0.1208 Mean :117.67
#> 3rd Qu.: -5.501 3rd Qu.:0.1050 3rd Qu.:139.05
#> Max. : 3.744 Max. :0.9670 Max. :242.90
#>
#> time_signature valence popularity.class
#> 0/4: 8 Min. :0.0000 Min. :1.000
#> 1/4: 2608 1st Qu.:0.2370 1st Qu.:2.000
#> 3/4: 24111 Median :0.4440 Median :3.000
#> 4/4:200760 Mean :0.4549 Mean :2.674
#> 5/4: 5238 3rd Qu.:0.6600 3rd Qu.:3.000
#> Max. :1.0000 Max. :4.000
#>
Insight :
# Change poplarity class into factor
spotify_clean <- spotify_clean %>%
mutate(popularity.class = as.factor(popularity.class))
glimpse(spotify_clean)#> Rows: 232,725
#> Columns: 18
#> $ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name <fct> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
#> $ popularity.class <fct> 4, 1, 1, 4, 1, 4, 1, 1, 4, 1, 4, 1, 1, 1, 4, 4, 4, 1,…
Due to our data has large number of observations (Rows: 232,725), we should take sample of them for lessen the computation time in later stage, directly for find K optimum
RNGkind(sample.kind = "Rounding")
set.seed(100) #Lock random values
reduction1 <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.0044) # We are going to take sample around a thousand of observations
spotify_keep <- spotify_clean[reduction1,]# Column's name numeric (quantitative)
quanti <- spotify_keep %>%
select_if(is.numeric) %>%
colnames()
# index numeric columns
quantivar <- which(colnames(spotify_keep) %in% quanti)
# Column's name category (qualitative)
quali <- spotify_keep %>%
select_if(is.factor) %>%
colnames()
# index category columns
qualivar <- which(colnames(spotify_keep) %in% quali)We choose K optimum by :
In this project, we are going to use elbow method
spotify_num <- spotify_keep %>%
select_if(is.numeric) %>%
scale()
library(factoextra)
fviz_nbclust(x = spotify_num,
FUNcluster = kmeans,
method = "wss")
Insight :
# Using Visualization PCA Biplot for check Outlier Data.
pca_spotify <- PCA(X = spotify_keep, # data sample has sampled before
scale.unit = T, # scaling
quali.sup = qualivar, # index qualitative
graph = F,
ncp = 11) # 11 columns numeric on spotify_keepplot.PCA(x = pca_spotify, # objek dari fungsi PCA() dari library FactoMineR
choix = "ind", # jenis visualisasi yang akan ditampilkan, ind -> individual factor map
invisible = "quali", # menghilangkan label variabel categorical
select = "contrib 10", # untuk identifikasi 5 outlier terluar
habillage = "popularity.class") # untuk mewarnai titik observasi berdasarkan variable categorical (ditulis index/nama kolom)spotify_num1 <- spotify_keep %>%
select_if(is.numeric)
spotify_num1[c("130688","71625"),]Insight :
plot.PCA(x = pca_spotify, # objek dari fungsi PCA()
choix = "var")
Insight :
the percentage displayed on the Dim 1 (31.53%) and Dim 2 (16.60%) axes indicates how much the axes summarize information. Collectively, the above biplots explain about 48.3% of the original data information.
PC1 includes at most two variables: energy, valence, danceability, loudness
PC2 includes at most two variables: speechiness, liveness, tempo, popularity, instrumentalness, acousticness, duration_ms
Pairs of variables that are highly positively correlated:
# your code here
outlier <- c(130688,216889,130346,81385,128160,105057,175004,172070,211861)
spotify_no <- spotify_num[-outlier,]# your code here
# clustering with optimum k
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans <- kmeans(x = spotify_no,
centers = 3)
spotify_kmeans#> K-means clustering with 3 clusters of sizes 197, 766, 60
#>
#> Cluster means:
#> popularity acousticness danceability duration_ms energy instrumentalness
#> 1 -0.6834540 1.2567025 -1.06133891 0.2596139 -1.3961322 1.1940948
#> 2 0.2643987 -0.4172683 0.27122789 -0.0502160 0.3292904 -0.2713331
#> 3 -1.1314829 1.2009523 0.02205331 -0.2113080 0.3800266 -0.4565918
#> liveness loudness speechiness tempo valence
#> 1 -0.2010456 -1.4650248 -0.3943022 -0.3381463 -0.9665134
#> 2 -0.1311695 0.4071616 -0.1786706 0.1481276 0.2628519
#> 3 2.3346967 -0.3879322 3.5756535 -0.7808484 -0.1823572
#>
#> Clustering vector:
#> 71625 59967 128539 13122 109042 112584 189062 86181 127194 39623 145447
#> 2 2 1 2 2 2 2 2 1 1 1
#> 205293 65242 92733 177455 155689 47616 83199 83653 160635 124686 165407
#> 1 2 2 2 2 2 2 2 2 1 2
#> 125276 174288 97759 39890 179249 205229 127774 64626 113627 216058 81139
#> 1 3 2 1 1 1 1 2 2 2 1
#> 222025 161785 206967 41979 146452 230259 30317 76940 201300 180931 192499
#> 2 2 1 2 2 2 2 2 1 3 2
#> 140383 114300 181574 205741 48331 71452 76907 46228 54840 63959 137584
#> 2 2 2 1 2 1 2 2 2 1 2
#> 58957 28732 53492 139037 49188 107888 150558 223490 157373 103569 83240
#> 1 2 1 2 2 2 2 2 2 2 2
#> 106031 103630 57023 161545 95910 76247 133209 224975 153964 145336 199300
#> 1 2 2 2 2 2 2 2 2 2 1
#> 180251 194034 21290 106907 139447 213967 228647 8795 134452 170598 57867
#> 2 3 2 1 2 2 1 2 2 3 1
#> 69963 170631 210990 48811 83315 104289 210863 90596 120377 29135 7013
#> 2 3 2 1 2 1 2 2 2 2 2
#> 179543 76165 90603 9550 84069 132822 159317 225901 163289 2686 124574
#> 2 1 2 2 2 2 2 2 2 1 1
#> 194600 187689 18826 55574 224700 8763 213165 168927 46702 195436 92266
#> 1 2 2 2 2 1 2 3 2 2 2
#> 91345 109919 135760 81966 6624 231457 222727 128313 23715 55338 200054
#> 2 2 2 1 2 2 2 1 2 2 1
#> 171715 115671 134885 3765 109813 9860 107724 146528 156588 20243 33396
#> 3 2 2 2 2 2 1 2 2 2 2
#> 211296 28499 169519 221038 10084 4559 46300 117471 215454 32189 39350
#> 3 2 3 2 2 2 2 2 2 2 2
#> 141084 189749 196145 183282 4424 162352 189563 132190 111645 37529 20758
#> 2 2 2 2 1 2 2 2 2 2 2
#> 37834 6216 165031 176981 199411 101689 97007 136195 191848 184446 75847
#> 2 2 2 2 1 2 2 2 2 1 2
#> 222767 151859 106664 140928 67141 166928 214730 156952 43239 80940 28724
#> 2 2 1 2 2 2 2 2 2 1 2
#> 25142 69286 194912 230855 101649 47103 224131 153561 69322 27732 139460
#> 2 2 2 1 2 2 2 2 2 2 2
#> 27948 183331 85941 222369 212428 191444 74287 204083 186097 142152 16840
#> 2 1 2 2 2 2 1 1 2 2 2
#> 98026 80094 174790 50863 67925 82764 146938 207522 172994 106308 8393
#> 2 1 2 2 2 1 2 2 3 1 2
#> 132183 102073 139615 221199 62644 153180 17780 16574 86101 69076 128123
#> 2 1 2 1 2 2 2 2 2 2 1
#> 85998 196617 144329 92797 69669 88607 163006 220134 179672 51043 166485
#> 2 1 2 2 2 2 2 1 2 2 2
#> 154420 164191 65009 165584 153656 9535 14214 65038 70200 222388 90623
#> 2 2 2 2 2 2 2 2 2 2 2
#> 86431 196007 196957 74235 30804 143458 183952 78585 210484 45925 184595
#> 2 2 1 2 2 1 1 2 2 2 1
#> 175416 211863 75012 20031 211815 222036 156011 173175 102284 26742 157054
#> 3 3 2 2 3 2 2 3 2 3 2
#> 170052 112502 39760 157099 61131 79376 48871 3794 87576 130704 158056
#> 3 2 2 2 2 1 2 2 2 1 2
#> 173258 220834 37909 75486 30848 148130 76919 150867 70401 16557 153959
#> 3 1 2 2 2 2 2 2 2 2 2
#> 176554 128616 125341 197467 151826 221008 143457 114547 226844 113966 152274
#> 2 1 2 2 2 1 2 2 2 2 2
#> 139172 220231 85532 204103 105719 115385 107053 143609 140409 182645 129049
#> 2 2 2 1 1 2 1 2 1 3 1
#> 178716 93959 118694 121672 230729 99851 231474 182806 119897 116824 211861
#> 2 2 2 2 2 2 2 1 2 2 3
#> 61449 40404 93035 125110 56928 87345 134845 48559 186219 148449 171113
#> 2 2 2 1 2 2 2 2 2 2 3
#> 101973 134739 59501 107760 39183 142657 222642 111122 174552 4784 39729
#> 2 2 1 1 2 2 2 2 2 2 2
#> 148711 38337 82428 43320 208588 55254 228869 4957 24989 56889 167851
#> 2 2 1 2 2 2 2 2 2 2 2
#> 7607 127648 158658 69764 90335 170189 223642 176729 135155 107550 82790
#> 2 1 2 2 2 3 2 2 2 1 1
#> 89205 47814 32246 90403 61816 163409 94769 61703 93165 45845 192828
#> 2 2 2 2 2 2 2 2 2 2 2
#> 122673 91805 133355 225593 151054 77508 77007 218959 88545 131110 118841
#> 2 2 2 2 2 2 2 2 2 2 2
#> 32262 55710 166813 69053 118774 64519 83789 101648 186570 120949 161730
#> 2 2 2 2 2 2 2 2 2 2 2
#> 196968 196473 91040 35664 148548 66714 215665 36156 223911 293 164364
#> 2 3 2 2 2 2 2 2 2 2 2
#> 146397 179591 207358 118755 173996 215200 21315 115374 45880 230936 8554
#> 2 2 2 2 3 2 2 2 2 2 2
#> 50540 206961 27629 85726 37719 38721 225904 188247 221447 92895 197966
#> 1 1 2 2 1 2 2 2 2 2 1
#> 14988 70273 113929 150115 118611 185188 131240 82116 153094 55578 61602
#> 2 2 2 2 2 2 2 1 1 2 2
#> 120381 171397 181082 26641 142236 211583 147263 63796 81952 155811 210597
#> 2 3 2 2 2 3 2 2 1 2 2
#> 169607 213627 88825 198383 12713 58991 85269 83745 65152 90276 80856
#> 3 2 2 1 2 1 2 2 2 2 1
#> 64781 186054 231092 174940 140126 137943 135235 55428 105505 231202 19218
#> 2 2 2 3 1 2 2 2 1 2 2
#> 12055 114680 12540 220731 147051 62690 231495 83327 86336 76690 24244
#> 2 2 2 2 1 2 2 1 2 1 2
#> 134671 77354 878 230973 85746 51144 172580 71051 93402 91396 102353
#> 2 2 2 2 2 2 3 2 2 2 2
#> 179461 48666 173727 135929 75342 87092 6256 112237 225264 92 24546
#> 2 2 3 2 2 2 2 2 2 2 1
#> 102159 72518 225550 135903 170422 154073 167879 25778 50755 144560 222743
#> 2 3 2 2 3 2 3 2 2 2 2
#> 188618 5147 158483 192904 187568 195470 43575 22046 104210 139425 200511
#> 2 2 2 2 2 1 2 2 1 2 1
#> 197697 162015 149521 152010 204403 198965 232493 75812 127002 170506 186971
#> 1 2 2 2 1 1 2 2 1 3 2
#> 47931 204351 91959 228227 130346 210954 166032 54708 907 111296 202479
#> 2 1 2 2 1 1 2 2 2 2 1
#> 60903 154018 204221 211616 90617 200853 119473 172075 222918 132262 72223
#> 2 2 1 3 2 1 2 3 2 2 1
#> 172560 104579 110015 110136 171955 24885 149081 98009 185835 143184 57890
#> 3 1 2 2 3 2 2 2 2 2 1
#> 158141 80725 206945 152571 217900 31148 225743 81422 55016 142908 173434
#> 2 1 1 2 2 2 2 1 2 2 3
#> 223663 216505 183890 72363 21684 48119 7374 134564 35795 29079 34351
#> 2 2 2 3 2 2 2 2 2 1 2
#> 212006 59755 125916 150650 34631 197052 68127 128160 202666 150173 92191
#> 2 2 2 2 1 2 2 1 1 2 2
#> 6855 82830 162138 69504 186815 179714 49006 87541 72271 208327 96661
#> 2 1 2 2 2 2 2 2 2 2 2
#> 194008 74606 31890 143369 156761 63230 43218 76666 28274 54665 216503
#> 2 2 2 2 2 2 2 1 1 2 2
#> 181618 104849 25716 206248 188820 45671 139301 9274 109951 180317 134566
#> 2 2 2 1 2 2 2 2 2 1 2
#> 74727 230541 115172 158146 171252 210631 102455 173617 175100 103354 85024
#> 1 2 2 2 3 1 2 3 3 2 2
#> 18356 47619 205376 212783 143313 84944 156846 95242 155324 211206 96274
#> 2 2 1 2 2 2 2 2 2 2 1
#> 35494 87594 108088 56684 27433 1922 151033 13126 30192 222924 60233
#> 2 2 2 2 1 2 2 2 2 2 2
#> 2104 230388 151501 168110 165742 191437 34399 156626 196505 230314 70560
#> 2 2 2 3 2 2 2 2 1 2 2
#> 33737 120433 193085 213043 80770 77193 194639 103539 161163 209473 6839
#> 2 2 1 2 1 2 1 1 2 1 2
#> 109470 94871 78177 153892 202583 20130 77643 216643 213126 159335 88987
#> 2 2 2 2 1 2 2 2 2 2 2
#> 2084 167033 111579 864 132734 18713 88137 231893 22613 186579 19542
#> 1 2 2 1 2 2 2 2 2 2 1
#> 203506 227948 141485 158879 114744 64028 232007 154142 127196 138274 71533
#> 1 2 2 2 2 2 2 2 2 2 1
#> 208161 212537 225419 79427 215381 213108 160404 178608 139698 178467 22730
#> 2 2 2 1 2 2 2 2 2 2 2
#> 53762 163969 6686 146409 155923 105057 156789 2179 14726 64774 108535
#> 2 2 1 2 2 1 2 2 2 2 2
#> 180112 85114 208935 120947 67179 56734 136656 88320 163627 26917 138816
#> 2 2 2 2 2 1 2 2 2 2 2
#> 196251 81385 73336 180969 59099 194885 174285 71622 106243 96234 75386
#> 3 1 1 2 1 2 3 2 1 2 2
#> 31456 211094 34881 194349 126239 26894 10259 65568 204397 194118 13397
#> 2 1 2 2 1 2 2 2 1 1 2
#> 11301 73439 30484 8260 28026 131708 29534 175813 23060 53328 102888
#> 2 1 2 2 2 2 2 3 2 2 2
#> 119604 214503 10311 34593 196707 35474 191324 200188 151755 5105 231435
#> 2 2 2 1 2 1 2 2 2 2 2
#> 34242 19146 184993 79792 168969 79092 199631 94437 131851 209072 216889
#> 1 2 2 1 3 2 1 2 2 2 1
#> 196466 96012 80570 105954 209582 83547 134302 145672 165452 5116 115228
#> 2 2 1 1 1 2 2 1 2 2 2
#> 102148 129334 40724 130688 141370 148177 183881 80335 77334 228975 166130
#> 2 1 2 1 2 2 1 1 2 2 2
#> 95311 132227 24336 201098 110808 64593 107362 9527 202735 172358 16842
#> 1 2 2 1 2 2 1 2 1 3 2
#> 98768 119591 171280 62145 112987 223143 195201 62164 60455 211776 138908
#> 2 2 3 2 2 2 1 2 2 3 2
#> 14762 166325 172070 60910 122887 186491 168745 200641 10757 1874 217794
#> 2 2 3 2 2 2 3 1 2 2 1
#> 56029 208592 118553 161579 221885 12559 30046 178099 199685 168748 209627
#> 2 2 2 2 2 2 2 2 1 3 2
#> 171908 37444 111848 95063 108012 102816 6150 151604 90548 6080 67461
#> 3 2 2 2 2 2 2 2 2 1 2
#> 105 149634 191699 175004 119345 143635 92261 99806 147139 111223 129668
#> 2 2 2 3 2 2 2 2 1 1 1
#> 118165 191595 129837 3079 110247 8448 56755 221302 62501 10945 222926
#> 2 2 1 2 2 2 2 1 2 2 2
#> 9545 131795 113437 189074 121980 193976 16649 172816 78221 143805 166421
#> 2 2 2 2 2 1 2 3 2 1 2
#> 220114 63036 225254 109498 184879 210667 229627 51299 44606 164550 196017
#> 1 1 2 2 2 2 2 2 2 2 1
#> 10142 168810 69346 59726 227251 200565 153925 201381 195478 224048 211667
#> 2 3 2 2 2 1 2 1 2 2 3
#> 65732 106106 201383 119719 61621 201315 159955 137785 214048 34621 186902
#> 2 1 1 2 2 1 2 2 2 2 2
#> 72180 213419 150544 41335 102292 174890 59721 231504 121078 231311 93304
#> 2 2 2 2 2 3 2 2 2 2 2
#> 135410 206673 197777 99472 64604 206681 228799 32349 125231 71975 197458
#> 2 1 1 1 1 2 2 2 1 2 2
#> 10088 86245 119569 177916 66588 189243 1223 5170 56303 192648 17159
#> 2 2 2 1 2 2 2 2 2 2 2
#> 25923 144583 155505 84786 42447 6192 113647 138087 221554 83045 41583
#> 2 2 2 2 2 2 2 2 1 1 1
#> 229977 7114 167552 10294 65398 13417 152431 208767 111671 166397 232057
#> 1 2 3 2 2 2 2 1 2 2 2
#>
#> Within cluster sum of squares by cluster:
#> [1] 2048.1328 4863.1969 411.4773
#> (between_SS / total_SS = 34.9 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"
Insight :
For each of clusters have their own songs, here it is :
the higher k :
# Make a new one data set contain only numeric columns
spotify_qty <- as.data.frame(spotify_num)
# Cluster Profilling
spotify_keep$cluster <- as.factor(spotify_kmeans$cluster)
spotify_qty$cluster <- as.factor(spotify_kmeans$cluster)# your code here
spotify_qty <- spotify_qty %>%
group_by(cluster) %>%
summarise_all(mean)
spotify_qtyspotify_kmeans$centers#> popularity acousticness danceability duration_ms energy instrumentalness
#> 1 -0.6834540 1.2567025 -1.06133891 0.2596139 -1.3961322 1.1940948
#> 2 0.2643987 -0.4172683 0.27122789 -0.0502160 0.3292904 -0.2713331
#> 3 -1.1314829 1.2009523 0.02205331 -0.2113080 0.3800266 -0.4565918
#> liveness loudness speechiness tempo valence
#> 1 -0.2010456 -1.4650248 -0.3943022 -0.3381463 -0.9665134
#> 2 -0.1311695 0.4071616 -0.1786706 0.1481276 0.2628519
#> 3 2.3346967 -0.3879322 3.5756535 -0.7808484 -0.1823572
spotify_qty %>%
tidyr::pivot_longer(-cluster) %>%
group_by(name) %>%# kolom
summarize(cluster_min_val = which.min(value),
cluster_max_val = which.max(value))library(ggiraphExtra)
library(ggplot2)
ggRadar(data=spotify_qty, aes(colour=cluster), interactive=TRUE)Insight :
spotify_keep %>%
select(c(artist_name, track_name, cluster))Insight :
fviz_cluster(object = spotify_kmeans, # object kmeans
data = spotify_no)+ # data variable numerik
labs(title = "Cluster Plot of Spotify Dataset using k=3")+
theme(panel.grid.minor = element_line(linetype = "dashed"),
panel.grid.major = element_line(linetype = "dashed"))
Insight :
After we explore and make clustering based on characteristics spotify’s song, we get :
Clustering is grouping data based on its characteristics. Clustering aims to produce clusters where:
K-means is a centroid-based clustering algorithm, meaning that each cluster has a centroid/center point that represents the cluster. K-means is an iterative process consisting of:
Workflow K-means :
For further project, you can try Hierarchical Clustering, Fuzzy C-means, DBScan.
Bruce, P., & Bruce, A. (2017). Practical statistics for data scientists: 50 essential concepts. O’Reilly Media, Inc.