Intro

Background

What is spotify? Spotify is a digital music, podcast, and video streaming service that gives you access to millions of songs and other content from artists all over the world. Recently spotify is one of biggest digital music and podcast service in world.

Spotify definitely is one of tech company has very advanced technology. One of example is, each of song track have uploaded to platform, they will identified. We can get audio feature information for each track, and access very easy we can use this Api link. In this case we will using spotify dataset from API from this source kaggle.

We will try to analyze popularity for each track we get, based on data we will try to find there is relation from popularity with other feature or variable. We will also try to do clustering analysis using K-means method and for sure we will find try to reduction dimensionality using Principle Component Analysis (PCA)

Dataset

We will use dataset we get from kaggle, can download from this source source)

Intial Setup and Library

# Starting collection for data science
library(tidyverse)
# Processing string
library(glue)
# Processing date data type
library(lubridate)
# Multivariate Data Analyses
library(factoextra)
# Multivariate Data Analyses
library(FactoMineR)
# Data visualization
library(ggplot2)
library(viridis)
library(GGally)
library(scales)

Import Data

The dataset we download from kaggle, we will import dataset. This dataset cotontaint audio feature from a track.

tracks <- read_csv("data/SpotifyFeatures.csv")

Observe structure and preview imported dataset

glimpse(tracks)
## Observations: 232,725
## Variables: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie"…
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi…
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par …
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", …
## $ popularity       <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
## $ duration_ms      <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.12…
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "…
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major"…
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/…
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…

Variable Explaination:
1. genre : Track genre
2. artist_name : Artist name
3. track_name : Title of track
4. track_id : The Spotify ID for the track.
5. popularity : Popularity rate (1-100)
6. acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
7. danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0.
8. duration_ms : The duration of the track in milliseconds.
9. energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
10. instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
11. key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
12. liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
13. loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
14. mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
15. speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
16, tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
17. time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
18. valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Data Wrangling

First of all, we would check NA or Empty value of each variable, We didnt find any NA inside data

colSums(is.na(tracks))
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##       popularity     acousticness     danceability      duration_ms 
##                0                0                0                0 
##           energy instrumentalness              key         liveness 
##                0                0                0                0 
##         loudness             mode      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

Some variable have wront type data, we need to convery the data type:

  • genre : to factor
  • key : to factor
  • genre : to factor
  • mode: to factor
tracks <- tracks  %>% 
                  mutate(genre = as.factor(genre),
                  key = as.factor(key),
                  genre = as.factor(str_replace_all(genre, "[[:punct:]]", "")),
                  mode = as.factor(mode))

Drop variable that we think didnt related with our case. In this case we prioritize variable with numerical data type.based on summary we will drop track id, time_signature, track_name.

summary(tracks)
##              genre        artist_name         track_name       
##  Childrens Music: 14756   Length:232725      Length:232725     
##  Comedy         :  9681   Class :character   Class :character  
##  Soundtrack     :  9646   Mode  :character   Mode  :character  
##  Indie          :  9543                                        
##  Jazz           :  9441                                        
##  Pop            :  9386                                        
##  (Other)        :170272                                        
##    track_id           popularity      acousticness     danceability   
##  Length:232725      Min.   :  0.00   Min.   :0.0000   Min.   :0.0569  
##  Class :character   1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350  
##  Mode  :character   Median : 43.00   Median :0.2320   Median :0.5710  
##                     Mean   : 41.13   Mean   :0.3686   Mean   :0.5544  
##                     3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920  
##                     Max.   :100.00   Max.   :0.9960   Max.   :0.9890  
##                                                                       
##   duration_ms          energy          instrumentalness         key       
##  Min.   :  15387   Min.   :0.0000203   Min.   :0.0000000   C      :27583  
##  1st Qu.: 182857   1st Qu.:0.3850000   1st Qu.:0.0000000   G      :26390  
##  Median : 220427   Median :0.6050000   Median :0.0000443   D      :24077  
##  Mean   : 235122   Mean   :0.5709577   Mean   :0.1483012   C#     :23201  
##  3rd Qu.: 265768   3rd Qu.:0.7870000   3rd Qu.:0.0358000   A      :22671  
##  Max.   :5552917   Max.   :0.9990000   Max.   :0.9990000   F      :20279  
##                                                            (Other):88524  
##     liveness          loudness          mode         speechiness    
##  Min.   :0.00967   Min.   :-52.457   Major:151744   Min.   :0.0222  
##  1st Qu.:0.09740   1st Qu.:-11.771   Minor: 80981   1st Qu.:0.0367  
##  Median :0.12800   Median : -7.762                  Median :0.0501  
##  Mean   :0.21501   Mean   : -9.570                  Mean   :0.1208  
##  3rd Qu.:0.26400   3rd Qu.: -5.501                  3rd Qu.:0.1050  
##  Max.   :1.00000   Max.   :  3.744                  Max.   :0.9670  
##                                                                     
##      tempo        time_signature        valence      
##  Min.   : 30.38   Length:232725      Min.   :0.0000  
##  1st Qu.: 92.96   Class :character   1st Qu.:0.2370  
##  Median :115.78   Mode  :character   Median :0.4440  
##  Mean   :117.67                      Mean   :0.4549  
##  3rd Qu.:139.05                      3rd Qu.:0.6600  
##  Max.   :242.90                      Max.   :1.0000  
## 
tracks <- tracks %>% select(-c(track_id,time_signature,track_name))

Exploratory Data Analysis

From dataset we get, we found we have genre whish is we can group our data base on it. To make us focus on popularity variable, we would select 5 highest average genre. It can inteprate the genre have big distribution from low to highest popularity. We will visualize data:

genre_popularity <- tracks %>% select(popularity, genre) %>% group_by(genre) %>% summarise("average_popularity" = round(mean(popularity)))

ggplot(data=genre_popularity, mapping = aes(x = reorder(genre,average_popularity), y = average_popularity, fill = genre)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  theme(
    legend.position = "none",
    
  ) +
  labs(
    y = "Average popularity",
    x = "Genre"
  )

Top 5 Genre with highest average popularity is Pop, Rap, Rock, HipHop and Dance. We filter our dataset and select only this 5 genres.

# Filter
tracks <- tracks %>% filter(genre == "Pop" | genre == "Rap" | genre == "Rock" | genre == "HipHop" | genre == "Dance")

# Total row
NROW(tracks)
## [1] 45886

We have filtered data, we can check distribution of popularity new data. We found distribution of popularity have spike in middle of popularity is around 50.

hist(tracks$popularity)

Clustering Opportunity

Before we use k-mean as method to clustering, we can use simple way to clustering some factor variable with popularity variable. Here we try to viasualize boxplot popularity with key and genre

tracks %>% 
  ggplot(aes(x = key, y = popularity, fill = key)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  theme_minimal()

tracks %>% 
  ggplot(aes(x = genre, y = popularity, fill = genre)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  theme_minimal()

From both bar, we can see in general genre and key didnt have significant relation to popularity. Even so, there is difference in plot popluaritu and genre when track overall using Key A# popularity slighlty higher and more have stable popularity than others.

Other things, we found that Pop genre have more stable popluarity than other 4 genre. So we can consider an opinion, if producer want get more popularity in the spotify platform, we can make tracks song with Pop genre and overall key in tracks using A# key.

We only visualize genre and key before, and we found each correlation with popularity even though, cant significantly which variable can significantly increase popularity. So we will try visualize other numerical variable to see correlation between them.

ggcorr(tracks, low = "blue", high = "red")
## Warning in ggcorr(tracks, low = "blue", high = "red"): data in column(s)
## 'genre', 'artist_name', 'key', 'mode' are not numeric and were ignored

It show popularity dont have strong correlation with others any numberical variable. But we found some variable have strong each other, it indicates that this dataset has multicollinearity and might not suitable for various classification algorithms.

To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means. We will use Principal Component Analysis (PCA) can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

Data Pre-processing

Since we will implement K-means method and using PCA, its need perform pre-processing data. We didnt take all data, we will do sampling from data. So i take around 5% from data.

set.seed(100)
tracks_sample <- sample_n(tracks, (nrow(tracks) * 0.05))
NROW(tracks_sample)
## [1] 2294

After do sampling we check distribution of popularity. We get distribution our popularity frequency still have same pattern before we take sample.

hist(tracks_sample$popularity)

Next step, we will do feature scaling. Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. Normalization is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1]. So we need only numerical variable to do feature scaling.

tracks_num <- tracks_sample %>% select(-c(genre,artist_name,key,mode))
str(tracks_num)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2294 obs. of  11 variables:
##  $ popularity      : num  48 55 50 65 67 82 67 62 52 66 ...
##  $ acousticness    : num  0.263 0.0409 0.109 0.000491 0.039 0.0776 0.00118 0.154 0.00487 0.73 ...
##  $ danceability    : num  0.629 0.621 0.438 0.522 0.673 0.643 0.528 0.699 0.542 0.425 ...
##  $ duration_ms     : num  227373 199113 197852 232067 229507 ...
##  $ energy          : num  0.787 0.827 0.792 0.751 0.758 0.904 0.858 0.668 0.714 0.406 ...
##  $ instrumentalness: num  0 0 0 0.00000222 0 0 0.00000156 0.0000032 0.298 0.00000359 ...
##  $ liveness        : num  0.357 0.0815 0.241 0.158 0.341 0.189 0.282 0.362 0.334 0.107 ...
##  $ loudness        : num  -5.62 -7.31 -6.88 -5.46 -3.63 ...
##  $ speechiness     : num  0.376 0.0454 0.346 0.0435 0.158 0.0739 0.0493 0.0336 0.028 0.176 ...
##  $ tempo           : num  85.7 100 175.9 139.5 136 ...
##  $ valence         : num  0.713 0.65 0.463 0.605 0.542 0.481 0.219 0.314 0.76 0.124 ...
tracks_scale <- scale(tracks_num)
head(tracks_scale)
##      popularity acousticness danceability duration_ms    energy
## [1,] -1.3144482    0.3159738 -0.103066412   0.0855727 0.7311049
## [2,] -0.5755485   -0.6849108 -0.156994529  -0.4137869 0.9617990
## [3,] -1.1033340   -0.3780209 -1.390600214  -0.4360690 0.7599417
## [4,]  0.4800225   -0.8670123 -0.824354981   0.1685166 0.5234803
## [5,]  0.6911367   -0.6934731  0.193538234   0.1232809 0.5638518
## [6,]  2.2744932   -0.5195237 -0.008692206  -0.4790606 1.4058849
##      instrumentalness     liveness    loudness speechiness      tempo
## [1,]       -0.2252330  1.162875416  0.39131252   2.0399495 -1.2071715
## [2,]       -0.2252330 -0.731538175 -0.22022169  -0.6602825 -0.7196655
## [3,]       -0.2252330  0.365227588 -0.06606864   1.7949194  1.8713091
## [4,]       -0.2252151 -0.205503185  0.45043474  -0.6758011  0.6297919
## [5,]       -0.2252330  1.052855026  1.11347420   0.2593973  0.5084701
## [6,]       -0.2252330  0.007661321  1.09098599  -0.4275039 -0.6155827
##          valence
## [1,]  1.00184410
## [2,]  0.72282618
## [3,] -0.10536990
## [4,]  0.52352766
## [5,]  0.24450973
## [6,] -0.02565049

Clustering

Find optimal number of Clusters

We will try to find optimum number of cluster, in this case we will try use 3 method: Elbow Method, Sillhoute Method, and gap statistic. In the end after we get result from this 3 method, we will chose optimum clusters based on majority voting.

Elbow Method

Best practice of elbow method is based on graph, we will choose are of “bend of an elbow”

fviz_nbclust(tracks_num, kmeans, method = "wss", k.max = 10) +
  scale_y_continuous(labels = number_format(scale = 10^(-9), big.mark = ",", suffix = " bil.")) +
  labs(subtitle = "Elbow method")

We found 2 cluster is good enought since there is’nt significant decline in total within-cluster sum of squares on higher number of clusters.

Sillhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(tracks_num, kmeans, method = "silhouette", k.max = 10) 

Based on sillhouette method, number clusters with maximum score and considered as the optimum k-clusters is 2

Gap Statistic

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(tracks_num, kmeans, "gap_stat", k.max = 10) + labs(subtitle = "Gap Statistic method")

Based on gap statistic method, the optimal k is 1

Majority Voting for Optimium K

Result of our 3 method is Elbow method k = 2, Sillhouette method k = 2, and Gap Statistic k = 1. Two of our method result is 2 we consider use this result, because if we choose k = 1, we cant analyze difference between culsters or segment.

K-Means Clustering

Here we implement optimal K from our process before, we decided using K = 2

set.seed(100)
km_tracks <-kmeans(tracks_scale, centers = 2)
km_tracks
## K-means clustering with 2 clusters of sizes 736, 1558
## 
## Cluster means:
##     popularity acousticness danceability duration_ms     energy
## 1  0.013047091    0.7995148  -0.13930852  0.12854175 -1.0607826
## 2 -0.006163453   -0.3776912   0.06580942 -0.06072319  0.5011143
##   instrumentalness   liveness   loudness speechiness      tempo    valence
## 1       0.19655118 -0.2476437 -0.9122186 -0.07286271 -0.2282781 -0.5674398
## 2      -0.09285088  0.1169870  0.4309325  0.03442038  0.1078387  0.2680588
## 
## Clustering vector:
##    [1] 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 2 2 2 1 2 1
##   [38] 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 2 2 2
##   [75] 2 1 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 1 1
##  [112] 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 1 2 1 2 2 1 1 2 1 2
##  [149] 1 2 2 2 1 1 2 1 1 2 2 1 2 1 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
##  [186] 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2
##  [223] 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1
##  [260] 2 1 2 2 2 1 2 1 2 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2
##  [297] 1 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 2 2
##  [334] 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1
##  [371] 1 1 2 2 2 2 1 1 2 1 2 2 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1
##  [408] 2 2 1 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1
##  [445] 1 2 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 2 2
##  [482] 2 1 1 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 1 1 1 2 2 1
##  [519] 2 2 1 2 1 2 2 2 2 2 2 1 2 1 1 1 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1
##  [556] 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 1
##  [593] 1 2 1 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1
##  [630] 2 1 2 2 2 1 2 1 2 2 2 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2
##  [667] 2 2 2 2 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1
##  [704] 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2
##  [741] 2 1 2 1 2 1 2 2 2 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2
##  [778] 1 2 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1
##  [815] 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2
##  [852] 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 1
##  [889] 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 1
##  [926] 1 1 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1
##  [963] 2 2 2 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2
## [1000] 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2
## [1037] 2 1 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 2 2 1 1 1
## [1074] 1 2 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1
## [1111] 2 2 1 2 1 1 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 2 2 2 2
## [1148] 2 1 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 1 1 2 2
## [1185] 1 2 1 1 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2
## [1222] 1 2 1 1 1 2 2 1 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 1
## [1259] 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## [1296] 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 1 2 2
## [1333] 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 1
## [1370] 2 2 2 2 2 2 1 2 2 2 1 1 2 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 2
## [1407] 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 1 2 1 2
## [1444] 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1481] 1 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1
## [1518] 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1
## [1555] 2 1 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 1 2 1 2 2 2
## [1592] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 1
## [1629] 2 1 1 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 2 1 1 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2
## [1666] 1 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 1 2 1 2
## [1703] 2 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1
## [1740] 1 2 1 2 2 1 2 2 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [1777] 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1 1 2 2 1
## [1814] 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2
## [1851] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1
## [1888] 2 1 2 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2
## [1925] 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2
## [1962] 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2
## [1999] 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2
## [2036] 1 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2
## [2073] 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 1 2 2 2 2 1 2 1 1 1 2 2
## [2110] 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 1 1
## [2147] 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 1
## [2184] 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2
## [2221] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 1
## [2258] 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1]  8839.964 13010.500
##  (between_SS / total_SS =  13.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

based on summary, we found 1558 observence goes to cluster 1 and other 736 goes to cluster 2. So right now we have information cluster of each observation, we can join cluster vector to our sample dataset

tracks_sample$cluster <- as.factor(km_tracks$cluster)

head(tracks_sample)
## # A tibble: 6 x 16
##   genre artist_name popularity acousticness danceability duration_ms energy
##   <fct> <chr>            <dbl>        <dbl>        <dbl>       <dbl>  <dbl>
## 1 HipH… Talib Kweli         48     0.263           0.629      227373  0.787
## 2 Dance Andy Gramm…         55     0.0409          0.621      199113  0.827
## 3 Rap   Mac Miller          50     0.109           0.438      197852  0.792
## 4 Pop   Simple Plan         65     0.000491        0.522      232067  0.751
## 5 HipH… Pitbull             67     0.039           0.673      229507  0.758
## 6 Rap   Jason Deru…         82     0.0776          0.643      195419  0.904
## # … with 9 more variables: instrumentalness <dbl>, key <fct>, liveness <dbl>,
## #   loudness <dbl>, mode <fct>, speechiness <dbl>, tempo <dbl>, valence <dbl>,
## #   cluster <fct>

Cluster Analysis

We will do analysis and exploration base on cluster we already do using k-mean. our focus is popularity, let see is there correlation between cluster with our popularity rate.

tracks_sample %>% 
  select(cluster, popularity) %>% 
  group_by(cluster) %>% 
  summarise_all("mean")
## # A tibble: 2 x 2
##   cluster popularity
##   <fct>        <dbl>
## 1 1             60.6
## 2 2             60.4

Between cluster 1 and 2 dont have any difference on Popularity in the average. it only show little difference that cluster 2 have 0.2 popularity more higher than cluster 1. Lets take look with boxplot

tracks_sample %>% ggplot(aes(x = cluster, y = popularity, fill = cluster)) +
  geom_boxplot() +
  theme_minimal()

Same like we analyze before, there isnt sepecific difference popularity between both clusters. So we can assume that our cluster didnt focus on popularity feature or variable. lets take look by genre

tracks_sample %>% 
  select(cluster, genre) %>% 
  group_by(genre, cluster) %>% 
  summarize(n = n()) %>% 
  ungroup() %>%
  spread(genre, n, fill=0)
## # A tibble: 2 x 6
##   cluster Dance HipHop   Pop   Rap  Rock
##   <fct>   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 1         108    146   176   137   169
## 2 2         299    322   303   336   298

Our cluster didnt seperate genre too. so lets take look other visualize acousticness

tracks_sample %>% ggplot(aes(x = cluster, y = acousticness, fill = cluster)) +
  geom_boxplot() +
  theme_minimal()

acousticness is one of variable that cluster read it. so we can take look for overall variable to take look which other variable that our cluster specific different

tracks_sample %>%
  select_if(is.numeric) %>% 
  mutate(cluster = as.factor(km_tracks$cluster)) %>% 
  group_by(cluster) %>% 
  summarise_all("mean")
## # A tibble: 2 x 12
##   cluster popularity acousticness danceability duration_ms energy
##   <fct>        <dbl>        <dbl>        <dbl>       <dbl>  <dbl>
## 1 1             60.6        0.370        0.624     229805.  0.476
## 2 2             60.4        0.109        0.654     219094.  0.747
## # … with 6 more variables: instrumentalness <dbl>, liveness <dbl>,
## #   loudness <dbl>, speechiness <dbl>, tempo <dbl>, valence <dbl>

We can see between Cluster 1and Cluster 2 have some significant difference in feature acousticness, energy, instrumentalness, liveness, loudness, energy, instrumentalness, loudness. No matter Between cluster 1 and 2 dont have significant difference on popularity we can assume if Tracks have more patter like Cluster 2 it have more chance to get better popularity

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. . PCA is sensitive to the relative scaling of the original variables.

Dimensionality Reduction

We will try make PCA from our sample datasets. We can see eigenvalues and percentage of variance for each dimensions. Behaviour of Eivenvalues will more larger at first PCs and will goes down sequently to the end of PCs.

non_numeric <- which(sapply(tracks_sample, negate(is.numeric)))

tracks_pca <- PCA(tracks_sample,
                  scale.unit = T,
                  quali.sup = non_numeric,
                  graph = F,
                  ncp = 11)

summary(tracks_pca)
## 
## Call:
## PCA(X = tracks_sample, scale.unit = T, ncp = 11, quali.sup = non_numeric,  
##      graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.309   1.505   1.142   1.027   0.985   0.948   0.847
## % of var.             20.987  13.682  10.379   9.332   8.955   8.619   7.696
## Cumulative % of var.  20.987  34.668  45.047  54.379  63.334  71.954  79.650
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.833   0.645   0.543   0.217
## % of var.              7.575   5.867   4.937   1.971
## Cumulative % of var.  87.225  93.092  98.029 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  3.250 |  1.012  0.019  0.097 |  0.942  0.026  0.084 |
## 2                |  2.008 |  0.805  0.012  0.161 | -0.047  0.000  0.001 |
## 3                |  3.315 |  0.852  0.014  0.066 | -0.663  0.013  0.040 |
## 4                |  1.841 |  1.040  0.020  0.319 | -0.978  0.028  0.282 |
## 5                |  2.029 |  1.579  0.047  0.606 | -0.136  0.001  0.004 |
## 6                |  3.075 |  1.616  0.049  0.276 | -0.091  0.000  0.001 |
## 7                |  2.559 |  1.031  0.020  0.162 | -1.282  0.048  0.251 |
## 8                |  2.441 |  0.762  0.011  0.097 | -0.048  0.000  0.000 |
## 9                |  3.198 |  0.304  0.002  0.009 | -1.388  0.056  0.188 |
## 10               |  4.496 | -2.847  0.153  0.401 | -1.205  0.042  0.072 |
##                   Dim.3    ctr   cos2  
## 1                 2.302  0.202  0.502 |
## 2                -0.555  0.012  0.076 |
## 3                 1.927  0.142  0.338 |
## 4                -0.786  0.024  0.182 |
## 5                 0.188  0.001  0.009 |
## 6                -1.807  0.125  0.345 |
## 7                -0.480  0.009  0.035 |
## 8                -0.049  0.000  0.000 |
## 9                 0.976  0.036  0.093 |
## 10                0.025  0.000  0.000 |
## 
## Variables (the 10 first)
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## popularity       |  0.002  0.000  0.000 |  0.095  0.596  0.009 | -0.617 33.386
## acousticness     | -0.647 18.115  0.418 |  0.158  1.653  0.025 | -0.003  0.001
## danceability     |  0.125  0.673  0.016 |  0.769 39.243  0.591 | -0.018  0.030
## duration_ms      | -0.149  0.966  0.022 | -0.471 14.759  0.222 |  0.122  1.306
## energy           |  0.872 32.917  0.760 | -0.267  4.720  0.071 |  0.015  0.021
## instrumentalness | -0.310  4.171  0.096 | -0.405 10.926  0.164 |  0.156  2.131
## liveness         |  0.237  2.434  0.056 | -0.099  0.654  0.010 |  0.595 31.035
## loudness         |  0.818 28.979  0.669 | -0.112  0.831  0.013 | -0.149  1.951
## speechiness      |  0.083  0.298  0.007 |  0.482 15.415  0.232 |  0.578 29.288
## tempo            |  0.159  1.093  0.025 | -0.254  4.297  0.065 |  0.084  0.612
##                    cos2  
## popularity        0.381 |
## acousticness      0.000 |
## danceability      0.000 |
## duration_ms       0.015 |
## energy            0.000 |
## instrumentalness  0.024 |
## liveness          0.354 |
## loudness          0.022 |
## speechiness       0.334 |
## tempo             0.007 |
## 
## Supplementary categories (the 10 first)
##                       Dist     Dim.1    cos2  v.test     Dim.2    cos2  v.test
## Dance            |   0.678 |   0.273   0.162   3.998 |  -0.339   0.249  -6.138
## HipHop           |   0.843 |   0.024   0.001   0.382 |   0.592   0.492  11.692
## Pop              |   0.735 |  -0.082   0.013  -1.334 |   0.025   0.001   0.491
## Rap              |   0.678 |   0.104   0.024   1.675 |   0.457   0.456   9.100
## Rock             |   1.084 |  -0.283   0.068  -4.512 |  -0.786   0.526 -15.517
## *NSYNC           |   3.488 |   2.905   0.694   1.912 |   0.196   0.003   0.159
## $uicideBoy$      |   2.143 |  -0.348   0.026  -0.648 |   0.681   0.101   1.572
## 03 Greedo        |   2.493 |   1.291   0.268   0.850 |   0.623   0.062   0.508
## 070 Shake        |   2.524 |  -1.786   0.501  -1.176 |   0.331   0.017   0.269
## 10cc             |   5.564 |  -3.790   0.464  -2.494 |  -1.795   0.104  -1.463
##                      Dim.3    cos2  v.test  
## Dance            |   0.015   0.000   0.313 |
## HipHop           |   0.500   0.351  11.333 |
## Pop              |  -0.534   0.528 -12.296 |
## Rap              |   0.259   0.146   5.912 |
## Rock             |  -0.228   0.044  -5.167 |
## *NSYNC           |   0.245   0.005   0.229 |
## $uicideBoy$      |   0.337   0.025   0.892 |
## 03 Greedo        |  -0.436   0.031  -0.408 |
## 070 Shake        |   0.424   0.028   0.397 |
## 10cc             |   0.606   0.012   0.567 |

After we finish make PCA to make more easy understand we can visualize each variance captured by each dimensions.

fviz_eig(tracks_pca, ncp = 11, addlabels = T, main = "Variance Explained by Dimensions")

We can see around 50% variances of sample data can be explained only by first 4 dimensions.

Our target is to reduction feature to make more light computation, so we can target more than 80% information with minimum dimension. We can take 8 dimensions from 11 dimensions we have, because we can sum up, with only 8 dimensions we can reach more than 80% total variance can representate our dataset.

tracks_pca_min <- data.frame(tracks_pca$ind$coord[ ,1:8]) %>% 
  bind_cols(cluster = as.factor(tracks_sample$cluster))

head(tracks_pca_min)
##       Dim.1       Dim.2      Dim.3      Dim.4      Dim.5       Dim.6
## 1 1.0122734  0.94229908  2.3020987 -1.1828077  0.1633709  0.04650929
## 2 0.8052256 -0.04725789 -0.5549460 -0.6090916 -1.2256628 -0.01547659
## 3 0.8515912 -0.66258819  1.9270233  2.0635383  0.1928861 -0.46747834
## 4 1.0398390 -0.97782829 -0.7863463  0.3407197  0.1184999  0.05866493
## 5 1.5792762 -0.13596040  0.1878639  0.1504676  0.9698811  0.19449559
## 6 1.6158722 -0.09069233 -1.8066716 -0.3757709  1.1769794  0.31947824
##         Dim.7      Dim.8 cluster
## 1 -0.05281645 -0.4811671       2
## 2 -0.29699955 -0.4206434       2
## 3 -0.04919005 -0.6600425       2
## 4  0.18860331 -0.2629920       2
## 5  0.11174311  0.4921659       2
## 6 -0.40082276  1.1204166       2
pca_dimdesc<-dimdesc(tracks_pca)
pca_dimdesc$Dim.1$quanti
##                  correlation       p.value
## energy            0.87172091  0.000000e+00
## loudness          0.81792061  0.000000e+00
## valence           0.48891303 3.760595e-138
## liveness          0.23705070  1.148005e-30
## tempo             0.15882568  1.985875e-14
## danceability      0.12460398  2.123139e-09
## speechiness       0.08291389  7.011264e-05
## duration_ms      -0.14935922  6.472797e-13
## instrumentalness -0.31029361  2.197482e-52
## acousticness     -0.64667005 7.190923e-272

Energy and loudness 2 variables that give more contribute on PC 1. so from PC 1 we can get more information of this both variables

Individual and Variable Factor Map

From the previous section, we have discussed that PCA can be combined with clustering to obtain better visualization of our clustering result, or simply to understand the pattern in our dataset.

fviz_cluster(object = km_tracks, data = tracks_scale) + 
  theme_minimal()

The plots above are examples of individual factor map. The points in the plot resemble observations and colored by Cluster (Kernel by clustering result). Dim1 and Dim2 are PC1 and PC2 respectively, with their own share (percentage) of information from the total information of the dataset. With only this visualize we cant really understand what kind insight of this pattern. We can add other visualize Variable Factor Map

fviz_pca_var(tracks_pca) +
  theme_minimal()

The plot above shows us that the variables are located inside the circle, meaning that we would need more that two components to represent our data perfectly. The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis.

Some insight we can get from Individual Factor Map and Variable Factor Map

Cluster 1: tracks that have more danceability,speechiness, valence,loudness,energy,tempo and lieveness and it little more better popularity than Cluster 1 Cluster 2: tracks have more accousticness instrumentalness and more durations

PCA Clustering

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

fviz_pca_ind(tracks_pca, habillage = 1)

Some insight we can get from this visualize is :
1. all genre have 2 different cluster
2. Looks from Variable Factor Map, Genre rock have chance get more smaller popularity than other 4 genre 3. Pop and Hiphop genre with more danceability,speechiness, valence have more bigger chance get more better Popularity

Conclusion

From the unsupervised learning analysis above, we can summarize:

  1. Popularity only have small related with other feature/variable. but have more danceability,speechiness, valence can help get more better Popularity
  2. Based on clustering, tracks that to be Cluster 1 get more better chance to get better Popularity than Cluster 2
  3. Rock genre that typically more loudness,energy,tempo, lieveness, instrumentalness and durations looks get smaller popularity than others genre
  4. Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 11 PC according to the total information. we can target more than 80% information with minimum dimension. We can take 8 dimensions from 11 dimensions we have, because we can sum up, with only 8 dimensions we can reach more than 80% total variance can representate our dataset.
  5. The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.