Instructions:

Install and load all the packages required for this workshop all at once with this code:

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(
  psych,
  knitr,
  DT, 
  tidyverse, 
  tidyr, 
  ggrepel, 
  gganimate, 
  magick, 
  GGally, 
  RColorBrewer, 
  viridis, 
  gridExtra, 
  factoextra, 
  fpc, 
  magrittr 
)

You can also install and load the packages individually from the list below.

#Data inspection
library(psych)

#Data frame
library(knitr)
library(DT)

#Data manipulation
library(tidyverse)
library(tidyr)

#Data visualizations
library(ggrepel)
library(gganimate)
library(magick)
library(GGally)
library(RColorBrewer)
library(viridis)
library(gridExtra)

#K-Means
library(factoextra)
library(fpc)

#For variable convertions
library(magrittr)

The task at hand:

Identify clusters among different music genres using Spotify data. The input variables for this analysis will be number of songs and popularity scores per artist.

What is clustering? Generally speaking, clustering is the process of placing spread data points into different groups by proximity or differences.

Phases of this Workshop

Let’s get started!

PART 1 - Data inspection

Load dataset.

library(readr)
music <- read.csv('https://raw.githubusercontent.com/karendelarosa/datasets/master/musicgenre-clustering.csv')

Check number of cases and types of variables.

str(music)
## 'data.frame':    3242 obs. of  5 variables:
##  $ artist    : chr  "10000 Maniacs" "12 Stones" "311" "4 Non Blondes" ...
##  $ songs     : int  110 75 196 15 13 36 36 13 192 0 ...
##  $ popularity: num  0.3 0.3 0.5 7.5 0 0.1 0.1 0 10.8 0 ...
##  $ link      : chr  "/10000-maniacs/" "/12-stones/" "/311/" "/4-non-blondes/" ...
##  $ genre     : chr  "Rock" "Rock" "Rock" "Rock" ...

The dataset contains 3242 cases. “songs” indicates the number of songs per artist, and “popularity” shows popularity scores. It looks like R is reading songs as integer, and genre as character. In order to make genre comparisons, we should convert the genre variable to factor. The songs variable will also be converted to numeric so both the outputs (songs and popularity) are consistent.

Make the variable conversions.

#Convert genre to factor
music$genre <- factor(music$genre)

#Convert songs to numeric
music$songs <- as.numeric(as.integer(music$songs))

Now inspect the songs and popularity variables to make sure that the conversions were performed correctly. Use the table function to check the factors and number of cases per factor in the genre column.

str(music$songs)
##  num [1:3242] 110 75 196 15 13 36 36 13 192 0 ...
table(music$genre)
## 
## Funk Carioca      Hip Hop          Pop         Rock        Samba    Sertanejo 
##          302          537          796          797          193          617

The songs variable was converted to numeric accordingly, and there are six factors in the genre variable (Funk Carioca, HipHop, Pop, Rock, Samba, and Sertanejo).

Everything looks good so far!

Visualizations 1 and 2 - Plots for Genre Count and Popularity Mean Scores

Plot - Genre count

Let’s create a plot showing genre count.

p1 <- ggplot(music, aes(x=factor(genre))) +
  geom_bar(width=0.7, 
           aes(fill=genre), 
           alpha=0.7) + 
  scale_fill_brewer(palette = "Paired") + 
  ggtitle("Plot 1 : Genre Count") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab("Genre") +
  coord_flip()

avg_popularity <- music %>% 
  select(popularity, genre) %>% 
  group_by(genre) %>% 
  summarise("average_popularity" = round(mean(popularity)))

p1

As shown in the previous factor table and this plot, Rock, Pop, and Sartanejo are the most recurrent genres in the dataset.

Plot - Popularity mean scores by genre

Now let’s take a look at the mean popularity scores for genre.

p2 <- ggplot(data=avg_popularity, 
             mapping = aes(x = (genre), 
                           y = average_popularity, 
                           fill = genre)) + 
  geom_col(width = 0.7,alpha=0.7) + 
  scale_fill_brewer(palette = "Paired") + 
  ggtitle("Plot 2 : Genre Popularity") + 
  xlab("Genre") + ylab("Mean Popularity") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

p2

This plot shows that Rock and Pop have the highest popularity scores out of all the genres in the dataset.

PART 2 - Clustering Analysis

Before conducting the clustering analysis, let’s compute descriptive statistics for the two input variables (songs and popularity). The purpose of this preliminary assessment is to check whether songs and popularity are measured on a similar scale.

describe(music$songs)
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 3242 53.91 78.21     17    37.4 23.72   0 759   759 2.62     9.99 1.37
describe(music$popularity)
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis   se
## X1    1 3242 1.52 6.97      0    0.31   0   0 246.8 246.8 17.69   512.88 0.12

Look at the ‘max’ column on the two outputs. The highest number of songs per artist is 759, while the highest popularity score is 246.8. Hence, songs is being measured on a wider scale than popularity is. In order to prevent the songs variable from skewing the clustering analysis, these two variables should be normalized.

The normalize procedure places all the values of a numerical column on a range between 0 and 1. This helps prevent variables on larger scales from overestimating predictive results. Some packages contain normalizing functions, but these functions can also be created when needed.

Let’s create a normalize function.

normalize <- function(x) {
  num <- x - min(x)
  denom <- max(x) - min(x)
  return (num/denom)
}

Normalize the song and popularity variables.

musicnorm <- normalize(normalize(music[,c("songs","popularity")]))

Inspect the normalized columns.

describe(musicnorm)
##            vars    n mean   sd median trimmed  mad min  max range  skew
## songs         1 3242 0.07 0.10   0.02    0.05 0.03   0 1.00  1.00  2.62
## popularity    2 3242 0.00 0.01   0.00    0.00 0.00   0 0.33  0.33 17.69
##            kurtosis se
## songs          9.99  0
## popularity   512.88  0

The minimum and maximum values for songs and popularity are on a range between 0 and 1. This looks good.

K-means clustering (KNN–Nearest Neighbor algorithm)

We will use the k-means algorithm to compute the clustering analysis and the “elbow method” to determine the optimal number of clusters for our fit.

Create this function to compute the clustering curve.

wss <- function(data, maxCluster = 9) {
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$withinss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", 
       ylab = "Within groups sum of squares", pch=19)
}

Visualization 3 - K-means curve

Clustering curve

Now lets check the curve.

wss(musicnorm)

Looks like the “elbow” point (the spot where the curve starts to flatten) happens at 3 clusters. This is the number of clusters that will be used to perform the k-means clustering analysis.

Alternative visualization for clustering curve

Note: In this code, the parameter ‘nstart’ is used to indicate the number of random assignments to compute the k-means. We will use 25 random assignments for this step.

Enter the optional number of clusters from the previous graphic in the ‘xintercept’ parameter. As indicated previously, the optimal number of clusters seems to be 3.

fviz_nbclust(musicnorm, kmeans, nstart=25, 
             method = "wss") + 
  geom_vline(xintercept = 3, linetype = 1)

This visualization shows the elbow point more clearly.

Visualization 4 - Clusters

Fit k-means.

set.seed(13437885)
fit <- kmeans(musicnorm, centers = 3, nstart = 25)

Plot clusters.

fviz_cluster(fit, 
             geom = c("point", "text"),  
             data = musicnorm, 
             palette = "Set4",
             main = "K Means Clustering with 3 Centers", 
             alpha = 0.9) + 
  theme(plot.title = element_text(hjust = 0.5)) +  
  expand_limits(y=40) +  
  expand_limits(x=10)

Three clusters were computed. This is consistent with our previous fitting procedures.

See the number of cases within each cluster.

fit$size
## [1]  736 2328  178

Print results for the cluster fit. This output will show the means of the input variables (songs and popularity) per cluster, a clustering vector indicating the cluster placement for each data point (1, 2, or 3 in this case), and the between_SS / total_SS measure to assess the cluster fit (this measure is explained below).

print(fit)
## K-means clustering with 3 clusters of sizes 736, 2328, 178
## 
## Cluster means:
##        songs   popularity
## 1 0.15548741 0.0050047975
## 2 0.02031408 0.0002854067
## 3 0.38500540 0.0121071487
## 
## Clustering vector:
##    [1] 1 1 1 2 2 2 2 2 1 2 2 1 2 3 1 3 1 2 1 2 2 2 1 2 1 1 2 2 1 2 2 2 2 2 2 1 2
##   [38] 2 1 2 1 2 2 2 2 1 1 1 2 2 1 2 1 1 3 2 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 2
##   [75] 2 2 2 3 1 2 2 2 2 1 1 2 1 2 1 3 1 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2
##  [112] 3 3 2 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 1 1 3 2 2 2 1 1 2 2 1 1 2 2 1 2 1
##  [149] 2 2 2 2 2 1 1 2 2 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 2 1 1 2 2 3 2 1 2 2 2 2 1
##  [186] 2 1 2 1 2 1 2 1 2 2 2 1 2 2 3 2 2 2 2 1 2 2 2 2 2 3 2 2 1 1 2 2 2 2 1 2 1
##  [223] 2 2 2 2 2 1 1 3 3 1 3 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 1 1 1 2 1 1 3
##  [260] 1 1 1 2 2 2 1 2 3 1 3 1 1 3 2 2 3 1 2 2 2 2 2 2 2 3 2 2 2 3 2 2 3 2 2 2 3
##  [297] 2 3 3 1 2 2 3 2 1 2 2 1 1 1 2 1 1 2 2 3 1 2 2 1 2 2 1 1 3 2 1 2 1 2 2 2 1
##  [334] 2 1 2 1 3 1 1 1 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 3 2 1 1 1 2 1
##  [371] 2 2 1 1 2 3 2 2 2 2 2 3 2 1 3 2 1 1 2 2 1 2 2 1 1 1 1 2 1 2 1 2 2 2 2 1 2
##  [408] 1 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 3 2 2 1 2 2 1 2 2 2 2 2
##  [445] 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 3
##  [482] 2 2 1 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [519] 1 2 2 2 1 1 3 2 2 2 2 2 2 2 2 2 2 2 1 2 3 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2
##  [556] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [593] 2 2 1 2 3 2 2 2 2 2 2 2 2 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 2
##  [630] 2 2 2 3 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 1 2
##  [667] 2 2 2 2 1 3 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 2 1
##  [704] 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2
##  [741] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 3 2 1 2 2 1 2 1 2 1
##  [778] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2
##  [815] 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 3 2
##  [852] 2 2 2 2 3 2 1 2 2 3 2 2 2 2 1 3 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2
##  [889] 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [926] 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 3 2 2 2 1 2 1 3 2
##  [963] 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 3 1 1 2 2
## [1000] 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [1037] 1 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 1 2 1 2 2 2 2 3 1 2 2 1 2 1 2 2 1 2 2 2 1
## [1074] 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 3 2 2 1 2 2 2 3 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [1111] 2 2 2 2 1 3 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [1148] 1 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 3 3 2 2 3 1 2 1 2 1 2 1 2 1
## [1185] 2 2 2 2 2 1 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2
## [1222] 2 2 2 1 2 3 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1
## [1259] 2 1 2 2 1 2 3 2 3 3 1 2 1 2 2 2 3 2 2 2 3 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 1
## [1296] 1 2 2 2 2 3 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 3 1 2 2 1
## [1333] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2
## [1370] 2 2 2 3 3 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 3 2
## [1407] 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 1 1 2 1 2 2 2 3 2 2 2 2 2 2 1 3 2 1 2
## [1444] 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 3 2 1 2 2 2 2 2 1 2 1 2 2
## [1481] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1518] 2 2 2 2 2 3 2 2 2 2 2 2 1 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 2
## [1555] 2 3 3 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 1 1 1 3 1 2
## [1592] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1629] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [1666] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1703] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2
## [1740] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1777] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [1814] 2 2 2 2 2 2 2 2 2 2 1 1 3 1 2 2 2 1 1 2 2 2 2 2 3 2 2 2 2 1 1 2 2 2 1 2 2
## [1851] 2 2 1 2 1 2 1 2 1 1 3 2 1 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1
## [1888] 2 2 1 2 2 1 3 1 2 2 2 1 2 2 1 2 2 2 2 2 2 2 1 3 2 2 2 2 2 1 2 2 2 1 2 2 2
## [1925] 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 3 1 1 2 1 2 2 1 2 1 2 2 2 2 1
## [1962] 1 2 2 1 3 2 2 2 2 2 2 2 2 2 1 3 2 1 2 2 1 2 2 2 1 3 3 2 2 2 3 2 1 2 2 2 1
## [1999] 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 2 2 3 2 1 2 2 2 1 3 1 1 2
## [2036] 1 1 1 1 2 2 2 2 2 1 2 3 2 1 2 2 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 1 1 1 1 2 2
## [2073] 2 2 1 1 2 1 1 1 1 3 2 2 1 1 2 2 2 1 1 3 2 2 2 2 2 1 3 1 3 1 2 1 1 2 2 1 2
## [2110] 3 2 3 2 2 1 2 1 1 2 2 2 1 2 2 3 2 1 2 1 3 2 2 3 1 2 2 1 2 2 1 1 3 2 2 1 2
## [2147] 1 2 2 3 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 1 1 1 3 2 2 1 2 2 2 2 2 1 1 1 2 1 1
## [2184] 1 3 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 3 2 3 2 1 1 2 2 2 2 2 2 3
## [2221] 1 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 3
## [2258] 2 2 2 1 2 1 2 1 3 1 2 2 3 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2
## [2295] 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 1 2 2 2 3 2 2 2 2 2 2
## [2332] 2 2 2 1 1 2 2 1 2 3 1 2 2 2 2 2 2 1 3 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [2369] 2 2 1 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 3 1 2 2 2 2 2 1 1 2 1
## [2406] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 2 2 2 1
## [2443] 2 1 3 2 3 2 3 2 2 1 2 2 2 3 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2
## [2480] 2 2 2 2 2 2 2 2 1 2 3 2 2 3 2 1 2 2 2 1 2 1 3 1 2 2 1 2 1 2 2 1 2 2 3 2 2
## [2517] 2 2 2 2 2 2 1 1 1 3 1 2 3 2 2 2 2 2 2 2 2 1 2 2 2 2 2 3 2 1 1 2 2 2 1 3 2
## [2554] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2
## [2591] 2 2 2 2 3 1 2 2 2 2 2 2 2 3 2 2 2 1 1 2 1 2 1 2 2 2 2 2 3 2 1 1 1 3 2 2 2
## [2628] 2 1 2 2 2 2 2 1 2 2 2 3 2 3 2 2 2 3 3 1 1 2 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2
## [2665] 2 3 2 2 1 2 1 3 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2
## [2702] 2 2 2 2 2 1 2 2 1 2 2 2 1 1 3 2 2 2 2 3 3 1 2 2 1 2 1 2 2 2 3 2 1 2 2 2 1
## [2739] 2 2 2 2 1 1 2 1 2 2 1 2 3 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1
## [2776] 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 1 2 2 3 2 2 2 2 1 2 1 2 3 1 2 1 3 2 3 1 1
## [2813] 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 1 2 1 2 2 2 1 1 3 2
## [2850] 3 2 2 2 2 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 2
## [2887] 1 2 3 2 2 1 2 2 1 2 1 1 1 2 2 1 2 1 1 2 2 1 2 3 1 1 1 1 1 2 2 2 1 1 1 2 1
## [2924] 2 2 1 1 2 2 2 1 2 1 2 2 1 2 3 2 1 1 1 2 2 2 2 3 2 2 2 2 2 2 2 2 1 2 2 3 1
## [2961] 2 1 1 2 2 1 2 2 1 3 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 3 2 2 2
## [2998] 2 2 2 2 2 3 2 2 2 2 1 1 2 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2
## [3035] 1 1 2 2 3 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## [3072] 2 2 3 2 2 2 2 3 2 1 1 2 2 2 3 2 2 2 3 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2
## [3109] 1 3 2 1 2 2 2 1 2 2 2 1 2 2 3 1 2 1 1 1 1 2 2 3 2 1 2 2 2 1 1 2 2 2 2 1 2
## [3146] 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 2 2 1 2 2 2 2 2
## [3183] 3 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [3220] 2 2 2 2 2 1 2 2 2 2 2 3 1 3 1 2 2 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 1.827114 1.218094 2.826217
##  (between_SS / total_SS =  83.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Interpretation of k-means clustering results and Conclusions

The main objective in clustering procedures is to get a low variance within each cluster and a high variance between clusters. Hence, we can compute the between_SS/total_SS measure to assess these aspects of the clustering analysis.

Concepts:

between_SS (Between Sum of Squares/BSS) Accounts for the variance between clusters.

total_SS (Total Sum of Squares/TSS) Accounts for the variance within each cluster.

The closer the between_SS/total_SS ratio to 100% (1.0 in decimal numbers) the better the predictability strength or clustering fit.

The printed fit results show a BSS/TSS ratio of 83.1% (0.831). This indicates that the fit and predictability of our clusters is good / acceptable.

Cluster attributes

In order to assess the cluster attributes, let’s create a new data frame by scaling the normalized data. When we scale data, we centralize the values in it. This means the mean for all the variables will be 0.

Scale normalized columns for songs and popularity.

musicnorm2 <- as.data.frame(scale(musicnorm[,c(1:2)]))

Check scaled columns.

describe(musicnorm2)
##            vars    n mean sd median trimmed mad   min   max range  skew
## songs         1 3242    0  1  -0.47   -0.21 0.3 -0.69  9.02  9.70  2.62
## popularity    2 3242    0  1  -0.22   -0.17 0.0 -0.22 35.22 35.43 17.69
##            kurtosis   se
## songs          9.99 0.02
## popularity   512.88 0.02

The mean for songs and popularity is 0. Hence, the scaling procedure was performed correctly.

PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is an unsupervised machine learning technique that attempts to derive a set of low-dimensional set of features from a much larger set, while still preserving as much variance as possible. PCA analysis is commonly used to assess cluster attributes.

Check the following steps in order to compute the PCAs:

Create these objects to fit the k-means for the next visualizations. 3 is the optimal number of clusters shown previously, and 25 is the number of random assignments that we will use for this step.

set.seed(212)
km1 <- kmeans(musicnorm,3,nstart = 25)

km_cl <- km1$cluster
km_ct <- data.frame(km1$centers,clust = rownames(km1$centers))


pc1 <- prcomp(musicnorm[,1:2],center = T)
summary(pc1)
## Importance of components:
##                           PC1      PC2
## Standard deviation     0.1031 0.008567
## Proportion of Variance 0.9931 0.006860
## Cumulative Proportion  0.9931 1.000000

Combine the cluster results with PC values and the initial data.

musicnorm <- data.frame(musicnorm,music[,-c(1:2)])
musicnorm$clust <- as.factor(km_cl)

pr1 <- data.frame(pc1$x, clust = factor(km_cl), 
      genre = music$genre)

Include the cluster column in the scaled data.

musicnorm2<- data.frame(musicnorm2,musicnorm[,-c(1:2)])
musicnorm2$clust <- as.factor(km_cl)

pr1 <- data.frame(pc1$x, clust = factor(km_cl), 
                  popularity = musicnorm$popularity, 
                  songs = musicnorm$songs)

Normalize the scaled data again.

musicgg <- normalize(musicnorm2[,c(1:2)])
musicgg <- cbind(musicgg,musicnorm2[,-c(1:2)])

Create additional objects and a new data frame for the PCA visualizations.

musicgg <- data.frame(clust = factor(km_cl), musicnorm$songs,
                      musicnorm$popularity)

Check variables in the created musicgg data frame.

str(musicgg)
## 'data.frame':    3242 obs. of  3 variables:
##  $ clust               : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 2 2 3 2 ...
##  $ musicnorm.songs     : num  0.1449 0.0988 0.2582 0.0198 0.0171 ...
##  $ musicnorm.popularity: num  0.000395 0.000395 0.000659 0.009881 0 ...

Rename the ‘musicnorm.songs’ and ‘musicnorm.popularity’ columns as ‘songs’ and `‘popularity’ for consistency.

names(musicgg)[2:3] <-c("songs","popularity")

Check the variables again.

str(musicgg)
## 'data.frame':    3242 obs. of  3 variables:
##  $ clust     : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 2 2 3 2 ...
##  $ songs     : num  0.1449 0.0988 0.2582 0.0198 0.0171 ...
##  $ popularity: num  0.000395 0.000395 0.000659 0.009881 0 ...

The ‘clust’ column has 3 factors (these are the 3 clusters), and the songs and popularity variables were renamed correctly.

Visualizations 5

See the attributes per each of the clusters using the scaled data frame.

musicgg %>% 
  group_by(clust) %>% 
  summarise(Popularity = mean(popularity),
            Songs = mean(songs)) %>% 
  select(Popularity,Songs,clust) %>% 
  gather("Name","Value",-clust) %>% 
  ggplot(aes(y=Value,x = clust,col=Name,group = Name)) +
  geom_point()+
  geom_line()+
  facet_wrap(~Name,scales = "free_y")+
  scale_color_brewer(palette = "Pastel1")+
  theme_dark()+
  labs(x="Cluster",col = "Attributes",
       title = "Attributes for each cluster")

The attribute values for both, the number of songs and popularity scores, are the highest on cluster # 1 and lowest on cluster # 2.

Visualization 6 - Genre distribution per cluster

Let’s create a new data frame to plot the music genres distributions per cluster.

musicgg2 <- data.frame(clust = factor(km_cl),
                      musicnorm$popularity,
                      music$genre)

Check the variables.

str(musicgg2)
## 'data.frame':    3242 obs. of  3 variables:
##  $ clust               : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 2 2 3 2 ...
##  $ musicnorm.popularity: num  0.000395 0.000395 0.000659 0.009881 0 ...
##  $ music.genre         : Factor w/ 6 levels "Funk Carioca",..: 4 4 4 4 4 4 4 4 4 4 ...

Just like before, rename the variables for consistency.

names(musicgg2) <-c("cluster","popularity","genre")

Recheck variable names.

str(musicgg2)
## 'data.frame':    3242 obs. of  3 variables:
##  $ cluster   : Factor w/ 3 levels "1","2","3": 3 3 3 2 2 2 2 2 3 2 ...
##  $ popularity: num  0.000395 0.000395 0.000659 0.009881 0 ...
##  $ genre     : Factor w/ 6 levels "Funk Carioca",..: 4 4 4 4 4 4 4 4 4 4 ...

This looks good.

Now, let’s plot the genre distribution per cluster.

ggplot(data = musicgg2, aes(y=cluster)) +
  geom_bar(aes(fill=genre)) +
  theme (plot.title = element_text(hjust = 0.5))

Looking at the genre distribution on this plot, it seems like Pop is the the most recurrent genre in cluster # 2 (the largest cluster on this plot), followed by Sertanejo and Rock respectively. However, the distribution seems to be a bit different in cluster # 3 (the second largest cluster on this plot), where Rock shows the largest segmentation, followed by Pop and Sertanejo.

It is important to note that the number of cases for Rock (n = 797) and Pop (n = 796) are pretty close. A single additional case should not create that much statistical difference in terms of analysis. With that in mind, it would be interesting to perform additional data science procedures to determine why Pop was the most recurrent genre in one cluster and Rock in another. Clustering is (after all) a way of unsupervised learning that separates data points based on differences as well as the lather (grouping data points based on similarities). These results seem to suggest that there might be intrinsic attribute differences between the Rock and Pop genres, despite their proximity in terms of number of songs and popularity scores.

We have reached the end of this workshop. Thank you for participating!