Introduction

This article aims to evaluate possible recommendations of potential popular songs on Spotify, using association rules in unsupervised machine learning. The goal was to answer my research question “What are the most common song features that co-occur in popular tracks on Spotify, and how can this information be used to identify potential hit songs?” using association rules.

As a music lover and data science enthusiast, I’ve always been fascinated by the factors that make a song successful. I have been using Spotify every day for most of the day for the last couple of years, so I also have personal intuition and experience in this area, which made it easier to verify my results with reality. With the rise of streaming platforms like Spotify, there’s now an abundance of data available on the features of popular songs. For my project, I decided to use the Association Rules Method to explore this data and identify the most common song features that co-occur in popular tracks on Spotify. My research question was simple: What makes a hit song? By analyzing the data and generating association rules, I was able to uncover some interesting insights into the relationships between different song features and their impact on popularity. In this report, I’ll share my findings and discuss how this information can be used to identify potential hit songs.

Data

For my project, I used the “tracks.csv” file from the Spotify Dataset 1921-2020 available on Kaggle, which contains data on over 600,000 tracks released between 1921 and 2020. This dataset includes a wide range of information on each track, such as its name, artist, release date, popularity, and various audio features. I focused on analyzing the audio features of the tracks, such as danceability, energy, loudness, and valence, which were calculated using Spotify’s audio analysis algorithm. Overall, the dataset provided a rich source of information for exploring the factors that contribute to the success of a song.

Association Rules methodology

Association rule mining is a data mining technique used to identify relationships and patterns between different variables in a dataset. The method works by identifying frequent item sets, or sets of items that frequently occur together in the data, and then generating association rules that describe the relationships between these item sets. These rules can then be used to make predictions or recommendations based on the patterns that have been identified.

Strengths:

Can uncover hidden relationships in data that may not be apparent through traditional statistical methods (Kotsiantis & Kanellopoulos, 2006).
Can handle large datasets with many variables (like in my project) (Toivonen, 1996).
Can be applied to a wide range of applications, such as market basket analysis, medical diagnosis, and bioinformatics (Kotsiantis & Kanellopoulos, 2006).
Provides intuitive and easy-to-understand results in the form of association rules (Toivonen, 1996).

Weaknesses:

The method can generate a large number of association rules, some of which may be spurious or irrelevant (Kotsiantis & Kanellopoulos, 2006).
The quality of the results depends heavily on the quality of the input data (Kotsiantis & Kanellopoulos, 2006) That is why I tried to clean my data well.
The method may not work well with datasets that contain rare items or infrequent patterns (Toivonen, 1996).
The method is computationally expensive and may require significant computational resources (Kotsiantis & Kanellopoulos, 2006). The computation time for my project was rather long.

Analysis

Loading packages

To do the analysis, the following packages will be needed.

# Install the package if not yet installed
packages <- c("tidyverse", "readr", "corrplot", "ggplot2", "GGally", 
              "patchwork", "tibble", "patchwork", "gridExtra", "RColorBrewer", 
              "factoextra", "labdsv", "maptools", "psych", "ClusterR", 
              "readxl", "cluster", "flexclust", "fpc", "clustertend", 
              "ggthemes", "plotly", "stringr", "missMDA", "ade4", "smacof", 
              "Rtsne", "psy", "scales", "kableExtra", "pdp", "jpeg", "raster")


# Load packages if not already loaded
for (package in packages) {
         if (!require(package, character.only = TRUE)) {
                  install.packages(package)
                  library(package, character.only = TRUE)
         }
}

Data input

The first step is to input a “tracks.csv” file. Afterwards, I printed the column names, the first six rows of my database and the str(df) function, to see the structure of my dataset.

# Read in the artists and tracks datasets
df <- read.csv("/Users/annaczarnocka/Desktop/UL3/tracks.csv", stringsAsFactors = FALSE)

colnames(df)

##  [1] "id"               "name"             "popularity"       "duration_ms"     
##  [5] "explicit"         "artists"          "id_artists"       "release_date"    
##  [9] "danceability"     "energy"           "key"              "loudness"        
## [13] "mode"             "speechiness"      "acousticness"     "instrumentalness"
## [17] "liveness"         "valence"          "tempo"            "time_signature"

head(df)

##                       id                                name popularity
## 1 35iwgR4jXetI318WEWsa1Q                               Carve          6
## 2 021ht4sdgPcrDgSk7JTbKY Capítulo 2.16 - Banquero Anarquista          0
## 3 07A5yehtSnoedViJAZkNnc  Vivo para Quererte - Remasterizado          0
## 4 08FmqUhxtyLTn6pAh6bk45       El Prisionero - Remasterizado          0
## 5 08y9GfoqCWfOGsKdwojr5e                 Lady of the Evening          0
## 6 0BRXJHRNGQ3W4v9frnSfhu                           Ave Maria          0
##   duration_ms explicit             artists                 id_artists
## 1      126903        0             ['Uli'] ['45tIt06XoI0Iio4LBEVpls']
## 2       98200        0 ['Fernando Pessoa'] ['14jtPCOoNZwquk5wd9DxrY']
## 3      181640        0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 4      176907        0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 5      163080        0     ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## 6      178933        0     ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
##   release_date danceability energy key loudness mode speechiness acousticness
## 1   1922-02-22        0.645 0.4450   0  -13.338    1      0.4510        0.674
## 2   1922-06-01        0.695 0.2630   0  -22.136    1      0.9570        0.797
## 3   1922-03-21        0.434 0.1770   1  -21.180    1      0.0512        0.994
## 4   1922-03-21        0.321 0.0946   7  -27.961    1      0.0504        0.995
## 5         1922        0.402 0.1580   3  -16.900    0      0.0390        0.989
## 6         1922        0.227 0.2610   5  -12.343    1      0.0382        0.994
##   instrumentalness liveness valence   tempo time_signature
## 1           0.7440   0.1510  0.1270 104.851              3
## 2           0.0000   0.1480  0.6550 102.009              1
## 3           0.0218   0.2120  0.4570 130.418              5
## 4           0.9180   0.1040  0.3970 169.980              3
## 5           0.1300   0.3110  0.1960 103.220              4
## 6           0.2470   0.0977  0.0539 118.891              4

str(df)

## 'data.frame':    586672 obs. of  20 variables:
##  $ id              : chr  "35iwgR4jXetI318WEWsa1Q" "021ht4sdgPcrDgSk7JTbKY" "07A5yehtSnoedViJAZkNnc" "08FmqUhxtyLTn6pAh6bk45" ...
##  $ name            : chr  "Carve" "Capítulo 2.16 - Banquero Anarquista" "Vivo para Quererte - Remasterizado" "El Prisionero - Remasterizado" ...
##  $ popularity      : int  6 0 0 0 0 0 0 0 0 0 ...
##  $ duration_ms     : int  126903 98200 181640 176907 163080 178933 134467 161427 310073 181173 ...
##  $ explicit        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ artists         : chr  "['Uli']" "['Fernando Pessoa']" "['Ignacio Corsini']" "['Ignacio Corsini']" ...
##  $ id_artists      : chr  "['45tIt06XoI0Iio4LBEVpls']" "['14jtPCOoNZwquk5wd9DxrY']" "['5LiOoJbxVSAMkBS2fUm3X2']" "['5LiOoJbxVSAMkBS2fUm3X2']" ...
##  $ release_date    : chr  "1922-02-22" "1922-06-01" "1922-03-21" "1922-03-21" ...
##  $ danceability    : num  0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
##  $ energy          : num  0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
##  $ key             : int  0 0 1 7 3 5 4 4 0 6 ...
##  $ loudness        : num  -13.3 -22.1 -21.2 -28 -16.9 ...
##  $ mode            : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ speechiness     : num  0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
##  $ acousticness    : num  0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
##  $ instrumentalness: num  7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
##  $ liveness        : num  0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
##  $ valence         : num  0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
##  $ tempo           : num  105 102 130 170 103 ...
##  $ time_signature  : int  3 1 5 3 4 4 5 3 4 3 ...

Data cleaning

Choice of variables

First, I removed any duplicate entries based on the song name, ensuring that each song only appears once in the data frame. I then printed the first few rows of the modified data frame to confirm that the duplicates were removed. This was useful because association rule mining algorithms are designed to work with a set of unique items or transactions. Duplicate entries in the dataset can skew the results of the association rule mining process, leading to inaccurate or misleading association rules. In the context of my project, removing duplicate entries based on the song name ensures that each song is only considered once in the analysis. This helps to identify the most common song features that co-occur in popular tracks on Spotify more accurately and avoids over-representing certain songs or features.

Next, I removed columns from the data frame that were not necessary for my analysis, keeping only the columns containing information on danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.

The list of variables before change:

name: name of the track
popularity: a measure of how popular the track is, ranging from 0 to 100
duration_ms: duration of the track in milliseconds
explicit: a binary indicator (0 or 1) that indicates whether the lyrics contain explicit language
artists: name(s) of the artist(s) who created the track
id_artists: unique identifier(s) of the artist(s) who created the track
release_date: date when the track was released
danceability: a measure of how suitable a track is for dancing, ranging from 0 to 1
energy: a measure of how energetic the track is, ranging from 0 to 1
key: the key of the track (0 = C, 1 = C#, 2 = D, and so on)
loudness: the overall loudness of the track in decibels (dB)
mode: the modality of the track (major = 1, minor = 0)
speechiness: a measure of how much spoken word is in the track, ranging from 0 to 1
acousticness: a measure of how acoustic the track is, ranging from 0 to 1
instrumentalness: a measure of how instrumental the track is, ranging from 0 to 1
liveness: a measure of how live the track is, ranging from 0 to 1
valence: a measure of the positivity of the track, ranging from 0 to 1
tempo: the tempo of the track in beats per minute (BPM)
time_signature: the time signature of the track, which determines the number of beats in each bar.

In particular, variables like danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence and tempo are measures of the musical characteristics of the tracks, and could be used to identify patterns in the song features that tend to co-occur in popular tracks.

Release date could also be used to investigate how the features of popular tracks have evolved over time. However, I chose not to include this variable, because it would have to be in a date format, and it was more accurate to work on only numeric data types.

I excluded these variables because they are either irrelevant or not informative for my research question:

name and artists are simply the names of the track and the artist(s) who created it. These variables do not provide any useful information for the analysis of song features and their potential co-occurrence.
popularity, duration_ms, explicit, key, mode, and time_signature are characteristics of the tracks that are not related to the audio features, which are the focus of my research question.
id_artists is a unique identifier for the artist(s) who created the track and is not relevant to my research question either.

Therefore, it is best to focus on the variables related to audio features such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo for my association rules analysis.

Deleting missing values

I then checked for any missing values in the data frame using the colSums(is.na()) function, and removed any rows that contained missing values using the complete.cases() function.

Finally, I viewed the cleaned data using the head() function to inspect the first few rows, and used the str() function to check the structure of the data frame. I also used the summary() function to obtain summary statistics for each column in the data frame.

# Remove duplicates based on the song name
df <- df[!duplicated(df$name),]
# Print the first few rows of the modified data frame
head(df)

##                       id                                name popularity
## 1 35iwgR4jXetI318WEWsa1Q                               Carve          6
## 2 021ht4sdgPcrDgSk7JTbKY Capítulo 2.16 - Banquero Anarquista          0
## 3 07A5yehtSnoedViJAZkNnc  Vivo para Quererte - Remasterizado          0
## 4 08FmqUhxtyLTn6pAh6bk45       El Prisionero - Remasterizado          0
## 5 08y9GfoqCWfOGsKdwojr5e                 Lady of the Evening          0
## 6 0BRXJHRNGQ3W4v9frnSfhu                           Ave Maria          0
##   duration_ms explicit             artists                 id_artists
## 1      126903        0             ['Uli'] ['45tIt06XoI0Iio4LBEVpls']
## 2       98200        0 ['Fernando Pessoa'] ['14jtPCOoNZwquk5wd9DxrY']
## 3      181640        0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 4      176907        0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 5      163080        0     ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## 6      178933        0     ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
##   release_date danceability energy key loudness mode speechiness acousticness
## 1   1922-02-22        0.645 0.4450   0  -13.338    1      0.4510        0.674
## 2   1922-06-01        0.695 0.2630   0  -22.136    1      0.9570        0.797
## 3   1922-03-21        0.434 0.1770   1  -21.180    1      0.0512        0.994
## 4   1922-03-21        0.321 0.0946   7  -27.961    1      0.0504        0.995
## 5         1922        0.402 0.1580   3  -16.900    0      0.0390        0.989
## 6         1922        0.227 0.2610   5  -12.343    1      0.0382        0.994
##   instrumentalness liveness valence   tempo time_signature
## 1           0.7440   0.1510  0.1270 104.851              3
## 2           0.0000   0.1480  0.6550 102.009              1
## 3           0.0218   0.2120  0.4570 130.418              5
## 4           0.9180   0.1040  0.3970 169.980              3
## 5           0.1300   0.3110  0.1960 103.220              4
## 6           0.2470   0.0977  0.0539 118.891              4

# Remove unnecessary columns
df <- subset(df, select = c(danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo))

# Check for missing values
colSums(is.na(df))

##     danceability           energy         loudness      speechiness 
##                0                0                0                0 
##     acousticness instrumentalness         liveness          valence 
##                0                0                0                0 
##            tempo 
##                0

df <- df[complete.cases(df), ]

# View the cleaned data
head(df)

##   danceability energy loudness speechiness acousticness instrumentalness
## 1        0.645 0.4450  -13.338      0.4510        0.674           0.7440
## 2        0.695 0.2630  -22.136      0.9570        0.797           0.0000
## 3        0.434 0.1770  -21.180      0.0512        0.994           0.0218
## 4        0.321 0.0946  -27.961      0.0504        0.995           0.9180
## 5        0.402 0.1580  -16.900      0.0390        0.989           0.1300
## 6        0.227 0.2610  -12.343      0.0382        0.994           0.2470
##   liveness valence   tempo
## 1   0.1510  0.1270 104.851
## 2   0.1480  0.6550 102.009
## 3   0.2120  0.4570 130.418
## 4   0.1040  0.3970 169.980
## 5   0.3110  0.1960 103.220
## 6   0.0977  0.0539 118.891

str(df)

## 'data.frame':    446475 obs. of  9 variables:
##  $ danceability    : num  0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
##  $ energy          : num  0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
##  $ loudness        : num  -13.3 -22.1 -21.2 -28 -16.9 ...
##  $ speechiness     : num  0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
##  $ acousticness    : num  0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
##  $ instrumentalness: num  7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
##  $ liveness        : num  0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
##  $ valence         : num  0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
##  $ tempo           : num  105 102 130 170 103 ...

summary(df)

##   danceability        energy          loudness        speechiness    
##  Min.   :0.0000   Min.   :0.0000   Min.   :-60.000   Min.   :0.0000  
##  1st Qu.:0.4550   1st Qu.:0.3450   1st Qu.:-12.990   1st Qu.:0.0344  
##  Median :0.5800   Median :0.5500   Median : -9.317   Median :0.0455  
##  Mean   :0.5653   Mean   :0.5431   Mean   :-10.300   Mean   :0.1171  
##  3rd Qu.:0.6880   3rd Qu.:0.7490   3rd Qu.: -6.552   3rd Qu.:0.0828  
##  Max.   :0.9910   Max.   :1.0000   Max.   :  5.376   Max.   :0.9710  
##   acousticness    instrumentalness       liveness         valence     
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0992   1st Qu.:0.0000000   1st Qu.:0.0990   1st Qu.:0.352  
##  Median :0.4270   Median :0.0000214   Median :0.1420   Median :0.570  
##  Mean   :0.4517   Mean   :0.1126884   Mean   :0.2196   Mean   :0.556  
##  3rd Qu.:0.7850   3rd Qu.:0.0091250   3rd Qu.:0.2870   3rd Qu.:0.772  
##  Max.   :0.9960   Max.   :1.0000000   Max.   :1.0000   Max.   :1.000  
##      tempo       
##  Min.   :  0.00  
##  1st Qu.: 95.49  
##  Median :117.42  
##  Mean   :118.40  
##  3rd Qu.:136.37  
##  Max.   :243.51

Correlation matrix

The plot of correlation matrix shows a color-coded correlation matrix, with darker shades of blue indicating stronger positive correlations and lighter shades indicating weaker or negative correlations. The diagonal line is always colored in dark blue, as it represents the correlation of each variable with itself, which is always perfect (1.0).

It can be seen that energy and loudness are strongly or moderately positively correlated, suggesting that they may be measuring similar aspects of music. However, they also show some level of independence from the other variables. Therefore, it may be beneficial to include these variables in your analysis, as they may provide additional information that is not highly correlated with the other variables you have selected.

Similarly, danceability and valence are positively correlated, indicating that they may capture similar aspects of musical enjoyment. However, their correlation is below 0.7, so it is not a strong correlation.

Each of the variables may have a certain level of association or correlation with one another, and including all of them in analysis may provide a more complete picture of the relationships between these variables.

library(RColorBrewer)
library(corrplot)

## corrplot 0.92 loaded

# correlation matrix
cor.matrix <- cor(df[, 1:9], method = "pearson")
mypal <- brewer.pal(9, "Blues")
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col = mypal)

# Compute the correlation matrix
cor.matrix <- cor(df[, 1:9], method = "pearson")
# Round the correlation matrix to one decimal place
cor.matrix <- round(cor.matrix, 1)
# Print the correlation matrix
print(cor.matrix)

##                  danceability energy loudness speechiness acousticness
## danceability              1.0    0.2      0.2         0.2         -0.2
## energy                    0.2    1.0      0.8        -0.1         -0.7
## loudness                  0.2    0.8      1.0        -0.2         -0.5
## speechiness               0.2   -0.1     -0.2         1.0          0.1
## acousticness             -0.2   -0.7     -0.5         0.1          1.0
## instrumentalness         -0.2   -0.2     -0.3        -0.1          0.2
## liveness                 -0.1    0.1      0.0         0.2          0.0
## valence                   0.5    0.4      0.3         0.0         -0.2
## tempo                     0.0    0.2      0.2        -0.1         -0.2
##                  instrumentalness liveness valence tempo
## danceability                 -0.2     -0.1     0.5   0.0
## energy                       -0.2      0.1     0.4   0.2
## loudness                     -0.3      0.0     0.3   0.2
## speechiness                  -0.1      0.2     0.0  -0.1
## acousticness                  0.2      0.0    -0.2  -0.2
## instrumentalness              1.0      0.0    -0.2   0.0
## liveness                      0.0      1.0     0.0   0.0
## valence                      -0.2      0.0     1.0   0.1
## tempo                         0.0      0.0     0.1   1.0

Outliers

First, I calculated z-scores for each variable in the dataset to help me identify potential outliers. I added the z-scores to the dataframe and then looked for any z-scores that were greater than 3 or less than -3. I found that there were a total of 446472 potential outliers in the dataset, which are all the observations, so unfrotunately it wasn’t useful for detecting actual outliers. Looking at the boxplots, tempo has the most outliers, however, I consider tempo as a highly varying variable, and even the most extreme values may provide useful insight to my analysis, as music tempo can be extraordinary high or very low, and it does not imply an error in observation, but it is simply a unique piece of music.

To further investigate, I used a function to calculate the number of outliers in each column of the data frame. This allowed me to see which variables had the most outliers.

Finally, I used another function to find the outliers in the dataset. This function used the interquartile range (IQR) to identify potential outliers. I created a variable containing the outliers and checked its dimensions and head to confirm that the function had worked properly.

Overall, the process of identifying and investigating outliers in my dataset allowed me to better understand the distribution of the data and to ensure that any extreme values were appropriately handled in my analysis. However, the outliers may be simply due to the high differences in results and characterstics of songs, which can indeed be very different from one another, and still providing well measured data, so I decided not to delete the outliers.

# Calculate z-scores for each variable
df_z <- apply(df[, 1:9], 2, function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
# Add z-scores to the dataframe
df[, paste0(names(df)[1:9], "_z")] <- df_z
# Identify potential outliers (z-score > 3 or < -3)
outliers <- apply(df[, 1:9], 1, function(x) any(abs(x) > 3))
# Print number of outliers
cat("Number of potential outliers:", sum(outliers))

## Number of potential outliers: 446472

# Calculate potential outliers
outliers <- boxplot(df)$out

# Calculate outliers for each column
num_outliers_all <- function(df) {
         num_vars <- sapply(df, is.numeric)
         outlier_counts <- sapply(df[num_vars], function(x) {
                  length(boxplot(x, plot = FALSE)$out)
         })
         names(outlier_counts) <- names(df)[num_vars]
         outlier_counts
}
num_outliers_all(df)

##       danceability             energy           loudness        speechiness 
##               1573                  0              11701              66118 
##       acousticness   instrumentalness           liveness            valence 
##                  0              97918              31888                  0 
##              tempo     danceability_z           energy_z         loudness_z 
##               4264               1573                  0              11701 
##      speechiness_z     acousticness_z instrumentalness_z         liveness_z 
##              66118                  0              97918              31888 
##          valence_z            tempo_z 
##                  0               4264

# Run the function to find outliers
# Find outliers in each variable of the data frame
find_outliers <- function(data) {
         outliers <- lapply(data, function(x) {
                  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
                  H <- 1.5 * IQR(x, na.rm = TRUE)
                  x[x < (qnt[1] - H) | x > (qnt[2] + H)]
         })
         return(outliers)
}

# Generate a variable containing the outliers in the data frame
outliers <- find_outliers(df)

# Check the dimensions and head of the outliers variable
dim(outliers)

## NULL

summary(df)

##   danceability        energy          loudness        speechiness    
##  Min.   :0.0000   Min.   :0.0000   Min.   :-60.000   Min.   :0.0000  
##  1st Qu.:0.4550   1st Qu.:0.3450   1st Qu.:-12.990   1st Qu.:0.0344  
##  Median :0.5800   Median :0.5500   Median : -9.317   Median :0.0455  
##  Mean   :0.5653   Mean   :0.5431   Mean   :-10.300   Mean   :0.1171  
##  3rd Qu.:0.6880   3rd Qu.:0.7490   3rd Qu.: -6.552   3rd Qu.:0.0828  
##  Max.   :0.9910   Max.   :1.0000   Max.   :  5.376   Max.   :0.9710  
##   acousticness    instrumentalness       liveness         valence     
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0992   1st Qu.:0.0000000   1st Qu.:0.0990   1st Qu.:0.352  
##  Median :0.4270   Median :0.0000214   Median :0.1420   Median :0.570  
##  Mean   :0.4517   Mean   :0.1126884   Mean   :0.2196   Mean   :0.556  
##  3rd Qu.:0.7850   3rd Qu.:0.0091250   3rd Qu.:0.2870   3rd Qu.:0.772  
##  Max.   :0.9960   Max.   :1.0000000   Max.   :1.0000   Max.   :1.000  
##      tempo        danceability_z        energy_z          loudness_z     
##  Min.   :  0.00   Min.   :-3.39488   Min.   :-2.15575   Min.   :-9.6893  
##  1st Qu.: 95.49   1st Qu.:-0.66236   1st Qu.:-0.78639   1st Qu.:-0.5245  
##  Median :117.42   Median : 0.08833   Median : 0.02729   Median : 0.1915  
##  Mean   :118.40   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.:136.37   3rd Qu.: 0.73693   3rd Qu.: 0.81715   3rd Qu.: 0.7306  
##  Max.   :243.51   Max.   : 2.55661   Max.   : 1.81341   Max.   : 3.0560  
##  speechiness_z     acousticness_z     instrumentalness_z   liveness_z     
##  Min.   :-0.5854   Min.   :-1.29615   Min.   :-0.4235    Min.   :-1.1579  
##  1st Qu.:-0.4133   1st Qu.:-1.01152   1st Qu.:-0.4235    1st Qu.:-0.6359  
##  Median :-0.3578   Median :-0.07096   Median :-0.4234    Median :-0.4092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000    Mean   : 0.0000  
##  3rd Qu.:-0.1713   3rd Qu.: 0.95625   3rd Qu.:-0.3892    3rd Qu.: 0.3553  
##  Max.   : 4.2700   Max.   : 1.56168   Max.   : 3.3344    Max.   : 4.1147  
##    valence_z           tempo_z        
##  Min.   :-2.16309   Min.   :-3.97182  
##  1st Qu.:-0.79375   1st Qu.:-0.76864  
##  Median : 0.05431   Median :-0.03318  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.84013   3rd Qu.: 0.60273  
##  Max.   : 1.72709   Max.   : 4.19653

# find column names containing "_z"
z_colnames <- grep("_z", colnames(df), value = TRUE)

# select columns not containing "_z"
df <- df[, !colnames(df) %in% z_colnames]

str(df)

## 'data.frame':    446475 obs. of  9 variables:
##  $ danceability    : num  0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
##  $ energy          : num  0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
##  $ loudness        : num  -13.3 -22.1 -21.2 -28 -16.9 ...
##  $ speechiness     : num  0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
##  $ acousticness    : num  0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
##  $ instrumentalness: num  7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
##  $ liveness        : num  0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
##  $ valence         : num  0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
##  $ tempo           : num  105 102 130 170 103 ...

Descriptive statistics

Histograms

The plot shows histograms for each variable in the dataset, which is useful for visualizing the distribution of my variables. From the histograms, we can see that some of the variables are normally distributed, such as “acousticness” and “instrumentalness”, while others have a skewed distribution, such as “energy” and “loudness”. In addition, we can also see that some of the variables have a large number of values close to zero, such as “speechiness” and “instrumentalness”. This indicates that these variables may be highly skewed and may require special handling or transformation before they can be used in analysis. However, it is because most of the songs have similar values of speechiness and intrumentalness, so I wouldn’t worry about this particular distributions, as they indicate the reallity accurately. The histograms suggest that different variables may require different data pre-processing techniques depending on their distribution. For this, I wil turn the variables into categorica, using firstly clustering method to find the optimal number of categories.

library(ggplot2)

ggplot(df, aes(x = danceability)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Danceability")

ggplot(df, aes(x = energy)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Energy")

ggplot(df, aes(x = loudness)) + 
  geom_histogram(binwidth = 0.5, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Loudness")

ggplot(df, aes(x = speechiness)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Speechiness")

ggplot(df, aes(x = acousticness)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Acousticness")

ggplot(df, aes(x = instrumentalness)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Instrumentalness")

ggplot(df, aes(x = liveness)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Liveness")

ggplot(df, aes(x = valence)) + 
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Valence")

ggplot(df, aes(x = tempo)) + 
  geom_histogram(binwidth = 10, color = "black", fill = "lightblue") + 
  labs(title = "Distribution of Tempo")

# Load required library
library(RColorBrewer)

# Define the color palette
mypal <- brewer.pal(9, "Blues")

# Subset the first 4 colors from the palette
mycols <- mypal[1:4]

# Create histograms for each variable
par(mfrow=c(3,3))
for (i in 1:9) {
  hist(df[,i], main = colnames(df)[i], xlab = "", col = mycols)
}

Variablity

The code for variablity calculates and prints the variances of the 9 variables in my dataframe.

The values indicate the degree of variability or spread in the data for each variable. For example, the variance of loudness is much higher than the variances of the other variables, which suggests that there is a wide range of loudness values in the data. In contrast, the variance of liveness is much lower than the variances of the other variables, which suggests that there is less variability in the liveness values in the data. Variables with low variance may not provide much information and may not be as useful in generating association rules. However, for better range of feature I decided to include all the 9 variables.

# Set numeric output format
options(scipen = 999)
# Calculate variances and print them without scientific notation
variances <- apply(df[,1:9], 2, var)
print(variances)

##     danceability           energy         loudness      speechiness 
##       0.02772659       0.06347506      26.31088561       0.03999337 
##     acousticness instrumentalness         liveness          valence 
##       0.12146366       0.07081435       0.03596980       0.06607852 
##            tempo 
##     888.69963461

Clustering

Hierarchical clustering

Converting the variables to categorical using binning can help to capture non-linear relationships between the variables and improve the accuracy of my association rules. To appropriately chose the number of categories, I performed clustering. To do this, I firstly standardized (centers and scales) all the variables in my dataframe ussing lappy() function. Centering means subtracting the mean of each variable from its values, so that each variable has a mean of zero. Scaling means dividing each variable by its standard deviation, so that each variable has a standard deviation of one. Standardizing variables can be useful for clustering algorithms, as it ensures that all variables are treated equally and have the same impact on the clustering result, regardless of their original scale.

The clustering was done using hierarchical clustering method with complete linkage criterion. I plotted two dendrograms with different numbers of clusters, i.e., k = 3 and k = 4, but I displayed them on one plot for better comparison. The green lines indicate 4 clusters and blue indicate 3 clusters.

The cutree() function is used to cut the other two plots at a specific number of clusters. The output of the plots suggests that the data can be grouped into either three or four distinct clusters based on their similarity. The “fviz_cluster” function is used to visualize the clustering results in two dimensions using the first two principal components of the data. The clusters are represented by different colors, and the plots suggest that the clusters identified by the algorithm are fairly distinct from each other. Both the dendrogram and the scatterplot matrix suggest that 3 or 4 clusters could be reasonable choices, but further analysis and comparison of different clustering solutions would be necessary to make a final determination. I used elbow method and gap statistic method to help me in choosing the optimal number of clusters.

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# clustering

df <- as.data.frame(lapply(df, scale))

distance.m<-dist(t(df))
hc<-hclust(distance.m, method="complete") 
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='blue')

rect.hclust(hc, k = 4, border='green')

sub_grp<-cutree(hc, k=3) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("aquamarine4", "cyan3", "darkblue", "darkgreen", "brown" ))

sub_grp<-cutree(hc, k=4) 
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("aquamarine4", "cyan3", "darkblue", "darkgreen", "brown" ))

Elbow method

Here, I determined the optimal number of clusters for the dataset by using the elbow method. First, I selected 10% of the rows from the dataset randomly, because there was no computational memory to perform elbow method on the whole dataset. Then I scaled the numeric data. The elbow method was then applied by calculating the within-cluster sum of squares (WSS) for the number of clusters ranging from 1 to 10. Finally, the plot of the number of clusters versus WSS was displayed to visualize the “elbow” point, which indicates the optimal number of clusters. The elbow point is the point of inflection where the WSS starts to level off, suggesting that adding more clusters does not improve the model significantly. In this plot, the optimal number of clusters appears to be between 2 and 4, with a slight elbow at 3.

# finding optimal number of clusters
library(cluster)

# Randomly select 10% of the rows from the dataset 
df_sampled <- df[sample(nrow(df), nrow(df)*0.1), ]

# Scale the numeric data
df_scaled <- scale(df_sampled)

# Determine the optimal number of clusters using the elbow method
set.seed(123)
wss <- c()
for(i in 1:10){
         kmeans_fit <- kmeans(df_scaled, centers = i, nstart = 25)
         wss[i] <- kmeans_fit$tot.withinss
}

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2232350)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2232350)

## Warning: did not converge in 10 iterations

plot(1:10, wss, type = "b", xlab = "Number of clusters", ylab = "Within-cluster sum of squares")

Gap statistic method

The gap statistic method is a statistical approach for determining the optimal number of clusters in a dataset. It compares the total within-cluster variation for different values of k number of clusters with their expected values under null reference distribution of the data. If the gap statistic is large for a certain value of k, it suggests that k is a good number of clusters for the data.

The steps involved in the gap statistic method are as follows: 1. Clustering the original dataset for a range of k values. 2. For each k value, generating B reference datasets, each of the same size as the original dataset, by sampling from a uniform distribution over the range of the original dataset. 3. Clustering each reference dataset for the same range of k values as the original dataset. 4. Calculating the gap statistic for each value of k as the difference between the expected log of the within-cluster sum of squares for the reference data and the log of the within-cluster sum of squares for the original data. 5. Selecting the k value with the largest gap statistic as the optimal number of clusters.

The code for this method is displayed below, but it worked only in the console of .R file, and running it in .Rmd file used too much of computing memory and it could not give the outcome of the plot. This is why I displayed here only the image of the plot which was the outcome of the code below.

The graph suggests that the optimal number of clusters is 4, which is the same outcome as from the Elbow method, so I will use this number of clusters as optimal for converting my data to categorical variables, with 4 categories.

# Randomly select 1% of the rows from the dataset (grater part damaged R memory)
# df_sampled <- df[sample(nrow(df), nrow(df)*0.01), ]

# Determine the optimal number of clusters using the gap statistic method
# set.seed(123)
# gap <- clusGap(df_scaled, kmeans, K.max = 10, B = 50)
# Plot the results
# plot(gap, main = "Gap Statistic for KMeans Clustering", xlab = "Number of Clusters")

# Based on the plot, choose the number of clusters for which the gap is the largest: 4

library(raster)

## Loading required package: sp

image <- stack("/Users/annaczarnocka/Desktop/UL3/plots/gap_statistic_clustering.jpeg")

## Warning: [rast] unknown extent

plotRGB(image)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

Data Conversion

Using the cut() function, I changed my data from numeric to categorical, with 4 categories, as indicated by my clustering analysis. The output of this change is displayed below by the str() function.

I also created histograms of converted variables, to see their distribution:

The distribution of danceability is roughly bell-shaped with a peak at around 0.7, indicating that most of the songs have a high level of danceability.
The distribution of energy is slightly skewed to the left with a peak at around 0.6, suggesting that most of the songs have a moderate energy level.
The distribution of loudness is roughly bell-shaped with a peak at around 2, indicating that most of the songs have a moderate to high loudness level.
The distribution of speechiness is heavily skewed to the right with a long tail, suggesting that most of the songs have a low speechiness level, but there are a few songs with high speechiness.
The distribution of acousticness is heavily skewed to the right with a long tail, suggesting that most of the songs have a low acousticness level, but there are a few songs with high acousticness.
The distribution of instrumentalness is heavily skewed to the right with a long tail, suggesting that most of the songs have a low instrumentalness level, but there are a few songs with high instrumentalness.
The distribution of liveness is heavily skewed to the right with a long tail, suggesting that most of the songs have a low liveness level, but there are a few songs with high liveness.
The distribution of valence is roughly bell-shaped with a peak at around 0.5, indicating that most of the songs have a moderate valence level.
The distribution of tempo is roughly bell-shaped, indicating that most of the songs have a moderate tempo.

Overall, the distributions of the variables appear to be suitable for performing association rule analysis, as there is sufficient variation in the data to identify patterns and relationships between the variables.

# Convert numeric variables to categorical using binning, according to optimal number of clusters

df$danceability <- cut(df$danceability, breaks = 4)

df$energy <- cut(df$energy, breaks = 4)

df$loudness <- cut(df$loudness, breaks = 4)

df$speechiness <- cut(df$speechiness, breaks = 4)

df$acousticness <- cut(df$acousticness, breaks = 4)

df$instrumentalness <- cut(df$instrumentalness, breaks = 4)

df$liveness <- cut(df$liveness, breaks = 4)

df$valence <- cut(df$valence, breaks = 4)

df$tempo <- cut(df$tempo, breaks = 4)

str(df)

## 'data.frame':    446475 obs. of  9 variables:
##  $ danceability    : Factor w/ 4 levels "(-3.4,-1.91]",..: 3 3 2 2 2 1 3 3 2 3 ...
##  $ energy          : Factor w/ 4 levels "(-2.16,-1.16]",..: 2 2 1 1 1 2 2 1 2 1 ...
##  $ loudness        : Factor w/ 4 levels "(-9.7,-6.5]",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ speechiness     : Factor w/ 4 levels "(-0.59,0.628]",..: 2 4 1 1 1 1 1 1 1 1 ...
##  $ acousticness    : Factor w/ 4 levels "(-1.3,-0.582]",..: 3 4 4 4 4 4 4 4 3 4 ...
##  $ instrumentalness: Factor w/ 4 levels "(-0.427,0.516]",..: 3 1 1 4 1 1 1 1 1 4 ...
##  $ liveness        : Factor w/ 4 levels "(-1.16,0.16]",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ valence         : Factor w/ 4 levels "(-2.17,-1.19]",..: 1 3 2 2 1 1 3 3 3 3 ...
##  $ tempo           : Factor w/ 4 levels "(-3.98,-1.93]",..: 2 2 3 3 2 2 2 3 3 2 ...

library(ggplot2)

mypal <- brewer.pal(9, "Blues")

ggplot(df, aes(x = danceability)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Danceability")

ggplot(df, aes(x = energy)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Energy")

ggplot(df, aes(x = loudness)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Loudness")

ggplot(df, aes(x = speechiness)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Speechiness")

ggplot(df, aes(x = acousticness)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Acousticness")

ggplot(df, aes(x = instrumentalness)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Instrumentalness")

ggplot(df, aes(x = liveness)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Liveness")

ggplot(df, aes(x = valence)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Valence")

ggplot(df, aes(x = tempo)) +
         geom_bar(fill = "lightblue") +
         labs(title = "Distribution of Tempo")

Association rules

Installing and loading required packages

########### ASSOCIATION RULES #####################################################


# Install the package if not yet installed
packages <- c("tidyverse", "readr", "arules", "arulesViz", "gridExtra", "RColorBrewer")

# Load packages if not already loaded
for (package in packages) {
         if (!require(package, character.only = TRUE)) {
                  install.packages(package)
                  library(package, character.only = TRUE)
         }
}

Apriori algorithm

Converting the data frame into a transaction object The trans <- as(df, “transactions”) line of code is used to convert the data frame df into a transaction object, which is a specific data format needed for association rule mining using the apriori algorithm. In order to identify potential hit songs on Spotify, a transaction object was created where each row represents a single track and contains a set of song features (e.g., danceability, energy, and tempo) that were present in that track. This allows the association rules algorithm to analyze the co-occurrence of these song features and identify the most common patterns that are associated with popular tracks on Spotify. This format is used to find frequent itemsets (here sets of song features that frequently appear together) and generate association rules, so the rules that describe the relationships between the items in the frequent itemsets.
Generating association rules Then I generated association rules from frequent itemsets using the Apriori algorithm. The ruleParameters variable is used to set the support, confidence, and maximum length of the rules.

Support (0.05) refers to the frequency of occurrence of a set of items in the entire dataset. In this case, it means that a set of items must appear in at least 5% of all transactions to be considered frequent and included in the rule generation process.
Confidence (0.5) refers to the likelihood of the rule being true. It is the conditional probability that an item B appears in a transaction given that item A appears in the same transaction. In this case, it means that a rule is considered significant if it occurs in at least 50% of the transactions in which the antecedent itemset appears.
Maximum length (2) refers to the maximum number of items that can appear in a rule (the length of the rule). In this case, it means that the algorithm will only consider rules with up to 2 items (or antecedents) and one consequent item.

The apriori function was applied to the trans transaction object with these parameters. The resulting associationRules object contains the generated association rules.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

# Convert the data frame to a transaction object
trans <- as(df, "transactions")

# generating the association rules from frequent itemsets
ruleParameters <- list(supp = 0.05, conf = 0.5, maxlen = 2)
associationRules <- apriori(trans, parameter = ruleParameters, control = list(verbose = FALSE))

Summary of rules

The summary provides information about the rules generated from the apriori algorithm with the given parameters.

The set contains 138 rules, where 6 rules have a length of 1 (i.e., only one item in the rule) and 132 rules have a length of 2 (i.e., two items in the rule).
The support represents the proportion of transactions that contain the items in the rule. The values range from 0.05238 to 0.89530, with an average of 0.26955.
The confidence represents the proportion of transactions containing the items in the rule - that also contain the consequent item(s). The values range from 0.5039 to 0.9793, with an average of 0.7165.
The lift represents the ratio of observed support to expected support if the items in the rule were independent. The values range from 0.7882 to 2.9181, with an average of 1.0983. All the values are positive, so there is positive dependency for all association rules.
The count represents the number of transactions that contain the items in the rule. The values range from 23385 to 399728, with an average of 120346.
The mining info section provides information about the dataset used, including the number of transactions and the support and confidence parameters used in the algorithm.

Scatter plot of rules

The plot shows the relationships between the support, confidence, and lift measures for the association rules generated by the apriori algorithm. Each point on the plot represents a rule, and the size of the point indicates the number of transactions that support the rule. The color of the point indicates the level of lift, with blue indicating low lift and red indicating high lift.

From the plot, you can see that the majority of the rules have relatively low support and confidence values, indicating that they are not very strong rules. However, there are a few rules with high support and confidence, which are represented by larger points on the plot. Additionally, there are a few rules with high lift, which are represented by red points on the plot. These rules may be particularly interesting, as they suggest that certain itemsets are more strongly associated with each other than would be expected by chance.

library(arulesViz)

# summarise and plot associationRules
summary(associationRules)

## set of 138 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2 
##   6 132 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   1.957   2.000   2.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.05238   Min.   :0.5039   Min.   :0.08503   Min.   :0.7882  
##  1st Qu.:0.14043   1st Qu.:0.5860   1st Qu.:0.20905   1st Qu.:0.9849  
##  Median :0.20743   Median :0.6997   Median :0.28433   Median :1.0246  
##  Mean   :0.26955   Mean   :0.7165   Mean   :0.38196   Mean   :1.0983  
##  3rd Qu.:0.34766   3rd Qu.:0.8578   3rd Qu.:0.54042   3rd Qu.:1.0933  
##  Max.   :0.89530   Max.   :0.9793   Max.   :1.00000   Max.   :2.9181  
##      count       
##  Min.   : 23385  
##  1st Qu.: 62700  
##  Median : 92612  
##  Mean   :120346  
##  3rd Qu.:155223  
##  Max.   :399728  
## 
## mining info:
##   data ntransactions support confidence
##  trans        446475    0.05        0.5
##                                                                                call
##  apriori(data = trans, parameter = ruleParameters, control = list(verbose = FALSE))

plot(associationRules, col="blue")

Graph of rules

This plot shows a visual representation of the association rules in a graph format. Each node in the graph represents an item or set of items, and the size of the node indicates its support. The thickness of the edges between the nodes represents the strength of the association, with thicker edges indicating higher confidence and/or lift values. The shading of the edges indicates the lift value, with lighter shading indicating higher lift values. Overall, this plot allows to easily identify which items are frequently purchased together and which associations are the strongest.

By looking at the plot, we can see that there are several groups of items that are strongly associated with each other, as they appear in clusters that are closer together. The size of the nodes indicates the frequency of the itemsets in the data, and the thickness of the edges represents the strength of the association between the items. From the plot, it can be inferred that the rules can be applied to a substantial amount of inputs, and there are some strong associations between certain groups of items. From this plot it is hard to see which groups have the the strongest associations.

# plot associationRules
mypal <- brewer.pal(9, "Blues")
set.seed(123)
plot(associationRules, method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph", 
     control = list(type = "items", border = "black", col = mypal))

## Warning: Unknown control parameters: type, border, main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).

Matrix for rules

The matrix plot can help us to visually identify the most interesting and useful rules from the association rule analysis.

The plot has two axes: the horizontal axis represents the support of the rule, and the vertical axis represents the confidence of the rule. Each rule is represented by a square in the plot, with the size of the square indicating the lift of the rule. The color of the square indicates the value of the lift, with brighter colors indicating higher values of lift. For my plot, most rules has lift smaller than 2.

The plot helps us to identify the most interesting rules based on their support, confidence, and lift. The rules that are located in the upper right corner of the plot have high support, high confidence, and high lift, which means that they are strong rules that are applicable to a large number of transactions and have a high likelihood of being correct. On the other hand, the rules that are located in the lower left corner of the plot have low support, low confidence, and low lift, which means that they are weak rules that are not very useful.

The plot shows the lift values for each rule, which indicates how strongly the antecedent and consequent are associated beyond what would be expected by chance. Rules with a lift value greater than 1 indicate a positive association between the antecedent and consequent, while values less than 1 indicate a negative association.

library(arulesViz)
mypal <- rev(brewer.pal(9, "Blues"))
plot(associationRules, method = "matrix", measure = c("support", "confidence"), shading = "lift", col = mypal)

## Itemsets in Antecedent (LHS)
##  [1] "{energy=(-2.16,-1.16]}"            "{instrumentalness=(2.39,3.34]}"   
##  [3] "{energy=(0.821,1.82]}"             "{danceability=(1.07,2.56]}"       
##  [5] "{acousticness=(0.847,1.56]}"       "{energy=(-1.16,-0.171]}"          
##  [7] "{valence=(-2.17,-1.19]}"           "{acousticness=(-1.3,-0.582]}"     
##  [9] "{energy=(-0.171,0.821]}"           "{loudness=(-0.13,3.07]}"          
## [11] "{acousticness=(-0.582,0.133]}"     "{valence=(0.755,1.73]}"           
## [13] "{tempo=(0.112,2.15]}"              "{valence=(-0.218,0.755]}"         
## [15] "{acousticness=(0.133,0.847]}"      "{instrumentalness=(-0.427,0.516]}"
## [17] "{liveness=(0.16,1.48]}"            "{danceability=(-0.419,1.07]}"     
## [19] "{}"                                "{liveness=(-1.16,0.16]}"          
## [21] "{speechiness=(-0.59,0.628]}"       "{loudness=(-3.32,-0.13]}"         
## [23] "{valence=(-1.19,-0.218]}"          "{tempo=(-1.93,0.112]}"            
## [25] "{danceability=(-1.91,-0.419]}"    
## Itemsets in Consequent (RHS)
##  [1] "{instrumentalness=(-0.427,0.516]}" "{speechiness=(-0.59,0.628]}"      
##  [3] "{liveness=(-1.16,0.16]}"           "{danceability=(-0.419,1.07]}"     
##  [5] "{tempo=(-1.93,0.112]}"             "{loudness=(-0.13,3.07]}"          
##  [7] "{tempo=(0.112,2.15]}"              "{acousticness=(-1.3,-0.582]}"     
##  [9] "{loudness=(-3.32,-0.13]}"          "{valence=(0.755,1.73]}"           
## [11] "{acousticness=(0.847,1.56]}"

Head and tail of rules

The head and tail of the associationRules data frame show different rules and values, as they represent the first and last few rows of the data frame respectively.

The rules in the head of associationRules are the most general rules, with an empty antecedent (i.e., {}) and a consequent that specifies a range of values for a particular feature. Since the antecedent is empty, the coverage and lift values for these rules are both 1, which means that they cover the entire dataset and there is no association between the antecedent and consequent.

In contrast, the rules in the tail of associationRules have more specific antecedents that correspond to smaller subsets of the data. As a result, their coverage and lift values are lower than 1, indicating that they cover only a portion of the data and there is some association between the antecedent and consequent.

For example, the last rule in the tail of associationRules has an antecedent of {speechiness=(-0.59,0.628]} and a consequent of {instrumentalness=(-0.427,0.516]}. The coverage of this rule is 0.8952976, which means that almost 90% of the data that has a speechiness value within the range of (-0.59,0.628]. The lift of this rule is 0.9848941, which means that the probability of the consequent occurring is only 0.9848941 times higher when the antecedent occurs, compared to when it does not occur. This suggests that there is a weak association between speechiness and instrumentalness for this subset of the data.

associationRules <- as(associationRules, "data.frame")
head(associationRules, )

##                                     rules   support confidence coverage lift
## 1      {} => {danceability=(-0.419,1.07]} 0.5404177  0.5404177        1    1
## 2             {} => {tempo=(-1.93,0.112]} 0.5491618  0.5491618        1    1
## 3           {} => {loudness=(-0.13,3.07]} 0.6243642  0.6243642        1    1
## 4           {} => {liveness=(-1.16,0.16]} 0.7043149  0.7043149        1    1
## 5 {} => {instrumentalness=(-0.427,0.516]} 0.8559964  0.8559964        1    1
## 6       {} => {speechiness=(-0.59,0.628]} 0.8952976  0.8952976        1    1
##    count
## 1 241283
## 2 245187
## 3 278763
## 4 314459
## 5 382181
## 6 399728

tail(associationRules, )

##                                                                rules   support
## 133     {liveness=(-1.16,0.16]} => {instrumentalness=(-0.427,0.516]} 0.5956011
## 134     {instrumentalness=(-0.427,0.516]} => {liveness=(-1.16,0.16]} 0.5956011
## 135           {liveness=(-1.16,0.16]} => {speechiness=(-0.59,0.628]} 0.6524553
## 136           {speechiness=(-0.59,0.628]} => {liveness=(-1.16,0.16]} 0.6524553
## 137 {instrumentalness=(-0.427,0.516]} => {speechiness=(-0.59,0.628]} 0.7547948
## 138 {speechiness=(-0.59,0.628]} => {instrumentalness=(-0.427,0.516]} 0.7547948
##     confidence  coverage      lift  count
## 133  0.8456460 0.7043149 0.9879084 265921
## 134  0.6957986 0.8559964 0.9879084 265921
## 135  0.9263688 0.7043149 1.0347049 291305
## 136  0.7287581 0.8952976 1.0347049 291305
## 137  0.8817733 0.8559964 0.9848941 336997
## 138  0.8430658 0.8952976 0.9848941 336997

Extracting LHS and RHS

This code is performed to extract the left and right side items from the association rules and to clean up the resulting data frame and for the better readability and interpretation of my association rules.

First, I created two new columns in the data frame associationRules called “rules_left” and “rules_right”. These columns are populated with the left and right side items of each association rule using the strsplit() function to split the original “rules” column by the “=>” character. Then, the original “rules” column was removed using associationRules$rules <- NULL.

Next, I used the separate() function to split the “rules_left” and “rules_right” columns into separate columns for the variable name and range. The into argument specifies the new column names, while the sep argument specifies the regular expression used to separate the variable name from the range.

Finally, the gsub() function is used to remove unnecessary characters (in this case, curly braces and equals signs) from the “left_var” and “right_var” columns, cleaning up the resulting data frame. The result is shown by tail() and str() functions.

#extract the left and right side items from the association rules
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

## The following object is masked from 'package:raster':
## 
##     extract

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

associationRules[c("rules_left", "rules_right")] <- 
         t(sapply(associationRules$rules, function(x) strsplit(x, " => ")[[1]]))

associationRules$rules <- NULL
tail(associationRules, )

##       support confidence  coverage      lift  count
## 133 0.5956011  0.8456460 0.7043149 0.9879084 265921
## 134 0.5956011  0.6957986 0.8559964 0.9879084 265921
## 135 0.6524553  0.9263688 0.7043149 1.0347049 291305
## 136 0.6524553  0.7287581 0.8952976 1.0347049 291305
## 137 0.7547948  0.8817733 0.8559964 0.9848941 336997
## 138 0.7547948  0.8430658 0.8952976 0.9848941 336997
##                            rules_left                       rules_right
## 133           {liveness=(-1.16,0.16]} {instrumentalness=(-0.427,0.516]}
## 134 {instrumentalness=(-0.427,0.516]}           {liveness=(-1.16,0.16]}
## 135           {liveness=(-1.16,0.16]}       {speechiness=(-0.59,0.628]}
## 136       {speechiness=(-0.59,0.628]}           {liveness=(-1.16,0.16]}
## 137 {instrumentalness=(-0.427,0.516]}       {speechiness=(-0.59,0.628]}
## 138       {speechiness=(-0.59,0.628]} {instrumentalness=(-0.427,0.516]}

library(tidyr)

associationRules <- associationRules %>%
         separate(rules_left, into = c("left_var", "left_range"), sep = "\\(|\\]") %>%
         separate(rules_right, into = c("right_var", "right_range"), sep = "\\(|\\]")

## Warning: Expected 2 pieces. Additional pieces discarded in 132 rows [7, 8, 9, 10, 11,
## 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 2, 3, 4, 5,
## 6].

## Warning: Expected 2 pieces. Additional pieces discarded in 138 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

# Remove unnecessary characters from "left_var" and "right_var" columns
associationRules$left_var <- gsub("[{}=]", "", associationRules$left_var)
associationRules$right_var <- gsub("[{}=]", "", associationRules$right_var)

tail(associationRules, )

##       support confidence  coverage      lift  count         left_var
## 133 0.5956011  0.8456460 0.7043149 0.9879084 265921         liveness
## 134 0.5956011  0.6957986 0.8559964 0.9879084 265921 instrumentalness
## 135 0.6524553  0.9263688 0.7043149 1.0347049 291305         liveness
## 136 0.6524553  0.7287581 0.8952976 1.0347049 291305      speechiness
## 137 0.7547948  0.8817733 0.8559964 0.9848941 336997 instrumentalness
## 138 0.7547948  0.8430658 0.8952976 0.9848941 336997      speechiness
##       left_range        right_var  right_range
## 133   -1.16,0.16 instrumentalness -0.427,0.516
## 134 -0.427,0.516         liveness   -1.16,0.16
## 135   -1.16,0.16      speechiness  -0.59,0.628
## 136  -0.59,0.628         liveness   -1.16,0.16
## 137 -0.427,0.516      speechiness  -0.59,0.628
## 138  -0.59,0.628 instrumentalness -0.427,0.516

str(associationRules)

## 'data.frame':    138 obs. of  9 variables:
##  $ support    : num  0.54 0.549 0.624 0.704 0.856 ...
##  $ confidence : num  0.54 0.549 0.624 0.704 0.856 ...
##  $ coverage   : num  1 1 1 1 1 ...
##  $ lift       : num  1 1 1 1 1 ...
##  $ count      : int  241283 245187 278763 314459 382181 399728 23385 23643 29608 37180 ...
##  $ left_var   : chr  "" "" "" "" ...
##  $ left_range : chr  NA NA NA NA ...
##  $ right_var  : chr  "danceability" "tempo" "loudness" "liveness" ...
##  $ right_range: chr  "-0.419,1.07" "-1.93,0.112" "-0.13,3.07" "-1.16,0.16" ...

Frequent song features

Identifying

I also generated association rules from frequent itemsets. This way I could identify the most frequent itemsets of song features, which can be used to gain insights into the common patterns of features that occur together in popular songs. The output of “print(sort(freq_items, decreasing = TRUE))” shows the frequent itemsets sorted in decreasing order. Each line represents an itemset and the value in the right column is its support. The itemsets are defined by their range values for the different audio features.

For example, the first line “speechiness=(-0.000971,0.243] instrumentalness=(-0.001,0.25]” represents the itemset of songs with speechiness in the range of (-0.000971,0.243] and instrumentalness in the range of (-0.001,0.25], and its support is 0.8952976090. This means that almost 90% of the songs in the dataset have these two features falling in this range.

Similarly, the second line shows an itemset of songs with instrumentalness in the range of (-0.001,0.25], liveness in the range of (-0.001,0.25], and its support is 0.8559964164. This means that almost 85% of the songs in the dataset have these two features falling in this range.

The rest of the lines can be interpreted in a similar way to understand the frequent itemsets and their support values.

# Identify frequent itemsets
freq_items <- itemFrequency(trans)

# Print frequent itemsets
print(sort(freq_items, decreasing = TRUE))

##       speechiness=(-0.59,0.628] instrumentalness=(-0.427,0.516] 
##                    0.8952976090                    0.8559964164 
##           liveness=(-1.16,0.16]           loudness=(-0.13,3.07] 
##                    0.7043149112                    0.6243641861 
##             tempo=(-1.93,0.112]      danceability=(-0.419,1.07] 
##                    0.5491617672                    0.5404177166 
##              tempo=(0.112,2.15]      acousticness=(-1.3,-0.582] 
##                    0.4211523602                    0.3806596114 
##          loudness=(-3.32,-0.13]           energy=(-0.171,0.821] 
##                    0.3675905706                    0.3158833081 
##          valence=(-0.218,0.755]           energy=(-1.16,-0.171] 
##                    0.3099188084                    0.2843294697 
##     danceability=(-1.91,-0.419]       acousticness=(0.847,1.56] 
##                    0.2821792934                    0.2811803572 
##            valence=(0.755,1.73]          valence=(-1.19,-0.218] 
##                    0.2754801501                    0.2634458816 
##             energy=(0.821,1.82]            liveness=(0.16,1.48] 
##                    0.2493510275                    0.2090464192 
##      acousticness=(0.133,0.847]     acousticness=(-0.582,0.133] 
##                    0.1736603393                    0.1644996920 
##           valence=(-2.17,-1.19]            energy=(-2.16,-1.16] 
##                    0.1511551599                    0.1504361946 
##        danceability=(1.07,2.56]    instrumentalness=(2.39,3.34] 
##                    0.1375978498                    0.0850327566 
##             liveness=(1.48,2.8]         speechiness=(3.06,4.27] 
##                    0.0535685089                    0.0479623719 
##        speechiness=(0.628,1.84]       danceability=(-3.4,-1.91] 
##                    0.0475480150                    0.0398051403 
##             liveness=(2.8,4.12]    instrumentalness=(1.46,2.39] 
##                    0.0330701607                    0.0305011479 
##   instrumentalness=(0.516,1.46]                tempo=(2.15,4.2] 
##                    0.0284696792                    0.0227067585 
##         speechiness=(1.84,3.06]           loudness=(-6.5,-3.32] 
##                    0.0091920040                    0.0079287754 
##             tempo=(-3.98,-1.93]            loudness=(-9.7,-6.5] 
##                    0.0069791142                    0.0001164679

Top 10 most frequent song features

Using the sot() function and head( ,10), I displayed the top 10 most often occuring song features. The results are displayed below, with ranges for variables in brackets, and after that there is a level of support, so the share of songs that have this feature’s value falling into a given range. For example, 86% of songs has instrumentalness in the range of (-0.427, 0.516]. The values from the table are also summarised in the barplot, but only three x-axis descriptions could be displayed, because when I set these descriptions to vertical, they were beneath the edge of the plot and not visible.

# Sort items in decreasing order of frequency
sorted_items <- sort(freq_items, decreasing = TRUE)

# Print the top 10 most frequent itemsets
head(sorted_items, 10)

##       speechiness=(-0.59,0.628] instrumentalness=(-0.427,0.516] 
##                       0.8952976                       0.8559964 
##           liveness=(-1.16,0.16]           loudness=(-0.13,3.07] 
##                       0.7043149                       0.6243642 
##             tempo=(-1.93,0.112]      danceability=(-0.419,1.07] 
##                       0.5491618                       0.5404177 
##              tempo=(0.112,2.15]      acousticness=(-1.3,-0.582] 
##                       0.4211524                       0.3806596 
##          loudness=(-3.32,-0.13]           energy=(-0.171,0.821] 
##                       0.3675906                       0.3158833

# Create a barplot of the 10 most frequent itemsets
barplot(sorted_items[1:10], xlab = "Itemset", ylab = "Frequency", main = "Top 10 Frequent Itemsets", col = mypal)

Determining the most common and strongest associations between song features

Then I identified frequent itemsets with high lift values to determine the most common and strongest associations between song features. These frequent itemsets can help to identify potential hit songs by providing insights into the preferences of the listeners.

These are the top 10 association rules with the highest lift values in the data. The lift value measures the strength of association between two items, and a lift value greater than 1 indicates a positive association between the items.

Here’s how to interpret the output: - support: the proportion of transactions containing both items in the rule - confidence: the proportion of transactions containing the left-hand side item that also contain the right-hand side item - coverage: the proportion of transactions containing the left-hand side item - lift: the measure of how much more often the two items appear together than if they were statistically independent (a lift of 1 means there is no association) - count: the number of transactions containing both items in the rule - rules_left: the item or items on the left-hand side of the rule (antecedent) - rules_right: the item or items on the right-hand side of the rule (consequent)

For example, the first rule with the highest lift has a support of 0.1239, which means that 12.39% of the transactions contain both items in the rule, and a confidence of 0.8191, which means that 81.91% of the transactions containing the left-hand side item also contain the right-hand side item. The lift value for this rule is 2.9216, which is greater than 1, indicating a positive association between the left-hand side item “energy=(-0.001,0.25]” and the right-hand side item “acousticness=(0.747,0.997]”. This suggests that songs with low energy tend to have high acousticness. Similarly, the other rules can be interpreted in a similar way.

# Sort associationRules by lift in decreasing order
sorted_rules <- associationRules[order(-associationRules$lift),]

# Print the top 10 rules with the highest lift values
head(sorted_rules, 10)

##       support confidence   coverage     lift  count         left_var
## 18 0.12343356  0.8205044 0.15043619 2.918072  55110           energy
## 19 0.13398735  0.8906590 0.15043619 2.422965  59822           energy
## 7  0.05237695  0.6159621 0.08503276 2.190630  23385 instrumentalness
## 46 0.18923344  0.7589038 0.24935103 1.993655  84488           energy
## 11 0.07345092  0.5338086 0.13759785 1.937739  32794     danceability
## 65 0.18368106  0.6532500 0.28118036 1.777113  82009     acousticness
## 8  0.05295481  0.6227578 0.08503276 1.694162  23643 instrumentalness
## 74 0.16171118  0.5687458 0.28432947 1.547226  72200           energy
## 49 0.23830002  0.9556809 0.24935103 1.530647 106395           energy
## 24 0.08085559  0.5349178 0.15115516 1.455200  36100          valence
##      left_range    right_var right_range
## 18  -2.16,-1.16 acousticness  0.847,1.56
## 19  -2.16,-1.16     loudness -3.32,-0.13
## 7     2.39,3.34 acousticness  0.847,1.56
## 46   0.821,1.82 acousticness -1.3,-0.582
## 11    1.07,2.56      valence  0.755,1.73
## 65   0.847,1.56     loudness -3.32,-0.13
## 8     2.39,3.34     loudness -3.32,-0.13
## 74 -1.16,-0.171     loudness -3.32,-0.13
## 49   0.821,1.82     loudness  -0.13,3.07
## 24  -2.17,-1.19     loudness -3.32,-0.13

Top 10 rules with the highest supportand confidence values

Support and confidence are other measures that can be used to evaluate the strength of association between the antecedent and consequent of a rule. I identified the rules with the highest support and confidence values to determine the most common combinations of song features that co-occur in popular tracks on Spotify.

The rules with the highest support values indicate the most frequent combinations of song features in the data set, while the rules with the highest confidence values indicate the strongest associations between the left-hand and right-hand side features.

From the output, we can see that the rule with the highest support value is: - instrumentalness -> speechiness with support 0.8952976.

The rule with the highest confidence value is: - instrumentalness (range 2.39,3.34) -> speechiness (range -0.59,0.628) with confidence 0.9793231.

These rules suggest that instrumentalness and speechiness are the most frequent features in popular tracks on Spotify and that when a song has a high instrumentalness value within the range of 2.39,3.34, it is highly likely to have a high speechiness value within the range of -0.59,0.628.

# Sort associationRules by support in decreasing order
sorted_rules_support <- associationRules[order(-associationRules$support),]

# Print the top 10 rules with the highest support values
head(sorted_rules_support, 10)

##       support confidence  coverage      lift  count         left_var
## 6   0.8952976  0.8952976 1.0000000 1.0000000 399728                 
## 5   0.8559964  0.8559964 1.0000000 1.0000000 382181                 
## 137 0.7547948  0.8817733 0.8559964 0.9848941 336997 instrumentalness
## 138 0.7547948  0.8430658 0.8952976 0.9848941 336997      speechiness
## 4   0.7043149  0.7043149 1.0000000 1.0000000 314459                 
## 135 0.6524553  0.9263688 0.7043149 1.0347049 291305         liveness
## 136 0.6524553  0.7287581 0.8952976 1.0347049 291305      speechiness
## 3   0.6243642  0.6243642 1.0000000 1.0000000 278763                 
## 133 0.5956011  0.8456460 0.7043149 0.9879084 265921         liveness
## 134 0.5956011  0.6957986 0.8559964 0.9879084 265921 instrumentalness
##       left_range        right_var  right_range
## 6           <NA>      speechiness  -0.59,0.628
## 5           <NA> instrumentalness -0.427,0.516
## 137 -0.427,0.516      speechiness  -0.59,0.628
## 138  -0.59,0.628 instrumentalness -0.427,0.516
## 4           <NA>         liveness   -1.16,0.16
## 135   -1.16,0.16      speechiness  -0.59,0.628
## 136  -0.59,0.628         liveness   -1.16,0.16
## 3           <NA>         loudness   -0.13,3.07
## 133   -1.16,0.16 instrumentalness -0.427,0.516
## 134 -0.427,0.516         liveness   -1.16,0.16

# Sort associationRules by confidence in decreasing order
sorted_rules_confidence <- associationRules[order(-associationRules$confidence),]

# Print the top 10 rules with the highest confidence values
head(sorted_rules_confidence, 10)

##        support confidence   coverage     lift  count         left_var
## 10  0.08327454  0.9793231 0.08503276 1.093852  37180 instrumentalness
## 73  0.27422140  0.9717985 0.28217929 1.085447 122433     danceability
## 28  0.14452545  0.9561397 0.15115516 1.067957  64527          valence
## 49  0.23830002  0.9556809 0.24935103 1.530647 106395           energy
## 33  0.15560334  0.9459187 0.16449969 1.105050  69473     acousticness
## 39  0.16227112  0.9344167 0.17366034 1.091613  72450     acousticness
## 103 0.35378465  0.9293990 0.38065961 1.038089 157956     acousticness
## 52  0.23160087  0.9288146 0.24935103 1.037437 103404           energy
## 135 0.65245534  0.9263688 0.70431491 1.034705 291305         liveness
## 108 0.38927376  0.9243062 0.42115236 1.032401 173801            tempo
##       left_range        right_var  right_range
## 10     2.39,3.34      speechiness  -0.59,0.628
## 73  -1.91,-0.419      speechiness  -0.59,0.628
## 28   -2.17,-1.19      speechiness  -0.59,0.628
## 49    0.821,1.82         loudness   -0.13,3.07
## 33  -0.582,0.133 instrumentalness -0.427,0.516
## 39   0.133,0.847 instrumentalness -0.427,0.516
## 103  -1.3,-0.582      speechiness  -0.59,0.628
## 52    0.821,1.82      speechiness  -0.59,0.628
## 135   -1.16,0.16      speechiness  -0.59,0.628
## 108   0.112,2.15      speechiness  -0.59,0.628

Item frequency plot

The plot below displays the frequency of occurrence of different items in the trans dataset. The horizontal axis represents the support level, which is the percentage of transactions that contain a particular itemset. The vertical axis represents the itemsets, which are sorted by decreasing frequency of occurrence. Each itemset is represented by a horizontal line, with the length of the line indicating the support level of the itemset.

The plot shows that the most frequent itemsets in the dataset have a support level of around 0.01. The itemsets with lower support levels are less frequent and are represented by shorter lines. The color of the lines indicates the frequency of occurrence of the itemsets, with darker colors representing more frequent itemsets.

Overall, this plot provides a visual representation of the frequency distribution of itemsets in the dataset.

library(arulesViz)

mypal <- brewer.pal(9, "Blues")

# Item frequency plot
itemFrequencyPlot(trans, support = 0.01, col = mypal)

Conclusions

Association rules analysis can be a useful technique for identifying common song features that co-occur in popular tracks on Spotify.
The dataset used in this project contained information on the song features of popular tracks on Spotify, including characteristics such as tempo, energy, danceability, and popularity.
By applying the Apriori algorithm and setting appropriate support and confidence levels, I was able to identify frequent itemsets and association rules that reveal interesting relationships between the song features.
The results showed that certain song features tend to co-occur more frequently than others.
These associations can be used to identify potential hit songs by looking for tracks that exhibit these common characteristics.
However, it is important to note that association rules do not necessarily imply causality, and further analysis would be required to establish any causal relationships between these song features and a song’s popularity.
In conclusion, association rules analysis can be a useful tool for identifying common song features that co-occur in popular tracks on Spotify, and this information can be used to help identify potential hit songs, although caution should be exercised in interpreting the results.

References

Kotsiantis, S., & Kanellopoulos, D. (2006). Association Rules Mining: A Recent Overview. GESTS International Transactions on Computer Science and Engineering, 32(1), 71–82. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=73a19026fb8a6ef5bf238ff472f31100c33753d0

RPubs - Project 3: Recommendation System with Association Rules. (2021, March 1). https://rpubs.com/okowalewska/association_rules_and_recommendations

Toivonen, H. (1996). Sampling Large Databases for Association Rules. Very Large Data Bases, 134–145. https://www.cin.ufpe.br/~jtalr/Mestrado/toiSample.pdf

Association rules of Spotify co-occuring most popular song features

Anna Czarnocka

2023-02-24