This article aims to evaluate possible recommendations of potential popular songs on Spotify, using association rules in unsupervised machine learning. The goal was to answer my research question “What are the most common song features that co-occur in popular tracks on Spotify, and how can this information be used to identify potential hit songs?” using association rules.
As a music lover and data science enthusiast, I’ve always been fascinated by the factors that make a song successful. I have been using Spotify every day for most of the day for the last couple of years, so I also have personal intuition and experience in this area, which made it easier to verify my results with reality. With the rise of streaming platforms like Spotify, there’s now an abundance of data available on the features of popular songs. For my project, I decided to use the Association Rules Method to explore this data and identify the most common song features that co-occur in popular tracks on Spotify. My research question was simple: What makes a hit song? By analyzing the data and generating association rules, I was able to uncover some interesting insights into the relationships between different song features and their impact on popularity. In this report, I’ll share my findings and discuss how this information can be used to identify potential hit songs.
For my project, I used the “tracks.csv” file from the Spotify Dataset 1921-2020 available on Kaggle, which contains data on over 600,000 tracks released between 1921 and 2020. This dataset includes a wide range of information on each track, such as its name, artist, release date, popularity, and various audio features. I focused on analyzing the audio features of the tracks, such as danceability, energy, loudness, and valence, which were calculated using Spotify’s audio analysis algorithm. Overall, the dataset provided a rich source of information for exploring the factors that contribute to the success of a song.
Association rule mining is a data mining technique used to identify relationships and patterns between different variables in a dataset. The method works by identifying frequent item sets, or sets of items that frequently occur together in the data, and then generating association rules that describe the relationships between these item sets. These rules can then be used to make predictions or recommendations based on the patterns that have been identified.
Strengths:
Weaknesses:
To do the analysis, the following packages will be needed.
# Install the package if not yet installed
packages <- c("tidyverse", "readr", "corrplot", "ggplot2", "GGally",
"patchwork", "tibble", "patchwork", "gridExtra", "RColorBrewer",
"factoextra", "labdsv", "maptools", "psych", "ClusterR",
"readxl", "cluster", "flexclust", "fpc", "clustertend",
"ggthemes", "plotly", "stringr", "missMDA", "ade4", "smacof",
"Rtsne", "psy", "scales", "kableExtra", "pdp", "jpeg", "raster")
# Load packages if not already loaded
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
}
}
The first step is to input a “tracks.csv” file. Afterwards, I printed the column names, the first six rows of my database and the str(df) function, to see the structure of my dataset.
# Read in the artists and tracks datasets
df <- read.csv("/Users/annaczarnocka/Desktop/UL3/tracks.csv", stringsAsFactors = FALSE)
colnames(df)
## [1] "id" "name" "popularity" "duration_ms"
## [5] "explicit" "artists" "id_artists" "release_date"
## [9] "danceability" "energy" "key" "loudness"
## [13] "mode" "speechiness" "acousticness" "instrumentalness"
## [17] "liveness" "valence" "tempo" "time_signature"
head(df)
## id name popularity
## 1 35iwgR4jXetI318WEWsa1Q Carve 6
## 2 021ht4sdgPcrDgSk7JTbKY Capítulo 2.16 - Banquero Anarquista 0
## 3 07A5yehtSnoedViJAZkNnc Vivo para Quererte - Remasterizado 0
## 4 08FmqUhxtyLTn6pAh6bk45 El Prisionero - Remasterizado 0
## 5 08y9GfoqCWfOGsKdwojr5e Lady of the Evening 0
## 6 0BRXJHRNGQ3W4v9frnSfhu Ave Maria 0
## duration_ms explicit artists id_artists
## 1 126903 0 ['Uli'] ['45tIt06XoI0Iio4LBEVpls']
## 2 98200 0 ['Fernando Pessoa'] ['14jtPCOoNZwquk5wd9DxrY']
## 3 181640 0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 4 176907 0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 5 163080 0 ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## 6 178933 0 ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## release_date danceability energy key loudness mode speechiness acousticness
## 1 1922-02-22 0.645 0.4450 0 -13.338 1 0.4510 0.674
## 2 1922-06-01 0.695 0.2630 0 -22.136 1 0.9570 0.797
## 3 1922-03-21 0.434 0.1770 1 -21.180 1 0.0512 0.994
## 4 1922-03-21 0.321 0.0946 7 -27.961 1 0.0504 0.995
## 5 1922 0.402 0.1580 3 -16.900 0 0.0390 0.989
## 6 1922 0.227 0.2610 5 -12.343 1 0.0382 0.994
## instrumentalness liveness valence tempo time_signature
## 1 0.7440 0.1510 0.1270 104.851 3
## 2 0.0000 0.1480 0.6550 102.009 1
## 3 0.0218 0.2120 0.4570 130.418 5
## 4 0.9180 0.1040 0.3970 169.980 3
## 5 0.1300 0.3110 0.1960 103.220 4
## 6 0.2470 0.0977 0.0539 118.891 4
str(df)
## 'data.frame': 586672 obs. of 20 variables:
## $ id : chr "35iwgR4jXetI318WEWsa1Q" "021ht4sdgPcrDgSk7JTbKY" "07A5yehtSnoedViJAZkNnc" "08FmqUhxtyLTn6pAh6bk45" ...
## $ name : chr "Carve" "Capítulo 2.16 - Banquero Anarquista" "Vivo para Quererte - Remasterizado" "El Prisionero - Remasterizado" ...
## $ popularity : int 6 0 0 0 0 0 0 0 0 0 ...
## $ duration_ms : int 126903 98200 181640 176907 163080 178933 134467 161427 310073 181173 ...
## $ explicit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ artists : chr "['Uli']" "['Fernando Pessoa']" "['Ignacio Corsini']" "['Ignacio Corsini']" ...
## $ id_artists : chr "['45tIt06XoI0Iio4LBEVpls']" "['14jtPCOoNZwquk5wd9DxrY']" "['5LiOoJbxVSAMkBS2fUm3X2']" "['5LiOoJbxVSAMkBS2fUm3X2']" ...
## $ release_date : chr "1922-02-22" "1922-06-01" "1922-03-21" "1922-03-21" ...
## $ danceability : num 0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
## $ energy : num 0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
## $ key : int 0 0 1 7 3 5 4 4 0 6 ...
## $ loudness : num -13.3 -22.1 -21.2 -28 -16.9 ...
## $ mode : int 1 1 1 1 0 1 1 1 0 1 ...
## $ speechiness : num 0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
## $ acousticness : num 0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
## $ instrumentalness: num 7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
## $ liveness : num 0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
## $ valence : num 0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
## $ tempo : num 105 102 130 170 103 ...
## $ time_signature : int 3 1 5 3 4 4 5 3 4 3 ...
First, I removed any duplicate entries based on the song name, ensuring that each song only appears once in the data frame. I then printed the first few rows of the modified data frame to confirm that the duplicates were removed. This was useful because association rule mining algorithms are designed to work with a set of unique items or transactions. Duplicate entries in the dataset can skew the results of the association rule mining process, leading to inaccurate or misleading association rules. In the context of my project, removing duplicate entries based on the song name ensures that each song is only considered once in the analysis. This helps to identify the most common song features that co-occur in popular tracks on Spotify more accurately and avoids over-representing certain songs or features.
Next, I removed columns from the data frame that were not necessary for my analysis, keeping only the columns containing information on danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.
The list of variables before change:
name: name of the trackpopularity: a measure of how popular
the track is, ranging from 0 to 100duration_ms: duration of the track in
millisecondsexplicit: a binary indicator (0 or 1)
that indicates whether the lyrics contain explicit languageartists: name(s) of the artist(s) who
created the trackid_artists: unique identifier(s) of
the artist(s) who created the trackrelease_date: date when the track was
releaseddanceability: a measure of how
suitable a track is for dancing, ranging from 0 to 1energy: a measure of how energetic the
track is, ranging from 0 to 1key: the key of the track (0 = C, 1 =
C#, 2 = D, and so on)loudness: the overall loudness of the
track in decibels (dB)mode: the modality of the track (major
= 1, minor = 0)speechiness: a measure of how much
spoken word is in the track, ranging from 0 to 1acousticness: a measure of how
acoustic the track is, ranging from 0 to 1instrumentalness: a measure of how
instrumental the track is, ranging from 0 to 1liveness: a measure of how live the
track is, ranging from 0 to 1valence: a measure of the positivity
of the track, ranging from 0 to 1tempo: the tempo of the track in beats
per minute (BPM)time_signature: the time signature of
the track, which determines the number of beats in each bar.
In particular, variables like danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence and tempo are measures of the musical characteristics of the tracks, and could be used to identify patterns in the song features that tend to co-occur in popular tracks.
Release date could also be used to investigate how the features of popular tracks have evolved over time. However, I chose not to include this variable, because it would have to be in a date format, and it was more accurate to work on only numeric data types.
I excluded these variables because they are either irrelevant or not informative for my research question:
name and
artists are simply the names of the track
and the artist(s) who created it. These variables do not provide any
useful information for the analysis of song features and their potential
co-occurrence.popularity,
duration_ms,
explicit,
key, mode,
and time_signature are characteristics of
the tracks that are not related to the audio features, which are the
focus of my research question.id_artists is a unique identifier for
the artist(s) who created the track and is not relevant to my research
question either.
Therefore, it is best to focus on the variables related to audio features such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo for my association rules analysis.
I then checked for any missing values in the data frame using the colSums(is.na()) function, and removed any rows that contained missing values using the complete.cases() function.
Finally, I viewed the cleaned data using the head() function to inspect the first few rows, and used the str() function to check the structure of the data frame. I also used the summary() function to obtain summary statistics for each column in the data frame.
# Remove duplicates based on the song name
df <- df[!duplicated(df$name),]
# Print the first few rows of the modified data frame
head(df)
## id name popularity
## 1 35iwgR4jXetI318WEWsa1Q Carve 6
## 2 021ht4sdgPcrDgSk7JTbKY Capítulo 2.16 - Banquero Anarquista 0
## 3 07A5yehtSnoedViJAZkNnc Vivo para Quererte - Remasterizado 0
## 4 08FmqUhxtyLTn6pAh6bk45 El Prisionero - Remasterizado 0
## 5 08y9GfoqCWfOGsKdwojr5e Lady of the Evening 0
## 6 0BRXJHRNGQ3W4v9frnSfhu Ave Maria 0
## duration_ms explicit artists id_artists
## 1 126903 0 ['Uli'] ['45tIt06XoI0Iio4LBEVpls']
## 2 98200 0 ['Fernando Pessoa'] ['14jtPCOoNZwquk5wd9DxrY']
## 3 181640 0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 4 176907 0 ['Ignacio Corsini'] ['5LiOoJbxVSAMkBS2fUm3X2']
## 5 163080 0 ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## 6 178933 0 ['Dick Haymes'] ['3BiJGZsyX9sJchTqcSA7Su']
## release_date danceability energy key loudness mode speechiness acousticness
## 1 1922-02-22 0.645 0.4450 0 -13.338 1 0.4510 0.674
## 2 1922-06-01 0.695 0.2630 0 -22.136 1 0.9570 0.797
## 3 1922-03-21 0.434 0.1770 1 -21.180 1 0.0512 0.994
## 4 1922-03-21 0.321 0.0946 7 -27.961 1 0.0504 0.995
## 5 1922 0.402 0.1580 3 -16.900 0 0.0390 0.989
## 6 1922 0.227 0.2610 5 -12.343 1 0.0382 0.994
## instrumentalness liveness valence tempo time_signature
## 1 0.7440 0.1510 0.1270 104.851 3
## 2 0.0000 0.1480 0.6550 102.009 1
## 3 0.0218 0.2120 0.4570 130.418 5
## 4 0.9180 0.1040 0.3970 169.980 3
## 5 0.1300 0.3110 0.1960 103.220 4
## 6 0.2470 0.0977 0.0539 118.891 4
# Remove unnecessary columns
df <- subset(df, select = c(danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo))
# Check for missing values
colSums(is.na(df))
## danceability energy loudness speechiness
## 0 0 0 0
## acousticness instrumentalness liveness valence
## 0 0 0 0
## tempo
## 0
df <- df[complete.cases(df), ]
# View the cleaned data
head(df)
## danceability energy loudness speechiness acousticness instrumentalness
## 1 0.645 0.4450 -13.338 0.4510 0.674 0.7440
## 2 0.695 0.2630 -22.136 0.9570 0.797 0.0000
## 3 0.434 0.1770 -21.180 0.0512 0.994 0.0218
## 4 0.321 0.0946 -27.961 0.0504 0.995 0.9180
## 5 0.402 0.1580 -16.900 0.0390 0.989 0.1300
## 6 0.227 0.2610 -12.343 0.0382 0.994 0.2470
## liveness valence tempo
## 1 0.1510 0.1270 104.851
## 2 0.1480 0.6550 102.009
## 3 0.2120 0.4570 130.418
## 4 0.1040 0.3970 169.980
## 5 0.3110 0.1960 103.220
## 6 0.0977 0.0539 118.891
str(df)
## 'data.frame': 446475 obs. of 9 variables:
## $ danceability : num 0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
## $ energy : num 0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
## $ loudness : num -13.3 -22.1 -21.2 -28 -16.9 ...
## $ speechiness : num 0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
## $ acousticness : num 0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
## $ instrumentalness: num 7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
## $ liveness : num 0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
## $ valence : num 0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
## $ tempo : num 105 102 130 170 103 ...
summary(df)
## danceability energy loudness speechiness
## Min. :0.0000 Min. :0.0000 Min. :-60.000 Min. :0.0000
## 1st Qu.:0.4550 1st Qu.:0.3450 1st Qu.:-12.990 1st Qu.:0.0344
## Median :0.5800 Median :0.5500 Median : -9.317 Median :0.0455
## Mean :0.5653 Mean :0.5431 Mean :-10.300 Mean :0.1171
## 3rd Qu.:0.6880 3rd Qu.:0.7490 3rd Qu.: -6.552 3rd Qu.:0.0828
## Max. :0.9910 Max. :1.0000 Max. : 5.376 Max. :0.9710
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0992 1st Qu.:0.0000000 1st Qu.:0.0990 1st Qu.:0.352
## Median :0.4270 Median :0.0000214 Median :0.1420 Median :0.570
## Mean :0.4517 Mean :0.1126884 Mean :0.2196 Mean :0.556
## 3rd Qu.:0.7850 3rd Qu.:0.0091250 3rd Qu.:0.2870 3rd Qu.:0.772
## Max. :0.9960 Max. :1.0000000 Max. :1.0000 Max. :1.000
## tempo
## Min. : 0.00
## 1st Qu.: 95.49
## Median :117.42
## Mean :118.40
## 3rd Qu.:136.37
## Max. :243.51
The plot of correlation matrix shows a color-coded correlation matrix, with darker shades of blue indicating stronger positive correlations and lighter shades indicating weaker or negative correlations. The diagonal line is always colored in dark blue, as it represents the correlation of each variable with itself, which is always perfect (1.0).
It can be seen that energy and loudness are strongly or moderately positively correlated, suggesting that they may be measuring similar aspects of music. However, they also show some level of independence from the other variables. Therefore, it may be beneficial to include these variables in your analysis, as they may provide additional information that is not highly correlated with the other variables you have selected.
Similarly, danceability and valence are positively correlated, indicating that they may capture similar aspects of musical enjoyment. However, their correlation is below 0.7, so it is not a strong correlation.
Each of the variables may have a certain level of association or correlation with one another, and including all of them in analysis may provide a more complete picture of the relationships between these variables.
library(RColorBrewer)
library(corrplot)
## corrplot 0.92 loaded
# correlation matrix
cor.matrix <- cor(df[, 1:9], method = "pearson")
mypal <- brewer.pal(9, "Blues")
corrplot(cor.matrix, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col = mypal)
# Compute the correlation matrix
cor.matrix <- cor(df[, 1:9], method = "pearson")
# Round the correlation matrix to one decimal place
cor.matrix <- round(cor.matrix, 1)
# Print the correlation matrix
print(cor.matrix)
## danceability energy loudness speechiness acousticness
## danceability 1.0 0.2 0.2 0.2 -0.2
## energy 0.2 1.0 0.8 -0.1 -0.7
## loudness 0.2 0.8 1.0 -0.2 -0.5
## speechiness 0.2 -0.1 -0.2 1.0 0.1
## acousticness -0.2 -0.7 -0.5 0.1 1.0
## instrumentalness -0.2 -0.2 -0.3 -0.1 0.2
## liveness -0.1 0.1 0.0 0.2 0.0
## valence 0.5 0.4 0.3 0.0 -0.2
## tempo 0.0 0.2 0.2 -0.1 -0.2
## instrumentalness liveness valence tempo
## danceability -0.2 -0.1 0.5 0.0
## energy -0.2 0.1 0.4 0.2
## loudness -0.3 0.0 0.3 0.2
## speechiness -0.1 0.2 0.0 -0.1
## acousticness 0.2 0.0 -0.2 -0.2
## instrumentalness 1.0 0.0 -0.2 0.0
## liveness 0.0 1.0 0.0 0.0
## valence -0.2 0.0 1.0 0.1
## tempo 0.0 0.0 0.1 1.0
First, I calculated z-scores for each variable in the dataset to help me identify potential outliers. I added the z-scores to the dataframe and then looked for any z-scores that were greater than 3 or less than -3. I found that there were a total of 446472 potential outliers in the dataset, which are all the observations, so unfrotunately it wasn’t useful for detecting actual outliers. Looking at the boxplots, tempo has the most outliers, however, I consider tempo as a highly varying variable, and even the most extreme values may provide useful insight to my analysis, as music tempo can be extraordinary high or very low, and it does not imply an error in observation, but it is simply a unique piece of music.
To further investigate, I used a function to calculate the number of outliers in each column of the data frame. This allowed me to see which variables had the most outliers.
Finally, I used another function to find the outliers in the dataset. This function used the interquartile range (IQR) to identify potential outliers. I created a variable containing the outliers and checked its dimensions and head to confirm that the function had worked properly.
Overall, the process of identifying and investigating outliers in my dataset allowed me to better understand the distribution of the data and to ensure that any extreme values were appropriately handled in my analysis. However, the outliers may be simply due to the high differences in results and characterstics of songs, which can indeed be very different from one another, and still providing well measured data, so I decided not to delete the outliers.
# Calculate z-scores for each variable
df_z <- apply(df[, 1:9], 2, function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
# Add z-scores to the dataframe
df[, paste0(names(df)[1:9], "_z")] <- df_z
# Identify potential outliers (z-score > 3 or < -3)
outliers <- apply(df[, 1:9], 1, function(x) any(abs(x) > 3))
# Print number of outliers
cat("Number of potential outliers:", sum(outliers))
## Number of potential outliers: 446472
# Calculate potential outliers
outliers <- boxplot(df)$out
# Calculate outliers for each column
num_outliers_all <- function(df) {
num_vars <- sapply(df, is.numeric)
outlier_counts <- sapply(df[num_vars], function(x) {
length(boxplot(x, plot = FALSE)$out)
})
names(outlier_counts) <- names(df)[num_vars]
outlier_counts
}
num_outliers_all(df)
## danceability energy loudness speechiness
## 1573 0 11701 66118
## acousticness instrumentalness liveness valence
## 0 97918 31888 0
## tempo danceability_z energy_z loudness_z
## 4264 1573 0 11701
## speechiness_z acousticness_z instrumentalness_z liveness_z
## 66118 0 97918 31888
## valence_z tempo_z
## 0 4264
# Run the function to find outliers
# Find outliers in each variable of the data frame
find_outliers <- function(data) {
outliers <- lapply(data, function(x) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
H <- 1.5 * IQR(x, na.rm = TRUE)
x[x < (qnt[1] - H) | x > (qnt[2] + H)]
})
return(outliers)
}
# Generate a variable containing the outliers in the data frame
outliers <- find_outliers(df)
# Check the dimensions and head of the outliers variable
dim(outliers)
## NULL
summary(df)
## danceability energy loudness speechiness
## Min. :0.0000 Min. :0.0000 Min. :-60.000 Min. :0.0000
## 1st Qu.:0.4550 1st Qu.:0.3450 1st Qu.:-12.990 1st Qu.:0.0344
## Median :0.5800 Median :0.5500 Median : -9.317 Median :0.0455
## Mean :0.5653 Mean :0.5431 Mean :-10.300 Mean :0.1171
## 3rd Qu.:0.6880 3rd Qu.:0.7490 3rd Qu.: -6.552 3rd Qu.:0.0828
## Max. :0.9910 Max. :1.0000 Max. : 5.376 Max. :0.9710
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0992 1st Qu.:0.0000000 1st Qu.:0.0990 1st Qu.:0.352
## Median :0.4270 Median :0.0000214 Median :0.1420 Median :0.570
## Mean :0.4517 Mean :0.1126884 Mean :0.2196 Mean :0.556
## 3rd Qu.:0.7850 3rd Qu.:0.0091250 3rd Qu.:0.2870 3rd Qu.:0.772
## Max. :0.9960 Max. :1.0000000 Max. :1.0000 Max. :1.000
## tempo danceability_z energy_z loudness_z
## Min. : 0.00 Min. :-3.39488 Min. :-2.15575 Min. :-9.6893
## 1st Qu.: 95.49 1st Qu.:-0.66236 1st Qu.:-0.78639 1st Qu.:-0.5245
## Median :117.42 Median : 0.08833 Median : 0.02729 Median : 0.1915
## Mean :118.40 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:136.37 3rd Qu.: 0.73693 3rd Qu.: 0.81715 3rd Qu.: 0.7306
## Max. :243.51 Max. : 2.55661 Max. : 1.81341 Max. : 3.0560
## speechiness_z acousticness_z instrumentalness_z liveness_z
## Min. :-0.5854 Min. :-1.29615 Min. :-0.4235 Min. :-1.1579
## 1st Qu.:-0.4133 1st Qu.:-1.01152 1st Qu.:-0.4235 1st Qu.:-0.6359
## Median :-0.3578 Median :-0.07096 Median :-0.4234 Median :-0.4092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.1713 3rd Qu.: 0.95625 3rd Qu.:-0.3892 3rd Qu.: 0.3553
## Max. : 4.2700 Max. : 1.56168 Max. : 3.3344 Max. : 4.1147
## valence_z tempo_z
## Min. :-2.16309 Min. :-3.97182
## 1st Qu.:-0.79375 1st Qu.:-0.76864
## Median : 0.05431 Median :-0.03318
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.84013 3rd Qu.: 0.60273
## Max. : 1.72709 Max. : 4.19653
# find column names containing "_z"
z_colnames <- grep("_z", colnames(df), value = TRUE)
# select columns not containing "_z"
df <- df[, !colnames(df) %in% z_colnames]
str(df)
## 'data.frame': 446475 obs. of 9 variables:
## $ danceability : num 0.645 0.695 0.434 0.321 0.402 0.227 0.51 0.563 0.488 0.548 ...
## $ energy : num 0.445 0.263 0.177 0.0946 0.158 0.261 0.355 0.184 0.475 0.0391 ...
## $ loudness : num -13.3 -22.1 -21.2 -28 -16.9 ...
## $ speechiness : num 0.451 0.957 0.0512 0.0504 0.039 0.0382 0.124 0.0512 0.0399 0.153 ...
## $ acousticness : num 0.674 0.797 0.994 0.995 0.989 0.994 0.965 0.993 0.62 0.996 ...
## $ instrumentalness: num 7.44e-01 0.00 2.18e-02 9.18e-01 1.30e-01 2.47e-01 0.00 1.55e-05 6.45e-03 9.33e-01 ...
## $ liveness : num 0.151 0.148 0.212 0.104 0.311 0.0977 0.155 0.325 0.107 0.148 ...
## $ valence : num 0.127 0.655 0.457 0.397 0.196 0.0539 0.727 0.654 0.544 0.612 ...
## $ tempo : num 105 102 130 170 103 ...
The plot shows histograms for each variable in the dataset, which is useful for visualizing the distribution of my variables. From the histograms, we can see that some of the variables are normally distributed, such as “acousticness” and “instrumentalness”, while others have a skewed distribution, such as “energy” and “loudness”. In addition, we can also see that some of the variables have a large number of values close to zero, such as “speechiness” and “instrumentalness”. This indicates that these variables may be highly skewed and may require special handling or transformation before they can be used in analysis. However, it is because most of the songs have similar values of speechiness and intrumentalness, so I wouldn’t worry about this particular distributions, as they indicate the reallity accurately. The histograms suggest that different variables may require different data pre-processing techniques depending on their distribution. For this, I wil turn the variables into categorica, using firstly clustering method to find the optimal number of categories.
library(ggplot2)
ggplot(df, aes(x = danceability)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Danceability")
ggplot(df, aes(x = energy)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Energy")
ggplot(df, aes(x = loudness)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "lightblue") +
labs(title = "Distribution of Loudness")
ggplot(df, aes(x = speechiness)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Speechiness")
ggplot(df, aes(x = acousticness)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Acousticness")
ggplot(df, aes(x = instrumentalness)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Instrumentalness")
ggplot(df, aes(x = liveness)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Liveness")
ggplot(df, aes(x = valence)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
labs(title = "Distribution of Valence")
ggplot(df, aes(x = tempo)) +
geom_histogram(binwidth = 10, color = "black", fill = "lightblue") +
labs(title = "Distribution of Tempo")
# Load required library
library(RColorBrewer)
# Define the color palette
mypal <- brewer.pal(9, "Blues")
# Subset the first 4 colors from the palette
mycols <- mypal[1:4]
# Create histograms for each variable
par(mfrow=c(3,3))
for (i in 1:9) {
hist(df[,i], main = colnames(df)[i], xlab = "", col = mycols)
}
The code for variablity calculates and prints the variances of the 9 variables in my dataframe.
The values indicate the degree of variability or spread in the data for each variable. For example, the variance of loudness is much higher than the variances of the other variables, which suggests that there is a wide range of loudness values in the data. In contrast, the variance of liveness is much lower than the variances of the other variables, which suggests that there is less variability in the liveness values in the data. Variables with low variance may not provide much information and may not be as useful in generating association rules. However, for better range of feature I decided to include all the 9 variables.
# Set numeric output format
options(scipen = 999)
# Calculate variances and print them without scientific notation
variances <- apply(df[,1:9], 2, var)
print(variances)
## danceability energy loudness speechiness
## 0.02772659 0.06347506 26.31088561 0.03999337
## acousticness instrumentalness liveness valence
## 0.12146366 0.07081435 0.03596980 0.06607852
## tempo
## 888.69963461
Converting the variables to categorical using binning can help to capture non-linear relationships between the variables and improve the accuracy of my association rules. To appropriately chose the number of categories, I performed clustering. To do this, I firstly standardized (centers and scales) all the variables in my dataframe ussing lappy() function. Centering means subtracting the mean of each variable from its values, so that each variable has a mean of zero. Scaling means dividing each variable by its standard deviation, so that each variable has a standard deviation of one. Standardizing variables can be useful for clustering algorithms, as it ensures that all variables are treated equally and have the same impact on the clustering result, regardless of their original scale.
The clustering was done using hierarchical clustering method with complete linkage criterion. I plotted two dendrograms with different numbers of clusters, i.e., k = 3 and k = 4, but I displayed them on one plot for better comparison. The green lines indicate 4 clusters and blue indicate 3 clusters.
The cutree() function is used to cut the other two plots at a specific number of clusters. The output of the plots suggests that the data can be grouped into either three or four distinct clusters based on their similarity. The “fviz_cluster” function is used to visualize the clustering results in two dimensions using the first two principal components of the data. The clusters are represented by different colors, and the plots suggest that the clusters identified by the algorithm are fairly distinct from each other. Both the dendrogram and the scatterplot matrix suggest that 3 or 4 clusters could be reasonable choices, but further analysis and comparison of different clustering solutions would be necessary to make a final determination. I used elbow method and gap statistic method to help me in choosing the optimal number of clusters.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# clustering
df <- as.data.frame(lapply(df, scale))
distance.m<-dist(t(df))
hc<-hclust(distance.m, method="complete")
plot(hc, hang=-1)
rect.hclust(hc, k = 3, border='blue')
rect.hclust(hc, k = 4, border='green')
sub_grp<-cutree(hc, k=3)
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("aquamarine4", "cyan3", "darkblue", "darkgreen", "brown" ))
sub_grp<-cutree(hc, k=4)
fviz_cluster(list(data = distance.m, cluster = sub_grp), palette=c("aquamarine4", "cyan3", "darkblue", "darkgreen", "brown" ))
Here, I determined the optimal number of clusters for the dataset by using the elbow method. First, I selected 10% of the rows from the dataset randomly, because there was no computational memory to perform elbow method on the whole dataset. Then I scaled the numeric data. The elbow method was then applied by calculating the within-cluster sum of squares (WSS) for the number of clusters ranging from 1 to 10. Finally, the plot of the number of clusters versus WSS was displayed to visualize the “elbow” point, which indicates the optimal number of clusters. The elbow point is the point of inflection where the WSS starts to level off, suggesting that adding more clusters does not improve the model significantly. In this plot, the optimal number of clusters appears to be between 2 and 4, with a slight elbow at 3.
# finding optimal number of clusters
library(cluster)
# Randomly select 10% of the rows from the dataset
df_sampled <- df[sample(nrow(df), nrow(df)*0.1), ]
# Scale the numeric data
df_scaled <- scale(df_sampled)
# Determine the optimal number of clusters using the elbow method
set.seed(123)
wss <- c()
for(i in 1:10){
kmeans_fit <- kmeans(df_scaled, centers = i, nstart = 25)
wss[i] <- kmeans_fit$tot.withinss
}
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2232350)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2232350)
## Warning: did not converge in 10 iterations
plot(1:10, wss, type = "b", xlab = "Number of clusters", ylab = "Within-cluster sum of squares")
The gap statistic method is a statistical approach for determining the optimal number of clusters in a dataset. It compares the total within-cluster variation for different values of k number of clusters with their expected values under null reference distribution of the data. If the gap statistic is large for a certain value of k, it suggests that k is a good number of clusters for the data.
The steps involved in the gap statistic method are as follows: 1. Clustering the original dataset for a range of k values. 2. For each k value, generating B reference datasets, each of the same size as the original dataset, by sampling from a uniform distribution over the range of the original dataset. 3. Clustering each reference dataset for the same range of k values as the original dataset. 4. Calculating the gap statistic for each value of k as the difference between the expected log of the within-cluster sum of squares for the reference data and the log of the within-cluster sum of squares for the original data. 5. Selecting the k value with the largest gap statistic as the optimal number of clusters.
The code for this method is displayed below, but it worked only in the console of .R file, and running it in .Rmd file used too much of computing memory and it could not give the outcome of the plot. This is why I displayed here only the image of the plot which was the outcome of the code below.
The graph suggests that the optimal number of clusters is 4, which is the same outcome as from the Elbow method, so I will use this number of clusters as optimal for converting my data to categorical variables, with 4 categories.
# Randomly select 1% of the rows from the dataset (grater part damaged R memory)
# df_sampled <- df[sample(nrow(df), nrow(df)*0.01), ]
# Determine the optimal number of clusters using the gap statistic method
# set.seed(123)
# gap <- clusGap(df_scaled, kmeans, K.max = 10, B = 50)
# Plot the results
# plot(gap, main = "Gap Statistic for KMeans Clustering", xlab = "Number of Clusters")
# Based on the plot, choose the number of clusters for which the gap is the largest: 4
library(raster)
## Loading required package: sp
image <- stack("/Users/annaczarnocka/Desktop/UL3/plots/gap_statistic_clustering.jpeg")
## Warning: [rast] unknown extent
plotRGB(image)
## Warning: [rast] unknown extent
## Warning: [rast] unknown extent
## Warning: [rast] unknown extent
Using the cut() function, I changed my data from numeric to categorical, with 4 categories, as indicated by my clustering analysis. The output of this change is displayed below by the str() function.
I also created histograms of converted variables, to see their distribution:
Overall, the distributions of the variables appear to be suitable for performing association rule analysis, as there is sufficient variation in the data to identify patterns and relationships between the variables.
# Convert numeric variables to categorical using binning, according to optimal number of clusters
df$danceability <- cut(df$danceability, breaks = 4)
df$energy <- cut(df$energy, breaks = 4)
df$loudness <- cut(df$loudness, breaks = 4)
df$speechiness <- cut(df$speechiness, breaks = 4)
df$acousticness <- cut(df$acousticness, breaks = 4)
df$instrumentalness <- cut(df$instrumentalness, breaks = 4)
df$liveness <- cut(df$liveness, breaks = 4)
df$valence <- cut(df$valence, breaks = 4)
df$tempo <- cut(df$tempo, breaks = 4)
str(df)
## 'data.frame': 446475 obs. of 9 variables:
## $ danceability : Factor w/ 4 levels "(-3.4,-1.91]",..: 3 3 2 2 2 1 3 3 2 3 ...
## $ energy : Factor w/ 4 levels "(-2.16,-1.16]",..: 2 2 1 1 1 2 2 1 2 1 ...
## $ loudness : Factor w/ 4 levels "(-9.7,-6.5]",..: 3 3 3 2 3 3 3 3 3 3 ...
## $ speechiness : Factor w/ 4 levels "(-0.59,0.628]",..: 2 4 1 1 1 1 1 1 1 1 ...
## $ acousticness : Factor w/ 4 levels "(-1.3,-0.582]",..: 3 4 4 4 4 4 4 4 3 4 ...
## $ instrumentalness: Factor w/ 4 levels "(-0.427,0.516]",..: 3 1 1 4 1 1 1 1 1 4 ...
## $ liveness : Factor w/ 4 levels "(-1.16,0.16]",..: 1 1 1 1 2 1 1 2 1 1 ...
## $ valence : Factor w/ 4 levels "(-2.17,-1.19]",..: 1 3 2 2 1 1 3 3 3 3 ...
## $ tempo : Factor w/ 4 levels "(-3.98,-1.93]",..: 2 2 3 3 2 2 2 3 3 2 ...
library(ggplot2)
mypal <- brewer.pal(9, "Blues")
ggplot(df, aes(x = danceability)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Danceability")
ggplot(df, aes(x = energy)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Energy")
ggplot(df, aes(x = loudness)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Loudness")
ggplot(df, aes(x = speechiness)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Speechiness")
ggplot(df, aes(x = acousticness)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Acousticness")
ggplot(df, aes(x = instrumentalness)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Instrumentalness")
ggplot(df, aes(x = liveness)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Liveness")
ggplot(df, aes(x = valence)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Valence")
ggplot(df, aes(x = tempo)) +
geom_bar(fill = "lightblue") +
labs(title = "Distribution of Tempo")
########### ASSOCIATION RULES #####################################################
# Install the package if not yet installed
packages <- c("tidyverse", "readr", "arules", "arulesViz", "gridExtra", "RColorBrewer")
# Load packages if not already loaded
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
}
}
Converting the data frame into a transaction object The trans <- as(df, “transactions”) line of code is used to convert the data frame df into a transaction object, which is a specific data format needed for association rule mining using the apriori algorithm. In order to identify potential hit songs on Spotify, a transaction object was created where each row represents a single track and contains a set of song features (e.g., danceability, energy, and tempo) that were present in that track. This allows the association rules algorithm to analyze the co-occurrence of these song features and identify the most common patterns that are associated with popular tracks on Spotify. This format is used to find frequent itemsets (here sets of song features that frequently appear together) and generate association rules, so the rules that describe the relationships between the items in the frequent itemsets.
Generating association rules Then I generated association rules from frequent itemsets using the Apriori algorithm. The ruleParameters variable is used to set the support, confidence, and maximum length of the rules.
The apriori function was applied to the trans transaction object with these parameters. The resulting associationRules object contains the generated association rules.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# Convert the data frame to a transaction object
trans <- as(df, "transactions")
# generating the association rules from frequent itemsets
ruleParameters <- list(supp = 0.05, conf = 0.5, maxlen = 2)
associationRules <- apriori(trans, parameter = ruleParameters, control = list(verbose = FALSE))
The summary provides information about the rules generated from the apriori algorithm with the given parameters.
The plot shows the relationships between the support, confidence, and lift measures for the association rules generated by the apriori algorithm. Each point on the plot represents a rule, and the size of the point indicates the number of transactions that support the rule. The color of the point indicates the level of lift, with blue indicating low lift and red indicating high lift.
From the plot, you can see that the majority of the rules have relatively low support and confidence values, indicating that they are not very strong rules. However, there are a few rules with high support and confidence, which are represented by larger points on the plot. Additionally, there are a few rules with high lift, which are represented by red points on the plot. These rules may be particularly interesting, as they suggest that certain itemsets are more strongly associated with each other than would be expected by chance.
library(arulesViz)
# summarise and plot associationRules
summary(associationRules)
## set of 138 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2
## 6 132
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 1.957 2.000 2.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05238 Min. :0.5039 Min. :0.08503 Min. :0.7882
## 1st Qu.:0.14043 1st Qu.:0.5860 1st Qu.:0.20905 1st Qu.:0.9849
## Median :0.20743 Median :0.6997 Median :0.28433 Median :1.0246
## Mean :0.26955 Mean :0.7165 Mean :0.38196 Mean :1.0983
## 3rd Qu.:0.34766 3rd Qu.:0.8578 3rd Qu.:0.54042 3rd Qu.:1.0933
## Max. :0.89530 Max. :0.9793 Max. :1.00000 Max. :2.9181
## count
## Min. : 23385
## 1st Qu.: 62700
## Median : 92612
## Mean :120346
## 3rd Qu.:155223
## Max. :399728
##
## mining info:
## data ntransactions support confidence
## trans 446475 0.05 0.5
## call
## apriori(data = trans, parameter = ruleParameters, control = list(verbose = FALSE))
plot(associationRules, col="blue")
This plot shows a visual representation of the association rules in a graph format. Each node in the graph represents an item or set of items, and the size of the node indicates its support. The thickness of the edges between the nodes represents the strength of the association, with thicker edges indicating higher confidence and/or lift values. The shading of the edges indicates the lift value, with lighter shading indicating higher lift values. Overall, this plot allows to easily identify which items are frequently purchased together and which associations are the strongest.
By looking at the plot, we can see that there are several groups of items that are strongly associated with each other, as they appear in clusters that are closer together. The size of the nodes indicates the frequency of the itemsets in the data, and the thickness of the edges represents the strength of the association between the items. From the plot, it can be inferred that the rules can be applied to a substantial amount of inputs, and there are some strong associations between certain groups of items. From this plot it is hard to see which groups have the the strongest associations.
# plot associationRules
mypal <- brewer.pal(9, "Blues")
set.seed(123)
plot(associationRules, method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph",
control = list(type = "items", border = "black", col = mypal))
## Warning: Unknown control parameters: type, border, main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using
## 'lift' (change control parameter max if needed).
The matrix plot can help us to visually identify the most interesting and useful rules from the association rule analysis.
The plot has two axes: the horizontal axis represents the support of the rule, and the vertical axis represents the confidence of the rule. Each rule is represented by a square in the plot, with the size of the square indicating the lift of the rule. The color of the square indicates the value of the lift, with brighter colors indicating higher values of lift. For my plot, most rules has lift smaller than 2.
The plot helps us to identify the most interesting rules based on their support, confidence, and lift. The rules that are located in the upper right corner of the plot have high support, high confidence, and high lift, which means that they are strong rules that are applicable to a large number of transactions and have a high likelihood of being correct. On the other hand, the rules that are located in the lower left corner of the plot have low support, low confidence, and low lift, which means that they are weak rules that are not very useful.
The plot shows the lift values for each rule, which indicates how strongly the antecedent and consequent are associated beyond what would be expected by chance. Rules with a lift value greater than 1 indicate a positive association between the antecedent and consequent, while values less than 1 indicate a negative association.
library(arulesViz)
mypal <- rev(brewer.pal(9, "Blues"))
plot(associationRules, method = "matrix", measure = c("support", "confidence"), shading = "lift", col = mypal)
## Itemsets in Antecedent (LHS)
## [1] "{energy=(-2.16,-1.16]}" "{instrumentalness=(2.39,3.34]}"
## [3] "{energy=(0.821,1.82]}" "{danceability=(1.07,2.56]}"
## [5] "{acousticness=(0.847,1.56]}" "{energy=(-1.16,-0.171]}"
## [7] "{valence=(-2.17,-1.19]}" "{acousticness=(-1.3,-0.582]}"
## [9] "{energy=(-0.171,0.821]}" "{loudness=(-0.13,3.07]}"
## [11] "{acousticness=(-0.582,0.133]}" "{valence=(0.755,1.73]}"
## [13] "{tempo=(0.112,2.15]}" "{valence=(-0.218,0.755]}"
## [15] "{acousticness=(0.133,0.847]}" "{instrumentalness=(-0.427,0.516]}"
## [17] "{liveness=(0.16,1.48]}" "{danceability=(-0.419,1.07]}"
## [19] "{}" "{liveness=(-1.16,0.16]}"
## [21] "{speechiness=(-0.59,0.628]}" "{loudness=(-3.32,-0.13]}"
## [23] "{valence=(-1.19,-0.218]}" "{tempo=(-1.93,0.112]}"
## [25] "{danceability=(-1.91,-0.419]}"
## Itemsets in Consequent (RHS)
## [1] "{instrumentalness=(-0.427,0.516]}" "{speechiness=(-0.59,0.628]}"
## [3] "{liveness=(-1.16,0.16]}" "{danceability=(-0.419,1.07]}"
## [5] "{tempo=(-1.93,0.112]}" "{loudness=(-0.13,3.07]}"
## [7] "{tempo=(0.112,2.15]}" "{acousticness=(-1.3,-0.582]}"
## [9] "{loudness=(-3.32,-0.13]}" "{valence=(0.755,1.73]}"
## [11] "{acousticness=(0.847,1.56]}"
The head and tail of the associationRules data frame show different rules and values, as they represent the first and last few rows of the data frame respectively.
The rules in the head of associationRules are the most general rules, with an empty antecedent (i.e., {}) and a consequent that specifies a range of values for a particular feature. Since the antecedent is empty, the coverage and lift values for these rules are both 1, which means that they cover the entire dataset and there is no association between the antecedent and consequent.
In contrast, the rules in the tail of associationRules have more specific antecedents that correspond to smaller subsets of the data. As a result, their coverage and lift values are lower than 1, indicating that they cover only a portion of the data and there is some association between the antecedent and consequent.
For example, the last rule in the tail of associationRules has an antecedent of {speechiness=(-0.59,0.628]} and a consequent of {instrumentalness=(-0.427,0.516]}. The coverage of this rule is 0.8952976, which means that almost 90% of the data that has a speechiness value within the range of (-0.59,0.628]. The lift of this rule is 0.9848941, which means that the probability of the consequent occurring is only 0.9848941 times higher when the antecedent occurs, compared to when it does not occur. This suggests that there is a weak association between speechiness and instrumentalness for this subset of the data.
associationRules <- as(associationRules, "data.frame")
head(associationRules, )
## rules support confidence coverage lift
## 1 {} => {danceability=(-0.419,1.07]} 0.5404177 0.5404177 1 1
## 2 {} => {tempo=(-1.93,0.112]} 0.5491618 0.5491618 1 1
## 3 {} => {loudness=(-0.13,3.07]} 0.6243642 0.6243642 1 1
## 4 {} => {liveness=(-1.16,0.16]} 0.7043149 0.7043149 1 1
## 5 {} => {instrumentalness=(-0.427,0.516]} 0.8559964 0.8559964 1 1
## 6 {} => {speechiness=(-0.59,0.628]} 0.8952976 0.8952976 1 1
## count
## 1 241283
## 2 245187
## 3 278763
## 4 314459
## 5 382181
## 6 399728
tail(associationRules, )
## rules support
## 133 {liveness=(-1.16,0.16]} => {instrumentalness=(-0.427,0.516]} 0.5956011
## 134 {instrumentalness=(-0.427,0.516]} => {liveness=(-1.16,0.16]} 0.5956011
## 135 {liveness=(-1.16,0.16]} => {speechiness=(-0.59,0.628]} 0.6524553
## 136 {speechiness=(-0.59,0.628]} => {liveness=(-1.16,0.16]} 0.6524553
## 137 {instrumentalness=(-0.427,0.516]} => {speechiness=(-0.59,0.628]} 0.7547948
## 138 {speechiness=(-0.59,0.628]} => {instrumentalness=(-0.427,0.516]} 0.7547948
## confidence coverage lift count
## 133 0.8456460 0.7043149 0.9879084 265921
## 134 0.6957986 0.8559964 0.9879084 265921
## 135 0.9263688 0.7043149 1.0347049 291305
## 136 0.7287581 0.8952976 1.0347049 291305
## 137 0.8817733 0.8559964 0.9848941 336997
## 138 0.8430658 0.8952976 0.9848941 336997
This code is performed to extract the left and right side items from the association rules and to clean up the resulting data frame and for the better readability and interpretation of my association rules.
First, I created two new columns in the data frame associationRules called “rules_left” and “rules_right”. These columns are populated with the left and right side items of each association rule using the strsplit() function to split the original “rules” column by the “=>” character. Then, the original “rules” column was removed using associationRules$rules <- NULL.
Next, I used the separate() function to split the “rules_left” and “rules_right” columns into separate columns for the variable name and range. The into argument specifies the new column names, while the sep argument specifies the regular expression used to separate the variable name from the range.
Finally, the gsub() function is used to remove unnecessary characters (in this case, curly braces and equals signs) from the “left_var” and “right_var” columns, cleaning up the resulting data frame. The result is shown by tail() and str() functions.
#extract the left and right side items from the association rules
library(tidyr)
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
##
## expand, pack, unpack
## The following object is masked from 'package:raster':
##
## extract
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:raster':
##
## intersect, select, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
associationRules[c("rules_left", "rules_right")] <-
t(sapply(associationRules$rules, function(x) strsplit(x, " => ")[[1]]))
associationRules$rules <- NULL
tail(associationRules, )
## support confidence coverage lift count
## 133 0.5956011 0.8456460 0.7043149 0.9879084 265921
## 134 0.5956011 0.6957986 0.8559964 0.9879084 265921
## 135 0.6524553 0.9263688 0.7043149 1.0347049 291305
## 136 0.6524553 0.7287581 0.8952976 1.0347049 291305
## 137 0.7547948 0.8817733 0.8559964 0.9848941 336997
## 138 0.7547948 0.8430658 0.8952976 0.9848941 336997
## rules_left rules_right
## 133 {liveness=(-1.16,0.16]} {instrumentalness=(-0.427,0.516]}
## 134 {instrumentalness=(-0.427,0.516]} {liveness=(-1.16,0.16]}
## 135 {liveness=(-1.16,0.16]} {speechiness=(-0.59,0.628]}
## 136 {speechiness=(-0.59,0.628]} {liveness=(-1.16,0.16]}
## 137 {instrumentalness=(-0.427,0.516]} {speechiness=(-0.59,0.628]}
## 138 {speechiness=(-0.59,0.628]} {instrumentalness=(-0.427,0.516]}
library(tidyr)
associationRules <- associationRules %>%
separate(rules_left, into = c("left_var", "left_range"), sep = "\\(|\\]") %>%
separate(rules_right, into = c("right_var", "right_range"), sep = "\\(|\\]")
## Warning: Expected 2 pieces. Additional pieces discarded in 132 rows [7, 8, 9, 10, 11,
## 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 2, 3, 4, 5,
## 6].
## Warning: Expected 2 pieces. Additional pieces discarded in 138 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Remove unnecessary characters from "left_var" and "right_var" columns
associationRules$left_var <- gsub("[{}=]", "", associationRules$left_var)
associationRules$right_var <- gsub("[{}=]", "", associationRules$right_var)
tail(associationRules, )
## support confidence coverage lift count left_var
## 133 0.5956011 0.8456460 0.7043149 0.9879084 265921 liveness
## 134 0.5956011 0.6957986 0.8559964 0.9879084 265921 instrumentalness
## 135 0.6524553 0.9263688 0.7043149 1.0347049 291305 liveness
## 136 0.6524553 0.7287581 0.8952976 1.0347049 291305 speechiness
## 137 0.7547948 0.8817733 0.8559964 0.9848941 336997 instrumentalness
## 138 0.7547948 0.8430658 0.8952976 0.9848941 336997 speechiness
## left_range right_var right_range
## 133 -1.16,0.16 instrumentalness -0.427,0.516
## 134 -0.427,0.516 liveness -1.16,0.16
## 135 -1.16,0.16 speechiness -0.59,0.628
## 136 -0.59,0.628 liveness -1.16,0.16
## 137 -0.427,0.516 speechiness -0.59,0.628
## 138 -0.59,0.628 instrumentalness -0.427,0.516
str(associationRules)
## 'data.frame': 138 obs. of 9 variables:
## $ support : num 0.54 0.549 0.624 0.704 0.856 ...
## $ confidence : num 0.54 0.549 0.624 0.704 0.856 ...
## $ coverage : num 1 1 1 1 1 ...
## $ lift : num 1 1 1 1 1 ...
## $ count : int 241283 245187 278763 314459 382181 399728 23385 23643 29608 37180 ...
## $ left_var : chr "" "" "" "" ...
## $ left_range : chr NA NA NA NA ...
## $ right_var : chr "danceability" "tempo" "loudness" "liveness" ...
## $ right_range: chr "-0.419,1.07" "-1.93,0.112" "-0.13,3.07" "-1.16,0.16" ...
I also generated association rules from frequent itemsets. This way I could identify the most frequent itemsets of song features, which can be used to gain insights into the common patterns of features that occur together in popular songs. The output of “print(sort(freq_items, decreasing = TRUE))” shows the frequent itemsets sorted in decreasing order. Each line represents an itemset and the value in the right column is its support. The itemsets are defined by their range values for the different audio features.
For example, the first line “speechiness=(-0.000971,0.243] instrumentalness=(-0.001,0.25]” represents the itemset of songs with speechiness in the range of (-0.000971,0.243] and instrumentalness in the range of (-0.001,0.25], and its support is 0.8952976090. This means that almost 90% of the songs in the dataset have these two features falling in this range.
Similarly, the second line shows an itemset of songs with instrumentalness in the range of (-0.001,0.25], liveness in the range of (-0.001,0.25], and its support is 0.8559964164. This means that almost 85% of the songs in the dataset have these two features falling in this range.
The rest of the lines can be interpreted in a similar way to understand the frequent itemsets and their support values.
# Identify frequent itemsets
freq_items <- itemFrequency(trans)
# Print frequent itemsets
print(sort(freq_items, decreasing = TRUE))
## speechiness=(-0.59,0.628] instrumentalness=(-0.427,0.516]
## 0.8952976090 0.8559964164
## liveness=(-1.16,0.16] loudness=(-0.13,3.07]
## 0.7043149112 0.6243641861
## tempo=(-1.93,0.112] danceability=(-0.419,1.07]
## 0.5491617672 0.5404177166
## tempo=(0.112,2.15] acousticness=(-1.3,-0.582]
## 0.4211523602 0.3806596114
## loudness=(-3.32,-0.13] energy=(-0.171,0.821]
## 0.3675905706 0.3158833081
## valence=(-0.218,0.755] energy=(-1.16,-0.171]
## 0.3099188084 0.2843294697
## danceability=(-1.91,-0.419] acousticness=(0.847,1.56]
## 0.2821792934 0.2811803572
## valence=(0.755,1.73] valence=(-1.19,-0.218]
## 0.2754801501 0.2634458816
## energy=(0.821,1.82] liveness=(0.16,1.48]
## 0.2493510275 0.2090464192
## acousticness=(0.133,0.847] acousticness=(-0.582,0.133]
## 0.1736603393 0.1644996920
## valence=(-2.17,-1.19] energy=(-2.16,-1.16]
## 0.1511551599 0.1504361946
## danceability=(1.07,2.56] instrumentalness=(2.39,3.34]
## 0.1375978498 0.0850327566
## liveness=(1.48,2.8] speechiness=(3.06,4.27]
## 0.0535685089 0.0479623719
## speechiness=(0.628,1.84] danceability=(-3.4,-1.91]
## 0.0475480150 0.0398051403
## liveness=(2.8,4.12] instrumentalness=(1.46,2.39]
## 0.0330701607 0.0305011479
## instrumentalness=(0.516,1.46] tempo=(2.15,4.2]
## 0.0284696792 0.0227067585
## speechiness=(1.84,3.06] loudness=(-6.5,-3.32]
## 0.0091920040 0.0079287754
## tempo=(-3.98,-1.93] loudness=(-9.7,-6.5]
## 0.0069791142 0.0001164679
Using the sot() function and head( ,10), I displayed the top 10 most often occuring song features. The results are displayed below, with ranges for variables in brackets, and after that there is a level of support, so the share of songs that have this feature’s value falling into a given range. For example, 86% of songs has instrumentalness in the range of (-0.427, 0.516]. The values from the table are also summarised in the barplot, but only three x-axis descriptions could be displayed, because when I set these descriptions to vertical, they were beneath the edge of the plot and not visible.
# Sort items in decreasing order of frequency
sorted_items <- sort(freq_items, decreasing = TRUE)
# Print the top 10 most frequent itemsets
head(sorted_items, 10)
## speechiness=(-0.59,0.628] instrumentalness=(-0.427,0.516]
## 0.8952976 0.8559964
## liveness=(-1.16,0.16] loudness=(-0.13,3.07]
## 0.7043149 0.6243642
## tempo=(-1.93,0.112] danceability=(-0.419,1.07]
## 0.5491618 0.5404177
## tempo=(0.112,2.15] acousticness=(-1.3,-0.582]
## 0.4211524 0.3806596
## loudness=(-3.32,-0.13] energy=(-0.171,0.821]
## 0.3675906 0.3158833
# Create a barplot of the 10 most frequent itemsets
barplot(sorted_items[1:10], xlab = "Itemset", ylab = "Frequency", main = "Top 10 Frequent Itemsets", col = mypal)
Then I identified frequent itemsets with high lift values to determine the most common and strongest associations between song features. These frequent itemsets can help to identify potential hit songs by providing insights into the preferences of the listeners.
These are the top 10 association rules with the highest lift values in the data. The lift value measures the strength of association between two items, and a lift value greater than 1 indicates a positive association between the items.
Here’s how to interpret the output: -
support: the proportion of transactions
containing both items in the rule -
confidence: the proportion of transactions
containing the left-hand side item that also contain the right-hand side
item - coverage: the proportion of
transactions containing the left-hand side item -
lift: the measure of how much more often
the two items appear together than if they were statistically
independent (a lift of 1 means there is no association) -
count: the number of transactions
containing both items in the rule -
rules_left: the item or items on the
left-hand side of the rule (antecedent) -
rules_right: the item or items on the
right-hand side of the rule (consequent)
For example, the first rule with the highest lift has a support of 0.1239, which means that 12.39% of the transactions contain both items in the rule, and a confidence of 0.8191, which means that 81.91% of the transactions containing the left-hand side item also contain the right-hand side item. The lift value for this rule is 2.9216, which is greater than 1, indicating a positive association between the left-hand side item “energy=(-0.001,0.25]” and the right-hand side item “acousticness=(0.747,0.997]”. This suggests that songs with low energy tend to have high acousticness. Similarly, the other rules can be interpreted in a similar way.
# Sort associationRules by lift in decreasing order
sorted_rules <- associationRules[order(-associationRules$lift),]
# Print the top 10 rules with the highest lift values
head(sorted_rules, 10)
## support confidence coverage lift count left_var
## 18 0.12343356 0.8205044 0.15043619 2.918072 55110 energy
## 19 0.13398735 0.8906590 0.15043619 2.422965 59822 energy
## 7 0.05237695 0.6159621 0.08503276 2.190630 23385 instrumentalness
## 46 0.18923344 0.7589038 0.24935103 1.993655 84488 energy
## 11 0.07345092 0.5338086 0.13759785 1.937739 32794 danceability
## 65 0.18368106 0.6532500 0.28118036 1.777113 82009 acousticness
## 8 0.05295481 0.6227578 0.08503276 1.694162 23643 instrumentalness
## 74 0.16171118 0.5687458 0.28432947 1.547226 72200 energy
## 49 0.23830002 0.9556809 0.24935103 1.530647 106395 energy
## 24 0.08085559 0.5349178 0.15115516 1.455200 36100 valence
## left_range right_var right_range
## 18 -2.16,-1.16 acousticness 0.847,1.56
## 19 -2.16,-1.16 loudness -3.32,-0.13
## 7 2.39,3.34 acousticness 0.847,1.56
## 46 0.821,1.82 acousticness -1.3,-0.582
## 11 1.07,2.56 valence 0.755,1.73
## 65 0.847,1.56 loudness -3.32,-0.13
## 8 2.39,3.34 loudness -3.32,-0.13
## 74 -1.16,-0.171 loudness -3.32,-0.13
## 49 0.821,1.82 loudness -0.13,3.07
## 24 -2.17,-1.19 loudness -3.32,-0.13
Support and confidence are other measures that can be used to evaluate the strength of association between the antecedent and consequent of a rule. I identified the rules with the highest support and confidence values to determine the most common combinations of song features that co-occur in popular tracks on Spotify.
The rules with the highest support values indicate the most frequent combinations of song features in the data set, while the rules with the highest confidence values indicate the strongest associations between the left-hand and right-hand side features.
From the output, we can see that the rule with the highest support
value is: - instrumentalness ->
speechiness with support 0.8952976.
The rule with the highest confidence value is: -
instrumentalness (range 2.39,3.34) ->
speechiness (range -0.59,0.628) with
confidence 0.9793231.
These rules suggest that instrumentalness and speechiness are the most frequent features in popular tracks on Spotify and that when a song has a high instrumentalness value within the range of 2.39,3.34, it is highly likely to have a high speechiness value within the range of -0.59,0.628.
# Sort associationRules by support in decreasing order
sorted_rules_support <- associationRules[order(-associationRules$support),]
# Print the top 10 rules with the highest support values
head(sorted_rules_support, 10)
## support confidence coverage lift count left_var
## 6 0.8952976 0.8952976 1.0000000 1.0000000 399728
## 5 0.8559964 0.8559964 1.0000000 1.0000000 382181
## 137 0.7547948 0.8817733 0.8559964 0.9848941 336997 instrumentalness
## 138 0.7547948 0.8430658 0.8952976 0.9848941 336997 speechiness
## 4 0.7043149 0.7043149 1.0000000 1.0000000 314459
## 135 0.6524553 0.9263688 0.7043149 1.0347049 291305 liveness
## 136 0.6524553 0.7287581 0.8952976 1.0347049 291305 speechiness
## 3 0.6243642 0.6243642 1.0000000 1.0000000 278763
## 133 0.5956011 0.8456460 0.7043149 0.9879084 265921 liveness
## 134 0.5956011 0.6957986 0.8559964 0.9879084 265921 instrumentalness
## left_range right_var right_range
## 6 <NA> speechiness -0.59,0.628
## 5 <NA> instrumentalness -0.427,0.516
## 137 -0.427,0.516 speechiness -0.59,0.628
## 138 -0.59,0.628 instrumentalness -0.427,0.516
## 4 <NA> liveness -1.16,0.16
## 135 -1.16,0.16 speechiness -0.59,0.628
## 136 -0.59,0.628 liveness -1.16,0.16
## 3 <NA> loudness -0.13,3.07
## 133 -1.16,0.16 instrumentalness -0.427,0.516
## 134 -0.427,0.516 liveness -1.16,0.16
# Sort associationRules by confidence in decreasing order
sorted_rules_confidence <- associationRules[order(-associationRules$confidence),]
# Print the top 10 rules with the highest confidence values
head(sorted_rules_confidence, 10)
## support confidence coverage lift count left_var
## 10 0.08327454 0.9793231 0.08503276 1.093852 37180 instrumentalness
## 73 0.27422140 0.9717985 0.28217929 1.085447 122433 danceability
## 28 0.14452545 0.9561397 0.15115516 1.067957 64527 valence
## 49 0.23830002 0.9556809 0.24935103 1.530647 106395 energy
## 33 0.15560334 0.9459187 0.16449969 1.105050 69473 acousticness
## 39 0.16227112 0.9344167 0.17366034 1.091613 72450 acousticness
## 103 0.35378465 0.9293990 0.38065961 1.038089 157956 acousticness
## 52 0.23160087 0.9288146 0.24935103 1.037437 103404 energy
## 135 0.65245534 0.9263688 0.70431491 1.034705 291305 liveness
## 108 0.38927376 0.9243062 0.42115236 1.032401 173801 tempo
## left_range right_var right_range
## 10 2.39,3.34 speechiness -0.59,0.628
## 73 -1.91,-0.419 speechiness -0.59,0.628
## 28 -2.17,-1.19 speechiness -0.59,0.628
## 49 0.821,1.82 loudness -0.13,3.07
## 33 -0.582,0.133 instrumentalness -0.427,0.516
## 39 0.133,0.847 instrumentalness -0.427,0.516
## 103 -1.3,-0.582 speechiness -0.59,0.628
## 52 0.821,1.82 speechiness -0.59,0.628
## 135 -1.16,0.16 speechiness -0.59,0.628
## 108 0.112,2.15 speechiness -0.59,0.628
The plot below displays the frequency of occurrence of different items in the trans dataset. The horizontal axis represents the support level, which is the percentage of transactions that contain a particular itemset. The vertical axis represents the itemsets, which are sorted by decreasing frequency of occurrence. Each itemset is represented by a horizontal line, with the length of the line indicating the support level of the itemset.
The plot shows that the most frequent itemsets in the dataset have a support level of around 0.01. The itemsets with lower support levels are less frequent and are represented by shorter lines. The color of the lines indicates the frequency of occurrence of the itemsets, with darker colors representing more frequent itemsets.
Overall, this plot provides a visual representation of the frequency distribution of itemsets in the dataset.
library(arulesViz)
mypal <- brewer.pal(9, "Blues")
# Item frequency plot
itemFrequencyPlot(trans, support = 0.01, col = mypal)
Kotsiantis, S., & Kanellopoulos, D. (2006). Association Rules Mining: A Recent Overview. GESTS International Transactions on Computer Science and Engineering, 32(1), 71–82. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=73a19026fb8a6ef5bf238ff472f31100c33753d0
RPubs - Project 3: Recommendation System with Association Rules. (2021, March 1). https://rpubs.com/okowalewska/association_rules_and_recommendations
Toivonen, H. (1996). Sampling Large Databases for Association Rules. Very Large Data Bases, 134–145. https://www.cin.ufpe.br/~jtalr/Mestrado/toiSample.pdf