As a dataset, I used my top 100 Spotify songs of 2023. I got the information from Exportify, where I linked my Spotify account to access the data. Exportify provided the dataset as a CSV file. To make this CSV file publicly accessible, I uploaded it to my GitHub. Click here to access the file.
This data relates to my music taste in 2023. The 100 observations it contains are my most listened-to songs on Spotify in 2023. However, the true population of music I enjoy is, of course, much larger than this small sample and differs to some extent from this sample, as only a fraction of the music I listen to is represented. Nevertheless, it should provide a good overview.
To load the dataset, I used the raw version of the dataset I posted
on GitHub and loaded it as dt_raw.
dt_raw <- read.csv("https://raw.githubusercontent.com/canevancane/WU/main/deine_top-songs2023.csv", header = TRUE, sep = ",", dec = ".")
This dataset contains a lot of IDs and metadata that are not really of interest for this analysis. To keep it simple, I reduced it to a few variables.
dt <- dt_raw[c("Track.Name", "Album.Name", "Artist.Name.s.","Release.Date", "Duration..ms.", "Popularity", "Danceability", "Energy", "Tempo", "Mode", "Key")]
Here is a short overview of the included variables:
| Variable Name | Explanation | Data Type |
|---|---|---|
| Track.Name | The title of the song | String |
| Album.Name | The title of the album the song is on | String |
| Artist.Name.s. | Name of the artist(s) | String |
| Release.Date | Release date of the song | String |
| Duration..ms. | Duration of the song in ms | Integer |
| Popularity | Popularity among all Spotify users | Numeric |
| Danceability | Danceability of the song ranked by Spotify | Numeric |
| Energy | Energy of the song ranked by spotify | Numeric |
| Tempo | Tempo of the song in beats per minute (BPM) | Numeric |
| Mode | Mode of the song (Major or minor) | Integer (binary) |
| Key | Key of the song | Integer |
However, not all of those variables are in a desirable format. The ‘Mode’ variable, which indicates if the song is in a major or minor chord, should be transformed to a string and factorised. Furthermore, the ‘Key’ variable should be transformed to a string as well. This is done by the following code:
dt$Mode <- factor(dt$Mode, levels = c(0,1), labels = c("Minor", "Major"))
dt$Key <- factor(dt$Key, levels = c(0:11), labels = c("C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"))
In addition, it would also be nice to have a variable indicating the ranking of each song, explaining at which place a song is. Fortunately, the observations are already in the correct order, so I can simply create that variable using the following code:
dt$Rank <- c(1:100)
Including the Rank variable, the dataset consists of 12 variables.
Okay, let’s get an overview of the 6 songs I listened to the most and
the last 6 songs that barely made it into my top 100. For that, I can
use the head() and tail() commands.
head(dt)
## Track.Name
## 1 This Charming Man - 2011 Remaster
## 2 Cemetry Gates - 2011 Remaster
## 3 These Days
## 4 This Night Has Opened My Eyes - 2011 Remaster
## 5 When The Sun Goes Down
## 6 Under the Bridge
## Album.Name Artist.Name.s.
## 1 The Smiths The Smiths
## 2 The Queen Is Dead The Smiths
## 3 Chelsea Girl Nico
## 4 Hatful of Hollow The Smiths
## 5 Whatever People Say I Am, That's What I'm Not Arctic Monkeys
## 6 Blood Sugar Sex Magik (Deluxe Edition) Red Hot Chili Peppers
## Release.Date Duration..ms. Popularity Danceability Energy Tempo Mode Key
## 1 1984-02-20 162920 79 0.611 0.846 103.912 Major B
## 2 1986-06-16 161040 65 0.550 0.796 105.354 Major G
## 3 1967-10-09 210653 67 0.401 0.130 93.091 Major F
## 4 1984 221533 75 0.696 0.628 100.038 Major D
## 5 2006-01-29 202133 74 0.348 0.875 169.152 Minor B
## 6 1991-09-24 264306 83 0.559 0.345 84.581 Major E
## Rank
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
tail(dt)
## Track.Name
## 95 After The Storm (feat. Tyler, The Creator & Bootsy Collins)
## 96 Lago Maggiore
## 97 Please, Please, Please, Let Me Get What I Want - 2011 Remaster
## 98 Mardy Bum
## 99 Harness Your Hopes - B-side
## 100 In Nächten wie diesen - House Remix
## Album.Name
## 95 Isolation
## 96 Blue Sky Szenario
## 97 Hatful of Hollow
## 98 Whatever People Say I Am, That's What I'm Not
## 99 Brighten the Corners: Nicene Creedence Ed.
## 100 In Nächten wie diesen Remixes
## Artist.Name.s. Release.Date Duration..ms.
## 95 Kali Uchis,Tyler, The Creator,Bootsy Collins 2018-04-06 207454
## 96 Harry Quintana,Blanko Malte 2021-02-14 191753
## 97 The Smiths 1984 112706
## 98 Arctic Monkeys 2006-01-29 175440
## 99 Pavement 2008-12-09 206946
## 100 Harry Quintana,Blanko Malte 2022-10-21 160589
## Popularity Danceability Energy Tempo Mode Key Rank
## 95 81 0.702 0.659 79.640 Minor E 95
## 96 31 0.688 0.740 91.838 Minor D 96
## 97 79 0.241 0.468 91.581 Major D 97
## 98 76 0.634 0.599 112.215 Major D 98
## 99 73 0.604 0.573 107.862 Major D 99
## 100 34 0.960 0.617 118.038 Minor F# 100
However, looking at the output of that, we see that it looks rather
messy. Fortunately, the R package “knitr” provides us with a nice way to
create tables. These tables can be adapted for every format possible
(html, pdf, word) and are highly customizable. For the following
overview table, I chose to include only the title, album, artist, and
rank variables. However, the included variables can be adapted by
changing the first argument in the
head()/tail() function.
I further adapted the column names with the col.names
argument because the variable names do not look too nice. In addition to
the “knitr” package, I also used the “kableExtra” package to add some
spacing between the columns. For that, I used the pipe operator and the
kable_styling() command. This command already applies a
nice template to the table. From here on, one could further personalize
the table, however, I believe that for this quick demonstration it is
enough. Nevertheless, the documentation about this command can be found
here.
library(knitr)
library(kableExtra)
kable(head(dt[c("Rank", "Track.Name", "Album.Name", "Artist.Name.s.")]), format = "html", col.names =
c("Rank", "Title", "Album", "Artist"))%>%
kable_styling()
| Rank | Title | Album | Artist |
|---|---|---|---|
| 1 | This Charming Man - 2011 Remaster | The Smiths | The Smiths |
| 2 | Cemetry Gates - 2011 Remaster | The Queen Is Dead | The Smiths |
| 3 | These Days | Chelsea Girl | Nico |
| 4 | This Night Has Opened My Eyes - 2011 Remaster | Hatful of Hollow | The Smiths |
| 5 | When The Sun Goes Down | Whatever People Say I Am, That’s What I’m Not | Arctic Monkeys |
| 6 | Under the Bridge | Blood Sugar Sex Magik (Deluxe Edition) | Red Hot Chili Peppers |
kable(tail(dt[c("Rank", "Track.Name", "Album.Name", "Artist.Name.s.")]), format = "html", col.names =
c("Rank", "Title", "Album", "Artist"))%>%
kable_styling()
| Rank | Title | Album | Artist | |
|---|---|---|---|---|
| 95 | 95 | After The Storm (feat. Tyler, The Creator & Bootsy Collins) | Isolation | Kali Uchis,Tyler, The Creator,Bootsy Collins |
| 96 | 96 | Lago Maggiore | Blue Sky Szenario | Harry Quintana,Blanko Malte |
| 97 | 97 | Please, Please, Please, Let Me Get What I Want - 2011 Remaster | Hatful of Hollow | The Smiths |
| 98 | 98 | Mardy Bum | Whatever People Say I Am, That’s What I’m Not | Arctic Monkeys |
| 99 | 99 | Harness Your Hopes - B-side | Brighten the Corners: Nicene Creedence Ed. | Pavement |
| 100 | 100 | In Nächten wie diesen - House Remix | In Nächten wie diesen Remixes | Harry Quintana,Blanko Malte |
When looking at the data presented in the tables, there are a few (minor) things that can still be adapted. The problem with text data is that the text itself is always a bit dirty and must also be cleaned. Therefore, I will use some regular expressions to tidy it. I am aware that regular expressions are not within the scope of this course, yet I prefer my data to be clean and equally structured. To skip this part, please click here.
The first thing I noticed that I would change is the display of the year an album got remastered behind the album title. This is not just for cosmetic reasons, but also because Spotify often has both versions of albums. Sometimes it occurs that you listen to two different songs from the same album, however, one is listed in the “original” version of the album and one is listed in the “remastered” version of the album. This creates unnecessary confusion, since R would handle those two albums as separate albums. Therefore, it makes sense to remove “remastered” from the album title. I did this in the following way:
dt$Track.Name <- gsub("\\s-\\s.*Remaster.*", "", dt$Track.Name)
The gsub() function is a function of base R that
allows for substitutions of specific elements of strings. The main
arguments are pattern, replacement, and
x, where x corresponds to the data to analyze.
Both pattern and replacement take inputs in
the form of a regular expression. A regular expression is a pattern of
symbols that represent specific elements of text (or non-text) in a
string. In this case, alongside normal characters, the elements
\\s, ., and * were used.
"" (an
empty string/nothing). Let’s examine how a title with “Remaster” looks.
One example is “This Charming Man - 2011 Remaster”. Here, I want to keep
“This Charming Man” and substitute ” - 2011 Remaster” with
"". The first three elements of what I want to substitute
are matched by \\s-\\s. Here, \\s represents a
space, and - represents a normal dash. Then, I want to
remove the year that follows. However, this year might change.
Therefore, I used . which stands for any character (text,
digit, space, special character). The * after it means “at
least 0 times”, indicating that there can be any number of characters
between the dash and the term “Remaster”, or none at all. Then, I added
“Remaster”, which will match exactly those characters. Lastly, I added
another .* after the expression to match a title that
perhaps contains additional characters after “Remaster”.
Furthermore, I would like to remove the featured artists mentioned in the title of a song. Since the featured artists are already included in the artist section, there is no need to mention them twice.
dt$Track.Name <- gsub("\\s\\(feat.*\\)", "", dt$Track.Name)
\\s to indicate a space between the end of
the title and the start of the “(feat. …)”. Afterwards, the “feat.” is
indicated in parentheses. Since parentheses have an effect in the
regular expression, they need to be “escaped” if they are not used as a
command. This is done by indicating it with two backslashes
(\\). Afterwards, the .* is added again to
exclude everything between the “feat.” and the closing parenthesis.
In addition, I want to modify the artist names. Currently, if there are multiple artist names, they are listed as one string where the artists are separated by a comma. The problem here is that when analysing the artists later, “DOMi & JD BECK,Thundercat” would be seen as a different artist than “DOMi & JD BECK”. Let’s normalise it so only the main artist is shown to make the analysis easier.
dt$Artist.Name.s. <- sub("([\\s\\w\\d&]*),\\S.*", "\\1", dt$Artist.Name.s.)
In the pattern argument, this regular expression starts with
square brackets, setting a range of allowable characters. With
\\s, I allow for space characters; with \\w, I
allow for word characters; with \\d, I allow for digit
characters; and with &, I allow for the ‘&’
character. Setting * behind the expression indicates that
there can be none or more characters that are in the predefined
character range. This whole expression is grouped with parentheses so it
can be later referenced in the replace argument. After this group, there
is a comma ‘,’. This fits all the expressions that have multiple artists
listed after the main artist. Then, \\S indicates that
there cannot be whitespace afterwards. This expression prevents a comma
in an artist name (e.g., Tyler, the creator) from being seen as a comma
separating multiple artists. The .* after that indicates
that everything coming after that can fit the expression.
\\1.
I also want to clean the format of the release date. Some dates only contain the year, while others contain specific days. To make it possible to use the variable, however, it would be better if all were in the same format. Therefore, I reduced the variables to only include the release year with the following regular expression.
dt$Release.Date <- sub("(\\d{4}).*", "\\1", dt$Release.Date)
Now, I can also convert the string values of release date to numeric values.
dt$Release.Date <- as.numeric(dt$Release.Date)
Okay, now that the data is imported and cleaned, it would be nice to
get an overview of the data. This can be nicely done with the
summary() command. However, there is not really any sense
in summarizing non-numeric variables, so let me exclude them.
summary(dt[sapply(dt, is.numeric)])
## Release.Date Duration..ms. Popularity Danceability
## Min. :1964 Min. :101466 Min. : 4.0 Min. :0.2020
## 1st Qu.:1986 1st Qu.:173633 1st Qu.:56.0 1st Qu.:0.4530
## Median :2006 Median :210054 Median :63.5 Median :0.5405
## Mean :2001 Mean :222000 Mean :62.7 Mean :0.5473
## 3rd Qu.:2017 3rd Qu.:263986 3rd Qu.:73.0 3rd Qu.:0.6460
## Max. :2022 Max. :445186 Max. :83.0 Max. :0.9600
## Energy Tempo Rank
## Min. :0.1300 Min. : 73.10 Min. : 1.00
## 1st Qu.:0.4718 1st Qu.: 96.93 1st Qu.: 25.75
## Median :0.6400 Median :113.89 Median : 50.50
## Mean :0.6216 Mean :120.80 Mean : 50.50
## 3rd Qu.:0.7945 3rd Qu.:141.15 3rd Qu.: 75.25
## Max. :0.9710 Max. :194.91 Max. :100.00
Here, with the sapply() command, a subset of the
dataframe (dt) is created that follows the properties of
is.numeric, meaning that the value must be numeric. This
output looks quite messy again, so let’s put it in a
kable() table with the kable_styling() piped
into it.
kable(summary(dt[sapply(dt, is.numeric)])) %>%
kable_styling()
| Release.Date | Duration..ms. | Popularity | Danceability | Energy | Tempo | Rank | |
|---|---|---|---|---|---|---|---|
| Min. :1964 | Min. :101466 | Min. : 4.0 | Min. :0.2020 | Min. :0.1300 | Min. : 73.10 | Min. : 1.00 | |
| 1st Qu.:1986 | 1st Qu.:173633 | 1st Qu.:56.0 | 1st Qu.:0.4530 | 1st Qu.:0.4718 | 1st Qu.: 96.93 | 1st Qu.: 25.75 | |
| Median :2006 | Median :210054 | Median :63.5 | Median :0.5405 | Median :0.6400 | Median :113.89 | Median : 50.50 | |
| Mean :2001 | Mean :222000 | Mean :62.7 | Mean :0.5473 | Mean :0.6216 | Mean :120.80 | Mean : 50.50 | |
| 3rd Qu.:2017 | 3rd Qu.:263986 | 3rd Qu.:73.0 | 3rd Qu.:0.6460 | 3rd Qu.:0.7945 | 3rd Qu.:141.15 | 3rd Qu.: 75.25 | |
| Max. :2022 | Max. :445186 | Max. :83.0 | Max. :0.9600 | Max. :0.9710 | Max. :194.91 | Max. :100.00 |
This output looks slightly better, however, it is annoying that the
classification of the measured value is in every cell of the table. By
using the describe() command of the “psych” package, we can
get a nicer overview table. By setting IQR = TRUE and
quant = c(0.25,0.75), I also include the quantiles and
interquartile range as in the summary() command. By writing
the expression
[c("mean", "median", "sd", "min", "max", "IQR", "Q0.25", "Q0.75")]
behind it, I only include parameters of interest, and not other stuff
like skew and kurtosis. Lastly, I added the round() command
to limit the decimal digits to three. When putting this expression in
the kable() command, we get the following table:
library(psych)
kable(round(describe(dt[sapply(dt, is.numeric)],IQR=TRUE, quant = c(0.25,0.75))[c("mean","median", "sd", "min", "max", "IQR", "Q0.25", "Q0.75")], 3)) %>%
kable_styling()
| mean | median | sd | min | max | IQR | Q0.25 | Q0.75 | |
|---|---|---|---|---|---|---|---|---|
| Release.Date | 2000.800 | 2006.000 | 17.896 | 1964.000 | 2022.000 | 31.500 | 1985.750 | 2017.250 |
| Duration..ms. | 222000.520 | 210053.500 | 69854.767 | 101466.000 | 445186.000 | 90353.250 | 173633.250 | 263986.500 |
| Popularity | 62.700 | 63.500 | 13.900 | 4.000 | 83.000 | 17.000 | 56.000 | 73.000 |
| Danceability | 0.547 | 0.540 | 0.148 | 0.202 | 0.960 | 0.193 | 0.453 | 0.646 |
| Energy | 0.622 | 0.640 | 0.211 | 0.130 | 0.971 | 0.323 | 0.472 | 0.794 |
| Tempo | 120.803 | 113.894 | 30.836 | 73.104 | 194.914 | 44.217 | 96.932 | 141.150 |
| Rank | 50.500 | 50.500 | 29.011 | 1.000 | 100.000 | 49.500 | 25.750 | 75.250 |
Ok, let me explain this table a bit with the example of tempo. The mean of tempo is about 120.8 bpm, which is somewhat the standard for modern western music (120 bpm). This means that the average song of the sample has about 120.8 bpm. The median is a bit lower, at about 113.89. This means that 50% of the songs are faster than 113.89 bpm and 50% are slower. The fact that the mean is higher than the median is already a sign that the distribution is somewhat skewed to the left, since a normal distribution would have the mean equal to the median.
The standard deviation is a measure of spread around the mean. One standard deviation contains the 68% of closest observations around the mean, two standard deviations contain the 95% observations around the mean, and three standard deviations contain 99.7% spread around the mean. A standard deviation of 30.84 around the mean means that 68% of all observations are between 89.97 and 151.64, which is \(\bar x ± s\).
The minimum and maximum are pretty self-explanatory; they represent the lowest and highest tempo in the sample. The IQR (interquartile range) is the difference between the 25th percentile and the 75th percentile. This means that 50% of all songs are between 96.93 and 141.15. The size of the IQR also gives some information about the spread/variance of a variable.
The issue with looking at descriptive statistics in tables is that we do not see too many details about the distribution. Something like a bimodal distribution, for example, would get lost in a table. Therefore, it is helpful to create some density plots (or histograms) to have a look at the distributions.
Since I did not want to type out the same code for every numeric
variable, I coded a loop. First, I created a sub-dataset that only
consists of numeric variables. Then, I got the names of the variables in
this dataset with names(dtnum) and looped through them,
referring to every element in names(dtnum) as
variable_name. In this loop, the variable
datavar was created, which is the data of
dtnum for only the specific variable the loop is currently
at. Then, the loop creates a density plot for the datavar
and sets the main label to the name of the variable.
This whole loop would also have been possible by just looping through
the dataset dtnum instead of the names. However, then the
main titles of the density plots would have been looking worse. Loops
are very nice in every programming language and are essential when
dealing with big data. However, customization is not as easy anymore and
in R they unfortunately do not function with ggplot.
dtnum <- dt[sapply(dt, is.numeric)]
for (variable_name in names(dtnum)) {
datavar <- dtnum[[variable_name]]
plot(density(datavar), main = variable_name)
}
From the above density plots, we can observe that the distributions of release year and energy are somewhat bimodal. This suggests that I may have preferences for either old or new songs, and that I may prefer either very energetic or less energetic songs.
It would also be nice to get an overview about the categorical
variables. So lets have a look at the frequencies of mode and key. With
the table() command I get the frequencies.
table(dt$Mode)
##
## Minor Major
## 39 61
We can observe that most songs of this sample are in a major mode.
The table()command can also be used for the key of the
song.
table(dt$Key)
##
## C C# D D# E F F# G G# A A# B
## 11 9 8 2 11 7 10 13 6 7 6 10
The dataset at hand has exactly 100 variables, so it is quite simple to calculate relative numbers from the frequencies in head. However, with moth datasets this is not the case, so relative numbers are usefull. Those numbers can be calculated with the following code:
print(table(dt$Mode)/sum(table(dt$Mode)))
##
## Minor Major
## 0.39 0.61
print(table(dt$Key)/sum(table(dt$Key)))
##
## C C# D D# E F F# G G# A A# B
## 0.11 0.09 0.08 0.02 0.11 0.07 0.10 0.13 0.06 0.07 0.06 0.10
Here, just each individual category is divided by the sum of all categories, to get to the percent value.
Now that we have an overview of the data, let’s dive a bit deeper
into it. An interesting point to examine is to see which artists made it
the most often into my top 100. For that, we can use the
table() command to get the frequency of artists. I
converted this frequency table to a dataframe, saved as “dti”. Then I
ordered “dti” by descending frequency. Now, we can see my top 6 most
listened artists with the head() command.
dti <- as.data.frame(table(dt$Artist.Name.s.))
dti <- dti[order(-dti$Freq),]
head(dti)
## Var1 Freq
## 46 The Smiths 17
## 5 Arctic Monkeys 7
## 13 DOMi & JD BECK 7
## 24 Mac DeMarco 5
## 9 Cocteau Twins 3
## 18 Harry Quintana 3
It is now already clear to see what I really enjoyed listening to, but a graphical overview would also be nice. Therefore, let’s create a boxplot. For that, I used “ggplot2” to have a variety of customization options. A basic boxplot using ggplot looks like this:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data = dti, mapping = aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity")
This looks quite messy, so let’s tidy it up a bit. First, I would
like to have the variables in descending order by their frequency. This
improves readability and makes comparisons between the bars easier. To
achieve this descending order, I factored the naming variable of the
frequency data I created earlier. Unfortunately, ggplot does not take
the order that was defined earlier. Furthermore, I changed the number of
data points shown. There is no need to display all 51 different artists,
since most of them have only one occurrence. Therefore, I used the
argument data = head(dti) to get only the first 6
observations. In addition, I changed the theme of the ggplot and
adjusted the labels on the axis.
dti$Var1 <- factor(dti$Var1, levels = dti$Var1[order(dti$Freq, decreasing = TRUE)])
ggplot(data = head(dti), mapping = aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity")+
theme_classic()+
xlab("Artist Name")+
ylab("Frequency")
Here, the same information can be observed as when we print
head(dti). However, we can easily compare the different
frequencies. We can see that The Smiths are by far the band
that had the most songs in my top 100. Furthermore, Arctic
Monkeys and DOMi & JD BECK also had quite a lot of
songs in my top 100, but even if you combine them, they do not have more
songs than The Smiths. This strongly suggests that The
Smiths were my favourite band in 2023.
Next to comparing frequencies, it would be furthermore interesting to
compare some numeric variables and (visually) observe some relation.
What interests me is if a more popular song in the Spotify community is
also more popular for me. Spotify measures the popularity of a song and
it is listed in the variable Popularity. How popular the
song was for myself can be seen with the rank the song has. So lets plot
that.
ggplot(data=dt, mapping = aes(x=Rank, y=Popularity))+
geom_point()+
geom_smooth()+
theme_classic()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It does not really seem like there is a correlation between the popularity on Spotify and the popularity for me, since the slope of the trendline is almost equal to 0. There is a small slope, however, but from just visually looking at it, I would say that there is no relation. A regression analysis could be used now to analyse the statistical significance, but since this will be introduced later in the course, I will not do it now.
Let’s see, if songs in a major or minor key are ranked higher in this sample. We can do that with the boxplot function of ggplot2.
ggplot(data=dt, mapping = aes(y=Rank, x=Mode))+
geom_boxplot()+
theme_classic()+
scale_y_reverse()
The ggplot2 is also in the classic theme as the plots above.
Furthermore, I added scale_y_reverse(), to reverse the
y-axis. I did this because intuitively we assume that a value further
away from the origin of a plot is better. With this reversed axis, we
can follow this intuition, since a song that has a better ranking (e.g.,
a lower number) is now further away from the origin.
From the output of the boxplot, we can see that the median (the black line in the middle of the box) of the major key is a bit higher than the median of the minor key. Furthermore, the IQR (the box), and hence the variance, for the major key is a bit smaller than for the minor key. Both variables seem somewhat equally distributed, since the median is approximately in the center of the box.
However, only because the median of the major key is a bit higher than the median of the minor key, we should not judge statistical significance from that. In general, I would conclude that there is no statistical significance, since the IQR ranges are pretty much overlapping. Nevertheless, we would need a t-test for certainty.