Load and Clean the Dataset

As a dataset, I used my top 100 Spotify songs of 2023. I got the information from Exportify, where I linked my Spotify account to access the data. Exportify provided the dataset as a CSV file. To make this CSV file publicly accessible, I uploaded it to my GitHub. Click here to access the file.

This data relates to my music taste in 2023. The 100 observations it contains are my most listened-to songs on Spotify in 2023. However, the true population of music I enjoy is, of course, much larger than this small sample and differs to some extent from this sample, as only a fraction of the music I listen to is represented. Nevertheless, it should provide a good overview.

To load the dataset, I used the raw version of the dataset I posted on GitHub and loaded it as dt_raw.

dt_raw <- read.csv("https://raw.githubusercontent.com/canevancane/WU/main/deine_top-songs2023.csv", header = TRUE, sep = ",", dec = ".")

This dataset contains a lot of IDs and metadata that are not really of interest for this analysis. To keep it simple, I reduced it to a few variables.

dt <- dt_raw[c("Track.Name", "Album.Name", "Artist.Name.s.","Release.Date", "Duration..ms.", "Popularity", "Danceability", "Energy", "Tempo", "Mode", "Key")]

Here is a short overview of the included variables:

Variable Name Explanation Data Type
Track.Name The title of the song String
Album.Name The title of the album the song is on String
Artist.Name.s. Name of the artist(s) String
Release.Date Release date of the song String
Duration..ms. Duration of the song in ms Integer
Popularity Popularity among all Spotify users Numeric
Danceability Danceability of the song ranked by Spotify Numeric
Energy Energy of the song ranked by spotify Numeric
Tempo Tempo of the song in beats per minute (BPM) Numeric
Mode Mode of the song (Major or minor) Integer (binary)
Key Key of the song Integer

However, not all of those variables are in a desirable format. The ‘Mode’ variable, which indicates if the song is in a major or minor chord, should be transformed to a string and factorised. Furthermore, the ‘Key’ variable should be transformed to a string as well. This is done by the following code:

dt$Mode <- factor(dt$Mode, levels = c(0,1), labels = c("Minor", "Major"))
dt$Key <- factor(dt$Key, levels = c(0:11), labels = c("C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"))

In addition, it would also be nice to have a variable indicating the ranking of each song, explaining at which place a song is. Fortunately, the observations are already in the correct order, so I can simply create that variable using the following code:

dt$Rank <- c(1:100)

Including the Rank variable, the dataset consists of 12 variables.

Okay, let’s get an overview of the 6 songs I listened to the most and the last 6 songs that barely made it into my top 100. For that, I can use the head() and tail() commands.

head(dt)
##                                      Track.Name
## 1             This Charming Man - 2011 Remaster
## 2                 Cemetry Gates - 2011 Remaster
## 3                                    These Days
## 4 This Night Has Opened My Eyes - 2011 Remaster
## 5                        When The Sun Goes Down
## 6                              Under the Bridge
##                                      Album.Name        Artist.Name.s.
## 1                                    The Smiths            The Smiths
## 2                             The Queen Is Dead            The Smiths
## 3                                  Chelsea Girl                  Nico
## 4                              Hatful of Hollow            The Smiths
## 5 Whatever People Say I Am, That's What I'm Not        Arctic Monkeys
## 6        Blood Sugar Sex Magik (Deluxe Edition) Red Hot Chili Peppers
##   Release.Date Duration..ms. Popularity Danceability Energy   Tempo  Mode Key
## 1   1984-02-20        162920         79        0.611  0.846 103.912 Major   B
## 2   1986-06-16        161040         65        0.550  0.796 105.354 Major   G
## 3   1967-10-09        210653         67        0.401  0.130  93.091 Major   F
## 4         1984        221533         75        0.696  0.628 100.038 Major   D
## 5   2006-01-29        202133         74        0.348  0.875 169.152 Minor   B
## 6   1991-09-24        264306         83        0.559  0.345  84.581 Major   E
##   Rank
## 1    1
## 2    2
## 3    3
## 4    4
## 5    5
## 6    6
tail(dt)
##                                                         Track.Name
## 95     After The Storm (feat. Tyler, The Creator & Bootsy Collins)
## 96                                                   Lago Maggiore
## 97  Please, Please, Please, Let Me Get What I Want - 2011 Remaster
## 98                                                       Mardy Bum
## 99                                     Harness Your Hopes - B-side
## 100                            In Nächten wie diesen - House Remix
##                                        Album.Name
## 95                                      Isolation
## 96                              Blue Sky Szenario
## 97                               Hatful of Hollow
## 98  Whatever People Say I Am, That's What I'm Not
## 99     Brighten the Corners: Nicene Creedence Ed.
## 100                 In Nächten wie diesen Remixes
##                                   Artist.Name.s. Release.Date Duration..ms.
## 95  Kali Uchis,Tyler, The Creator,Bootsy Collins   2018-04-06        207454
## 96                   Harry Quintana,Blanko Malte   2021-02-14        191753
## 97                                    The Smiths         1984        112706
## 98                                Arctic Monkeys   2006-01-29        175440
## 99                                      Pavement   2008-12-09        206946
## 100                  Harry Quintana,Blanko Malte   2022-10-21        160589
##     Popularity Danceability Energy   Tempo  Mode Key Rank
## 95          81        0.702  0.659  79.640 Minor   E   95
## 96          31        0.688  0.740  91.838 Minor   D   96
## 97          79        0.241  0.468  91.581 Major   D   97
## 98          76        0.634  0.599 112.215 Major   D   98
## 99          73        0.604  0.573 107.862 Major   D   99
## 100         34        0.960  0.617 118.038 Minor  F#  100

However, looking at the output of that, we see that it looks rather messy. Fortunately, the R package “knitr” provides us with a nice way to create tables. These tables can be adapted for every format possible (html, pdf, word) and are highly customizable. For the following overview table, I chose to include only the title, album, artist, and rank variables. However, the included variables can be adapted by changing the first argument in the head()/tail() function.

I further adapted the column names with the col.names argument because the variable names do not look too nice. In addition to the “knitr” package, I also used the “kableExtra” package to add some spacing between the columns. For that, I used the pipe operator and the kable_styling() command. This command already applies a nice template to the table. From here on, one could further personalize the table, however, I believe that for this quick demonstration it is enough. Nevertheless, the documentation about this command can be found here.

library(knitr)
library(kableExtra)
kable(head(dt[c("Rank", "Track.Name", "Album.Name", "Artist.Name.s.")]), format = "html", col.names = 
        c("Rank", "Title", "Album", "Artist"))%>%
  kable_styling()
Rank Title Album Artist
1 This Charming Man - 2011 Remaster The Smiths The Smiths
2 Cemetry Gates - 2011 Remaster The Queen Is Dead The Smiths
3 These Days Chelsea Girl Nico
4 This Night Has Opened My Eyes - 2011 Remaster Hatful of Hollow The Smiths
5 When The Sun Goes Down Whatever People Say I Am, That’s What I’m Not Arctic Monkeys
6 Under the Bridge Blood Sugar Sex Magik (Deluxe Edition) Red Hot Chili Peppers
kable(tail(dt[c("Rank", "Track.Name", "Album.Name", "Artist.Name.s.")]), format = "html", col.names = 
        c("Rank", "Title", "Album", "Artist"))%>%
  kable_styling()
Rank Title Album Artist
95 95 After The Storm (feat. Tyler, The Creator & Bootsy Collins) Isolation Kali Uchis,Tyler, The Creator,Bootsy Collins
96 96 Lago Maggiore Blue Sky Szenario Harry Quintana,Blanko Malte
97 97 Please, Please, Please, Let Me Get What I Want - 2011 Remaster Hatful of Hollow The Smiths
98 98 Mardy Bum Whatever People Say I Am, That’s What I’m Not Arctic Monkeys
99 99 Harness Your Hopes - B-side Brighten the Corners: Nicene Creedence Ed. Pavement
100 100 In Nächten wie diesen - House Remix In Nächten wie diesen Remixes Harry Quintana,Blanko Malte

When looking at the data presented in the tables, there are a few (minor) things that can still be adapted. The problem with text data is that the text itself is always a bit dirty and must also be cleaned. Therefore, I will use some regular expressions to tidy it. I am aware that regular expressions are not within the scope of this course, yet I prefer my data to be clean and equally structured. To skip this part, please click here.

The first thing I noticed that I would change is the display of the year an album got remastered behind the album title. This is not just for cosmetic reasons, but also because Spotify often has both versions of albums. Sometimes it occurs that you listen to two different songs from the same album, however, one is listed in the “original” version of the album and one is listed in the “remastered” version of the album. This creates unnecessary confusion, since R would handle those two albums as separate albums. Therefore, it makes sense to remove “remastered” from the album title. I did this in the following way:

dt$Track.Name <- gsub("\\s-\\s.*Remaster.*", "", dt$Track.Name)
Explanation of the Regular Expression


The gsub() function is a function of base R that allows for substitutions of specific elements of strings. The main arguments are pattern, replacement, and x, where x corresponds to the data to analyze. Both pattern and replacement take inputs in the form of a regular expression. A regular expression is a pattern of symbols that represent specific elements of text (or non-text) in a string. In this case, alongside normal characters, the elements \\s, ., and * were used.

Since I want to substitute the expression “Remaster”, I had to create a regular expression that allows me to replace it with "" (an empty string/nothing). Let’s examine how a title with “Remaster” looks. One example is “This Charming Man - 2011 Remaster”. Here, I want to keep “This Charming Man” and substitute ” - 2011 Remaster” with "". The first three elements of what I want to substitute are matched by \\s-\\s. Here, \\s represents a space, and - represents a normal dash. Then, I want to remove the year that follows. However, this year might change. Therefore, I used . which stands for any character (text, digit, space, special character). The * after it means “at least 0 times”, indicating that there can be any number of characters between the dash and the term “Remaster”, or none at all. Then, I added “Remaster”, which will match exactly those characters. Lastly, I added another .* after the expression to match a title that perhaps contains additional characters after “Remaster”.

Furthermore, I would like to remove the featured artists mentioned in the title of a song. Since the featured artists are already included in the artist section, there is no need to mention them twice.

dt$Track.Name <- gsub("\\s\\(feat.*\\)", "", dt$Track.Name)
Explanation of the Regular Expression
This regular expression follows the same logic as the last one. It starts with a \\s to indicate a space between the end of the title and the start of the “(feat. …)”. Afterwards, the “feat.” is indicated in parentheses. Since parentheses have an effect in the regular expression, they need to be “escaped” if they are not used as a command. This is done by indicating it with two backslashes (\\). Afterwards, the .* is added again to exclude everything between the “feat.” and the closing parenthesis.

In addition, I want to modify the artist names. Currently, if there are multiple artist names, they are listed as one string where the artists are separated by a comma. The problem here is that when analysing the artists later, “DOMi & JD BECK,Thundercat” would be seen as a different artist than “DOMi & JD BECK”. Let’s normalise it so only the main artist is shown to make the analysis easier.

dt$Artist.Name.s. <- sub("([\\s\\w\\d&]*),\\S.*", "\\1", dt$Artist.Name.s.)
Explanation of the Regular Expression


In the pattern argument, this regular expression starts with square brackets, setting a range of allowable characters. With \\s, I allow for space characters; with \\w, I allow for word characters; with \\d, I allow for digit characters; and with &, I allow for the ‘&’ character. Setting * behind the expression indicates that there can be none or more characters that are in the predefined character range. This whole expression is grouped with parentheses so it can be later referenced in the replace argument. After this group, there is a comma ‘,’. This fits all the expressions that have multiple artists listed after the main artist. Then, \\S indicates that there cannot be whitespace afterwards. This expression prevents a comma in an artist name (e.g., Tyler, the creator) from being seen as a comma separating multiple artists. The .* after that indicates that everything coming after that can fit the expression.

In the replace argument, the group that was created in the first expression is called again. This is done with \\1.

I also want to clean the format of the release date. Some dates only contain the year, while others contain specific days. To make it possible to use the variable, however, it would be better if all were in the same format. Therefore, I reduced the variables to only include the release year with the following regular expression.

dt$Release.Date <- sub("(\\d{4}).*", "\\1", dt$Release.Date)
Explanation of the Regular Expression
Here, the pattern argument expects a four-digit number followed by everything. Then, this four-digit number is captured into the replace argument. With this, every entry for the release date variable is substituted by only the year it contains.

Now, I can also convert the string values of release date to numeric values.

dt$Release.Date <- as.numeric(dt$Release.Date)

Overview of the Data

Descriptive Statistics

Okay, now that the data is imported and cleaned, it would be nice to get an overview of the data. This can be nicely done with the summary() command. However, there is not really any sense in summarizing non-numeric variables, so let me exclude them.

summary(dt[sapply(dt, is.numeric)])
##   Release.Date  Duration..ms.      Popularity    Danceability   
##  Min.   :1964   Min.   :101466   Min.   : 4.0   Min.   :0.2020  
##  1st Qu.:1986   1st Qu.:173633   1st Qu.:56.0   1st Qu.:0.4530  
##  Median :2006   Median :210054   Median :63.5   Median :0.5405  
##  Mean   :2001   Mean   :222000   Mean   :62.7   Mean   :0.5473  
##  3rd Qu.:2017   3rd Qu.:263986   3rd Qu.:73.0   3rd Qu.:0.6460  
##  Max.   :2022   Max.   :445186   Max.   :83.0   Max.   :0.9600  
##      Energy           Tempo             Rank       
##  Min.   :0.1300   Min.   : 73.10   Min.   :  1.00  
##  1st Qu.:0.4718   1st Qu.: 96.93   1st Qu.: 25.75  
##  Median :0.6400   Median :113.89   Median : 50.50  
##  Mean   :0.6216   Mean   :120.80   Mean   : 50.50  
##  3rd Qu.:0.7945   3rd Qu.:141.15   3rd Qu.: 75.25  
##  Max.   :0.9710   Max.   :194.91   Max.   :100.00

Here, with the sapply() command, a subset of the dataframe (dt) is created that follows the properties of is.numeric, meaning that the value must be numeric. This output looks quite messy again, so let’s put it in a kable() table with the kable_styling() piped into it.

kable(summary(dt[sapply(dt, is.numeric)])) %>% 
  kable_styling()
Release.Date Duration..ms. Popularity Danceability Energy Tempo Rank
Min. :1964 Min. :101466 Min. : 4.0 Min. :0.2020 Min. :0.1300 Min. : 73.10 Min. : 1.00
1st Qu.:1986 1st Qu.:173633 1st Qu.:56.0 1st Qu.:0.4530 1st Qu.:0.4718 1st Qu.: 96.93 1st Qu.: 25.75
Median :2006 Median :210054 Median :63.5 Median :0.5405 Median :0.6400 Median :113.89 Median : 50.50
Mean :2001 Mean :222000 Mean :62.7 Mean :0.5473 Mean :0.6216 Mean :120.80 Mean : 50.50
3rd Qu.:2017 3rd Qu.:263986 3rd Qu.:73.0 3rd Qu.:0.6460 3rd Qu.:0.7945 3rd Qu.:141.15 3rd Qu.: 75.25
Max. :2022 Max. :445186 Max. :83.0 Max. :0.9600 Max. :0.9710 Max. :194.91 Max. :100.00

This output looks slightly better, however, it is annoying that the classification of the measured value is in every cell of the table. By using the describe() command of the “psych” package, we can get a nicer overview table. By setting IQR = TRUE and quant = c(0.25,0.75), I also include the quantiles and interquartile range as in the summary() command. By writing the expression [c("mean", "median", "sd", "min", "max", "IQR", "Q0.25", "Q0.75")] behind it, I only include parameters of interest, and not other stuff like skew and kurtosis. Lastly, I added the round() command to limit the decimal digits to three. When putting this expression in the kable() command, we get the following table:

library(psych)
kable(round(describe(dt[sapply(dt, is.numeric)],IQR=TRUE, quant = c(0.25,0.75))[c("mean","median", "sd", "min", "max", "IQR", "Q0.25", "Q0.75")], 3)) %>% 
  kable_styling()
mean median sd min max IQR Q0.25 Q0.75
Release.Date 2000.800 2006.000 17.896 1964.000 2022.000 31.500 1985.750 2017.250
Duration..ms. 222000.520 210053.500 69854.767 101466.000 445186.000 90353.250 173633.250 263986.500
Popularity 62.700 63.500 13.900 4.000 83.000 17.000 56.000 73.000
Danceability 0.547 0.540 0.148 0.202 0.960 0.193 0.453 0.646
Energy 0.622 0.640 0.211 0.130 0.971 0.323 0.472 0.794
Tempo 120.803 113.894 30.836 73.104 194.914 44.217 96.932 141.150
Rank 50.500 50.500 29.011 1.000 100.000 49.500 25.750 75.250

Ok, let me explain this table a bit with the example of tempo. The mean of tempo is about 120.8 bpm, which is somewhat the standard for modern western music (120 bpm). This means that the average song of the sample has about 120.8 bpm. The median is a bit lower, at about 113.89. This means that 50% of the songs are faster than 113.89 bpm and 50% are slower. The fact that the mean is higher than the median is already a sign that the distribution is somewhat skewed to the left, since a normal distribution would have the mean equal to the median.

The standard deviation is a measure of spread around the mean. One standard deviation contains the 68% of closest observations around the mean, two standard deviations contain the 95% observations around the mean, and three standard deviations contain 99.7% spread around the mean. A standard deviation of 30.84 around the mean means that 68% of all observations are between 89.97 and 151.64, which is \(\bar x ± s\).

The minimum and maximum are pretty self-explanatory; they represent the lowest and highest tempo in the sample. The IQR (interquartile range) is the difference between the 25th percentile and the 75th percentile. This means that 50% of all songs are between 96.93 and 141.15. The size of the IQR also gives some information about the spread/variance of a variable.

Distribution of Variables

The issue with looking at descriptive statistics in tables is that we do not see too many details about the distribution. Something like a bimodal distribution, for example, would get lost in a table. Therefore, it is helpful to create some density plots (or histograms) to have a look at the distributions.

Since I did not want to type out the same code for every numeric variable, I coded a loop. First, I created a sub-dataset that only consists of numeric variables. Then, I got the names of the variables in this dataset with names(dtnum) and looped through them, referring to every element in names(dtnum) as variable_name. In this loop, the variable datavar was created, which is the data of dtnum for only the specific variable the loop is currently at. Then, the loop creates a density plot for the datavar and sets the main label to the name of the variable.

This whole loop would also have been possible by just looping through the dataset dtnum instead of the names. However, then the main titles of the density plots would have been looking worse. Loops are very nice in every programming language and are essential when dealing with big data. However, customization is not as easy anymore and in R they unfortunately do not function with ggplot.

dtnum <- dt[sapply(dt, is.numeric)]

for (variable_name in names(dtnum)) {
  datavar <- dtnum[[variable_name]]
  plot(density(datavar), main = variable_name)
}

From the above density plots, we can observe that the distributions of release year and energy are somewhat bimodal. This suggests that I may have preferences for either old or new songs, and that I may prefer either very energetic or less energetic songs.

Summary Statistics

It would also be nice to get an overview about the categorical variables. So lets have a look at the frequencies of mode and key. With the table() command I get the frequencies.

table(dt$Mode)
## 
## Minor Major 
##    39    61

We can observe that most songs of this sample are in a major mode.

The table()command can also be used for the key of the song.

table(dt$Key)
## 
##  C C#  D D#  E  F F#  G G#  A A#  B 
## 11  9  8  2 11  7 10 13  6  7  6 10

The dataset at hand has exactly 100 variables, so it is quite simple to calculate relative numbers from the frequencies in head. However, with moth datasets this is not the case, so relative numbers are usefull. Those numbers can be calculated with the following code:

print(table(dt$Mode)/sum(table(dt$Mode)))
## 
## Minor Major 
##  0.39  0.61
print(table(dt$Key)/sum(table(dt$Key)))
## 
##    C   C#    D   D#    E    F   F#    G   G#    A   A#    B 
## 0.11 0.09 0.08 0.02 0.11 0.07 0.10 0.13 0.06 0.07 0.06 0.10

Here, just each individual category is divided by the sum of all categories, to get to the percent value.

Analysis of the Data

Most Listened Artists (Bar Chart)

Now that we have an overview of the data, let’s dive a bit deeper into it. An interesting point to examine is to see which artists made it the most often into my top 100. For that, we can use the table() command to get the frequency of artists. I converted this frequency table to a dataframe, saved as “dti”. Then I ordered “dti” by descending frequency. Now, we can see my top 6 most listened artists with the head() command.

dti <- as.data.frame(table(dt$Artist.Name.s.))
dti <- dti[order(-dti$Freq),]
head(dti)
##              Var1 Freq
## 46     The Smiths   17
## 5  Arctic Monkeys    7
## 13 DOMi & JD BECK    7
## 24    Mac DeMarco    5
## 9   Cocteau Twins    3
## 18 Harry Quintana    3

It is now already clear to see what I really enjoyed listening to, but a graphical overview would also be nice. Therefore, let’s create a boxplot. For that, I used “ggplot2” to have a variety of customization options. A basic boxplot using ggplot looks like this:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(data = dti, mapping = aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity")

This looks quite messy, so let’s tidy it up a bit. First, I would like to have the variables in descending order by their frequency. This improves readability and makes comparisons between the bars easier. To achieve this descending order, I factored the naming variable of the frequency data I created earlier. Unfortunately, ggplot does not take the order that was defined earlier. Furthermore, I changed the number of data points shown. There is no need to display all 51 different artists, since most of them have only one occurrence. Therefore, I used the argument data = head(dti) to get only the first 6 observations. In addition, I changed the theme of the ggplot and adjusted the labels on the axis.

dti$Var1 <- factor(dti$Var1, levels = dti$Var1[order(dti$Freq, decreasing = TRUE)])

ggplot(data = head(dti), mapping = aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity")+
  theme_classic()+
  xlab("Artist Name")+
  ylab("Frequency")

Here, the same information can be observed as when we print head(dti). However, we can easily compare the different frequencies. We can see that The Smiths are by far the band that had the most songs in my top 100. Furthermore, Arctic Monkeys and DOMi & JD BECK also had quite a lot of songs in my top 100, but even if you combine them, they do not have more songs than The Smiths. This strongly suggests that The Smiths were my favourite band in 2023.

Rank vs. Popularity (Scatterplot)

Next to comparing frequencies, it would be furthermore interesting to compare some numeric variables and (visually) observe some relation. What interests me is if a more popular song in the Spotify community is also more popular for me. Spotify measures the popularity of a song and it is listed in the variable Popularity. How popular the song was for myself can be seen with the rank the song has. So lets plot that.

ggplot(data=dt, mapping = aes(x=Rank, y=Popularity))+
  geom_point()+
  geom_smooth()+
  theme_classic()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It does not really seem like there is a correlation between the popularity on Spotify and the popularity for me, since the slope of the trendline is almost equal to 0. There is a small slope, however, but from just visually looking at it, I would say that there is no relation. A regression analysis could be used now to analyse the statistical significance, but since this will be introduced later in the course, I will not do it now.

Rank vs. Mode (Boxplot)

Let’s see, if songs in a major or minor key are ranked higher in this sample. We can do that with the boxplot function of ggplot2.

ggplot(data=dt, mapping = aes(y=Rank, x=Mode))+
  geom_boxplot()+
  theme_classic()+
  scale_y_reverse()

The ggplot2 is also in the classic theme as the plots above. Furthermore, I added scale_y_reverse(), to reverse the y-axis. I did this because intuitively we assume that a value further away from the origin of a plot is better. With this reversed axis, we can follow this intuition, since a song that has a better ranking (e.g., a lower number) is now further away from the origin.

From the output of the boxplot, we can see that the median (the black line in the middle of the box) of the major key is a bit higher than the median of the minor key. Furthermore, the IQR (the box), and hence the variance, for the major key is a bit smaller than for the minor key. Both variables seem somewhat equally distributed, since the median is approximately in the center of the box.

However, only because the median of the major key is a bit higher than the median of the minor key, we should not judge statistical significance from that. In general, I would conclude that there is no statistical significance, since the IQR ranges are pretty much overlapping. Nevertheless, we would need a t-test for certainty.