| Name | Matriculation number |
|---|---|
| Hew Li Yang | A0200905U |
| Zhu Le Yao | A0223207U |
| Harry Chang | A0201825N |
| Brandon Chia | A0216337H |
| Kaaviya Selvam | A0219611L |
Spotify is one of the largest music streaming services in the world with over 406 million monthly active users (Spotify, 2022). In this project, we are looking to answer the following questions regarding the music hosted on the platform:
To do so, we will be using the dataset on Spotify song metadata in 2020 obtained from tidytuesday/data/2020/2020-01-21/.
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
A brief description of the variables relevant to our analysis are given in the table below:
| variable | class | description |
|---|---|---|
| track_popularity | double | Song popularity from 0-100 where higher is better |
| playlist_genre | character | Genre of playlist |
| danceability | double | Describes how suitable a track is for dancing from 0.0-1.0 where higher means more danceable. |
| energy | double | Represents a perceptual measure of intensity and activity from 0.0-1.0. Energetic tracks are typically fast, loud and noisy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The average loudness across the entire track in decibels (dB). Values range from -60 to 0 db. The smaller the absolute value, the louder the relative loudness. |
| mode | double | Indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0. |
| speechiness | double | Detects the presence of spoken words in a track from 0.0-1.0. More speechy tracks are closer to 1 and vice versa. |
| acousticness | double | A confidence measure from 0.0-1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | A confidence measure from 0.0-1.0 of whether the track contains no vocals, i.e contains instruments only |
| liveness | double | A confidence measure from 0.0-1.0 of whether the track was performed live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). |
| duration_ms | double | Duration of song in milliseconds |
# check for missing values
spotify_songs %>%
summarise_all(funs(sum(is.na(.))))
## # A tibble: 1 × 23
## track_id track_name track_artist track_popularity track_album_id
## <int> <int> <int> <int> <int>
## 1 0 5 5 0 0
## # … with 18 more variables: track_album_name <int>,
## # track_album_release_date <int>, playlist_name <int>, playlist_id <int>,
## # playlist_genre <int>, playlist_subgenre <int>, danceability <int>,
## # energy <int>, key <int>, loudness <int>, mode <int>, speechiness <int>,
## # acousticness <int>, instrumentalness <int>, liveness <int>, valence <int>,
## # tempo <int>, duration_ms <int>
We can remove these observations as they are unlikely to affect our finals results.
spotify_songs = spotify_songs %>%
na.omit()
Next, we check for duplicated instances.
spotify_songs %>%
summarise_all(funs(sum(duplicated(.))))
## # A tibble: 1 × 23
## track_id track_name track_artist track_popularity track_album_id
## <int> <int> <int> <int> <int>
## 1 4476 9379 22136 32727 10285
## # … with 18 more variables: track_album_name <int>,
## # track_album_release_date <int>, playlist_name <int>, playlist_id <int>,
## # playlist_genre <int>, playlist_subgenre <int>, danceability <int>,
## # energy <int>, key <int>, loudness <int>, mode <int>, speechiness <int>,
## # acousticness <int>, instrumentalness <int>, liveness <int>, valence <int>,
## # tempo <int>, duration_ms <int>
It appears that each track does not appear uniquely in the data set. To investigate, we can find the track ID with the most repeats as such:
# find the track ID with most repeats
spotify_songs %>%
group_by(track_id) %>%
count() %>%
filter(n > 1) %>%
arrange(desc(n)) %>% head(1)
## # A tibble: 1 × 2
## # Groups: track_id [1]
## track_id n
## <chr> <int>
## 1 7BKLCZ1jbUBVqRi2FVlTVw 10
# display all observations with this track ID
spotify_songs %>%
filter(track_id == "7BKLCZ1jbUBVqRi2FVlTVw")
## # A tibble: 10 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 2 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 3 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 4 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 5 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 6 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 7 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 8 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 9 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## 10 7BKLCZ1jbUBVqRi2FVlT… Closer (f… The Chainsm… 85 0rSLgV8p5Fzfn…
## # … with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
From the results, we note that the same song can appear multiple times under different playlist & genres. Depending on the use case, additional steps such as
may be needed. Finally, we should also note that the data is not
tidy. For instance, columns like loudness,
mode, speechiness are all
features of a track. We will handle this issue during
plotting.
With some preliminary cleaning, the dimension of the dataset is now
dim(spotify_songs)
## [1] 32828 23
For this question, we are interested in investigating the differences
in songs of different genres. This could be useful in designing an
algorithm to automatically classify songs to their genres which can be
used in recommendation systems. This question seemed natural to ask
since information like genre and numerical features of each
song like danceability, energy, …
duration_ms is given in the dataset.
Plot 1:
The first plot is a ridgeline faceted density plot
where each facet represents one feature variable in which its
distribution is seperated by genre. This visualization was inspired by
Kaylin
Pavlik (2019) in her article on classifying song genres. In our
version of the plot, we seperated genre by ridges instead of simply
using colored lines to improve visibility of the differences in
distributions. A viridis fill is also used to enhance
readability for color blind readers.
Plot 2: In order to further investigate the
feature speechiness, we use a boxplot to
visualize its distribution by genre. A box plot is appropriate here as
it will enable us to clearly compare the median speechiness
levels of each genre. In addition, the y-axis is log-scaled in order to
minimise the distortion due to outliers.
# additional packages needed
library(ggthemes)
library(ggridges)
library(viridis)
plot = spotify_songs %>%
select(c("playlist_genre", "danceability":"duration_ms")) %>%
gather("danceability":"duration_ms", key = "feature", value = "val") %>%
ggplot(aes(x = val, y = playlist_genre)) +
geom_density_ridges(aes(fill = playlist_genre), alpha = 0.6) +
facet_wrap(~feature, ncol = 3, scales = "free") +
labs(title = 'Density of Song Features - by Genre',
x = '', y = 'density') +
theme(axis.text.y = element_blank()) +
scale_fill_colorblind(name = "Genre") +
theme_clean() +
theme(legend.position = "bottom")
plot
The plot displays the distribution of values for each numerical feature associated with the genre of the songs.
At first glance, liveliness, key,
instrumentalness and mode seem to have a
similar distribution for all genres and therefore may not help much as
predictor variables for genre classification.
It is interesting in particular to observe the distribution of the remaining features as they vary from genre to genre. The most interesting observations for each feature is populated in the table below.
| Feature | Observations |
|---|---|
| Acousticness | Most rock and EDM tracks are less likely to be acoustic |
| Danceability | Rock music is significantly less danceable than other genres, while most Latin songs are very danceable |
| Energy | EDM tracks are mostly high-energy while R&B tracks are most likely to be low-energy |
| Acousticness | Most rock and EDM tracks are less likely to be acoustic |
| Loudness | High across all genres, but EDM is the loudest by a slight margin |
| Speechiness | Rap music is by far the most speechy genre |
| Tempo | The tempo of EDM music is the most concentrated at 120bpm, while other genres are similarly distributed |
The table seems to indicate that EDM is the genre that can be most easily classified based on the above features as it has the most different distributions compared to other genres.
Next, it would be interesting to explore a certain feature more
closely to figure out more specifically how each genre differs. For this
we will be looking at speechiness.
spotify_songs %>%
ggplot(aes(x = playlist_genre, y = speechiness, fill = playlist_genre)) +
geom_boxplot() +
labs(title = "Boxplot of Speechiness by Genre", x = "Genre", y = "Speechiness") +
scale_fill_colorblind() +
scale_y_log10() +
theme_clean()
From the graph, it can be clearly seen that most rap
tracks have a significantly higher speechiness level. In
fact, the median speechiness for rap is higher
compared to the 75th quantile for speechiness of any other
genre. Therefore we can conclude that speechiness is
definitely an informative predictor variable when it comes to predicting
a song to be of the rap genre. Similar analysis can be done for other
feature variables in order to select the most informative features as
input to a classifier algorithm like Decision Trees, Naive Bayes and
SVM.
This question explores the relationship between the number of songs produced and the season the songs are released in per genre. Thus, the parts of the dataset necessary are the playlist genre and track album release date. From the track album release date, we extracted the month and year, to group into seasons e.g. all track albums released from March to May in all years are grouped under the season “Spring”; June to August under “Summer”; September to November under “Autumn”; and December to February under “Winter”. We are interested in this question due to the conjecture that certain genres are associated with certain moods and festivals, thus will be more profitable/popular during certain seasons. For instance, winter could be a more melancholic season due to Christmas, thus more slow and moody songs. Therefore, there is an increase/decrease in release of each genre, depending of the different season(s).
Plot 1: The first plot is a line chart which shows how the number of songs released varies over the four seasons, with each line representing a different genre of music. Each point on the graph represents a summation of the total number of songs released in each season across all the years, whihc was further grouped by its genre based on the different lines.
In addition, we have also used colour blind colours to ensure that the graph is easily interpretable by all readers.
Plot 2: The second plot is a faceted donut chart which features the distribution of the different genres of songs over the 4 seasons. This second plot is meant to supplement the previous plot where it conveys more accurately the breakdown of songs released in each season.
spotify2=spotify_songs %>%
select(-c(("track_id"),("track_album_id"),("playlist_name"),("playlist_id"),("duration_ms")))
#spotify2 %>%count(playlist_subgenre)
#head(spotify2)
spotify2$track_album_release_date=as.Date(spotify2$track_album_release_date)
library(lubridate)
spotify2=spotify2 %>%
mutate(month=month(spotify2$track_album_release_date),year=year(spotify2$track_album_release_date)) %>%
select(-c("track_album_release_date"))
spotify2 =spotify2%>%
mutate(quarter = case_when(
month == 3 | month == 4 | month == 5 ~ 'Spring'
, month == 6 | month == 7 | month == 8 ~ 'Summer'
, month == 9 | month == 10 | month == 11 ~ 'Autumn'
, month == 12 | month == 1 | month == 2 ~ 'Winter'))
spotify2= spotify2 %>%
group_by(quarter) %>%
count(playlist_genre)
spotify2 = na.omit(spotify2)
spotify2$quarter = factor(spotify2$quarter, levels = c("Spring", "Summer", "Autumn", "Winter"))
ggplot(spotify2,aes(x=quarter,y=n, group = playlist_genre))+
geom_point(aes(colour=playlist_genre))+
geom_line(aes(colour=playlist_genre))+
xlab("Season")+ylab("Number of songs")+
ggtitle("Number of songs produced in each season per genre") +
theme_clean()+
scale_color_colorblind()
From the plot, we observe that all genres have an overall increase in songs produced across the seasons, where it is at its lowest in the Spring, and slowly increases as the following seasons come. We can see that EDM, pop and Latin songs share a similar trend where it slightly dips in songs produced in Winter while the other genres, R&B, rap and rock continue to increase in Winter.
We can also observe that the rock genre is consistently lower than other genres regardless of the season. We hypothesize that it may due to a multitude of reasons:
First, there have been many controversies surrounding the rock industry as a whole which may lead to a deteriorating fan base as the people who they thought of as idols no longer fit, with them not realizing the impact their actions might bring about for the industry as a whole (Joan CA, 2013).
Secondly, there seems to be a lack of diverse ethnicities among rock artists. Rock music mainly appeals to the Whites, while the Black, Latino and Asian youth may be less enticed to listen to a genre with artists that hardly resemble them.
Lastly, there is an overt sexualization and masculinity associated with the rock genre which may not appeal to women, especially the younger generations. These women may find more enjoyment from the other genres which includes R&B and Rap where there is a larger variety of artists that are also female. All these factors, the poor choices made by rock idols; the inability to capture a younger demographic; and the natural seclusion of women from the genre may be the reason why rock has consistently lower number of songs produced as there are a smaller number of artists from the genre.
spotify3 = spotify2 %>%
group_by(playlist_genre) %>%
mutate(percent = n/sum(n)*100)
ggplot(data = spotify3, aes(x = 2, y = percent, fill = quarter)) +
geom_bar(stat = "identity") +
coord_polar(theta = "y") +
geom_text(aes(label=paste0(round(percent,2), "%")),size=2.75,position = position_stack(vjust = 0.5))+
facet_wrap(facets=. ~ playlist_genre)+
theme_void()+
scale_fill_manual(values = c("#76b7b2","#59a14f","#edc948","#b07aa1"))+
xlim(0.7,2.5)+
ggtitle("Distribution of different genres of songs across all 4 seasons")+
theme(plot.title = element_text(vjust = 3, face = "bold"),
strip.text.x = element_text(size = 10))
The faceted donut chart is an adjunct to the previous plot, where it further simplifies the line graph into a breakdown of the songs in the season by its genre.
While each donut chart in the faceted plot is separated by its percentages, we can see that there are only subtle differences in the breakdown of songs throughout the season, with each genre capturing roughly the same percentage of songs despite the season. We believe that the percentages here can reflect the popularity of the songs as it does not fluctuate much which gives an idea of what genres have been popular.
In the chart above, the season with the most songs for the EDM, Latin and pop genres is Autumn while the season with the most songs for the R&B, Rap and Rock songs is Winter. this is particularly interesting considering that the faceted donuts are arranged by genre in alphabetical order, and it is coincidental that the above 3 donut charts have a similar observation (genre(s) with most songs in Autumn) while the bottom 3 donut charts share another common observation (genre(s) with most songs in Winter).