While I was browsing through Kaggle, I came across a dataset that interested me called “Spotify Stats for 2023.” As someone who loves music, I thought it would be interesting to compare statistics compiled by Spotify and see if any overlapped with my music preferences.
# load packages
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
# read data
spotify <- read.csv("/Users/victorzheng/Documents/NYU/R/spotify_2023.csv")
# show column names
print(names(spotify))
## [1] "track_name" "artist.s._name" "artist_count"
## [4] "released_year" "released_month" "released_day"
## [7] "in_spotify_playlists" "in_spotify_charts" "streams"
## [10] "in_apple_playlists" "in_apple_charts" "in_deezer_playlists"
## [13] "in_deezer_charts" "in_shazam_charts" "bpm"
## [16] "key" "mode" "dance"
## [19] "valence_." "energy" "acousticness_."
## [22] "instrumentalness_." "liveness_." "wordiness"
Below is a summary of the data as well as a few superlatives of interest.
# summarize data
summary(spotify)
## track_name artist.s._name artist_count released_year
## Length:953 Length:953 Min. :1.000 Min. :1930
## Class :character Class :character 1st Qu.:1.000 1st Qu.:2020
## Mode :character Mode :character Median :1.000 Median :2022
## Mean :1.556 Mean :2018
## 3rd Qu.:2.000 3rd Qu.:2022
## Max. :8.000 Max. :2023
## released_month released_day in_spotify_playlists in_spotify_charts
## Min. : 1.000 Min. : 1.00 Min. : 31 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 6.00 1st Qu.: 875 1st Qu.: 0.00
## Median : 6.000 Median :13.00 Median : 2224 Median : 3.00
## Mean : 6.034 Mean :13.93 Mean : 5200 Mean : 12.01
## 3rd Qu.: 9.000 3rd Qu.:22.00 3rd Qu.: 5542 3rd Qu.: 16.00
## Max. :12.000 Max. :31.00 Max. :52898 Max. :147.00
## streams in_apple_playlists in_apple_charts in_deezer_playlists
## Min. :2.762e+03 Min. : 0.00 Min. : 0.00 Length:953
## 1st Qu.:1.414e+08 1st Qu.: 13.00 1st Qu.: 7.00 Class :character
## Median :2.902e+08 Median : 34.00 Median : 38.00 Mode :character
## Mean :5.136e+08 Mean : 67.81 Mean : 51.91
## 3rd Qu.:6.738e+08 3rd Qu.: 88.00 3rd Qu.: 87.00
## Max. :3.704e+09 Max. :672.00 Max. :275.00
## in_deezer_charts in_shazam_charts bpm key
## Min. : 0.000 Length:953 Min. : 65.0 Length:953
## 1st Qu.: 0.000 Class :character 1st Qu.:100.0 Class :character
## Median : 0.000 Mode :character Median :121.0 Mode :character
## Mean : 2.666 Mean :122.5
## 3rd Qu.: 2.000 3rd Qu.:140.0
## Max. :58.000 Max. :206.0
## mode dance valence_. energy
## Length:953 Min. :23.00 Min. : 4.00 Min. : 9.00
## Class :character 1st Qu.:57.00 1st Qu.:32.00 1st Qu.:53.00
## Mode :character Median :69.00 Median :51.00 Median :66.00
## Mean :66.97 Mean :51.43 Mean :64.28
## 3rd Qu.:78.00 3rd Qu.:70.00 3rd Qu.:77.00
## Max. :96.00 Max. :97.00 Max. :97.00
## acousticness_. instrumentalness_. liveness_. wordiness
## Min. : 0.00 Min. : 0.000 Min. : 3.00 Min. : 2.00
## 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.:10.00 1st Qu.: 4.00
## Median :18.00 Median : 0.000 Median :12.00 Median : 6.00
## Mean :27.06 Mean : 1.581 Mean :18.21 Mean :10.13
## 3rd Qu.:43.00 3rd Qu.: 0.000 3rd Qu.:24.00 3rd Qu.:11.00
## Max. :97.00 Max. :91.000 Max. :97.00 Max. :64.00
# Track added to the most Spotify playlists in 2023
in_most_playlists = spotify$track_name[spotify$in_spotify_playlists==max(spotify$in_spotify_playlists)]
print(in_most_playlists)
## [1] "Get Lucky - Radio Edit"
# Track with the most Spotify streams in 2023
most_streams = spotify$track_name[spotify$streams==max(spotify$streams)]
print(most_streams)
## [1] "Blinding Lights"
# Track with the highest danceability in Spotify's Top Songs of 2023
highest_danceability = spotify$track_name[spotify$dance==max(spotify$dance)]
print(highest_danceability)
## [1] "Peru"
# Track with the highest beats per minute (bpm) in Spotify's Top Songs of 2023
highest_bpm = spotify$track_name[spotify$bpm==max(spotify$bpm)]
print(highest_bpm)
## [1] "We Don't Talk About Bruno" "Lover"
# Track with the lowest beats per minute (bpm) in Spotify's Top Songs of 2023
lowest_bpm = spotify$track_name[spotify$bpm==max(spotify$bpm)]
print(lowest_bpm)
## [1] "We Don't Talk About Bruno" "Lover"
# Oldest track in Spotify's Top Songs of 2023
oldest_song = spotify$track_name[spotify$released_year==min(spotify$released_year)]
print(oldest_song)
## [1] "Agudo"
Below is a plot that displays the release date of a song versus the number of times that song was streamed. The mass accumulation of data points on the right side of the graph shows that the most popular songs of 2023 came from songs that were released circa late 2010s and early 2020s. There is definitely something to be said about songs from the earlier eras, but it seems that the more recently released songs have captured the hearts of Spotify users.
ggplot(data = spotify, aes(x = released_year, y = streams)) +
geom_point() +
labs(x = "Release Year", y = "# of Streams", title = "Release Year vs. Number of Streams") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_y_continuous(trans='log10') +
scale_x_continuous(breaks = seq(1900, 2025, by = 20))
Have you ever wondered if there was any correlation between a song’s danceability rating and the number of times it gets streamed? The data suggests no.
ggplot(data = spotify, aes(x = dance, y = streams)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 100, by = 10))
Looking into a song’s BPM and the number of times it gets streamed also yields similar results. However, it does appear that most listeners prefer songs within a certain BPM range. The data shows a preference for songs between 90 BPM and 120 BPM. If you are an aspiring artist hoping to break into the industry, I’d recommend writing a song within that range ;)
ggplot(data = spotify, aes(x = bpm, y = streams)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 220, by = 20))
Songs written in major keys are generally happier songs. Conversely, songs written in minor keys are more melancholy or serious. Does the key a song is written in affect the number of streams? The answer is a very candid “it depends.” While songs written in major keys generated more streams than songs written in minor keys, there is no strong correlation between major/minor key songs and the number of streams they generate.
ggplot(spotify, aes(x = mode, y = streams)) +
geom_boxplot() +
labs(x = "Major/Minor Key", y = "Streams", title = "Major/Minor Keys vs. # of Streams")
To use more or less words in a song - that is the question of a century. Well, lucky for you, the data is showing a strong preference towards songs that use less words. Some of the tracks with the most streams had a wordiness rating below 20 (these were graded on a scale of 1 to 100, with 100 being incredibly wordy).
ggplot(data = spotify, aes(x = wordiness, y = streams)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 100, by = 10))
Does this make sense?
Whole-heartedly, yes.
Now, is there a time and place for songs that are wordy? Of course. However, you must keep in mind that you are a story teller without thousands of pages to work with. You must be able to convey your message in short, simple, but memorable phrases. This allows the audience to remember your lyrics easier and allow room for your music to shine through!