Main Question/Goal
The key question : What characteristics of a song contribute most to its popularity on Spotify?
The Spotify Songs dataset contains detailed information on over 32,000 tracks pulled from Spotify’s Web API in January 2020. Each song is described by a variety of musical features, such as energy, danceability, loudness, acousticness, and speechiness, as well as metadata like genre, artist name, track popularity, and release year. These attributes provide insights into the musical composition and popularity trends on Spotify. The dataset is especially useful for exploring how different musical characteristics influence a song’s popularity, genre distribution, and trends over time.
dataset - (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/spotify_songs.csv) and the corresponding documentation - (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md).
The key question : What characteristics of a song contribute most to its popularity on Spotify?
The primary goal of this project is to explore how various musical features, such as energy, danceability, loudness, and tempo, impact a song’s popularity on Spotify. Understanding these relationships can provide insights into which musical attributes resonate most with listeners.
The purpose of this analysis is twofold:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
str(spotify_songs)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
track_id (chr): The unique identifier for each track on Spotify (e.g., “6f807x0ima9a1j3VPbc7VN”).
track_name (chr): The name of the track (e.g., “I Don’t Care (with Justin Bieber) - Loud Luxury Remix”).
track_artist (chr): The name of the artist who performed the track (e.g., “Ed Sheeran”).
track_popularity (int): A numeric value representing the popularity of the track on Spotify, ranging from 0 to 100 (e.g., 66).
track_album_id (chr): The unique identifier for the album that contains the track (e.g., “2oCs0DGTsRO98Gh5ZSl2Cx”).
track_album_name (chr): The name of the album that contains the track (e.g., “I Don’t Care (with Justin Bieber) [Loud Luxury Remix]”).
track_album_release_date (chr): The release date of the album in the format “YYYY-MM-DD” (e.g., “2019-06-14”).
playlist_name (chr): The name of the playlist that the track belongs to (e.g., “Pop Remix”).
playlist_id (chr): The unique identifier for the playlist (e.g., “37i9dQZF1DXcZDD7cfEKhW”).
playlist_genre (chr): The genre of the playlist (e.g., “pop”).
playlist_subgenre (chr): The subgenre of the playlist (e.g., “dance pop”).
danceability (num): A measure of how suitable the track is for dancing, ranging from 0.0 to 1.0 (e.g., 0.748).
energy (num): A measure of intensity and activity in the track, ranging from 0.0 to 1.0 (e.g., 0.916).
key (int): The key of the track, using standard pitch notation from 0 to 11 (e.g., 6).
loudness (num): The overall loudness of the track in decibels (dB), measured relative to the average loudness of other tracks (e.g., -2.63).
mode (int): Indicates the modality of the track, where 1 represents major and 0 represents minor (e.g., 1).
speechiness (num): A measure of the presence of spoken words in the track, ranging from 0.0 to 1.0 (e.g., 0.0583).
acousticness (num): A measure of the acoustic quality of the track, ranging from 0.0 to 1.0 (e.g., 0.102).
instrumentalness (num): A measure of the presence of instrumental parts, ranging from 0.0 to 1.0. Higher values indicate a greater likelihood of the track being instrumental (e.g., 0.00).
liveness (num): A measure of the likelihood that the track was recorded live, ranging from 0.0 to 1.0 (e.g., 0.0653).
valence (num): A measure of the musical positivity of the track, ranging from 0.0 to 1.0 (e.g., 0.518).
tempo (num): The speed or pace of the track, measured in beats per minute (BPM) (e.g., 122).
duration_ms (int): The length of the track in milliseconds (e.g., 194754).
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
The summary of the spotify_songs dataset provides a statistical overview of its 23 variables:
Track details: The dataset contains 32,833 unique tracks and includes attributes such as track_id, track_name, and track_artist, which are character-type variables.
Track popularity: The popularity score ranges from 0 to 100, with a median of 45.
Musical attributes: - danceability: Ranges from 0.0 to 0.983 (median: 0.672). - energy: Ranges from 0.0002 to 1.0 (median: 0.721). - loudness: Ranges from -46.45 dB to 1.28 dB (median: -6.17 dB). - speechiness, acousticness, and instrumentalness: Generally low values, indicating mostly non-instrumental, less acoustic, and not very speech-heavy tracks. - valence: Indicates musical positivity, ranging from 0.0 to 0.991 (median: 0.512). - tempo: Ranges from 0 to 239 BPM (median: 122 BPM).
Duration: Track lengths range from 4,000 to 517,810 milliseconds (about 8.6 minutes), with a median duration of 216,000 milliseconds (about 3.6 minutes).
ggplot(spotify_songs, aes(x = energy, y = track_popularity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Energy vs Popularity", x = "Energy", y = "Popularity") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This visualization presents a scatter plot that examines the relationship between a song’s energy and its popularity. Each point on the plot represents a song, with the x-axis showing the energy level (a measure of intensity and activity) and the y-axis representing the song’s popularity score (on a scale of 0 to 100). A linear trend line has been added to help visualize potential correlations between these two variables.
Energy is a crucial characteristic in music analysis as it captures the overall intensity of a song, ranging from softer, slower tracks to fast, dynamic ones. High-energy songs are often upbeat, intense, and suited for activities like workouts or parties, which may make them more appealing to certain audiences. By exploring this relationship, we aim to understand if there is a positive correlation, meaning songs with higher energy levels tend to be more popular on Spotify. Identifying such patterns could provide valuable insights for artists and producers in creating music that aligns with listener preferences. This analysis also offers a foundation for deeper investigation into how musical attributes influence popularity.
ggplot(spotify_songs, aes(x = danceability, y = track_popularity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This plot explores the relationship between a song’s danceability and its popularity on Spotify. Danceability measures how suitable a track is for dancing, considering elements like tempo, rhythm stability, beat strength, and overall musical patterns. Each point on the plot represents a song, with danceability scores on the x-axis and popularity scores on the y-axis, while a linear trend line helps visualize potential correlations.
Danceability is a key characteristic for many Spotify listeners, particularly for those who enjoy genres like pop, electronic, or dance music, where rhythmic appeal plays a major role in audience engagement. Songs with higher danceability scores often have more consistent beats and rhythmic structures that make them more enjoyable in social settings like clubs or parties. Understanding how this feature influences a song’s popularity is important for both artists and music producers, as it helps them craft tracks that resonate with listeners’ preferences for danceable music.
By examining this relationship, we aim to identify whether more danceable songs tend to be more popular on Spotify, which could provide actionable insights for improving playlist recommendations, music marketing, and understanding audience behavior. This analysis is essential for evaluating whether rhythmic appeal translates into higher streaming numbers.
To further explore the relationships between musical features and song popularity, the next steps in this project are as follows:
Investigate Additional Features: Examine other song attributes such as loudness, tempo, valence (positivity), and acousticness to see how they correlate with popularity. This broader analysis will provide a more comprehensive view of the factors that influence a song’s success.
Genre-Based Analysis: Analyze whether certain musical features contribute more significantly to popularity within specific genres. This can help reveal whether the importance of features like danceability or energy varies across genres like pop, hip-hop, or rock.
Refine Visualizations: Create more advanced visualizations, including heatmaps and pair plots, to better understand feature correlations and trends over time.
Energy, as defined in this dataset, measures the intensity and activity of a song. Higher energy levels are typically associated with fast tempos, louder sounds, and more dynamic musical structures, often found in genres like electronic dance music, pop, or rock. These songs tend to be more stimulating and engaging, making them suitable for high-energy activities such as workouts, parties, or events.
This hypothesis assumes that listeners gravitate towards energetic tracks, particularly for upbeat playlists or social activities, which can lead to higher popularity scores on streaming platforms like Spotify. We aim to test whether there is a positive correlation between a song’s energy level and its popularity. By investigating this, we can explore whether high-energy music aligns with listener preferences and contributes to higher streaming numbers. If true, this finding could be useful for artists and producers in creating tracks that meet audience demand for lively, energetic music.
Danceability measures how well a song can be danced to, based on elements like tempo, rhythm stability, beat strength, and musical patterns. Songs with higher danceability scores often have a steady rhythm and predictable structure, making them ideal for social settings, particularly in genres like pop, electronic, or hip-hop.
The hypothesis here is that listeners prefer tracks that are easier to dance to, leading to higher popularity scores for these songs. This is particularly relevant in today’s music consumption landscape, where social settings, parties, and curated playlists often feature highly danceable songs. If this hypothesis holds true, it could highlight the importance of rhythmic appeal in driving song popularity. Understanding how danceability influences streaming performance can guide artists, DJs, and producers in tailoring their music to align with the desires of listeners who prioritize danceable beats.
# Visualization for Hypothesis 1
ggplot(spotify_songs, aes(x = energy, y = track_popularity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "green") +
labs(title = "Energy vs Popularity (Hypothesis 1)", x = "Energy", y = "Popularity") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Visualization for Hypothesis 2
ggplot(spotify_songs, aes(x = danceability, y = track_popularity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "purple") +
labs(title = "Danceability vs Popularity (Hypothesis 2)", x = "Danceability", y = "Popularity") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'