Dataset Description

The Spotify Songs dataset contains detailed information on over 32,000 tracks pulled from Spotify’s Web API in January 2020. Each song is described by a variety of musical features, such as energy, danceability, loudness, acousticness, and speechiness, as well as metadata like genre, artist name, track popularity, and release year. These attributes provide insights into the musical composition and popularity trends on Spotify. The dataset is especially useful for exploring how different musical characteristics influence a song’s popularity, genre distribution, and trends over time.

dataset - (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/spotify_songs.csv) and the corresponding documentation - (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md).

Main Question/Goal

The key question : What characteristics of a song contribute most to its popularity on Spotify?

Project Goal:

The primary goal of this project is to explore how various musical features, such as energy, danceability, loudness, and tempo, impact a song’s popularity on Spotify. Understanding these relationships can provide insights into which musical attributes resonate most with listeners.

Purpose

The purpose of this analysis is twofold:

  1. To identify patterns and trends in musical features that correlate with higher popularity.
  2. To utilize these insights to predict and better understand the elements of a successful song, potentially informing music recommendations and playlist curation.

Load the data

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
str(spotify_songs)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Variables Breakdown:

track_id (chr): The unique identifier for each track on Spotify (e.g., “6f807x0ima9a1j3VPbc7VN”).

track_name (chr): The name of the track (e.g., “I Don’t Care (with Justin Bieber) - Loud Luxury Remix”).

track_artist (chr): The name of the artist who performed the track (e.g., “Ed Sheeran”).

track_popularity (int): A numeric value representing the popularity of the track on Spotify, ranging from 0 to 100 (e.g., 66).

track_album_id (chr): The unique identifier for the album that contains the track (e.g., “2oCs0DGTsRO98Gh5ZSl2Cx”).

track_album_name (chr): The name of the album that contains the track (e.g., “I Don’t Care (with Justin Bieber) [Loud Luxury Remix]”).

track_album_release_date (chr): The release date of the album in the format “YYYY-MM-DD” (e.g., “2019-06-14”).

playlist_name (chr): The name of the playlist that the track belongs to (e.g., “Pop Remix”).

playlist_id (chr): The unique identifier for the playlist (e.g., “37i9dQZF1DXcZDD7cfEKhW”).

playlist_genre (chr): The genre of the playlist (e.g., “pop”).

playlist_subgenre (chr): The subgenre of the playlist (e.g., “dance pop”).

danceability (num): A measure of how suitable the track is for dancing, ranging from 0.0 to 1.0 (e.g., 0.748).

energy (num): A measure of intensity and activity in the track, ranging from 0.0 to 1.0 (e.g., 0.916).

key (int): The key of the track, using standard pitch notation from 0 to 11 (e.g., 6).

loudness (num): The overall loudness of the track in decibels (dB), measured relative to the average loudness of other tracks (e.g., -2.63).

mode (int): Indicates the modality of the track, where 1 represents major and 0 represents minor (e.g., 1).

speechiness (num): A measure of the presence of spoken words in the track, ranging from 0.0 to 1.0 (e.g., 0.0583).

acousticness (num): A measure of the acoustic quality of the track, ranging from 0.0 to 1.0 (e.g., 0.102).

instrumentalness (num): A measure of the presence of instrumental parts, ranging from 0.0 to 1.0. Higher values indicate a greater likelihood of the track being instrumental (e.g., 0.00).

liveness (num): A measure of the likelihood that the track was recorded live, ranging from 0.0 to 1.0 (e.g., 0.0653).

valence (num): A measure of the musical positivity of the track, ranging from 0.0 to 1.0 (e.g., 0.518).

tempo (num): The speed or pace of the track, measured in beats per minute (BPM) (e.g., 122).

duration_ms (int): The length of the track in milliseconds (e.g., 194754).

summary(spotify_songs)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Summary

The summary of the spotify_songs dataset provides a statistical overview of its 23 variables:

Track details: The dataset contains 32,833 unique tracks and includes attributes such as track_id, track_name, and track_artist, which are character-type variables.

Track popularity: The popularity score ranges from 0 to 100, with a median of 45.

Musical attributes: - danceability: Ranges from 0.0 to 0.983 (median: 0.672). - energy: Ranges from 0.0002 to 1.0 (median: 0.721). - loudness: Ranges from -46.45 dB to 1.28 dB (median: -6.17 dB). - speechiness, acousticness, and instrumentalness: Generally low values, indicating mostly non-instrumental, less acoustic, and not very speech-heavy tracks. - valence: Indicates musical positivity, ranging from 0.0 to 0.991 (median: 0.512). - tempo: Ranges from 0 to 239 BPM (median: 122 BPM).

Duration: Track lengths range from 4,000 to 517,810 milliseconds (about 8.6 minutes), with a median duration of 216,000 milliseconds (about 3.6 minutes).

Visualization for interesting aspects

Plot energy vs popularity

ggplot(spotify_songs, aes(x = energy, y = track_popularity)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Energy vs Popularity", x = "Energy", y = "Popularity") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Explanation:

This visualization presents a scatter plot that examines the relationship between a song’s energy and its popularity. Each point on the plot represents a song, with the x-axis showing the energy level (a measure of intensity and activity) and the y-axis representing the song’s popularity score (on a scale of 0 to 100). A linear trend line has been added to help visualize potential correlations between these two variables.

Energy is a crucial characteristic in music analysis as it captures the overall intensity of a song, ranging from softer, slower tracks to fast, dynamic ones. High-energy songs are often upbeat, intense, and suited for activities like workouts or parties, which may make them more appealing to certain audiences. By exploring this relationship, we aim to understand if there is a positive correlation, meaning songs with higher energy levels tend to be more popular on Spotify. Identifying such patterns could provide valuable insights for artists and producers in creating music that aligns with listener preferences. This analysis also offers a foundation for deeper investigation into how musical attributes influence popularity.

Plot danceability vs popularity

ggplot(spotify_songs, aes(x = danceability, y = track_popularity)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Danceability vs Popularity", x = "Danceability", y = "Popularity") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Eplanation:

This plot explores the relationship between a song’s danceability and its popularity on Spotify. Danceability measures how suitable a track is for dancing, considering elements like tempo, rhythm stability, beat strength, and overall musical patterns. Each point on the plot represents a song, with danceability scores on the x-axis and popularity scores on the y-axis, while a linear trend line helps visualize potential correlations.

Danceability is a key characteristic for many Spotify listeners, particularly for those who enjoy genres like pop, electronic, or dance music, where rhythmic appeal plays a major role in audience engagement. Songs with higher danceability scores often have more consistent beats and rhythmic structures that make them more enjoyable in social settings like clubs or parties. Understanding how this feature influences a song’s popularity is important for both artists and music producers, as it helps them craft tracks that resonate with listeners’ preferences for danceable music.

By examining this relationship, we aim to identify whether more danceable songs tend to be more popular on Spotify, which could provide actionable insights for improving playlist recommendations, music marketing, and understanding audience behavior. This analysis is essential for evaluating whether rhythmic appeal translates into higher streaming numbers.

Planing to Moving Forward

To further explore the relationships between musical features and song popularity, the next steps in this project are as follows:

  1. Investigate Additional Features: Examine other song attributes such as loudness, tempo, valence (positivity), and acousticness to see how they correlate with popularity. This broader analysis will provide a more comprehensive view of the factors that influence a song’s success.

  2. Genre-Based Analysis: Analyze whether certain musical features contribute more significantly to popularity within specific genres. This can help reveal whether the importance of features like danceability or energy varies across genres like pop, hip-hop, or rock.

  3. Refine Visualizations: Create more advanced visualizations, including heatmaps and pair plots, to better understand feature correlations and trends over time.

Hypotheses

Hypothesis 1: Songs with higher energy levels tend to be more popular.

Energy, as defined in this dataset, measures the intensity and activity of a song. Higher energy levels are typically associated with fast tempos, louder sounds, and more dynamic musical structures, often found in genres like electronic dance music, pop, or rock. These songs tend to be more stimulating and engaging, making them suitable for high-energy activities such as workouts, parties, or events.

This hypothesis assumes that listeners gravitate towards energetic tracks, particularly for upbeat playlists or social activities, which can lead to higher popularity scores on streaming platforms like Spotify. We aim to test whether there is a positive correlation between a song’s energy level and its popularity. By investigating this, we can explore whether high-energy music aligns with listener preferences and contributes to higher streaming numbers. If true, this finding could be useful for artists and producers in creating tracks that meet audience demand for lively, energetic music.

Hypothesis 2: Songs with higher danceability tend to be more popular.

Danceability measures how well a song can be danced to, based on elements like tempo, rhythm stability, beat strength, and musical patterns. Songs with higher danceability scores often have a steady rhythm and predictable structure, making them ideal for social settings, particularly in genres like pop, electronic, or hip-hop.

The hypothesis here is that listeners prefer tracks that are easier to dance to, leading to higher popularity scores for these songs. This is particularly relevant in today’s music consumption landscape, where social settings, parties, and curated playlists often feature highly danceable songs. If this hypothesis holds true, it could highlight the importance of rhythmic appeal in driving song popularity. Understanding how danceability influences streaming performance can guide artists, DJs, and producers in tailoring their music to align with the desires of listeners who prioritize danceable beats.

Visualization for Hypothesis 1

# Visualization for Hypothesis 1
ggplot(spotify_songs, aes(x = energy, y = track_popularity)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "green") +
  labs(title = "Energy vs Popularity (Hypothesis 1)", x = "Energy", y = "Popularity") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Visualization for Hypothesis 2

# Visualization for Hypothesis 2
ggplot(spotify_songs, aes(x = danceability, y = track_popularity)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "purple") +
  labs(title = "Danceability vs Popularity (Hypothesis 2)", x = "Danceability", y = "Popularity") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'