Project 2

Author

Jhonathan Urquilla

Final Project: Spotify Song Popularity Influences

Source: Spotify website under the artist Katy Perry page

This project explored the relationship between various audio features and the popularity of songs on Spotify from 2010 onward. The data set includes quantitative variables such as popularity, danceability, energy, loudness, tempo, and duration, as well as categorical variables like artist name, and Song names. The data was sourced from Spotify popular songs from 1998-2020, unfortunately no ReadMe file or detailed documentation is provided, so the exact methodology behind the data collection and feature calculation is not publicly available. This analysis assumes the data reflects Spotify’s internal audio analysis and popularity measures.The data was cleaned by removing duplicates, standardizing names, converting data types, and filtering incomplete records. The focus is on tracks by top artists including Katy Perry, Drake, Calvin Harris, Ariana Grande, and David Guetta.

I chose this topic due to a personal interest in music and the desire to explore how measurable audio traits relate to success. Analyzing well-known artists makes the study more relatable and engaging to me, while leveraging Spotify data provides insight into trends in popular music.

Sources:

Music Marketing Monday. (2023). How do people discover and consume music? Retrieved July 4, 2025, from https://www.musicmarketingmonday.com/p/how-do-people-discover-and-consume-music]

Loading Libraries we may use and data-set

library(tidyverse)
library(tidyr)
library(leaflet)

setwd("C:/Users/ubjho/Downloads")
songs <- read_csv("spotifysongs.csv")

Looking at data

head(songs)

# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0
3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7
4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0
5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

Creating new variables

songs <- mutate(songs, explicit = as.factor(explicit))
songs <- mutate(songs, mode = as.factor(mode))
songs <- mutate(songs, year = as.integer(year))
songs <- mutate(songs, duration_min = duration_ms / 60000)
songs <- filter(songs, popularity != "", danceability != "")

Filtering for songs 2010 and newer, as well as arranging from most popular to less

songs_2010 <- filter(songs, year >= 2010)

songs_2010 <- mutate(
  songs_2010,
  song_clean = tolower(trimws(song)),
  artist_clean = tolower(trimws(artist))
)

head(songs_2010)

# A tibble: 6 × 21
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <fct>    <int>      <dbl>        <dbl>  <dbl> <dbl>
1 Gigi D'… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
2 Chicane  Don'…      210786 FALSE     2016         47        0.644  0.72     10
3 Samanth… Gott…      201946 FALSE     2018         43        0.729  0.632     0
4 DJ Ötzi  Hey …      219240 FALSE     2010         58        0.666  0.968    10
5 Mariah … Agai…      199480 FALSE     2011          0        0.471  0.514     1
6 Faithle… We C…      222435 FALSE     2015         53        0.645  0.903     5
# ℹ 12 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>, duration_min <dbl>, song_clean <chr>,
#   artist_clean <chr>

Checking top 10 Popularity

songs_sorted <- arrange(songs_2010, desc(popularity))
head(songs_sorted, 10)

# A tibble: 10 × 21
   artist  song  duration_ms explicit  year popularity danceability energy   key
   <chr>   <chr>       <dbl> <fct>    <int>      <dbl>        <dbl>  <dbl> <dbl>
 1 The Ne… Swea…      240400 FALSE     2013         89        0.612  0.807    10
 2 Tom Od… Anot…      244360 TRUE      2013         88        0.445  0.537     4
 3 WILLOW  Wait…      196520 FALSE     2015         86        0.764  0.705     3
 4 Billie… love…      200185 FALSE     2018         86        0.351  0.296     4
 5 Billie… love…      200185 FALSE     2018         86        0.351  0.296     4
 6 Bruno … Lock…      233478 FALSE     2012         85        0.726  0.698     5
 7 Bruno … Lock…      233478 FALSE     2012         85        0.726  0.698     5
 8 The Ne… Dadd…      260173 FALSE     2015         85        0.588  0.521    10
 9 Avicii  The …      176658 FALSE     2014         85        0.527  0.835     6
10 Ed She… Perf…      263400 FALSE     2017         85        0.599  0.448     8
# ℹ 12 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>, duration_min <dbl>, song_clean <chr>,
#   artist_clean <chr>

Checking if “Locked out of Heaven” is a duplicate

locked_out_songs <- filter(songs_2010, song == "Locked out of Heaven" & artist == "Bruno Mars")

# Check the number of times the song appears
nrow(locked_out_songs)

[1] 2

Confirmed there are more than 1

Remove potential duplicates

songs_2010 <- distinct(songs_2010, song_clean, artist_clean, .keep_all = TRUE)
songs_2010 <- select(songs_2010, -song_clean, -artist_clean)

Check that duplicates were removed

songs_sorted <- arrange(songs_2010, desc(popularity))
head(songs_sorted, 10)

# A tibble: 10 × 19
   artist  song  duration_ms explicit  year popularity danceability energy   key
   <chr>   <chr>       <dbl> <fct>    <int>      <dbl>        <dbl>  <dbl> <dbl>
 1 The Ne… Swea…      240400 FALSE     2013         89        0.612  0.807    10
 2 Tom Od… Anot…      244360 TRUE      2013         88        0.445  0.537     4
 3 WILLOW  Wait…      196520 FALSE     2015         86        0.764  0.705     3
 4 Billie… love…      200185 FALSE     2018         86        0.351  0.296     4
 5 Bruno … Lock…      233478 FALSE     2012         85        0.726  0.698     5
 6 The Ne… Dadd…      260173 FALSE     2015         85        0.588  0.521    10
 7 Avicii  The …      176658 FALSE     2014         85        0.527  0.835     6
 8 Ed She… Perf…      263400 FALSE     2017         85        0.599  0.448     8
 9 Post M… Circ…      215280 FALSE     2019         85        0.695  0.762     0
10 Arctic… Why'…      161123 FALSE     2013         84        0.691  0.631     2
# ℹ 10 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>, duration_min <dbl>

locked_out_songs <- filter(songs_2010, song == "Locked out of Heaven" & artist == "Bruno Mars")

nrow(locked_out_songs)

[1] 1

Confirmed duplicates were removed

Checking for top 5 artist with total song quantity

top_artists <- songs_2010 |>
  group_by(artist) |>
  summarize(song_count = n()) |>
  arrange(desc(song_count)) |>
  slice_head(n = 5)

head(top_artists)

# A tibble: 5 × 2
  artist        song_count
  <chr>              <int>
1 Drake                 20
2 Calvin Harris         18
3 David Guetta          18
4 Ariana Grande         13
5 Katy Perry            13

I am going to start off with focusing only on Katy Perry

Creating popularity to numeric and order her songs by popularity

katy_songs <- songs_2010 |>
  filter(tolower(artist) == "katy perry")

katy_songs <- katy_songs |>
  mutate(popularity = as.numeric(popularity)) |>
  arrange(desc(popularity))

Create Graph only for Katy Perry songs and compare with each other

ggplot(katy_songs, aes(x = reorder(song, popularity), y = popularity, fill = factor(year))) +
  geom_bar(stat = "identity") +
 scale_fill_manual(values = c("2010" = "violetred3", "2012" = "orchid", "2013" = "brown3", "2017" = "gold"
  )) +
  labs(
    title = "Katy Perry Songs (2010+) by Popularity",
    x = "Song Title",
    y = "Popularity",
    fill = "Release Year",
    caption = "Data Source: Spotify Web"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 14),
    axis.text.y = element_text(size = 10)
  )

katy_songs <- filter(songs_2010, artist == "Katy Perry") |>
  select(song, year, popularity, danceability, energy, loudness, tempo) |>
  arrange(desc(popularity))

katy_songs

# A tibble: 13 × 7
   song                       year popularity danceability energy loudness tempo
   <chr>                     <int>      <dbl>        <dbl>  <dbl>    <dbl> <dbl>
 1 Last Friday Night (T.G.I…  2012         74        0.649  0.815    -3.80 126. 
 2 Dark Horse                 2013         74        0.647  0.585    -6.12 132. 
 3 Part Of Me                 2012         73        0.678  0.918    -4.63 130. 
 4 Roar                       2013         73        0.554  0.772    -4.82 180. 
 5 California Gurls           2012         72        0.791  0.754    -3.73 125. 
 6 Firework                   2010         72        0.638  0.832    -5.04 124. 
 7 The One That Got Away      2012         72        0.687  0.792    -4.02 134. 
 8 Teenage Dream              2010         69        0.719  0.798    -4.58 120. 
 9 Chained To The Rhythm      2017         69        0.562  0.8      -5.40  95.0
10 E.T.                       2012         65        0.62   0.869    -5.25 152. 
11 Wide Awake                 2012         65        0.514  0.683    -5.10 160. 
12 This Is How We Do          2013         60        0.69   0.636    -6.03  96  
13 Unconditionally            2013          0        0.555  0.729    -4.81 129.

Shown above in the graph is the comparison of only Katy Perry’s songs from 2010 and up to each other. Each of her songs titles that are part of the data set are located on the x-axis while the popularity rating is on the y-axis. On the bar graph itself that I created, each of the bars are filled in with a color that is associated with the release date of the song. I wanted to compare only Katy Perry songs for my first graph as growing up I remember always having her songs stuck in my head. I would always sing out to the songs “Teenage Dream” to “T.G.I.F” so to me it is no surprise that they are the most popular of her songs out of all. The interesting thing that I did see was that her song “Unconditionally” didn’t show up at all as a bar. Initially I assumed that I messed up the graph but, I then confirmed that it did receive a score of 0 for popularity which I was shocked about. This one song of Katy Perry’s from 2013 received a total 0 for popularity from all her songs.

Katy Perry Songs Regression Analysis

katy_songs <- filter(songs_2010, artist == "Katy Perry")

katy_songs_numeric <- select(katy_songs, popularity, danceability, energy, loudness, tempo)

library(DataExplorer)

plot_correlation(katy_songs_numeric)

For my 3rd graph I will be plotting an Interactive Scatter plot for the top 5 artist from my Data Set

Filter and mutating songs for those artists

library(plotly)

top5_artists <- c("Drake", "Calvin Harris", "David Guetta", "Ariana Grande", "Katy Perry")

top5_allsongs <- filter(songs_2010, artist %in% top5_artists)

top5_allsongs <- mutate(
  top5_allsongs,
  
popularity_scaled = (popularity / 100) * 35 + 5 #Help from Chatgbt, I was having issues with sizing on my graph
)

Setting up the graph to display

scatter_plot <- plot_ly(
  data = top5_allsongs,
  x = ~energy,
  y = ~danceability,
  color = ~artist,
  colors = c("pink2", "darkturquoise", "gold", "forestgreen", "orchid"),
  type = "scatter",
  mode = "markers",
  text = ~paste(
    "Song:", song,
    "<br>Artist:", artist,
    "<br>Popularity:", popularity,
    "<br>Year:", year,
    "<br>Tempo:", tempo
  ),
  marker = list(
    size = ~popularity_scaled,
    sizemode = "diameter",
    opacity = 0.7,
    sizemin = 1
  )
)

Editing how labels/titles on the graph

scatter_plot <- layout(
  scatter_plot,
  title = "Energy vs Danceability (Point Size by Popularity)",
  xaxis = list(title = "Energy"),
  yaxis = list(title = "Danceability"),
  legend = list(title = list(text = "Artist"))) #I didn't think adding Artist would change the little pop up color but very glad I did and it didn't break

Showing the graph created

scatter_plot

This interactive scatter plot that I created shows how energy and danceability relate to song popularity for each of the top five artists in the dataset: Drake, Calvin Harris, David Guetta, Ariana Grande, and Katy Perry. Each point is a song with its position reflecting energy and danceability, its color shows the artist that the song belongs to, and its size shows how popular it is. With this plot I noticed that Calvin Harris and David Guetta tend to release more high energy, danceable hits, while Katy Perry and Ariana Grande also lean upbeat but with more variety. I did see with Drake’s song, those that have less energy have a higher danceability and those with high energy didn’t which i found pretty interesting. Overall, the most popular songs cluster in the high-energy, high-danceability corner, suggesting these traits are key to mainstream success. Hovering over each point reveals more about the song, making it easy to explore trends by artist.

Conclusion

The visualizations offered helpful insights into how audio features like energy and danceability relate to popularity, especially across different artists. One noticeable pattern was that the most popular songs tended to be both high in energy and danceability, particularly among artists like Calvin Harris and David Guetta. A surprising finding was that some commercially released songs like Katy Perry’s Unconditionally received very low popularity scores on Spotify, even though they were widely promoted at the time of release. I attempted to scale point sizes by popularity in the interactive scatter plot, but the sizing function only worked partially and didn’t display as clearly as intended. With more time, I would have liked to improve that visual and include additional features, such as genre or playlist placement, to better understand what drives a song’s success beyond its audio characteristics.