library(tidyverse)
library(tidyr)
library(leaflet)
setwd("C:/Users/ubjho/Downloads")
songs <- read_csv("spotifysongs.csv")Project 2
Final Project: Spotify Song Popularity Influences
This project explored the relationship between various audio features and the popularity of songs on Spotify from 2010 onward. The data set includes quantitative variables such as popularity, danceability, energy, loudness, tempo, and duration, as well as categorical variables like artist name, and Song names. The data was sourced from Spotify popular songs from 1998-2020, unfortunately no ReadMe file or detailed documentation is provided, so the exact methodology behind the data collection and feature calculation is not publicly available. This analysis assumes the data reflects Spotify’s internal audio analysis and popularity measures.The data was cleaned by removing duplicates, standardizing names, converting data types, and filtering incomplete records. The focus is on tracks by top artists including Katy Perry, Drake, Calvin Harris, Ariana Grande, and David Guetta.
I chose this topic due to a personal interest in music and the desire to explore how measurable audio traits relate to success. Analyzing well-known artists makes the study more relatable and engaging to me, while leveraging Spotify data provides insight into trends in popular music.
Sources:
Music Marketing Monday. (2023). How do people discover and consume music? Retrieved July 4, 2025, from https://www.musicmarketingmonday.com/p/how-do-people-discover-and-consume-music]
Loading Libraries we may use and data-set
Looking at data
head(songs)# A tibble: 6 × 18
artist song duration_ms explicit year popularity danceability energy key
<chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Britney… Oops… 211160 FALSE 2000 77 0.751 0.834 1
2 blink-1… All … 167066 FALSE 1999 79 0.434 0.897 0
3 Faith H… Brea… 250546 FALSE 1999 66 0.529 0.496 7
4 Bon Jovi It's… 224493 FALSE 2000 78 0.551 0.913 0
5 *NSYNC Bye … 200560 FALSE 2000 65 0.614 0.928 8
6 Sisqo Thon… 253733 TRUE 1999 69 0.706 0.888 2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, genre <chr>
Creating new variables
songs <- mutate(songs, explicit = as.factor(explicit))
songs <- mutate(songs, mode = as.factor(mode))
songs <- mutate(songs, year = as.integer(year))
songs <- mutate(songs, duration_min = duration_ms / 60000)
songs <- filter(songs, popularity != "", danceability != "")Filtering for songs 2010 and newer, as well as arranging from most popular to less
songs_2010 <- filter(songs, year >= 2010)
songs_2010 <- mutate(
songs_2010,
song_clean = tolower(trimws(song)),
artist_clean = tolower(trimws(artist))
)
head(songs_2010)# A tibble: 6 × 21
artist song duration_ms explicit year popularity danceability energy key
<chr> <chr> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Gigi D'… L'Am… 238759 FALSE 2011 1 0.617 0.728 7
2 Chicane Don'… 210786 FALSE 2016 47 0.644 0.72 10
3 Samanth… Gott… 201946 FALSE 2018 43 0.729 0.632 0
4 DJ Ötzi Hey … 219240 FALSE 2010 58 0.666 0.968 10
5 Mariah … Agai… 199480 FALSE 2011 0 0.471 0.514 1
6 Faithle… We C… 222435 FALSE 2015 53 0.645 0.903 5
# ℹ 12 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, genre <chr>, duration_min <dbl>, song_clean <chr>,
# artist_clean <chr>
Checking top 10 Popularity
songs_sorted <- arrange(songs_2010, desc(popularity))
head(songs_sorted, 10)# A tibble: 10 × 21
artist song duration_ms explicit year popularity danceability energy key
<chr> <chr> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 The Ne… Swea… 240400 FALSE 2013 89 0.612 0.807 10
2 Tom Od… Anot… 244360 TRUE 2013 88 0.445 0.537 4
3 WILLOW Wait… 196520 FALSE 2015 86 0.764 0.705 3
4 Billie… love… 200185 FALSE 2018 86 0.351 0.296 4
5 Billie… love… 200185 FALSE 2018 86 0.351 0.296 4
6 Bruno … Lock… 233478 FALSE 2012 85 0.726 0.698 5
7 Bruno … Lock… 233478 FALSE 2012 85 0.726 0.698 5
8 The Ne… Dadd… 260173 FALSE 2015 85 0.588 0.521 10
9 Avicii The … 176658 FALSE 2014 85 0.527 0.835 6
10 Ed She… Perf… 263400 FALSE 2017 85 0.599 0.448 8
# ℹ 12 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, genre <chr>, duration_min <dbl>, song_clean <chr>,
# artist_clean <chr>
Checking if “Locked out of Heaven” is a duplicate
locked_out_songs <- filter(songs_2010, song == "Locked out of Heaven" & artist == "Bruno Mars")
# Check the number of times the song appears
nrow(locked_out_songs)[1] 2
Confirmed there are more than 1
Remove potential duplicates
songs_2010 <- distinct(songs_2010, song_clean, artist_clean, .keep_all = TRUE)
songs_2010 <- select(songs_2010, -song_clean, -artist_clean)Check that duplicates were removed
songs_sorted <- arrange(songs_2010, desc(popularity))
head(songs_sorted, 10)# A tibble: 10 × 19
artist song duration_ms explicit year popularity danceability energy key
<chr> <chr> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 The Ne… Swea… 240400 FALSE 2013 89 0.612 0.807 10
2 Tom Od… Anot… 244360 TRUE 2013 88 0.445 0.537 4
3 WILLOW Wait… 196520 FALSE 2015 86 0.764 0.705 3
4 Billie… love… 200185 FALSE 2018 86 0.351 0.296 4
5 Bruno … Lock… 233478 FALSE 2012 85 0.726 0.698 5
6 The Ne… Dadd… 260173 FALSE 2015 85 0.588 0.521 10
7 Avicii The … 176658 FALSE 2014 85 0.527 0.835 6
8 Ed She… Perf… 263400 FALSE 2017 85 0.599 0.448 8
9 Post M… Circ… 215280 FALSE 2019 85 0.695 0.762 0
10 Arctic… Why'… 161123 FALSE 2013 84 0.691 0.631 2
# ℹ 10 more variables: loudness <dbl>, mode <fct>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, genre <chr>, duration_min <dbl>
locked_out_songs <- filter(songs_2010, song == "Locked out of Heaven" & artist == "Bruno Mars")
nrow(locked_out_songs)[1] 1
Confirmed duplicates were removed
Checking for top 5 artist with total song quantity
top_artists <- songs_2010 |>
group_by(artist) |>
summarize(song_count = n()) |>
arrange(desc(song_count)) |>
slice_head(n = 5)head(top_artists)# A tibble: 5 × 2
artist song_count
<chr> <int>
1 Drake 20
2 Calvin Harris 18
3 David Guetta 18
4 Ariana Grande 13
5 Katy Perry 13
I am going to start off with focusing only on Katy Perry
Creating popularity to numeric and order her songs by popularity
katy_songs <- songs_2010 |>
filter(tolower(artist) == "katy perry")
katy_songs <- katy_songs |>
mutate(popularity = as.numeric(popularity)) |>
arrange(desc(popularity))Create Graph only for Katy Perry songs and compare with each other
ggplot(katy_songs, aes(x = reorder(song, popularity), y = popularity, fill = factor(year))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("2010" = "violetred3", "2012" = "orchid", "2013" = "brown3", "2017" = "gold"
)) +
labs(
title = "Katy Perry Songs (2010+) by Popularity",
x = "Song Title",
y = "Popularity",
fill = "Release Year",
caption = "Data Source: Spotify Web"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold", size = 14),
axis.text.y = element_text(size = 10)
)katy_songs <- filter(songs_2010, artist == "Katy Perry") |>
select(song, year, popularity, danceability, energy, loudness, tempo) |>
arrange(desc(popularity))
katy_songs# A tibble: 13 × 7
song year popularity danceability energy loudness tempo
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Last Friday Night (T.G.I… 2012 74 0.649 0.815 -3.80 126.
2 Dark Horse 2013 74 0.647 0.585 -6.12 132.
3 Part Of Me 2012 73 0.678 0.918 -4.63 130.
4 Roar 2013 73 0.554 0.772 -4.82 180.
5 California Gurls 2012 72 0.791 0.754 -3.73 125.
6 Firework 2010 72 0.638 0.832 -5.04 124.
7 The One That Got Away 2012 72 0.687 0.792 -4.02 134.
8 Teenage Dream 2010 69 0.719 0.798 -4.58 120.
9 Chained To The Rhythm 2017 69 0.562 0.8 -5.40 95.0
10 E.T. 2012 65 0.62 0.869 -5.25 152.
11 Wide Awake 2012 65 0.514 0.683 -5.10 160.
12 This Is How We Do 2013 60 0.69 0.636 -6.03 96
13 Unconditionally 2013 0 0.555 0.729 -4.81 129.
Shown above in the graph is the comparison of only Katy Perry’s songs from 2010 and up to each other. Each of her songs titles that are part of the data set are located on the x-axis while the popularity rating is on the y-axis. On the bar graph itself that I created, each of the bars are filled in with a color that is associated with the release date of the song. I wanted to compare only Katy Perry songs for my first graph as growing up I remember always having her songs stuck in my head. I would always sing out to the songs “Teenage Dream” to “T.G.I.F” so to me it is no surprise that they are the most popular of her songs out of all. The interesting thing that I did see was that her song “Unconditionally” didn’t show up at all as a bar. Initially I assumed that I messed up the graph but, I then confirmed that it did receive a score of 0 for popularity which I was shocked about. This one song of Katy Perry’s from 2013 received a total 0 for popularity from all her songs.
Katy Perry Songs Regression Analysis
katy_songs <- filter(songs_2010, artist == "Katy Perry")
katy_songs_numeric <- select(katy_songs, popularity, danceability, energy, loudness, tempo)library(DataExplorer)
plot_correlation(katy_songs_numeric)For my 3rd graph I will be plotting an Interactive Scatter plot for the top 5 artist from my Data Set
Filter and mutating songs for those artists
library(plotly)
top5_artists <- c("Drake", "Calvin Harris", "David Guetta", "Ariana Grande", "Katy Perry")
top5_allsongs <- filter(songs_2010, artist %in% top5_artists)top5_allsongs <- mutate(
top5_allsongs,
popularity_scaled = (popularity / 100) * 35 + 5 #Help from Chatgbt, I was having issues with sizing on my graph
)Setting up the graph to display
scatter_plot <- plot_ly(
data = top5_allsongs,
x = ~energy,
y = ~danceability,
color = ~artist,
colors = c("pink2", "darkturquoise", "gold", "forestgreen", "orchid"),
type = "scatter",
mode = "markers",
text = ~paste(
"Song:", song,
"<br>Artist:", artist,
"<br>Popularity:", popularity,
"<br>Year:", year,
"<br>Tempo:", tempo
),
marker = list(
size = ~popularity_scaled,
sizemode = "diameter",
opacity = 0.7,
sizemin = 1
)
)Editing how labels/titles on the graph
scatter_plot <- layout(
scatter_plot,
title = "Energy vs Danceability (Point Size by Popularity)",
xaxis = list(title = "Energy"),
yaxis = list(title = "Danceability"),
legend = list(title = list(text = "Artist"))) #I didn't think adding Artist would change the little pop up color but very glad I did and it didn't breakShowing the graph created
scatter_plot