Spotify Visualization

Author

Phoebe Lam

Spotify Visualization

Source: https://www.amazon.com/Spotify-Music/dp/B00KLBR6IC

I’m using a data-set from Spotify, which lists songs that are available on Spotify with their corresponding stats and information. Some of which includes the genre, the artist, duration, year it was released, energy, tempo, and popularity. The variables I want to focus on specifically are popularity and genre, or in other words, what genre is the most popular?

##Retrieving Data-set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Phoeb/Downloads/Data 110/DatasetsData110")
spotify <- read_csv("spotifysongs.csv")

Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(spotify)

# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0
3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7
4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0
5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

I want to see what songs were the most popular and focus on 100 of them.

spotifypopular <- spotify |>
  arrange(desc(popularity)) |> #rearranges data to start with the song with the highest popularity value then the next
  head(100)
spotifypopular

# A tibble: 100 × 18
   artist  song  duration_ms explicit  year popularity danceability energy   key
   <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
 1 The Ne… Swea…      240400 FALSE     2013         89        0.612  0.807    10
 2 Tom Od… Anot…      244360 TRUE      2013         88        0.445  0.537     4
 3 Eminem  With…      290320 TRUE      2002         87        0.908  0.669     7
 4 Eminem  The …      284200 TRUE      2000         86        0.949  0.661     5
 5 WILLOW  Wait…      196520 FALSE     2015         86        0.764  0.705     3
 6 Billie… love…      200185 FALSE     2018         86        0.351  0.296     4
 7 Billie… love…      200185 FALSE     2018         86        0.351  0.296     4
 8 Eminem  'Til…      297786 TRUE      2002         85        0.548  0.847     1
 9 Bruno … Lock…      233478 FALSE     2012         85        0.726  0.698     5
10 Bruno … Lock…      233478 FALSE     2012         85        0.726  0.698     5
# ℹ 90 more rows
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

Now I want to see which genre had the most amount of songs in the top 100

spotifygenre <- spotifypopular |>
  group_by(genre) |> #grouped all the same genres together
  summarize(number_rows = n()) #then displayed how many rows(songs) each genre had
spotifygenre

# A tibble: 20 × 2
   genre                          number_rows
   <chr>                                <int>
 1 Dance/Electronic                         1
 2 Folk/Acoustic, pop                       1
 3 hip hop                                 19
 4 hip hop, Dance/Electronic                4
 5 hip hop, pop                            10
 6 hip hop, pop, Dance/Electronic           2
 7 hip hop, pop, R&B                        5
 8 hip hop, pop, latin                      3
 9 latin                                    1
10 metal                                    1
11 pop                                     19
12 pop, Dance/Electronic                   10
13 pop, R&B                                 2
14 pop, R&B, Dance/Electronic               1
15 pop, latin                               1
16 pop, rock, Dance/Electronic              1
17 rock                                     8
18 rock, metal                              3
19 rock, pop                                7
20 rock, pop, Dance/Electronic              1

To see if there’s any distinction between popularity in genres, I want to look at the bottom 100 songs.

spotifyunpop <- spotify |>
  arrange(popularity) |>
  head(100)
spotifyunpop

# A tibble: 100 × 18
   artist  song  duration_ms explicit  year popularity danceability energy   key
   <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
 1 Oasis   Go L…      278666 FALSE     2000          0        0.408  0.849     2
 2 Mariah… Agai…      199480 FALSE     2011          0        0.471  0.514     1
 3 Jennif… Ain'…      246160 FALSE     2001          0        0.707  0.869     5
 4 DB Bou… Poin…      231166 FALSE     2018          0        0.676  0.715     6
 5 Musiq … Love       304666 FALSE     2000          0        0.569  0.385     1
 6 Baseme… Romeo      217493 FALSE     2001          0        0.713  0.829     2
 7 Aaliyah Rock…      275026 FALSE     2019          0        0.641  0.72      5
 8 Electr… Dang…      214600 FALSE     2003          0        0.66   0.698    11
 9 Baseme… Good…      282306 FALSE     2003          0        0.571  0.968     5
10 Mariah… It's…      203360 FALSE     2005          0        0.8    0.633     8
# ℹ 90 more rows
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

It was at this point that I realized the bottom most unpopular songs also included a lot of the same genres as the most popular. I decided to see how many different categorical variables are in genres, just so I can gauge how feasible it is to include them all in my visualization.

spotifygenre2 <- spotify |>
  group_by(genre) |>
  summarize(number_rows = n()) 
spotifygenre2

# A tibble: 59 × 2
   genre                                 number_rows
   <chr>                                       <int>
 1 Dance/Electronic                               41
 2 Folk/Acoustic, pop                              2
 3 Folk/Acoustic, rock                             1
 4 Folk/Acoustic, rock, pop                        1
 5 R&B                                            13
 6 World/Traditional, Folk/Acoustic                1
 7 World/Traditional, hip hop                      2
 8 World/Traditional, pop                          1
 9 World/Traditional, pop, Folk/Acoustic           2
10 World/Traditional, rock                         2
# ℹ 49 more rows

There are 59 total different categorical variables in genre. This is far too many. I also noticed that a good chunk of songs’ genre is “set()”. This poses another problem, if I decide to exclude these songs, my visualization would not be accurate to the data-set. Should I include these 22 “set()” songs? If I do, how should I incorporate them? I decide to center the visualization on songs that are only categorized by one genre. This brings the 59 total different categorical variables to 9, or 10 if “set()” is included. In the end, I want to remove “set()” and “easy listening” from the genres because of their vague nature. So now, I’m left with 8 categorical variables.

targetgenre <- c('pop', 'hip hop', 'rock', 'Dance/Electronic', 'latin', 'R&B', 'country', 'metal')
spotify2 <- spotify %>% filter(genre %in% targetgenre) #narrowed down data set to focus on genres I cared about 
spotify2

# A tibble: 698 × 18
   artist  song  duration_ms explicit  year popularity danceability energy   key
   <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
 1 Britne… Oops…      211160 FALSE     2000         77        0.751  0.834     1
 2 *NSYNC  Bye …      200560 FALSE     2000         65        0.614  0.928     8
 3 Eminem  The …      284200 TRUE      2000         86        0.949  0.661     5
 4 Modjo   Lady…      307153 FALSE     2001         77        0.72   0.808     6
 5 Gigi D… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
 6 Eiffel… Move…      268863 FALSE     1999         56        0.745  0.958     7
 7 Bomfun… Free…      306333 FALSE     2000         55        0.822  0.922    11
 8 Anasta… I'm …      245400 FALSE     1999         64        0.761  0.716    10
 9 Alice … Bett…      214883 FALSE     2000         73        0.671  0.88      8
10 Gigi D… The …      285426 FALSE     1999         64        0.74   0.876     6
# ℹ 688 more rows
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(RColorBrewer)
spotviz <- spotify2 |>
  ggplot(aes(x=popularity, y=genre, color=genre, text = paste("Song: ", song, "\n", "Artist: ", artist, "\n", "Released in ", year, "Popularity in %: ", popularity, "\n", sep = ""))) +
  geom_point(size=3) +
  theme(axis.title.y = element_text(margin = margin(r = 30))) +
  theme(legend.position = "none") + #got rid of redundant legend
  labs(caption = "Source: Spotify", y = "Music Genres", x = "Popularity in Percent", title = "Songs' Popularity by Genre on Spotify") 
spotviz + scale_color_brewer(palette = "Paired") +
  theme(panel.background = element_rect(fill = "dimgrey"), plot.background = element_rect(fill = "darkgrey"))

ggplotly(spotviz, tooltip = "text")

The chunk generated 2 separate scatterplots, one with the rcolorbrewer theme and another with plotly’s interactivity. I couldn’t figure out how to have the rcolorbrewer palette show up on plotly. I decided to manually add a separate color palette.

spotviz2 <- spotify2 |>
  ggplot(aes(x=popularity, y=genre, color=genre, text = paste("Song: ", song, "\n", "Artist: ", artist, "\n", "Released in ", year,"\n", "Popularity: ", popularity, "%", "\n", sep = ""))) +
  geom_point(size=3, ) +
  scale_color_manual(values = c("rock" = "#434279", "R&B" = "#58508d", "pop" = "#8a508f", "metal" = "#bc5090", "latin" = "#de5a79", "hip hop" = "#ff6361", "Dance/Electronic" = "#ff8531", "country" = "#ffa600")) + 
  labs(caption = "Source: Spotify",y = "Music Genres
       
       
       
       ", x = "Popularity in Percent", title = "Songs' Popularity by Genre on Spotify") +
  theme_gray() +
  theme(legend.position = "none", text=element_text(family="Times New Roman")) +
  xlim(0,100)
ggplotly(spotviz2, tooltip = "text")

Plotly kind of generated the plot in a different way, so I had to adjust some things in an awkward way. The y-axis label was way too close to y-variable labels. I initially tried to use “” “ or”” to increase the distance as well as “theme(axis.title.y=element_text(vjust=-0.5)” but neither worked. I eventually just included a bunch of ‘enters’ after my y-axis label. Plotly also got rid of my caption. I made a few more aesthetic changes, like changing the theme to “theme_gray()” and changing the font.

This visualization shows how many of the select songs on Spotify were each of the select genres and their respective popularity. The mouse-over interactivity displays the songs’ title, artist, release date, and popularity. I was not surprised to see how many more pop and hip hop songs there were in comparison to other genres. I was however surprised at how little R&B songs there were and how Dance/Electronic songs were mostly congested below 75%. Even though there is an abundance of pop songs, it’s pretty spread out across popularity, I wonder if the average of all the genres’ popularity would be around the same value.

I only have a few aesthetic regrets. The way it’s centered is odd and I wish I could figure out how to move the x-axis and y-axis labels better. I would’ve liked if the genres were capitalized on the plot. It also looks a bit plain, but it gets the job done.