I’m using a data-set from Spotify, which lists songs that are available on Spotify with their corresponding stats and information. Some of which includes the genre, the artist, duration, year it was released, energy, tempo, and popularity. The variables I want to focus on specifically are popularity and genre, or in other words, what genre is the most popular?
##Retrieving Data-set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl (1): explicit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I want to see what songs were the most popular and focus on 100 of them.
spotifypopular <- spotify |>arrange(desc(popularity)) |>#rearranges data to start with the song with the highest popularity value then the nexthead(100)spotifypopular
Now I want to see which genre had the most amount of songs in the top 100
spotifygenre <- spotifypopular |>group_by(genre) |>#grouped all the same genres togethersummarize(number_rows =n()) #then displayed how many rows(songs) each genre hadspotifygenre
# A tibble: 20 × 2
genre number_rows
<chr> <int>
1 Dance/Electronic 1
2 Folk/Acoustic, pop 1
3 hip hop 19
4 hip hop, Dance/Electronic 4
5 hip hop, pop 10
6 hip hop, pop, Dance/Electronic 2
7 hip hop, pop, R&B 5
8 hip hop, pop, latin 3
9 latin 1
10 metal 1
11 pop 19
12 pop, Dance/Electronic 10
13 pop, R&B 2
14 pop, R&B, Dance/Electronic 1
15 pop, latin 1
16 pop, rock, Dance/Electronic 1
17 rock 8
18 rock, metal 3
19 rock, pop 7
20 rock, pop, Dance/Electronic 1
To see if there’s any distinction between popularity in genres, I want to look at the bottom 100 songs.
It was at this point that I realized the bottom most unpopular songs also included a lot of the same genres as the most popular. I decided to see how many different categorical variables are in genres, just so I can gauge how feasible it is to include them all in my visualization.
# A tibble: 59 × 2
genre number_rows
<chr> <int>
1 Dance/Electronic 41
2 Folk/Acoustic, pop 2
3 Folk/Acoustic, rock 1
4 Folk/Acoustic, rock, pop 1
5 R&B 13
6 World/Traditional, Folk/Acoustic 1
7 World/Traditional, hip hop 2
8 World/Traditional, pop 1
9 World/Traditional, pop, Folk/Acoustic 2
10 World/Traditional, rock 2
# ℹ 49 more rows
There are 59 total different categorical variables in genre. This is far too many. I also noticed that a good chunk of songs’ genre is “set()”. This poses another problem, if I decide to exclude these songs, my visualization would not be accurate to the data-set. Should I include these 22 “set()” songs? If I do, how should I incorporate them? I decide to center the visualization on songs that are only categorized by one genre. This brings the 59 total different categorical variables to 9, or 10 if “set()” is included. In the end, I want to remove “set()” and “easy listening” from the genres because of their vague nature. So now, I’m left with 8 categorical variables.
targetgenre <-c('pop', 'hip hop', 'rock', 'Dance/Electronic', 'latin', 'R&B', 'country', 'metal')spotify2 <- spotify %>%filter(genre %in% targetgenre) #narrowed down data set to focus on genres I cared about spotify2
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(RColorBrewer)spotviz <- spotify2 |>ggplot(aes(x=popularity, y=genre, color=genre, text =paste("Song: ", song, "\n", "Artist: ", artist, "\n", "Released in ", year, "Popularity in %: ", popularity, "\n", sep =""))) +geom_point(size=3) +theme(axis.title.y =element_text(margin =margin(r =30))) +theme(legend.position ="none") +#got rid of redundant legendlabs(caption ="Source: Spotify", y ="Music Genres", x ="Popularity in Percent", title ="Songs' Popularity by Genre on Spotify") spotviz +scale_color_brewer(palette ="Paired") +theme(panel.background =element_rect(fill ="dimgrey"), plot.background =element_rect(fill ="darkgrey"))
ggplotly(spotviz, tooltip ="text")
The chunk generated 2 separate scatterplots, one with the rcolorbrewer theme and another with plotly’s interactivity. I couldn’t figure out how to have the rcolorbrewer palette show up on plotly. I decided to manually add a separate color palette.
spotviz2 <- spotify2 |>ggplot(aes(x=popularity, y=genre, color=genre, text =paste("Song: ", song, "\n", "Artist: ", artist, "\n", "Released in ", year,"\n", "Popularity: ", popularity, "%", "\n", sep =""))) +geom_point(size=3, ) +scale_color_manual(values =c("rock"="#434279", "R&B"="#58508d", "pop"="#8a508f", "metal"="#bc5090", "latin"="#de5a79", "hip hop"="#ff6361", "Dance/Electronic"="#ff8531", "country"="#ffa600")) +labs(caption ="Source: Spotify",y ="Music Genres ", x ="Popularity in Percent", title ="Songs' Popularity by Genre on Spotify") +theme_gray() +theme(legend.position ="none", text=element_text(family="Times New Roman")) +xlim(0,100)ggplotly(spotviz2, tooltip ="text")
Plotly kind of generated the plot in a different way, so I had to adjust some things in an awkward way. The y-axis label was way too close to y-variable labels. I initially tried to use “” “ or”” to increase the distance as well as “theme(axis.title.y=element_text(vjust=-0.5)” but neither worked. I eventually just included a bunch of ‘enters’ after my y-axis label. Plotly also got rid of my caption. I made a few more aesthetic changes, like changing the theme to “theme_gray()” and changing the font.
This visualization shows how many of the select songs on Spotify were each of the select genres and their respective popularity. The mouse-over interactivity displays the songs’ title, artist, release date, and popularity. I was not surprised to see how many more pop and hip hop songs there were in comparison to other genres. I was however surprised at how little R&B songs there were and how Dance/Electronic songs were mostly congested below 75%. Even though there is an abundance of pop songs, it’s pretty spread out across popularity, I wonder if the average of all the genres’ popularity would be around the same value.
I only have a few aesthetic regrets. The way it’s centered is odd and I wish I could figure out how to move the x-axis and y-axis labels better. I would’ve liked if the genres were capitalized on the plot. It also looks a bit plain, but it gets the job done.