Spotify

What I will be doing and the reason behind the data set?

I chose the “spotifysongs.csv” data set from the class google drive that was directly scraped from Spotify.I am going to explore the top 5 artists for a particular year(2019). I am planning on doing the plot by identifying their top songs and how popular they were during that year and I am also going to see how was the danceability for their songs. The variables that I will be focusing on are: artist, song, year, danceability and popularity. Popularity is a quantitative variable. To clean this datset, I first need to filter for the year 2019. Next, I need to find the top 5 artists and I used the group_by, summarize, and arrange function to group all the artist together and add all of their popularity points. I chose this data set because I love listening to songs without being interrupted, and spotify is the best platform I have used(still using). The year I picked was 2019 and that was because COVID. I just wanted to see how it went for them in that particular year. The variables of this data set are as followed: artist name, song release year, genre, tempo, danceability, popularity, valence, whether a song is explicit or not, energy, acoustics, key, and some more.

What is Spotify?

Spotify is a Swedish audio streaming and media service provider founded on 23 April 2006 by Daniel Ek and Martin Lorentzon. It is one of the largest music streaming service providers, with over 602 million monthly active users, including 236 million paying subscribers, as of December 2023.This data set includes popular songs spanning from 1998 to 2020. These songs are from varying genres and artists.

Let’s load the libraries that we need to achieve this!!!

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tinytex)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)

Setting the working directory and calling the data set.

setwd("/Users/janithrithilakasiri/Downloads")
Spotify <- read_csv("spotifysongs.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
## lgl  (1): explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s choose the year 2019…

spotifyy_2019 <- Spotify %>%
  filter(year == 2019)
spotifyy_2019
## # A tibble: 89 × 18
##    artist  song  duration_ms explicit  year popularity danceability energy   key
##    <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
##  1 K-Ci &… Crazy      262773 FALSE     2019         30        0.68   0.644     0
##  2 Angie … If I…      244466 FALSE     2019         40        0.583  0.643     9
##  3 Aaliyah Rock…      275026 FALSE     2019          0        0.641  0.72      5
##  4 Libert… Just…      237359 FALSE     2019         43        0.786  0.614     5
##  5 Lil' K… Magi…      359973 TRUE      2019         47        0.849  0.498     2
##  6 Hinder  Lips…      261053 FALSE     2019         35        0.474  0.744     2
##  7 Hinder  Bett…      223533 FALSE     2019         30        0.451  0.682     2
##  8 Chris … Beau…      225881 FALSE     2019         53        0.415  0.775     5
##  9 Hayden… NUMB       217296 TRUE      2019         47        0.617  0.558    10
## 10 Nicky … X          172854 FALSE     2019         74        0.594  0.749     9
## # ℹ 79 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, genre <chr>

Now we are going to see the top 5 artist in 2019.

top_5 <- spotifyy_2019 %>%
  group_by(artist) %>%
  summarise(total_popularity = sum(popularity)) %>%
  arrange(desc(total_popularity)) %>%
  top_n(5)
## Selecting by total_popularity
top_5
## # A tibble: 5 × 2
##   artist        total_popularity
##   <chr>                    <dbl>
## 1 Post Malone                244
## 2 Ariana Grande              236
## 3 Lil Nas X                  226
## 4 Ed Sheeran                 193
## 5 Billie Eilish              158

Let’s find the top songs of these artists…

top_songs_by_top5 <- spotifyy_2019 %>%
  filter(artist %in% top_5$artist) %>%
  group_by(artist, song,danceability) %>%
  summarise(total_popularity = sum(popularity)) %>%
  arrange(artist, desc(total_popularity)) %>%
  group_by(artist) %>%
  top_n(1)
## `summarise()` has grouped output by 'artist', 'song'. You can override using
## the `.groups` argument.
## Selecting by total_popularity
top_songs_by_top5
## # A tibble: 5 × 4
## # Groups:   artist [5]
##   artist        song                               danceability total_popularity
##   <chr>         <chr>                                     <dbl>            <dbl>
## 1 Ariana Grande 7 rings                                   0.778               83
## 2 Billie Eilish bad guy                                   0.701               83
## 3 Ed Sheeran    Take Me Back to London (feat. Sto…        0.885               66
## 4 Lil Nas X     Old Town Road - Remix                     0.878               79
## 5 Post Malone   Circles                                   0.695               85
artist_colors <- c("Post Malone" = "hotpink",   
                    "Ariana Grande" = "orange", 
                    "Lil Nas X" = "green",
                    "Ed Sheeran" =  "red",
                    "Billie Eilish" = "purple"  )


ggplot(top_songs_by_top5, aes(x = total_popularity, y = song, color = artist, size = danceability)) +
  geom_point(alpha = 0.7, position = "jitter") +
  labs(title = "Top 5 songs of Top 5 Artists in the year of 2019",
       x = "Popularity Ratings",
       y = "Top 5 songs of 2019",
       size= "Danceability",
       color = "Artist",
       caption= "Source: Spotify") + 
  scale_color_manual(values = artist_colors) +
  theme_gray()

Summary

Above visualization shows information on the top 5 artist and their songs in 2019. Number 1 song was “Circles” by Post Malone but there was no danceability to the song. In 2nd Place, “7 rings” by Ariana Grande and the danceability was fine with her song. In 3rd place, “Old Town Road” by Lil Nas X and the danceability for this song was more than fine as we all know. . In 4th place was, “Take me back to London” by Ed Sheeran and this song had the higheset danceability out of the 5 songs. At Last, “Badguy” by Billie Eilish. However, not all of the popular songs were high in danceability. First, I tried to do the top 10 songs but it wasn’t working as much as I thought because of overlapping plot points. Other than that I had so much fun with this project.