This data set contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The data set offers a wealth of features beyond what is typically available in similar data sets. It provides insights into each song’s attributes, popularity, and presence on various music platforms. For my analysis I thought it would be interesting to see what songs and artists get the most streams, if the songs most frequently in playlist differ across platforms. Lastly, it might be fun to see if there is a relationship between the amount a song gets streamed and the dancability score.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forcats)
url <- "https://raw.githubusercontent.com/D-hartog/DATA607/main/PROJECT2/spotify_untidy.csv"

spotify <- read_csv(url)

DATA CLEANING

# Rename playlist and chart columns
spotify <- spotify %>% rename("spotify_playlists" = "in_spotify_playlists",
                   "spotify_charts" = "in_spotify_charts",
                   "apple_playlists" = "in_apple_playlists",
                   "apple_charts" = "in_apple_charts",
                   "deezer_playlists" = "in_deezer_playlists",
                   "deezer_charts" =  "in_deezer_charts",
                   "shazam_charts" = "in_shazam_charts",
                   "artist" = "artist(s)_name")

spotify$streams <- as.numeric(spotify$streams)
## Warning: NAs introduced by coercion

DATA TIDYING/TRANSFORMATION

  1. I took the in_playlist and in_chart column and pivoted to create two new columns called playlist and chart, that listed the music app containing the playlist or chart and the number of times the song appears in each one.
spotify <- spotify %>% pivot_longer(
    cols = c("spotify_playlists", "apple_playlists", "deezer_playlists"),
    names_to = "playlists",
    values_to = "playlist_count"
  )

spotify <- spotify %>% pivot_longer(
  cols = c("spotify_charts", "apple_charts", "deezer_charts","shazam_charts"),
  names_to = "charts",
  values_to = "charts_count"
)

write.csv(spotify,file='/Users/dirkhartog/Desktop/CUNY_MSDS/DATA_607/PROJECT2/spotify/spotify_tidy.csv', row.names=FALSE)

DATA ANALYSIS/VISUALIZATIONS

  1. For my analysis I thought it would be interesting to see what songs and artists get the most streams, if the songs most frequently in playlist differ across platforms. Lastly, it might be fun to see if there is a relationship between the amount a song gets streamed and the dancability score.
top10artists <- spotify %>% group_by(artist) %>%
  summarise(total_streams = round(sum(streams)/1000000),2) %>% 
  arrange(desc(total_streams)) %>%
  head(10)

ggplot(data = top10artists, aes(x = fct_reorder(artist, total_streams), 
                                y = total_streams, fill = artist)) +
  geom_col(color = "black") +
  xlab("Artist") +
  ylab("Total streams (millions)") +
  ggtitle("Top 10 Streaming Artists") +
    theme(axis.title.x = element_text(size=12),
          axis.text.x = element_text(size=8, angle = 90),
          axis.title.y = element_text(size=12),
          legend.position = "none",
          plot.title = element_text(color="black",
                                    size=20))

  1. Next I wanted to look at the top 10 streaming songs
top10tracks <- spotify %>% group_by(track_name) %>%
  summarise(total_streams = round(sum(streams)/1000000)) %>% 
  arrange(desc(total_streams)) %>%
  head(10)

ggplot(data = top10tracks, aes(x = fct_reorder(track_name, total_streams), 
                                y = total_streams, fill = track_name)) +
  geom_col(color = "black") +
  xlab("Track name") +
  ylab("Total streams (millions)") +
  ggtitle("Top 10 Streaming Tracks") +
  theme(axis.title.x = element_text(size=12),
        axis.text.x = element_text(size=8, angle = 90),
        axis.title.y = element_text(size=12),
        legend.position = "none",
        plot.title = element_text(color="black",
                                  size=20)) +
  scale_x_discrete(labels=c('Starboy', 'Closer', 'Believer', 'STAY',
                            "One Dance", "Sunflower", "Dance Monkey",
                            "Someone You Loved", "Shape of You", 
                            "Blinding Lights"))

  1. Relationship between danceability and streams
spotify <- spotify %>% rename(danceability = 'danceability_%')


ggplot(data = spotify, aes(x = danceability , y = streams)) +
  geom_point() +
  xlab("Danceability (percent)") +
  ylab("Total streams (millions)") +
  ggtitle("Relationship Between Danceability Score 
          and Streams") +
  theme(axis.title.x = element_text(size=12),
        axis.text.x = element_text(size=8, angle = 90),
        axis.title.y = element_text(size=12),
        legend.position = "none",
        plot.title = element_text(color="black",
                                  size=18))
## Warning: Removed 12 rows containing missing values (`geom_point()`).

Conclusions

There are a lot of other interesting analyses that could be done using this data set. According to this data set it we can see that The Weekend, Taylor Swift, and Ed Sheeran are the top three streaming artists and also have songs on the top 10 most streamed tracks.

When we are looking at relationships between danceability and streams it appears that there is a larger concentration of observations where danceability percentage is 70% or higher and are on the lower end of the streaming range.