Analysis

This project is 2/3 in a continuation of a case study on the effects of various attributes on a song’s popularity. Part 1 focused on cleaning our data using SQL in Azure Data Studio. Part 3 is an Tableau dashboard visualization.

1. Installing/Loading Packages

Before we get started with our analysis, we’ll install the packages we’ll be using throughout.

install.packages("ggpubr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("lubridate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("ggrepel")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("ggpubr")

## Loading required package: ggplot2

library("tidyverse")

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library("here")

## here() starts at /cloud/project

library("skimr")
library("janitor")

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library("lubridate")

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library("ggrepel")

2.1 Importing Dataset

Now we’ll import our CSV dataset. Since we already cleaned and joined this data with SQL in an earlier project, we’ll only need to import 1 file.

music_info <- read_csv(file="/cloud/project/FullSpot.csv")

## Rows: 52812 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): album_id, track_id, art_id, track_name, artist_name, album_name, k...
## dbl (12): popularity, artist_popularity, duration_ms, danceability, tempo, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 Verifying structure

We’ll run the head and str functions to get a quick summary of the data we’re working with. This will be a helpful reference point throughout our analysis.

head(music_info)

str(music_info)

## spc_tbl_ [52,812 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ album_id         : chr [1:52812] "7JZi4kTfYCJcqOzVbKAYh5" "16vClA3h0HIZSy6O0H5yyU" "4gPpWZjXUvbrIacogrTKrl" "1H9VCendIQLE3QFcKDwQ5Q" ...
##  $ track_id         : chr [1:52812] "43O1LuFW6QIJ20RSV8gC4g" "0QyZr2HXSVJgTci0DPfcB9" "0eCtJer01h42lZEKkAhQBK" "62ynwuxd3SXdbW7mzZaHuG" ...
##  $ art_id           : chr [1:52812] "23zg3TcAtWQy7J6upgbUnj" "2rBb2WwvpcIHNiofmw22vn" "4x1nvY2FN8jxqAFA0DA02H" "0qk9IE3icvkYLRii8qAko8" ...
##  $ track_name       : chr [1:52812] "You Decide" "Ainsi soit-il" "Lennon / McCartney: Come together" "El Barzón" ...
##  $ artist_name      : chr [1:52812] "Usher" "Louis Chedid" "John Lennon" "Amparo Ochoa" ...
##  $ album_name       : chr [1:52812] "\"A\"" "\"Ainsi Soit-Il\"" "\"Fiction\"" "\"Por Siempre\"" ...
##  $ popularity       : num [1:52812] 59 29 36 36 36 18 42 51 69 64 ...
##  $ artist_popularity: num [1:52812] 81 33 71 36 58 34 61 70 70 70 ...
##  $ key_let          : chr [1:52812] "C#Db" "F" "G" "A" ...
##  $ mode_let         : chr [1:52812] "Maj" "min" "Maj" "Maj" ...
##  $ duration_ms      : num [1:52812] 212920 190533 185676 193893 316680 ...
##  $ danceability     : num [1:52812] 0.747 0.7 0.522 0.532 0.389 0.654 0.59 0.598 0.702 0.553 ...
##  $ tempo            : num [1:52812] 139.9 124.6 174.1 200.2 72.3 ...
##  $ energy           : num [1:52812] 0.533 0.491 0.294 0.461 0.474 0.869 0.971 0.244 0.632 0.864 ...
##  $ followers        : num [1:52812] 6806588 4357 2329020 14462 151818 ...
##  $ liveness         : num [1:52812] 0.119 0.186 0.118 0.156 0.349 0.0784 0.276 0.106 0.127 0.138 ...
##  $ loudness         : num [1:52812] -6.33 -14.86 -9.24 -13.21 -7.99 ...
##  $ release_date     : num [1:52812] 2018 1981 2010 2006 2015 ...
##  $ speechiness      : num [1:52812] 0.0403 0.0278 0.0369 0.314 0.0306 0.0304 0.0452 0.0374 0.0247 0.0414 ...
##  $ valence          : num [1:52812] 0.38 0.56 0.541 0.617 0.172 0.739 0.725 0.225 0.457 0.437 ...
##  $ song_pop_seg     : chr [1:52812] "middle" "low" "low" "low" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   album_id = col_character(),
##   ..   track_id = col_character(),
##   ..   art_id = col_character(),
##   ..   track_name = col_character(),
##   ..   artist_name = col_character(),
##   ..   album_name = col_character(),
##   ..   popularity = col_double(),
##   ..   artist_popularity = col_double(),
##   ..   key_let = col_character(),
##   ..   mode_let = col_character(),
##   ..   duration_ms = col_double(),
##   ..   danceability = col_double(),
##   ..   tempo = col_double(),
##   ..   energy = col_double(),
##   ..   followers = col_double(),
##   ..   liveness = col_double(),
##   ..   loudness = col_double(),
##   ..   release_date = col_double(),
##   ..   speechiness = col_double(),
##   ..   valence = col_double(),
##   ..   song_pop_seg = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.3 Further verification

To get an even better idea of what we’re working with, let’s get a count of unique artist and track IDs.

n_unique(music_info$art_id)

## [1] 20183

n_unique(music_info$track_id)

## [1] 52812

A total of 20,183 individual artists and 52,812 individual tracks are included in our cleaned dataset.

3.1 Segmenting by Popularity

We’ll now segment tracks out into 5 different groups based on their popularity - “very low”, “low”, “middle”, “high”, and “very high”. These segments will come in handy when determining popularity distribution.

average_pop <- music_info %>%
  group_by(track_id) %>%
  summarise (mean_song_pop = mean(popularity))

head(average_pop)

pop_segs <- average_pop %>%
    mutate(pop_segs = case_when(
    mean_song_pop < 21 ~ "very low",
    mean_song_pop >= 21 & mean_song_pop < 41 ~ "low", 
    mean_song_pop >= 41 & mean_song_pop < 61 ~ "middle", 
    mean_song_pop >= 61 & mean_song_pop <81 ~ "high",
    mean_song_pop >= 81 ~ "very high"
  ))

head(pop_segs)

3.2 Determining segment distribution

Let’s take a look at the percentages of each popularity segment to get a better understanding of our sample distribution.

pop_seg_perc <- pop_segs %>%
  group_by(pop_segs) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(pop_segs) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

pop_seg_perc$pop_segs <- factor(pop_seg_perc$pop_segs , levels = c("very high", "high", "middle", "low", "very low"))


head(pop_seg_perc)

From the numbers alone, it’s easy to see that the most songs fall in the “middle” or “low” popularity segments while the “very high” segment is only 0.3% of the population.

Visualizing this information should paint a clearer picture as to how few songs fall into the “very high” popularity segment.

pop_seg_perc %>%
  ggplot(aes(x="",y=total_percent, fill=pop_segs)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_label_repel(data = pop_seg_perc,
                   aes(y = 0.5, x = 0.5, label = labels),
                   size = 4, nudge_x = 0.5, nudge_y = 0.5, show.legend = FALSE) +
  labs(title="Track Popularty Distribution")

Clearly, releasing a popular song is not easy!

4.1 Identifying Correlations

Let’s take a look at what type of correlations might exist between a song’s popularity and various other attributes such as tempo, danceability, valence, and others. We can spot these correlations by visualizing them as scatterplots and letting R map out the smooth trend lines for us.

ggarrange(
ggplot(music_info, aes(x=duration_ms, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Song Length", x = "Duration (ms)", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=danceability, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Danceability", x = "Danceability", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=energy, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Energy", x = "Energy", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=liveness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Liveness", x = "Liveness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=loudness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Loudness", x = "Loudness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=speechiness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Speechiness", x = "Speechiness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=valence, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Valence", x = "Valence", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=tempo, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Tempo (BPM)", x = "Tempo (BPM)", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=artist_popularity, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Artist Popularity", x = "Artist Popularity", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

A few points stand out:

The most glaring and perhaps only real correlation we see is between artist popularity and track popularity. The more popular the artist, the more popular the track.
Track popularity starts to fall off after a certain point if a track is too “speechy”
There’s a “sweet spot” for track length but popularity does appear to decrease the longer a track goes on

Spotify Data Analysis

Sam Sherman

January 23, 2023