Analysis

This project is 2/3 in a continuation of a case study on the effects of various attributes on a song’s popularity. Part 1 focused on cleaning our data using SQL in Azure Data Studio. Part 3 is an Tableau dashboard visualization.

1. Installing/Loading Packages

Before we get started with our analysis, we’ll install the packages we’ll be using throughout.

install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggrepel")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("ggpubr")
## Loading required package: ggplot2
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("here")
## here() starts at /cloud/project
library("skimr")
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library("ggrepel")

2.1 Importing Dataset

Now we’ll import our CSV dataset. Since we already cleaned and joined this data with SQL in an earlier project, we’ll only need to import 1 file.

music_info <- read_csv(file="/cloud/project/FullSpot.csv")
## Rows: 52812 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): album_id, track_id, art_id, track_name, artist_name, album_name, k...
## dbl (12): popularity, artist_popularity, duration_ms, danceability, tempo, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 Verifying structure

We’ll run the head and str functions to get a quick summary of the data we’re working with. This will be a helpful reference point throughout our analysis.

head(music_info)
str(music_info)
## spc_tbl_ [52,812 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ album_id         : chr [1:52812] "7JZi4kTfYCJcqOzVbKAYh5" "16vClA3h0HIZSy6O0H5yyU" "4gPpWZjXUvbrIacogrTKrl" "1H9VCendIQLE3QFcKDwQ5Q" ...
##  $ track_id         : chr [1:52812] "43O1LuFW6QIJ20RSV8gC4g" "0QyZr2HXSVJgTci0DPfcB9" "0eCtJer01h42lZEKkAhQBK" "62ynwuxd3SXdbW7mzZaHuG" ...
##  $ art_id           : chr [1:52812] "23zg3TcAtWQy7J6upgbUnj" "2rBb2WwvpcIHNiofmw22vn" "4x1nvY2FN8jxqAFA0DA02H" "0qk9IE3icvkYLRii8qAko8" ...
##  $ track_name       : chr [1:52812] "You Decide" "Ainsi soit-il" "Lennon / McCartney: Come together" "El Barzón" ...
##  $ artist_name      : chr [1:52812] "Usher" "Louis Chedid" "John Lennon" "Amparo Ochoa" ...
##  $ album_name       : chr [1:52812] "\"A\"" "\"Ainsi Soit-Il\"" "\"Fiction\"" "\"Por Siempre\"" ...
##  $ popularity       : num [1:52812] 59 29 36 36 36 18 42 51 69 64 ...
##  $ artist_popularity: num [1:52812] 81 33 71 36 58 34 61 70 70 70 ...
##  $ key_let          : chr [1:52812] "C#Db" "F" "G" "A" ...
##  $ mode_let         : chr [1:52812] "Maj" "min" "Maj" "Maj" ...
##  $ duration_ms      : num [1:52812] 212920 190533 185676 193893 316680 ...
##  $ danceability     : num [1:52812] 0.747 0.7 0.522 0.532 0.389 0.654 0.59 0.598 0.702 0.553 ...
##  $ tempo            : num [1:52812] 139.9 124.6 174.1 200.2 72.3 ...
##  $ energy           : num [1:52812] 0.533 0.491 0.294 0.461 0.474 0.869 0.971 0.244 0.632 0.864 ...
##  $ followers        : num [1:52812] 6806588 4357 2329020 14462 151818 ...
##  $ liveness         : num [1:52812] 0.119 0.186 0.118 0.156 0.349 0.0784 0.276 0.106 0.127 0.138 ...
##  $ loudness         : num [1:52812] -6.33 -14.86 -9.24 -13.21 -7.99 ...
##  $ release_date     : num [1:52812] 2018 1981 2010 2006 2015 ...
##  $ speechiness      : num [1:52812] 0.0403 0.0278 0.0369 0.314 0.0306 0.0304 0.0452 0.0374 0.0247 0.0414 ...
##  $ valence          : num [1:52812] 0.38 0.56 0.541 0.617 0.172 0.739 0.725 0.225 0.457 0.437 ...
##  $ song_pop_seg     : chr [1:52812] "middle" "low" "low" "low" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   album_id = col_character(),
##   ..   track_id = col_character(),
##   ..   art_id = col_character(),
##   ..   track_name = col_character(),
##   ..   artist_name = col_character(),
##   ..   album_name = col_character(),
##   ..   popularity = col_double(),
##   ..   artist_popularity = col_double(),
##   ..   key_let = col_character(),
##   ..   mode_let = col_character(),
##   ..   duration_ms = col_double(),
##   ..   danceability = col_double(),
##   ..   tempo = col_double(),
##   ..   energy = col_double(),
##   ..   followers = col_double(),
##   ..   liveness = col_double(),
##   ..   loudness = col_double(),
##   ..   release_date = col_double(),
##   ..   speechiness = col_double(),
##   ..   valence = col_double(),
##   ..   song_pop_seg = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.3 Further verification

To get an even better idea of what we’re working with, let’s get a count of unique artist and track IDs.

n_unique(music_info$art_id)
## [1] 20183
n_unique(music_info$track_id)
## [1] 52812

A total of 20,183 individual artists and 52,812 individual tracks are included in our cleaned dataset.

3.1 Segmenting by Popularity

We’ll now segment tracks out into 5 different groups based on their popularity - “very low”, “low”, “middle”, “high”, and “very high”. These segments will come in handy when determining popularity distribution.

average_pop <- music_info %>%
  group_by(track_id) %>%
  summarise (mean_song_pop = mean(popularity))

head(average_pop)
pop_segs <- average_pop %>%
    mutate(pop_segs = case_when(
    mean_song_pop < 21 ~ "very low",
    mean_song_pop >= 21 & mean_song_pop < 41 ~ "low", 
    mean_song_pop >= 41 & mean_song_pop < 61 ~ "middle", 
    mean_song_pop >= 61 & mean_song_pop <81 ~ "high",
    mean_song_pop >= 81 ~ "very high"
  ))

head(pop_segs)

3.2 Determining segment distribution

Let’s take a look at the percentages of each popularity segment to get a better understanding of our sample distribution.

pop_seg_perc <- pop_segs %>%
  group_by(pop_segs) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(pop_segs) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

pop_seg_perc$pop_segs <- factor(pop_seg_perc$pop_segs , levels = c("very high", "high", "middle", "low", "very low"))


head(pop_seg_perc)

From the numbers alone, it’s easy to see that the most songs fall in the “middle” or “low” popularity segments while the “very high” segment is only 0.3% of the population.

Visualizing this information should paint a clearer picture as to how few songs fall into the “very high” popularity segment.

pop_seg_perc %>%
  ggplot(aes(x="",y=total_percent, fill=pop_segs)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_label_repel(data = pop_seg_perc,
                   aes(y = 0.5, x = 0.5, label = labels),
                   size = 4, nudge_x = 0.5, nudge_y = 0.5, show.legend = FALSE) +
  labs(title="Track Popularty Distribution")

Clearly, releasing a popular song is not easy!

4.1 Identifying Correlations

Let’s take a look at what type of correlations might exist between a song’s popularity and various other attributes such as tempo, danceability, valence, and others. We can spot these correlations by visualizing them as scatterplots and letting R map out the smooth trend lines for us.

ggarrange(
ggplot(music_info, aes(x=duration_ms, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Song Length", x = "Duration (ms)", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=danceability, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Danceability", x = "Danceability", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=energy, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Energy", x = "Energy", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=liveness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Liveness", x = "Liveness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=loudness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Loudness", x = "Loudness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=speechiness, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Speechiness", x = "Speechiness", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=valence, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Valence", x = "Valence", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=tempo, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Tempo (BPM)", x = "Tempo (BPM)", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(music_info, aes(x=artist_popularity, y=popularity))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Artist Popularity", x = "Artist Popularity", y= "Track Popularity") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

A few points stand out:

  • The most glaring and perhaps only real correlation we see is between artist popularity and track popularity. The more popular the artist, the more popular the track.
  • Track popularity starts to fall off after a certain point if a track is too “speechy”
  • There’s a “sweet spot” for track length but popularity does appear to decrease the longer a track goes on

Conclusion

Based on the limits of the data and the lack of trends identified, we can’t provide a formula for a popular song as confidently as we can provide a set of guidelines that would give a song the best chances to succeed. That set of guidelines is as follows