This project is 2/3 in a continuation of a case study on the effects of various attributes on a song’s popularity. Part 1 focused on cleaning our data using SQL in Azure Data Studio. Part 3 is an Tableau dashboard visualization.
Before we get started with our analysis, we’ll install the packages we’ll be using throughout.
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggrepel")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("ggpubr")
## Loading required package: ggplot2
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("here")
## here() starts at /cloud/project
library("skimr")
library("janitor")
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("lubridate")
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("ggrepel")
Now we’ll import our CSV dataset. Since we already cleaned and joined this data with SQL in an earlier project, we’ll only need to import 1 file.
music_info <- read_csv(file="/cloud/project/FullSpot.csv")
## Rows: 52812 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): album_id, track_id, art_id, track_name, artist_name, album_name, k...
## dbl (12): popularity, artist_popularity, duration_ms, danceability, tempo, e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We’ll run the head and str functions to get a quick summary of the data we’re working with. This will be a helpful reference point throughout our analysis.
head(music_info)
str(music_info)
## spc_tbl_ [52,812 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ album_id : chr [1:52812] "7JZi4kTfYCJcqOzVbKAYh5" "16vClA3h0HIZSy6O0H5yyU" "4gPpWZjXUvbrIacogrTKrl" "1H9VCendIQLE3QFcKDwQ5Q" ...
## $ track_id : chr [1:52812] "43O1LuFW6QIJ20RSV8gC4g" "0QyZr2HXSVJgTci0DPfcB9" "0eCtJer01h42lZEKkAhQBK" "62ynwuxd3SXdbW7mzZaHuG" ...
## $ art_id : chr [1:52812] "23zg3TcAtWQy7J6upgbUnj" "2rBb2WwvpcIHNiofmw22vn" "4x1nvY2FN8jxqAFA0DA02H" "0qk9IE3icvkYLRii8qAko8" ...
## $ track_name : chr [1:52812] "You Decide" "Ainsi soit-il" "Lennon / McCartney: Come together" "El Barzón" ...
## $ artist_name : chr [1:52812] "Usher" "Louis Chedid" "John Lennon" "Amparo Ochoa" ...
## $ album_name : chr [1:52812] "\"A\"" "\"Ainsi Soit-Il\"" "\"Fiction\"" "\"Por Siempre\"" ...
## $ popularity : num [1:52812] 59 29 36 36 36 18 42 51 69 64 ...
## $ artist_popularity: num [1:52812] 81 33 71 36 58 34 61 70 70 70 ...
## $ key_let : chr [1:52812] "C#Db" "F" "G" "A" ...
## $ mode_let : chr [1:52812] "Maj" "min" "Maj" "Maj" ...
## $ duration_ms : num [1:52812] 212920 190533 185676 193893 316680 ...
## $ danceability : num [1:52812] 0.747 0.7 0.522 0.532 0.389 0.654 0.59 0.598 0.702 0.553 ...
## $ tempo : num [1:52812] 139.9 124.6 174.1 200.2 72.3 ...
## $ energy : num [1:52812] 0.533 0.491 0.294 0.461 0.474 0.869 0.971 0.244 0.632 0.864 ...
## $ followers : num [1:52812] 6806588 4357 2329020 14462 151818 ...
## $ liveness : num [1:52812] 0.119 0.186 0.118 0.156 0.349 0.0784 0.276 0.106 0.127 0.138 ...
## $ loudness : num [1:52812] -6.33 -14.86 -9.24 -13.21 -7.99 ...
## $ release_date : num [1:52812] 2018 1981 2010 2006 2015 ...
## $ speechiness : num [1:52812] 0.0403 0.0278 0.0369 0.314 0.0306 0.0304 0.0452 0.0374 0.0247 0.0414 ...
## $ valence : num [1:52812] 0.38 0.56 0.541 0.617 0.172 0.739 0.725 0.225 0.457 0.437 ...
## $ song_pop_seg : chr [1:52812] "middle" "low" "low" "low" ...
## - attr(*, "spec")=
## .. cols(
## .. album_id = col_character(),
## .. track_id = col_character(),
## .. art_id = col_character(),
## .. track_name = col_character(),
## .. artist_name = col_character(),
## .. album_name = col_character(),
## .. popularity = col_double(),
## .. artist_popularity = col_double(),
## .. key_let = col_character(),
## .. mode_let = col_character(),
## .. duration_ms = col_double(),
## .. danceability = col_double(),
## .. tempo = col_double(),
## .. energy = col_double(),
## .. followers = col_double(),
## .. liveness = col_double(),
## .. loudness = col_double(),
## .. release_date = col_double(),
## .. speechiness = col_double(),
## .. valence = col_double(),
## .. song_pop_seg = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
To get an even better idea of what we’re working with, let’s get a count of unique artist and track IDs.
n_unique(music_info$art_id)
## [1] 20183
n_unique(music_info$track_id)
## [1] 52812
A total of 20,183 individual artists and 52,812 individual tracks are included in our cleaned dataset.
We’ll now segment tracks out into 5 different groups based on their popularity - “very low”, “low”, “middle”, “high”, and “very high”. These segments will come in handy when determining popularity distribution.
average_pop <- music_info %>%
group_by(track_id) %>%
summarise (mean_song_pop = mean(popularity))
head(average_pop)
pop_segs <- average_pop %>%
mutate(pop_segs = case_when(
mean_song_pop < 21 ~ "very low",
mean_song_pop >= 21 & mean_song_pop < 41 ~ "low",
mean_song_pop >= 41 & mean_song_pop < 61 ~ "middle",
mean_song_pop >= 61 & mean_song_pop <81 ~ "high",
mean_song_pop >= 81 ~ "very high"
))
head(pop_segs)
Let’s take a look at the percentages of each popularity segment to get a better understanding of our sample distribution.
pop_seg_perc <- pop_segs %>%
group_by(pop_segs) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(pop_segs) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))
pop_seg_perc$pop_segs <- factor(pop_seg_perc$pop_segs , levels = c("very high", "high", "middle", "low", "very low"))
head(pop_seg_perc)
From the numbers alone, it’s easy to see that the most songs fall in the “middle” or “low” popularity segments while the “very high” segment is only 0.3% of the population.
Visualizing this information should paint a clearer picture as to how few songs fall into the “very high” popularity segment.
pop_seg_perc %>%
ggplot(aes(x="",y=total_percent, fill=pop_segs)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_brewer(palette = "Pastel1") +
geom_label_repel(data = pop_seg_perc,
aes(y = 0.5, x = 0.5, label = labels),
size = 4, nudge_x = 0.5, nudge_y = 0.5, show.legend = FALSE) +
labs(title="Track Popularty Distribution")
Clearly, releasing a popular song is not easy!
Let’s take a look at what type of correlations might exist between a song’s popularity and various other attributes such as tempo, danceability, valence, and others. We can spot these correlations by visualizing them as scatterplots and letting R map out the smooth trend lines for us.
ggarrange(
ggplot(music_info, aes(x=duration_ms, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Song Length", x = "Duration (ms)", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=danceability, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Danceability", x = "Danceability", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=energy, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Energy", x = "Energy", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=liveness, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Liveness", x = "Liveness", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=loudness, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Loudness", x = "Loudness", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=speechiness, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Speechiness", x = "Speechiness", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=valence, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Valence", x = "Valence", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=tempo, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Tempo (BPM)", x = "Tempo (BPM)", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14)),
ggplot(music_info, aes(x=artist_popularity, y=popularity))+
geom_jitter() +
geom_smooth(color = "red") +
labs(title = "Artist Popularity", x = "Artist Popularity", y= "Track Popularity") +
theme(panel.background = element_blank(),
plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
A few points stand out:
Based on the limits of the data and the lack of trends identified, we can’t provide a formula for a popular song as confidently as we can provide a set of guidelines that would give a song the best chances to succeed. That set of guidelines is as follows