Can We Explain Song Popularity?

1. Introduction

1.1. Project Objectives

Analyze the Spotify database to:

Understand how song characteristics (e.g. danceability, liveness) might be associated to different song genres (e.g. pop, rock).
Create a model to predict song popularity based on song characteristics.

1.2. Plan to Deliver Against Project Objectives

Conduct preliminary data analyses to determine what cleaning steps, if any, are needed
Clean the data
Determine what variables might be correlated
Fit various types of models to the data, starting with the simplest models (e.g. multiple linear regression, trees)
Select best model
Re-evaluate variables to determine whether further data cleaning and/or collection should be recommended
Final summary and recommendations

1.3. Analysis and Modeling Proposal

This project will be executed in two major phases:

Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.
Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures (to be delivered on 8/13/21)

1.4. Expected Output

Learnings from these analyses and song popularity models will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.

2. Packages Required

2.1. Packages Required

2.2. Messages and warnings resulting from loading the packages are suppressed

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)

2.3. Package Short Description

tidyverse - for interacting with data through subsetting, transformation, visualization, etc.

dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset

ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics

plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js

corrplot - for visualizing correlation matrices and confidence intervals

3. Data Preparation

3.1. Data Source

The dataset is available in Github. Link to the data source is here.

3.2. Explanation of Data Source

The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.

You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.

3.3. Data Importing and Cleaning

# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")

Identifying and reviewing the codebook

dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")

# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)

## Warning: package 'knitr' was built under R version 4.0.5

kable(dictionary_spotify[,], caption = "Spotify Codebook")

Spotify Codebook
variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Assessing dimensions of the dataset

dim(spotify)

## [1] 28352    43

Check for duplicate rows or columns

# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))

## [1] 28352

# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)

# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique

# Checking whether the unique file contains only 28356
str(spotify)

Viewing the head and tail of the data

head(spotify, n=5)
tail(spotify, n=5)

Cleaning the Data (explanation of the data cleaning steps)

Identify missing data
Determine how to handle missing data
Looking for outliers
Determine how to handle outliers
Frequency distribution for the variables

Identifying missing data

sum(is.na(spotify))
colSums(is.na(spotify))

# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)

# Checking whether missing data was omitted
str(spotify)

Computing summary statistics for the variables

summary(spotify)

Learn about the data visually by plotting:

Histograms of numeric variables

hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)

Tables for character variables

library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")

Playlist genre frequencies
Var1	Freq
edm	4877
latin	4136
pop	5132
r&b	4504
rap	5398
rock	4305

kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )

Playlist sub-genre frequencies
Var1	Freq
album rock	1039
big room	1034
classic rock	1100
dance pop	1298
electro house	1416
electropop	1251
gangster rap	1314
hard rock	1202
hip hop	1296
hip pop	803
indie poptimism	1547
latin hip hop	1194
latin pop	1097
neo soul	1478
new jack swing	1036
permanent wave	964
pop edm	967
post-teen pop	1036
progressive electro house	1460
reggaeton	687
southern hip hop	1582
trap	1206
tropical	1158
urban contemporary	1187

Bar Plots for Song Genre

barplot(table(spotify$playlist_genre))

Box plots – looking for outliers

boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")

All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo

Number of Artists

length(unique(spotify$track_artist))

## [1] 10692

Number of Playlists IDs

length(unique(spotify$playlist_id))

## [1] 470

Number of Playlist Names

length(unique(spotify$playlist_id))

## [1] 470

Scatter plots to Look for Correlations Between Variables

plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$liveness, spotify$track_popularity)
plot(spotify$energy, spotify$track_popularity)
plot(spotify$loudness, spotify$track_popularity)
plot(spotify$key, spotify$track_popularity)
plot(spotify$speechiness, spotify$track_popularity)

# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)

# Checking for variable correlations
spotify_corr <- cor(spotify_num)
corrplot(spotify_corr, type = "lower", tl.srt = 20)

Create the final clean file with numeric, genre and sub-genre variables that will be used for modeling

library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)

3.4. Clean Data (show the data in the most condensed form possible)

Transforming genre and sub-genre variables into factors

# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)

# Summary of the clean dataset
summary(spotify_m)

Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.

# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
          energ_mean = mean(spotify_m$energy, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE), 
          speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
          acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
          instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
          liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
          valen_mean = mean(spotify_m$valence, na.rm = TRUE),
          tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
          key_mean = mean(spotify_m$key, na.rm = TRUE),
          mode_mean = mean(spotify_m$mode, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
          n = n())

# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_range = range(spotify_m$danceability, na.rm = TRUE),
          energ_range = range(spotify_m$energy, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE), 
          speech_range = range(spotify_m$speechiness, na.rm = TRUE),
          acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
          instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
          liven_range = range(spotify_m$liveness, na.rm = TRUE),
          valen_range = range(spotify_m$valence, na.rm = TRUE),
          tempo_range = range(spotify_m$tempo, na.rm = TRUE),
          key_range = range(spotify_m$key, na.rm = TRUE),
          mode_range = range(spotify_m$mode, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE),
          n = n() )

# Printing the two key summary tables
print(list(summ1, summ2))

3.5 Provide summary information about the variables of concern in your cleaned data set.

As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.

# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)

##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926  
##  Median :0.0626   Median :0.0797   Median :0.0000207   Median :0.1270  
##  Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910  
##  3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960

4. Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

As part of the data analysis and modeling of this data, I looked at correlation, skewness, outliers and value frequency measures. I sliced the data into low (bottom quartile) and high (top quartile) popularity scores to try to explain song popularity.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Correlation and clustered column plots helped me determine what songs characteristics are more present in the most popular songs.

Continuing the analysis with break-outs

spotify %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul = mean(track_popularity, na.rm = TRUE),
    mean_liven = mean(liveness, na.rm = TRUE),
    mean_speech = mean(speechiness, na.rm = TRUE),
    mean_instr = mean(instrumentalness, na.rm = TRUE),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE),
    mean_danc = mean(danceability, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE),
    mean_valence = mean(valence, na.rm = TRUE),
    mean_durat = mean(duration_ms, na.rm = TRUE),
    mean_mode = mean(mode, na.rm = TRUE),
    mean_tempo = mean(tempo, na.rm = TRUE),
    mean_key = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul))

## # A tibble: 6 x 14
##   playlist_genre mean_popul mean_liven mean_speech mean_instr mean_acoust
##   <chr>               <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1 pop                  45.9      0.177      0.0742     0.0634      0.172 
## 2 rap                  41.8      0.191      0.197      0.0802      0.197 
## 3 latin                41.4      0.182      0.100      0.0526      0.213 
## 4 rock                 39.7      0.205      0.0579     0.0664      0.147 
## 5 r&b                  35.9      0.176      0.116      0.0285      0.264 
## 6 edm                  30.7      0.214      0.0879     0.245       0.0769
## # ... with 8 more variables: mean_loud <dbl>, mean_danc <dbl>,
## #   mean_energy <dbl>, mean_valence <dbl>, mean_durat <dbl>, mean_mode <dbl>,
## #   mean_tempo <dbl>, mean_key <dbl>

# Focusing the analysis on top quartile of popularity (goal is to determine whether the most popular songs have common characteristics)
topqtpop <-
spotify %>%
  filter(track_popularity > 58) %>%
  arrange(desc(track_popularity))
str(topqtpop)

# Summarizing the characteristics for the top quantile most popular songs

t1 <- topqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul1 = mean(track_popularity, na.rm = TRUE),
    mean_liven1 = mean(liveness, na.rm = TRUE),
    mean_speech1 = mean(speechiness, na.rm = TRUE),
    mean_instr1 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust1 = mean(acousticness, na.rm = TRUE),
    mean_loud1 = mean(loudness, na.rm = TRUE),
    mean_danc1 = mean(danceability, na.rm = TRUE),
    mean_energy1 = mean(energy, na.rm = TRUE),
    mean_valence1 = mean(valence, na.rm = TRUE),
    mean_durat1 = mean(duration_ms, na.rm = TRUE),
    mean_mode1 = mean(mode, na.rm = TRUE),
    mean_tempo1 = mean(tempo, na.rm = TRUE),
    mean_key1 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul1))

# Sort t1 by playlist_genre
t1 <- arrange(t1, playlist_genre)

# Focusing the analysis on bottom quartile of popularity (goal is to determine whether the least popular songs have common characteristics)
botqtpop <-
spotify %>%
  filter(track_popularity < 21) %>%
  arrange(desc(track_popularity))
str(botqtpop)

# Summarizing the characteristics for the bottom quantile least popular songs

b1 <- botqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul2 = mean(track_popularity, na.rm = TRUE),
    mean_liven2 = mean(liveness, na.rm = TRUE),
    mean_speech2 = mean(speechiness, na.rm = TRUE),
    mean_instr2 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust2 = mean(acousticness, na.rm = TRUE),
    mean_loud2 = mean(loudness, na.rm = TRUE),
    mean_danc2 = mean(danceability, na.rm = TRUE),
    mean_energy2 = mean(energy, na.rm = TRUE),
    mean_valence2 = mean(valence, na.rm = TRUE),
    mean_durat2 = mean(duration_ms, na.rm = TRUE),
    mean_mode2 = mean(mode, na.rm = TRUE),
    mean_tempo2 = mean(tempo, na.rm = TRUE),
    mean_key2 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul2))

# Sort b1 by playlist_genre
b1 <- arrange(b1, playlist_genre)

# Calculating the difference in percent between the most and least popular songs for all song characteristics
dif_db <- 100 * (t1[-1] - b1[-1]) / t1[-1]

# Rounding the numbers of the dataset with the percent difference between most and least popular songs
#round(dif_db[-1], 0)

# Add the column 1 back
# Add the columns from the second dataframe to the first

dif_db <- cbind(dif_db, b1[1])

# Renaming the column names
# colnames(df) <- c('C1','C2','C3')
colnames(dif_db) <- c("popul", "liven", "speech", "instrum", "acoust", "loud", "dance", "energy", "valence", "durat", "mode", "tempo", "key", "genre")

Conclusion #1: Song popularity seems to be associated to a higher level of acousticness and a lower level of instrumentalness as shown by the Chart #1 below.

library(tidyr)
library(ggplot2)
dat.g <- gather(dif_db[2:14], type, value, -genre)
ggplot(dat.g, aes(type, value)) + 
  geom_bar(aes(fill = genre), stat = "identity", position = "dodge") +
  geom_vline(xintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5 )) +
  theme_get() +
  scale_x_discrete(name = "Musical Characteristics") +
  scale_y_continuous(name = "%") +
  ggtitle ("Chart #1: Comparing Most (top quantile) vs. Least (bottom quantile) Popular Songs", 
           subtitle = "% variation between top vs. bottom quantile of song popularity for 12 musical characteristics")

# Reduce dataset to include only relevant variables for visualization and modeling
most_pop <-
  select(spotify, track_artist, track_popularity, playlist_genre, playlist_subgenre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms)

# Rounding the numeric variables to 3 digits
library(dplyr)
most_pop %>% 
 mutate_if(is.numeric, round, digits =3)

# Artists with the most popular tracks, sorted by mean popularity for each artist

library(dplyr)
chart2_db <- 
most_pop %>%
  group_by(track_artist)%>%
  summarize (
    mean_pop = mean(track_popularity, na.rm = TRUE),
    min_pop = min(track_popularity, na.rm = TRUE),
    max_pop = max(track_popularity, na.rm = TRUE),
    pop_range = (max(track_popularity, na.rm = TRUE) - min(track_popularity, na.rm = TRUE)),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_instrum = mean(instrumentalness, na.rm = TRUE) ,
    mean_energy = mean(energy, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE)) %>%
  mutate_if(is.numeric, round, digits = 2) %>%
  arrange(desc(mean_pop))

# Artists with the most popular tracks, sorted by highest range for each artist

library(dplyr)
chart3_db <-
most_pop %>%
  group_by(track_artist)%>%
  summarize (
    mean_pop = mean(track_popularity, na.rm = TRUE),
    min_pop = min(track_popularity, na.rm = TRUE),
    max_pop = max(track_popularity, na.rm = TRUE),
    pop_range = (max(track_popularity, na.rm = TRUE) - min(track_popularity, na.rm = TRUE)),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_instrum = mean(instrumentalness, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE)) %>%
  mutate_if(is.numeric, round, digits = 2) %>%
  arrange(desc(pop_range))

# Creating dataset with top 50 most popular songs by artist
most_pop_50 <- head(chart2_db, n = 50)
str(most_pop_50)

Conclusion #2: Despite being the two most discriminating characteristics to explain song popularity, acousticness and instrumentalness are not consistent across the top most popular artists (Table 1). According to Table 1, artists such as Trevor Daniel (ranked 1st in average popularity) and Roddy Ricch (ranked 8th in average popularity) have the same mean acousticness and instrumentalness for their songs, but have a 13-point difference in the average popularity for their songs. To reinforce the point that acousticness may not be driving song popularity, Table 1 also shows that artists with the same average popularity for their songs (79) have mean acousticness ranging from 0.75 to 0.01.

# Table showing the top 50 highest popularity ranges among artists 
# This shows that there is NOT a lot of similarity among song characteristics from the 50 most popular artists
# Popularity seems to be driven by variables not captured by this dataset

library(knitr)
kable(most_pop_50, caption = "Table 1: Top 50 Artists Ranked by the Average Popularity of All Their Songs")

Table 1: Top 50 Artists Ranked by the Average Popularity of All Their Songs
track_artist	mean_pop	min_pop	max_pop	pop_range	mean_acoust	mean_instrum	mean_energy	mean_loud
Trevor Daniel	97.00	97	97	0	0.12	0.00	0.43	-8.76
Y2K	91.00	91	91	0	0.18	0.00	0.39	-7.90
Don Toliver	87.50	83	92	9	0.41	0.00	0.70	-4.78
Kina	85.50	85	86	1	0.81	0.01	0.18	-17.63
JACKBOYS	84.33	82	87	5	0.07	0.00	0.62	-5.11
DadÃ¡ BoladÃ£o	84.00	84	84	0	0.25	0.00	0.55	-7.03
DaBaby	83.67	69	93	24	0.08	0.00	0.69	-4.86
Roddy Ricch	83.43	78	98	20	0.10	0.00	0.53	-8.15
Baby Keem	83.00	83	83	0	0.18	0.00	0.56	-7.82
Dayvi	83.00	83	83	0	0.01	0.47	0.97	-3.51
Internet Money	83.00	83	83	0	0.36	0.00	0.67	-6.66
Olivia Rodrigo	83.00	83	83	0	0.09	0.00	0.43	-6.58
Omar Montes	82.00	82	82	0	0.21	0.00	0.75	-4.70
Harry Styles	81.78	68	91	23	0.22	0.00	0.63	-5.06
YNW Melly	81.57	76	90	14	0.18	0.00	0.54	-8.02
Camilo	80.25	77	88	11	0.34	0.00	0.67	-4.33
Apache 207	80.00	79	81	2	0.08	0.13	0.79	-6.69
Kaash Paige	80.00	80	80	0	0.83	0.00	0.36	-10.60
Ludmilla	80.00	80	80	0	0.48	0.00	0.86	-4.28
RIN	80.00	80	80	0	0.30	0.00	0.60	-6.41
Lil Tjay	79.67	78	82	4	0.35	0.00	0.52	-9.66
Alex & Sierra	79.00	79	79	0	0.75	0.00	0.29	-8.55
BlocBoy JB	79.00	79	79	0	0.00	0.00	0.58	-7.50
Don Patricio	79.00	79	79	0	0.44	0.00	0.52	-10.48
Ghetto Kids	79.00	79	79	0	0.05	0.00	0.74	-4.54
K CAMP	79.00	79	79	0	0.01	0.00	0.56	-8.60
King Princess	79.00	79	79	0	0.65	0.00	0.54	-7.04
KSI	79.00	79	79	0	0.35	0.00	0.80	-4.96
PUBLIC	79.00	79	79	0	0.01	0.00	0.80	-4.45
Spice Girls	79.00	79	79	0	0.10	0.00	0.86	-6.14
Ali Gatie	78.50	68	89	21	0.37	0.00	0.46	-6.97
Lil Skies	78.50	76	81	5	0.39	0.00	0.55	-7.13
A Great Big World	78.00	78	78	0	0.86	0.00	0.15	-8.82
Diego & Arnaldo	78.00	78	78	0	0.05	0.00	0.94	-1.40
Dj Guuga	78.00	78	78	0	0.14	0.00	0.92	0.30
Ed Maverick	78.00	78	78	0	0.95	0.00	0.16	-14.46
Gradur	78.00	78	78	0	0.21	0.00	0.78	-3.38
Keala Settle	78.00	78	78	0	0.01	0.00	0.70	-7.28
Likybo	78.00	78	78	0	0.31	0.00	0.70	-9.48
MC Ingryd	78.00	78	78	0	0.60	0.00	0.56	-5.70
Ms Nina	78.00	78	78	0	0.05	0.00	0.90	-7.07
Peachy!	78.00	78	78	0	0.74	0.00	0.36	-12.02
Rocco Hunt	78.00	78	78	0	0.01	0.00	0.72	-6.06
Sub Urban	78.00	78	78	0	0.26	0.00	0.59	-1.86
Tate McRae	78.00	78	78	0	0.07	0.00	0.58	-6.03
WILLOW	78.00	78	78	0	0.04	0.00	0.70	-5.28
Rvssian	77.50	73	82	9	0.13	0.00	0.74	-4.88
Samra	77.50	75	80	5	0.22	0.00	0.70	-4.56
FINNEAS	77.00	77	77	0	0.80	0.00	0.41	-7.94
Loredana	77.00	77	77	0	0.16	0.00	0.53	-5.86

# Creating dataset with top 50 highest popularity ranges by artist
# This shows that there's a large popularity variation even for the same artist
most_range_50 <- head(chart3_db, n = 50)
str(most_range_50)

## tibble [50 x 9] (S3: tbl_df/tbl/data.frame)
##  $ track_artist: chr [1:50] "Maroon 5" "The Weeknd" "Post Malone" "The Black Eyed Peas" ...
##  $ mean_pop    : num [1:50] 42.4 47 57.5 43.8 67.6 ...
##  $ min_pop     : num [1:50] 0 0 1 1 0 0 0 0 2 3 ...
##  $ max_pop     : num [1:50] 98 98 98 96 94 93 93 93 93 94 ...
##  $ pop_range   : num [1:50] 98 98 97 95 94 93 93 93 91 91 ...
##  $ mean_acoust : num [1:50] 0.14 0.26 0.28 0.08 0.06 0.35 0.09 0.11 0.36 0.16 ...
##  $ mean_instrum: num [1:50] 0 0.01 0 0 0 0 0.01 0 0 0 ...
##  $ mean_energy : num [1:50] 0.7 0.62 0.63 0.74 0.62 0.61 0.57 0.7 0.71 0.61 ...
##  $ mean_loud   : num [1:50] -5.75 -7.13 -5.33 -5.53 -5.64 -6.41 -6.95 -5.59 -4.02 -6.41 ...

Conclusion #3: Table #2 (below) shows that there’s a wide variation in song popularity for the same artist. That is, the data shows that the same artist can get a song popularity score of zero and 98.

# Table showing the top 50 highest popularity ranges among artists 
# This shows that there isn't a lot of similarity among song characteristics even for the 50 most popular artists
library(knitr)
kable(most_range_50, caption = "Table 2: Top 50 Biggest Popularity Variation for the Same Artist")

Table 2: Top 50 Biggest Popularity Variation for the Same Artist
track_artist	mean_pop	min_pop	max_pop	pop_range	mean_acoust	mean_instrum	mean_energy	mean_loud
Maroon 5	42.41	0	98	98	0.14	0.00	0.70	-5.75
The Weeknd	46.98	0	98	98	0.26	0.01	0.62	-7.13
Post Malone	57.52	1	98	97	0.28	0.00	0.63	-5.33
The Black Eyed Peas	43.79	1	96	95	0.08	0.00	0.74	-5.53
Travis Scott	67.58	0	94	94	0.06	0.00	0.62	-5.64
Bad Bunny	57.97	0	93	93	0.35	0.00	0.61	-6.41
Future	53.63	0	93	93	0.09	0.01	0.57	-6.95
Selena Gomez	59.56	0	93	93	0.11	0.00	0.70	-5.59
Anuel AA	48.53	2	93	91	0.36	0.00	0.71	-4.02
blackbear	64.90	3	94	91	0.16	0.00	0.61	-6.41
Ed Sheeran	65.06	0	91	91	0.27	0.00	0.65	-6.05
J Balvin	47.77	0	91	91	0.16	0.00	0.75	-4.95
Justin Bieber	64.88	4	95	91	0.27	0.00	0.68	-5.78
Rauw Alejandro	60.33	1	92	91	0.31	0.00	0.69	-4.67
Shawn Mendes	54.33	2	93	91	0.18	0.00	0.73	-5.12
Tyga	45.12	0	91	91	0.12	0.00	0.63	-6.21
Ariana Grande	53.40	0	90	90	0.21	0.00	0.60	-5.96
Daddy Yankee	46.65	0	90	90	0.13	0.00	0.83	-4.76
Mustard	38.82	0	90	90	0.18	0.00	0.64	-6.36
Nicky Jam	53.63	0	90	90	0.20	0.00	0.72	-5.29
Juice WRLD	57.57	3	92	89	0.22	0.00	0.60	-6.53
Marshmello	57.46	0	89	89	0.07	0.10	0.83	-3.63
Dalex	62.30	3	91	88	0.55	0.00	0.62	-6.02
DJ Snake	41.92	0	88	88	0.08	0.08	0.79	-4.67
Drake	41.81	0	88	88	0.14	0.01	0.55	-7.57
Ellie Goulding	50.74	0	88	88	0.16	0.01	0.76	-5.02
Halsey	51.50	1	88	87	0.15	0.00	0.70	-4.87
Imagine Dragons	38.91	1	88	87	0.14	0.02	0.71	-5.70
Jonas Brothers	63.11	0	87	87	0.02	0.00	0.75	-4.86
Lady Gaga	48.80	1	88	87	0.09	0.00	0.76	-4.63
Red Velvet	52.83	0	87	87	0.15	0.00	0.84	-3.43
Sech	52.75	1	88	87	0.17	0.00	0.71	-3.99
XXXTENTACION	71.22	0	87	87	0.24	0.00	0.61	-6.34
Regard	57.00	8	94	86	0.06	0.00	0.76	-6.31
Stormzy	62.91	2	88	86	0.25	0.00	0.70	-5.19
Taylor Swift	59.62	0	86	86	0.12	0.00	0.67	-6.17
BTS	53.27	0	85	85	0.06	0.00	0.76	-4.80
Calvin Harris	51.14	0	85	85	0.09	0.06	0.85	-4.07
Chris Brown	51.36	0	85	85	0.22	0.00	0.57	-5.74
Conan Gray	52.33	0	85	85	0.32	0.00	0.57	-7.06
J. Cole	50.80	1	86	85	0.30	0.02	0.53	-9.88
John Legend	30.28	0	85	85	0.27	0.00	0.63	-7.06
The Chainsmokers	49.23	0	85	85	0.11	0.00	0.73	-5.75
Arcangel	45.05	0	84	84	0.19	0.01	0.71	-5.76
Avicii	41.90	0	84	84	0.08	0.08	0.78	-4.96
Lauv	64.80	1	85	84	0.29	0.00	0.54	-6.87
Major Lazer	30.30	0	84	84	0.06	0.05	0.80	-4.74
Meek Mill	49.67	1	85	84	0.12	0.00	0.72	-4.44
Myke Towers	66.82	0	84	84	0.30	0.00	0.70	-4.32
Queen	42.40	0	84	84	0.29	0.02	0.62	-8.39

Conclusion #4: Chart #2 visualizes the variation in song popularity for the same artist. The chart includes data for the artists with the 50 highest average popularity scores. Even for the most popular artists, there’s a large variation in popularity for their songs. This suggests that the artist’s name is not a main driver of popularity as measured by this dataset.

chart2 <-
most_pop_50 %>%
 ggplot(aes(x = mean_pop, y = pop_range , color = mean_acoust)) + 
  geom_point() +
  theme_get() +
  scale_x_continuous(name = "Song Popularity for the Top 50 Most Popular Artists") +
  scale_y_continuous(name = "Popularity Range by Artist") +
  ggtitle ("Chart #2: Top 50 Most Popular Artists", 
           subtitle = "Popularity Range (Max Pop - Min Pop) for All Songs from Same Artist")
  
chart2

Conclusion #5: Chart #3 (below) shows that there’s a significantly higher amount of variation in song popularity for artists that are less popular. In this case, the variation in song popularity for songs from the same artist can be 96 points (one song popularity equals zero and another equals 96).

chart3 <-
most_range_50 %>%
 ggplot(aes(x = mean_pop, y = pop_range , color = mean_acoust)) + 
  geom_point() +
  theme_get() +
  scale_x_continuous(name = "Mean Popularity by Artist") +
  scale_y_continuous(name = "Top 50 Highest Popularity Range by Artist") +
  ggtitle ("Charter #3: Top 50 Popularity Range for Songs from the Same Artist", 
           subtitle = "Popularity Range (Max - Min)")
  
chart3

# Creating a smaller dat aset for modeling
spotify_mod <- spotify %>% select(23:24, 30:43)
spotify_mod

Modeling

Conclusion #6: With such a variable and inconsistent data set, that is, popularity scores are not explained by artist name nor musical characteristics, it’s highly unlikely that a significant model to predict song popularity will be created without additional information. Below are three multiple linear regression models with a maximum R-Squared of 0.058.

# Multiple regression model to predict song popularity
fit_lm1 <- lm(track_popularity ~ instrumentalness + acousticness , data = spotify_mod)
summary(fit_lm1)

## 
## Call:
## lm(formula = track_popularity ~ instrumentalness + acousticness, 
##     data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.228 -17.778   3.166  18.507  59.724 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7698     0.1860  208.49   <2e-16 ***
## instrumentalness -12.6629     0.5980  -21.18   <2e-16 ***
## acousticness       9.7042     0.6242   15.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.42 on 28349 degrees of freedom
## Multiple R-squared:  0.02384,    Adjusted R-squared:  0.02377 
## F-statistic: 346.1 on 2 and 28349 DF,  p-value: < 2.2e-16

# Multiple regression model to predict song popularity
fit_lm2 <- lm(track_popularity ~ energy + loudness , data = spotify_mod)
summary(fit_lm2)

## 
## Call:
## lm(formula = track_popularity ~ energy + loudness, data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.091 -17.565   3.044  18.417  65.599 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.83742    1.06637   67.37   <2e-16 ***
## energy      -31.15559    1.03181  -30.20   <2e-16 ***
## loudness      1.57586    0.06236   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.31 on 28349 degrees of freedom
## Multiple R-squared:  0.03251,    Adjusted R-squared:  0.03244 
## F-statistic: 476.3 on 2 and 28349 DF,  p-value: < 2.2e-16

# Multiple regression model to predict song popularity
fit_lm3 <- lm(track_popularity ~ energy + speechiness + loudness + acousticness + instrumentalness + danceability + liveness + valence + tempo + duration_ms + key + mode, data = spotify_mod)
summary(fit_lm3)

## 
## Call:
## lm(formula = track_popularity ~ energy + speechiness + loudness + 
##     acousticness + instrumentalness + danceability + liveness + 
##     valence + tempo + duration_ms + key + mode, data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.610 -17.197   2.948  18.103  60.513 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.783e+01  1.704e+00  39.808  < 2e-16 ***
## energy           -2.319e+01  1.220e+00 -19.011  < 2e-16 ***
## speechiness      -6.273e+00  1.380e+00  -4.547 5.47e-06 ***
## loudness          1.155e+00  6.527e-02  17.701  < 2e-16 ***
## acousticness      4.322e+00  7.465e-01   5.790 7.11e-09 ***
## instrumentalness -9.300e+00  6.254e-01 -14.871  < 2e-16 ***
## danceability      3.708e+00  1.072e+00   3.458 0.000546 ***
## liveness         -4.280e+00  8.990e-01  -4.761 1.94e-06 ***
## valence           1.784e+00  6.565e-01   2.718 0.006573 ** 
## tempo             2.596e-02  5.239e-03   4.955 7.29e-07 ***
## duration_ms      -4.341e-05  2.294e-06 -18.923  < 2e-16 ***
## key               4.677e-03  3.844e-02   0.122 0.903175    
## mode              8.572e-01  2.809e-01   3.052 0.002278 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.01 on 28339 degrees of freedom
## Multiple R-squared:  0.05804,    Adjusted R-squared:  0.05764 
## F-statistic: 145.5 on 12 and 28339 DF,  p-value: < 2.2e-16