1. Introduction

1.1. Project Objectives

Analyze the Spotify database to:

  • Understand how song characteristics (e.g. danceability, liveness) might be associated to different song genres (e.g. pop, rock).
  • Create a model to predict song popularity based on song characteristics.

1.2. Plan to Deliver Against Project Objectives

  • Conduct preliminary data analyses to determine what cleaning steps, if any, are needed
  • Clean the data
  • Determine what variables might be correlated
  • Fit various types of models to the data, starting with the simplest models (e.g. multiple linear regression, trees)
  • Select best model
  • Re-evaluate variables to determine whether further data cleaning and/or collection should be recommended
  • Final summary and recommendations

1.3. Analysis and Modeling Proposal

This project will be executed in two major phases:

  • Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.

  • Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures (to be delivered on 8/13/21)

1.4. Expected Output

Learnings from these analyses and song popularity models will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.

2. Packages Required

2.1. Packages Required

2.2. Messages and warnings resulting from loading the packages are suppressed

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)

2.3. Package Short Description

tidyverse - for interacting with data through subsetting, transformation, visualization, etc.

dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset

ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics

plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js

corrplot - for visualizing correlation matrices and confidence intervals

3. Data Preparation

3.1. Data Source

The dataset is available in Github. Link to the data source is here.

3.2. Explanation of Data Source

The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.

You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.

3.3. Data Importing and Cleaning

# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")

Identifying and reviewing the codebook

dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")
# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
kable(dictionary_spotify[,], caption = "Spotify Codebook")
Spotify Codebook
variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Assessing dimensions of the dataset

dim(spotify)
## [1] 28352    43

Check for duplicate rows or columns

# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))
## [1] 28352
# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)

# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique

# Checking whether the unique file contains only 28356
str(spotify)

Viewing the head and tail of the data

head(spotify, n=5)
tail(spotify, n=5)

Cleaning the Data (explanation of the data cleaning steps)

  • Identify missing data
  • Determine how to handle missing data
  • Looking for outliers
  • Determine how to handle outliers
  • Frequency distribution for the variables

Identifying missing data

sum(is.na(spotify))
colSums(is.na(spotify))
# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)

# Checking whether missing data was omitted
str(spotify)

Computing summary statistics for the variables

summary(spotify)

Learn about the data visually by plotting:


Histograms of numeric variables

hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)

Tables for character variables

library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")
Playlist genre frequencies
Var1 Freq
edm 4877
latin 4136
pop 5132
r&b 4504
rap 5398
rock 4305
kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )
Playlist sub-genre frequencies
Var1 Freq
album rock 1039
big room 1034
classic rock 1100
dance pop 1298
electro house 1416
electropop 1251
gangster rap 1314
hard rock 1202
hip hop 1296
hip pop 803
indie poptimism 1547
latin hip hop 1194
latin pop 1097
neo soul 1478
new jack swing 1036
permanent wave 964
pop edm 967
post-teen pop 1036
progressive electro house 1460
reggaeton 687
southern hip hop 1582
trap 1206
tropical 1158
urban contemporary 1187

Bar Plots for Song Genre

barplot(table(spotify$playlist_genre))

Box plots – looking for outliers

boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")

All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo

Number of Artists

length(unique(spotify$track_artist))
## [1] 10692

Number of Playlists IDs

length(unique(spotify$playlist_id))
## [1] 470

Number of Playlist Names

length(unique(spotify$playlist_id))
## [1] 470

Scatter plots to Look for Correlations Between Variables

plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$liveness, spotify$track_popularity)
plot(spotify$energy, spotify$track_popularity)
plot(spotify$loudness, spotify$track_popularity)
plot(spotify$key, spotify$track_popularity)
plot(spotify$speechiness, spotify$track_popularity)

# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)

# Checking for variable correlations
spotify_corr <- cor(spotify_num)
corrplot(spotify_corr, type = "lower", tl.srt = 20)

Create the final clean file with numeric, genre and sub-genre variables that will be used for modeling

library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)

3.4. Clean Data (show the data in the most condensed form possible)

Transforming genre and sub-genre variables into factors

# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)
# Summary of the clean dataset
summary(spotify_m)

Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.

# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
          energ_mean = mean(spotify_m$energy, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE), 
          speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
          acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
          instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
          liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
          valen_mean = mean(spotify_m$valence, na.rm = TRUE),
          tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
          key_mean = mean(spotify_m$key, na.rm = TRUE),
          mode_mean = mean(spotify_m$mode, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
          n = n())
# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_range = range(spotify_m$danceability, na.rm = TRUE),
          energ_range = range(spotify_m$energy, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE), 
          speech_range = range(spotify_m$speechiness, na.rm = TRUE),
          acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
          instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
          liven_range = range(spotify_m$liveness, na.rm = TRUE),
          valen_range = range(spotify_m$valence, na.rm = TRUE),
          tempo_range = range(spotify_m$tempo, na.rm = TRUE),
          key_range = range(spotify_m$key, na.rm = TRUE),
          mode_range = range(spotify_m$mode, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE),
          n = n() )
# Printing the two key summary tables
print(list(summ1, summ2))

3.5 Provide summary information about the variables of concern in your cleaned data set.

As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.

# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)
##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926  
##  Median :0.0626   Median :0.0797   Median :0.0000207   Median :0.1270  
##  Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910  
##  3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960

4. Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

As part of the data analysis and modeling of this data, I looked at correlation, skewness, outliers and value frequency measures. I sliced the data into low (bottom quartile) and high (top quartile) popularity scores to try to explain song popularity.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Correlation and clustered column plots helped me determine what songs characteristics are more present in the most popular songs.

Continuing the analysis with break-outs

spotify %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul = mean(track_popularity, na.rm = TRUE),
    mean_liven = mean(liveness, na.rm = TRUE),
    mean_speech = mean(speechiness, na.rm = TRUE),
    mean_instr = mean(instrumentalness, na.rm = TRUE),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE),
    mean_danc = mean(danceability, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE),
    mean_valence = mean(valence, na.rm = TRUE),
    mean_durat = mean(duration_ms, na.rm = TRUE),
    mean_mode = mean(mode, na.rm = TRUE),
    mean_tempo = mean(tempo, na.rm = TRUE),
    mean_key = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul))
## # A tibble: 6 x 14
##   playlist_genre mean_popul mean_liven mean_speech mean_instr mean_acoust
##   <chr>               <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1 pop                  45.9      0.177      0.0742     0.0634      0.172 
## 2 rap                  41.8      0.191      0.197      0.0802      0.197 
## 3 latin                41.4      0.182      0.100      0.0526      0.213 
## 4 rock                 39.7      0.205      0.0579     0.0664      0.147 
## 5 r&b                  35.9      0.176      0.116      0.0285      0.264 
## 6 edm                  30.7      0.214      0.0879     0.245       0.0769
## # ... with 8 more variables: mean_loud <dbl>, mean_danc <dbl>,
## #   mean_energy <dbl>, mean_valence <dbl>, mean_durat <dbl>, mean_mode <dbl>,
## #   mean_tempo <dbl>, mean_key <dbl>
# Focusing the analysis on top quartile of popularity (goal is to determine whether the most popular songs have common characteristics)
topqtpop <-
spotify %>%
  filter(track_popularity > 58) %>%
  arrange(desc(track_popularity))
str(topqtpop)
# Summarizing the characteristics for the top quantile most popular songs

t1 <- topqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul1 = mean(track_popularity, na.rm = TRUE),
    mean_liven1 = mean(liveness, na.rm = TRUE),
    mean_speech1 = mean(speechiness, na.rm = TRUE),
    mean_instr1 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust1 = mean(acousticness, na.rm = TRUE),
    mean_loud1 = mean(loudness, na.rm = TRUE),
    mean_danc1 = mean(danceability, na.rm = TRUE),
    mean_energy1 = mean(energy, na.rm = TRUE),
    mean_valence1 = mean(valence, na.rm = TRUE),
    mean_durat1 = mean(duration_ms, na.rm = TRUE),
    mean_mode1 = mean(mode, na.rm = TRUE),
    mean_tempo1 = mean(tempo, na.rm = TRUE),
    mean_key1 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul1))
# Sort t1 by playlist_genre
t1 <- arrange(t1, playlist_genre)
# Focusing the analysis on bottom quartile of popularity (goal is to determine whether the least popular songs have common characteristics)
botqtpop <-
spotify %>%
  filter(track_popularity < 21) %>%
  arrange(desc(track_popularity))
str(botqtpop)
# Summarizing the characteristics for the bottom quantile least popular songs

b1 <- botqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul2 = mean(track_popularity, na.rm = TRUE),
    mean_liven2 = mean(liveness, na.rm = TRUE),
    mean_speech2 = mean(speechiness, na.rm = TRUE),
    mean_instr2 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust2 = mean(acousticness, na.rm = TRUE),
    mean_loud2 = mean(loudness, na.rm = TRUE),
    mean_danc2 = mean(danceability, na.rm = TRUE),
    mean_energy2 = mean(energy, na.rm = TRUE),
    mean_valence2 = mean(valence, na.rm = TRUE),
    mean_durat2 = mean(duration_ms, na.rm = TRUE),
    mean_mode2 = mean(mode, na.rm = TRUE),
    mean_tempo2 = mean(tempo, na.rm = TRUE),
    mean_key2 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul2))
# Sort b1 by playlist_genre
b1 <- arrange(b1, playlist_genre)
# Calculating the difference in percent between the most and least popular songs for all song characteristics
dif_db <- 100 * (t1[-1] - b1[-1]) / t1[-1]
# Rounding the numbers of the dataset with the percent difference between most and least popular songs
#round(dif_db[-1], 0)
# Add the column 1 back
# Add the columns from the second dataframe to the first

dif_db <- cbind(dif_db, b1[1])
# Renaming the column names
# colnames(df) <- c('C1','C2','C3')
colnames(dif_db) <- c("popul", "liven", "speech", "instrum", "acoust", "loud", "dance", "energy", "valence", "durat", "mode", "tempo", "key", "genre")

Conclusion #1: Song popularity seems to be associated to a higher level of acousticness and a lower level of instrumentalness as shown by the Chart #1 below.

library(tidyr)
library(ggplot2)
dat.g <- gather(dif_db[2:14], type, value, -genre)
ggplot(dat.g, aes(type, value)) + 
  geom_bar(aes(fill = genre), stat = "identity", position = "dodge") +
  geom_vline(xintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5 )) +
  theme_get() +
  scale_x_discrete(name = "Musical Characteristics") +
  scale_y_continuous(name = "%") +
  ggtitle ("Chart #1: Comparing Most (top quantile) vs. Least (bottom quantile) Popular Songs", 
           subtitle = "% variation between top vs. bottom quantile of song popularity for 12 musical characteristics")

# Reduce dataset to include only relevant variables for visualization and modeling
most_pop <-
  select(spotify, track_artist, track_popularity, playlist_genre, playlist_subgenre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms)
# Rounding the numeric variables to 3 digits
library(dplyr)
most_pop %>% 
 mutate_if(is.numeric, round, digits =3)
# Artists with the most popular tracks, sorted by mean popularity for each artist

library(dplyr)
chart2_db <- 
most_pop %>%
  group_by(track_artist)%>%
  summarize (
    mean_pop = mean(track_popularity, na.rm = TRUE),
    min_pop = min(track_popularity, na.rm = TRUE),
    max_pop = max(track_popularity, na.rm = TRUE),
    pop_range = (max(track_popularity, na.rm = TRUE) - min(track_popularity, na.rm = TRUE)),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_instrum = mean(instrumentalness, na.rm = TRUE) ,
    mean_energy = mean(energy, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE)) %>%
  mutate_if(is.numeric, round, digits = 2) %>%
  arrange(desc(mean_pop))
# Artists with the most popular tracks, sorted by highest range for each artist

library(dplyr)
chart3_db <-
most_pop %>%
  group_by(track_artist)%>%
  summarize (
    mean_pop = mean(track_popularity, na.rm = TRUE),
    min_pop = min(track_popularity, na.rm = TRUE),
    max_pop = max(track_popularity, na.rm = TRUE),
    pop_range = (max(track_popularity, na.rm = TRUE) - min(track_popularity, na.rm = TRUE)),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_instrum = mean(instrumentalness, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE)) %>%
  mutate_if(is.numeric, round, digits = 2) %>%
  arrange(desc(pop_range))
# Creating dataset with top 50 most popular songs by artist
most_pop_50 <- head(chart2_db, n = 50)
str(most_pop_50)

Conclusion #2: Despite being the two most discriminating characteristics to explain song popularity, acousticness and instrumentalness are not consistent across the top most popular artists (Table 1). According to Table 1, artists such as Trevor Daniel (ranked 1st in average popularity) and Roddy Ricch (ranked 8th in average popularity) have the same mean acousticness and instrumentalness for their songs, but have a 13-point difference in the average popularity for their songs. To reinforce the point that acousticness may not be driving song popularity, Table 1 also shows that artists with the same average popularity for their songs (79) have mean acousticness ranging from 0.75 to 0.01.

# Table showing the top 50 highest popularity ranges among artists 
# This shows that there is NOT a lot of similarity among song characteristics from the 50 most popular artists
# Popularity seems to be driven by variables not captured by this dataset

library(knitr)
kable(most_pop_50, caption = "Table 1: Top 50 Artists Ranked by the Average Popularity of All Their Songs")
Table 1: Top 50 Artists Ranked by the Average Popularity of All Their Songs
track_artist mean_pop min_pop max_pop pop_range mean_acoust mean_instrum mean_energy mean_loud
Trevor Daniel 97.00 97 97 0 0.12 0.00 0.43 -8.76
Y2K 91.00 91 91 0 0.18 0.00 0.39 -7.90
Don Toliver 87.50 83 92 9 0.41 0.00 0.70 -4.78
Kina 85.50 85 86 1 0.81 0.01 0.18 -17.63
JACKBOYS 84.33 82 87 5 0.07 0.00 0.62 -5.11
Dadá Boladão 84.00 84 84 0 0.25 0.00 0.55 -7.03
DaBaby 83.67 69 93 24 0.08 0.00 0.69 -4.86
Roddy Ricch 83.43 78 98 20 0.10 0.00 0.53 -8.15
Baby Keem 83.00 83 83 0 0.18 0.00 0.56 -7.82
Dayvi 83.00 83 83 0 0.01 0.47 0.97 -3.51
Internet Money 83.00 83 83 0 0.36 0.00 0.67 -6.66
Olivia Rodrigo 83.00 83 83 0 0.09 0.00 0.43 -6.58
Omar Montes 82.00 82 82 0 0.21 0.00 0.75 -4.70
Harry Styles 81.78 68 91 23 0.22 0.00 0.63 -5.06
YNW Melly 81.57 76 90 14 0.18 0.00 0.54 -8.02
Camilo 80.25 77 88 11 0.34 0.00 0.67 -4.33
Apache 207 80.00 79 81 2 0.08 0.13 0.79 -6.69
Kaash Paige 80.00 80 80 0 0.83 0.00 0.36 -10.60
Ludmilla 80.00 80 80 0 0.48 0.00 0.86 -4.28
RIN 80.00 80 80 0 0.30 0.00 0.60 -6.41
Lil Tjay 79.67 78 82 4 0.35 0.00 0.52 -9.66
Alex & Sierra 79.00 79 79 0 0.75 0.00 0.29 -8.55
BlocBoy JB 79.00 79 79 0 0.00 0.00 0.58 -7.50
Don Patricio 79.00 79 79 0 0.44 0.00 0.52 -10.48
Ghetto Kids 79.00 79 79 0 0.05 0.00 0.74 -4.54
K CAMP 79.00 79 79 0 0.01 0.00 0.56 -8.60
King Princess 79.00 79 79 0 0.65 0.00 0.54 -7.04
KSI 79.00 79 79 0 0.35 0.00 0.80 -4.96
PUBLIC 79.00 79 79 0 0.01 0.00 0.80 -4.45
Spice Girls 79.00 79 79 0 0.10 0.00 0.86 -6.14
Ali Gatie 78.50 68 89 21 0.37 0.00 0.46 -6.97
Lil Skies 78.50 76 81 5 0.39 0.00 0.55 -7.13
A Great Big World 78.00 78 78 0 0.86 0.00 0.15 -8.82
Diego & Arnaldo 78.00 78 78 0 0.05 0.00 0.94 -1.40
Dj Guuga 78.00 78 78 0 0.14 0.00 0.92 0.30
Ed Maverick 78.00 78 78 0 0.95 0.00 0.16 -14.46
Gradur 78.00 78 78 0 0.21 0.00 0.78 -3.38
Keala Settle 78.00 78 78 0 0.01 0.00 0.70 -7.28
Likybo 78.00 78 78 0 0.31 0.00 0.70 -9.48
MC Ingryd 78.00 78 78 0 0.60 0.00 0.56 -5.70
Ms Nina 78.00 78 78 0 0.05 0.00 0.90 -7.07
Peachy! 78.00 78 78 0 0.74 0.00 0.36 -12.02
Rocco Hunt 78.00 78 78 0 0.01 0.00 0.72 -6.06
Sub Urban 78.00 78 78 0 0.26 0.00 0.59 -1.86
Tate McRae 78.00 78 78 0 0.07 0.00 0.58 -6.03
WILLOW 78.00 78 78 0 0.04 0.00 0.70 -5.28
Rvssian 77.50 73 82 9 0.13 0.00 0.74 -4.88
Samra 77.50 75 80 5 0.22 0.00 0.70 -4.56
FINNEAS 77.00 77 77 0 0.80 0.00 0.41 -7.94
Loredana 77.00 77 77 0 0.16 0.00 0.53 -5.86
# Creating dataset with top 50 highest popularity ranges by artist
# This shows that there's a large popularity variation even for the same artist
most_range_50 <- head(chart3_db, n = 50)
str(most_range_50)
## tibble [50 x 9] (S3: tbl_df/tbl/data.frame)
##  $ track_artist: chr [1:50] "Maroon 5" "The Weeknd" "Post Malone" "The Black Eyed Peas" ...
##  $ mean_pop    : num [1:50] 42.4 47 57.5 43.8 67.6 ...
##  $ min_pop     : num [1:50] 0 0 1 1 0 0 0 0 2 3 ...
##  $ max_pop     : num [1:50] 98 98 98 96 94 93 93 93 93 94 ...
##  $ pop_range   : num [1:50] 98 98 97 95 94 93 93 93 91 91 ...
##  $ mean_acoust : num [1:50] 0.14 0.26 0.28 0.08 0.06 0.35 0.09 0.11 0.36 0.16 ...
##  $ mean_instrum: num [1:50] 0 0.01 0 0 0 0 0.01 0 0 0 ...
##  $ mean_energy : num [1:50] 0.7 0.62 0.63 0.74 0.62 0.61 0.57 0.7 0.71 0.61 ...
##  $ mean_loud   : num [1:50] -5.75 -7.13 -5.33 -5.53 -5.64 -6.41 -6.95 -5.59 -4.02 -6.41 ...

Conclusion #3: Table #2 (below) shows that there’s a wide variation in song popularity for the same artist. That is, the data shows that the same artist can get a song popularity score of zero and 98.

# Table showing the top 50 highest popularity ranges among artists 
# This shows that there isn't a lot of similarity among song characteristics even for the 50 most popular artists
library(knitr)
kable(most_range_50, caption = "Table 2: Top 50 Biggest Popularity Variation for the Same Artist")
Table 2: Top 50 Biggest Popularity Variation for the Same Artist
track_artist mean_pop min_pop max_pop pop_range mean_acoust mean_instrum mean_energy mean_loud
Maroon 5 42.41 0 98 98 0.14 0.00 0.70 -5.75
The Weeknd 46.98 0 98 98 0.26 0.01 0.62 -7.13
Post Malone 57.52 1 98 97 0.28 0.00 0.63 -5.33
The Black Eyed Peas 43.79 1 96 95 0.08 0.00 0.74 -5.53
Travis Scott 67.58 0 94 94 0.06 0.00 0.62 -5.64
Bad Bunny 57.97 0 93 93 0.35 0.00 0.61 -6.41
Future 53.63 0 93 93 0.09 0.01 0.57 -6.95
Selena Gomez 59.56 0 93 93 0.11 0.00 0.70 -5.59
Anuel AA 48.53 2 93 91 0.36 0.00 0.71 -4.02
blackbear 64.90 3 94 91 0.16 0.00 0.61 -6.41
Ed Sheeran 65.06 0 91 91 0.27 0.00 0.65 -6.05
J Balvin 47.77 0 91 91 0.16 0.00 0.75 -4.95
Justin Bieber 64.88 4 95 91 0.27 0.00 0.68 -5.78
Rauw Alejandro 60.33 1 92 91 0.31 0.00 0.69 -4.67
Shawn Mendes 54.33 2 93 91 0.18 0.00 0.73 -5.12
Tyga 45.12 0 91 91 0.12 0.00 0.63 -6.21
Ariana Grande 53.40 0 90 90 0.21 0.00 0.60 -5.96
Daddy Yankee 46.65 0 90 90 0.13 0.00 0.83 -4.76
Mustard 38.82 0 90 90 0.18 0.00 0.64 -6.36
Nicky Jam 53.63 0 90 90 0.20 0.00 0.72 -5.29
Juice WRLD 57.57 3 92 89 0.22 0.00 0.60 -6.53
Marshmello 57.46 0 89 89 0.07 0.10 0.83 -3.63
Dalex 62.30 3 91 88 0.55 0.00 0.62 -6.02
DJ Snake 41.92 0 88 88 0.08 0.08 0.79 -4.67
Drake 41.81 0 88 88 0.14 0.01 0.55 -7.57
Ellie Goulding 50.74 0 88 88 0.16 0.01 0.76 -5.02
Halsey 51.50 1 88 87 0.15 0.00 0.70 -4.87
Imagine Dragons 38.91 1 88 87 0.14 0.02 0.71 -5.70
Jonas Brothers 63.11 0 87 87 0.02 0.00 0.75 -4.86
Lady Gaga 48.80 1 88 87 0.09 0.00 0.76 -4.63
Red Velvet 52.83 0 87 87 0.15 0.00 0.84 -3.43
Sech 52.75 1 88 87 0.17 0.00 0.71 -3.99
XXXTENTACION 71.22 0 87 87 0.24 0.00 0.61 -6.34
Regard 57.00 8 94 86 0.06 0.00 0.76 -6.31
Stormzy 62.91 2 88 86 0.25 0.00 0.70 -5.19
Taylor Swift 59.62 0 86 86 0.12 0.00 0.67 -6.17
BTS 53.27 0 85 85 0.06 0.00 0.76 -4.80
Calvin Harris 51.14 0 85 85 0.09 0.06 0.85 -4.07
Chris Brown 51.36 0 85 85 0.22 0.00 0.57 -5.74
Conan Gray 52.33 0 85 85 0.32 0.00 0.57 -7.06
J. Cole 50.80 1 86 85 0.30 0.02 0.53 -9.88
John Legend 30.28 0 85 85 0.27 0.00 0.63 -7.06
The Chainsmokers 49.23 0 85 85 0.11 0.00 0.73 -5.75
Arcangel 45.05 0 84 84 0.19 0.01 0.71 -5.76
Avicii 41.90 0 84 84 0.08 0.08 0.78 -4.96
Lauv 64.80 1 85 84 0.29 0.00 0.54 -6.87
Major Lazer 30.30 0 84 84 0.06 0.05 0.80 -4.74
Meek Mill 49.67 1 85 84 0.12 0.00 0.72 -4.44
Myke Towers 66.82 0 84 84 0.30 0.00 0.70 -4.32
Queen 42.40 0 84 84 0.29 0.02 0.62 -8.39

Conclusion #4: Chart #2 visualizes the variation in song popularity for the same artist. The chart includes data for the artists with the 50 highest average popularity scores. Even for the most popular artists, there’s a large variation in popularity for their songs. This suggests that the artist’s name is not a main driver of popularity as measured by this dataset.

chart2 <-
most_pop_50 %>%
 ggplot(aes(x = mean_pop, y = pop_range , color = mean_acoust)) + 
  geom_point() +
  theme_get() +
  scale_x_continuous(name = "Song Popularity for the Top 50 Most Popular Artists") +
  scale_y_continuous(name = "Popularity Range by Artist") +
  ggtitle ("Chart #2: Top 50 Most Popular Artists", 
           subtitle = "Popularity Range (Max Pop - Min Pop) for All Songs from Same Artist")
  
chart2

Conclusion #5: Chart #3 (below) shows that there’s a significantly higher amount of variation in song popularity for artists that are less popular. In this case, the variation in song popularity for songs from the same artist can be 96 points (one song popularity equals zero and another equals 96).

chart3 <-
most_range_50 %>%
 ggplot(aes(x = mean_pop, y = pop_range , color = mean_acoust)) + 
  geom_point() +
  theme_get() +
  scale_x_continuous(name = "Mean Popularity by Artist") +
  scale_y_continuous(name = "Top 50 Highest Popularity Range by Artist") +
  ggtitle ("Charter #3: Top 50 Popularity Range for Songs from the Same Artist", 
           subtitle = "Popularity Range (Max - Min)")
  
chart3

# Creating a smaller dat aset for modeling
spotify_mod <- spotify %>% select(23:24, 30:43)
spotify_mod

Modeling

Conclusion #6: With such a variable and inconsistent data set, that is, popularity scores are not explained by artist name nor musical characteristics, it’s highly unlikely that a significant model to predict song popularity will be created without additional information. Below are three multiple linear regression models with a maximum R-Squared of 0.058.

# Multiple regression model to predict song popularity
fit_lm1 <- lm(track_popularity ~ instrumentalness + acousticness , data = spotify_mod)
summary(fit_lm1)
## 
## Call:
## lm(formula = track_popularity ~ instrumentalness + acousticness, 
##     data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.228 -17.778   3.166  18.507  59.724 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7698     0.1860  208.49   <2e-16 ***
## instrumentalness -12.6629     0.5980  -21.18   <2e-16 ***
## acousticness       9.7042     0.6242   15.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.42 on 28349 degrees of freedom
## Multiple R-squared:  0.02384,    Adjusted R-squared:  0.02377 
## F-statistic: 346.1 on 2 and 28349 DF,  p-value: < 2.2e-16
# Multiple regression model to predict song popularity
fit_lm2 <- lm(track_popularity ~ energy + loudness , data = spotify_mod)
summary(fit_lm2)
## 
## Call:
## lm(formula = track_popularity ~ energy + loudness, data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.091 -17.565   3.044  18.417  65.599 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.83742    1.06637   67.37   <2e-16 ***
## energy      -31.15559    1.03181  -30.20   <2e-16 ***
## loudness      1.57586    0.06236   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.31 on 28349 degrees of freedom
## Multiple R-squared:  0.03251,    Adjusted R-squared:  0.03244 
## F-statistic: 476.3 on 2 and 28349 DF,  p-value: < 2.2e-16
# Multiple regression model to predict song popularity
fit_lm3 <- lm(track_popularity ~ energy + speechiness + loudness + acousticness + instrumentalness + danceability + liveness + valence + tempo + duration_ms + key + mode, data = spotify_mod)
summary(fit_lm3)
## 
## Call:
## lm(formula = track_popularity ~ energy + speechiness + loudness + 
##     acousticness + instrumentalness + danceability + liveness + 
##     valence + tempo + duration_ms + key + mode, data = spotify_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.610 -17.197   2.948  18.103  60.513 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.783e+01  1.704e+00  39.808  < 2e-16 ***
## energy           -2.319e+01  1.220e+00 -19.011  < 2e-16 ***
## speechiness      -6.273e+00  1.380e+00  -4.547 5.47e-06 ***
## loudness          1.155e+00  6.527e-02  17.701  < 2e-16 ***
## acousticness      4.322e+00  7.465e-01   5.790 7.11e-09 ***
## instrumentalness -9.300e+00  6.254e-01 -14.871  < 2e-16 ***
## danceability      3.708e+00  1.072e+00   3.458 0.000546 ***
## liveness         -4.280e+00  8.990e-01  -4.761 1.94e-06 ***
## valence           1.784e+00  6.565e-01   2.718 0.006573 ** 
## tempo             2.596e-02  5.239e-03   4.955 7.29e-07 ***
## duration_ms      -4.341e-05  2.294e-06 -18.923  < 2e-16 ***
## key               4.677e-03  3.844e-02   0.122 0.903175    
## mode              8.572e-01  2.809e-01   3.052 0.002278 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.01 on 28339 degrees of freedom
## Multiple R-squared:  0.05804,    Adjusted R-squared:  0.05764 
## F-statistic: 145.5 on 12 and 28339 DF,  p-value: < 2.2e-16