Final Project: Spotify Data

Introduction
Packages Required
Data Preparation
Exploratory Data Analysis
- Analysis plans
- Data Visualizations
Summary

Introduction

This project explores Spotify Song Data set to find relationships regarding track popularity and factors relating to the tracks such as danceability, tempo, or genre.

Relationships between track popularity will be measured against danceability, artist, genre, sub genre, and energy. Track popularity may be affected by its song components but also the information surrounding the track. For a thorough search, track popularity will me checked against popularity of the artist, album , and genre.

This exploration will be conducted through R to show each step in the process from data cleaning, exploration, visualization, and a brief analysis.

The following questions will be explored:

How does genre or subgenre effect track popularity?
How does danceability effect track popularity?
How does a track’s music factors influence its popularity?(Regression)

After this analysis is completed, the findings could be used for music companies to find the ideal factors for creating smash-hit songs.

Packages Required

The following packages are needed to properly analyze the Spotify data set:

tidyverse: This package contains a collection of packages used for data preparation.
ggplot2: For visualizing data.
dplyr: For data manipulation.
ggpubr: For displaying graphs from ggplot2.
ggcorrplot: To display correlation matrix with ggplot2.

Both ggplot2 and dplyer are inside tidyverse but code to import them will be shown:

library(tidyverse) #general data preparation
library(ggplot2) # visualization
library(dbplyr) # manipulate data
library(ggpubr) # to show ggplot2 neatly
library(ggcorrplot) #for correlation matrix visualization

Data Preparation

The Spotify Songs data set will be first downloaded as a data frame labeled spotify_songs_df.

spotify_songs_df <- read_csv("spotify_songs.csv")

The value types, data frame size, missing values, and the general data will be explored in the following code:

dim(spotify_songs_df) # check row # and col #

## [1] 32833    23

str(spotify_songs_df) # Check variable type

## tibble [32,833 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )

colSums(is.na(spotify_songs_df)) # check # of missing values

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

After running this code, 32833 observations with 23 variables in each entry are visible in spotify_songs_df. Variables are either numeric or double in type. There is also a total of 15 missing variables.

Missing Values

The following is how these missing variables are distributed:

track_artist: 5 missing values
track_name: 5 missing values
track_album_name: 5 missing values

To determine whether to revise or remove the missing values, a new data frame will be made to observe how these variables are distributed called spotify_songs_missing_df.

spotify_songs_missing_df <- spotify_songs_df[!complete.cases(spotify_songs_df),] # data frame of missing values
sum(is.na(spotify_songs_df)) # to check if all missing values are accounted for

## [1] 15

view(spotify_songs_missing_df) # view actual values

After running this code, all 15 missing variables are composed from 5 incomplete records all with 0 track popularity. The records also barely provide enough information to fill in missing values with a manual search. These missing values are all character values where replacing them with average variables will not make sense based on the actual variable meanings.

Since these missing values only effect 5 observations, they will be removed for this analysis. It is better to remove them since a loss of a few will not have a significant impact.

spotify_songs_df <- spotify_songs_df[complete.cases(spotify_songs_df), ] # keep complete observations
sum(is.na(spotify_songs_df)) #check for remaining missing

## [1] 0

There are now 0 missing variables.

Data Revisions

Now that there a no more missing values, the data set will be manipulated to have the following attributes:

Only unique tracks
Order from highest popularity to lowest
Include only track_name, track_artist, track_popularity, playlist_genre, playlist_subgenre, and all of numerical variables.

The code script below alters the data set to match those attributes and also checks the current status of the updated data set.

spotify_songs_df <- spotify_songs_df %>%
  select(track_name, track_artist, track_popularity, playlist_genre,
    playlist_subgenre, danceability, energy, key, loudness,
    mode, speechiness, acousticness, instrumentalness,
    liveness, valence, tempo, duration_ms)  %>% # keep certain variables
  distinct(track_name, .keep_all = TRUE) %>% # include unique tracks only
  arrange(desc(track_popularity)) # order highest popularity first
  

dim(spotify_songs_df) # check row # and col #

## [1] 23449    17

Now the data has been altered to answer this project’s key questions.

Cleaned Dataset

After this cleaning process, the first 10 observations in the data set can be observed with the following code:

head(spotify_songs_df, 10)

## # A tibble: 10 x 17
##    track_name track_artist track_popularity playlist_genre playlist_subgen~
##    <chr>      <chr>                   <dbl> <chr>          <chr>           
##  1 Tusa       KAROL G                    98 pop            dance pop       
##  2 Memories   Maroon 5                   98 pop            dance pop       
##  3 Blinding ~ The Weeknd                 98 pop            dance pop       
##  4 The Box    Roddy Ricch                98 rap            hip hop         
##  5 everythin~ Billie Eili~               97 pop            dance pop       
##  6 Don't Sta~ Dua Lipa                   97 pop            post-teen pop   
##  7 Falling    Trevor Dani~               97 pop            electropop      
##  8 RITMO (Ba~ The Black E~               96 pop            dance pop       
##  9 bad guy    Billie Eili~               95 pop            dance pop       
## 10 Yummy      Justin Bieb~               95 pop            dance pop       
## # ... with 12 more variables: danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

The following table displays variables with their explanations based on the provided code description found here.

Variable Name	Data Type	Missing Values	Explanation
`track_name`	character	None	Song name
`track_artist`	character	None	Song artist
`track_popularity`	double	None	Song popularity from 0 - 100 where the larger number is better.
`playlist_genre`	character	None	Playlist genre
`playlist_subgenre`	character	None	Playlist subgenre
`danceability`	double	None	0 - 1.0 scale of how suitable a track is to dance to.
`energy`	double	None	0 - 1.0 scale of a track’s perpetually measured activity and intensity where higher values are more energetic.
`key`	double	None	The average key/pitch of a track. Integers map pitches with pitch class notation.
`loudness`	double	None	The average loudness of a track measured in decibels (dB).
`mode`	double	None	Indicates whether a track is a major or minor with major equally 1 and minor equal to 0.
`speechiness`	double	None	0 - 1.0 of how much of a track consists of words where the higher value likely is a voice recording.
`acousticness`	double	None	0 - 1.0 scale that measures how likely a track is to be acoustic.
`instrumentalness`	double	None	0 - 1.0 scale that measures how likely a track contains any vocals where 1.0 is a track without vocals.
`liveness`	double	None	Detects likelihood of the track having an audience in the recording.
`valence`	double	None	0 - 1.0 scale that measures the positiveness conveyed from the track where 1 is positve and 0 is negative.
`tempo`	double	None	Average estimated beats per minute (BPM) for a track.
`duration_ms`	double	None	Song duration in milliseconds.

Exploratory Data Analysis

Using the revised data set, the key questions mentioned previously will be explored with different graphs to gain new insights. Any additional changes to the data set will be conducted inside each individual analysis going forward.

Analysis plans

Several types of visualization forms will be used to learn more about track popularity. The following questions will be explored with the various plot types:

Track Popularity and Genre/Subgenre: Box plots
Track Popularity and Danceability: Bar plpt categorized by genre and subgenre.
Track Popularity Regression: Correlation matrix,

Visualizing these 3 relationships will hopefully yield valuable insights to what makes a track popular.

Data Visualizations

Track Popularity and Genre Track popularity will be compared against playlist genre and playlist subgenre using box plots. This code below creates the first box plot:

#genre_popularity_box
spotify_songs_df %>%
  ggplot(aes(x = playlist_genre ,y = track_popularity, fill =playlist_genre)) + 
  geom_boxplot()+ 
  scale_y_continuous(name = "Track Popularity", labels = scales::comma) +
  scale_x_discrete(name = "Playlist Genre") +
  ggtitle("Track Popularity based on Playlist Genre")

The box plot displays that pop, latin, and rap are the top 3 genres with the highest average track popularity. Pop and latin have boxes with the highest popularity while edm has a box with the lowest variables. Despite these genres corresponding with high popularity, every genre has their tails reaching to close to 100 and 0. The plot shows that genre influences but is not the sole factor for determining popularity.

The next box plot will compare popularity against subgenre with the following code:

#subgenre_popularity_box
spotify_songs_df %>%
  ggplot(aes(x = playlist_genre ,y = track_popularity, fill =playlist_subgenre)) + 
  geom_boxplot()+ 
  scale_y_continuous(name = "Track Popularity", labels = scales::comma) +
  scale_x_discrete(name = "Playlist Genre") +
  ggtitle("Track Popularity based on Playlist Subgenre")

The chart shows a more detailed look at how playlist genres relate to popularity accounting for subgenres. Post-teen Pop, hip-hop, and progressive electro house have the highest average popularity. Something unsual is that hip-pop categorized in r&b playlists do better on average than hip-hop placed in rap playlists. However, hip-hop songs in rap playlists have several outliers of high popularity. So how a track is ordered in both playlist genre and subgenre can affect the way it is percieved by its audiences.

Track Popularity and Danceability Track popularity will be viewed against danceability in two different bar plots. One version will be categorized by playlist genre to see how the main genres relate to popularity with danceability. The second bar plot will instead be categorized with playlist subgenre.

The following code creates the first bar plot:

#genre_popularity_hist
spotify_songs_df %>%
  ggplot(y = track_popularity, aes(x =danceability , fill = playlist_genre)) + 
  geom_bar()+
  scale_x_continuous(name = "Danceability", labels = scales::comma) +
  scale_y_continuous(name = "Track Popularity") +
  ggtitle("Track Popularity and Track Danceability by Playlist Genre")

Looking at this chart, the curve is skewed fairly right, showing that tracks with danceability ranging from about 0.69-0.75 have the high popularity ratings above 50. There are also several occurrences where certain levels of danceability produce ratings higher than 75.

It is also apparent that tracks belonging to edm, latin, and pop playlists most frequently have higher danceability scores.

Based on this chart, it appears that tracks with danceability scores ranging in the 3rd quartile do perform well in popularity. Danceability is very apparent in tracks that also belong to edm, latin, and pop playlists. Danceablilty appears to have a certain mix that helps tracks become popular.

The bar plot below will take a much deeper look to compare track popularity and playlist subgenre:

#subgenre_popularity_hist
spotify_songs_df %>%
  ggplot(y = track_popularity, aes(x =danceability , fill = playlist_subgenre)) + 
  geom_bar()+ 
  scale_x_continuous(name = "Danceability", labels = scales::comma) +
  scale_y_continuous(name = "Track Popularity") +
  ggtitle("Track Popularity and Track Danceability by Playlist Subgenre")

Similar insights from the previous bar plot are seen here. Tracks with tend to have higher track popularity scores when their danceability ratings are relatively high. Danceability past 0.75 seem to drop off in popularity, so tracks must have a fine danceability mix to be a smash-hit.

Also with this bar plot, album rock, big room, classic rock, and dance pop are the top 4 subgenres where a mix of high danceability results in high popularity.

Track Popularity Regression: To statistically see which numeric variables affect track popularity, a correlation matrix and multivariate regression model will be used.

The numeric variable associated in this data set will be examined with a correlation matrix. The following code creates the matrix:

Based on this matrix, only a few variables have some correlation between each other. The following are key takeaways from the matrix:

Mode and liveness have a high correlation of 9.3.
Loudness and key have a high correlation of 0.82.
Mode, key, acousticness, tempo, and instrumentalness have the most correlations between other variables.

These findings show some variables have positive correlations with other ones. There are also key variables that correlated greatly to many other variables like key or instrumentalness.

For a better grasp on the relationship between track popularity and numeric track factors, 3 different linear regression models will be made.

Model #1 This simple model explores a further explores the relationship between track popularity and danceability.

\[ track popularity =β_1 + β_2*danceability \]

Model 2 This multivariate linear model only compares danceability, energy, and key against track popularity.

\[ track popularity =β_1 + β_2*danceability + β_3 *energy + β_4 *key \] Model 3 This multivariate linear model compares track popularity to every numeric variable in the Spotify data set.

\[ track popularity =β_1+β_2*danceability + β_3 *energy + β_4 *key + β_5*loudness + β_6*mode + \\ + β_7*speechiness + β_8*acousticness + β_9*instrumentalness +\\ + β_{10}*liveness + β_{11}*valence + β_{12}*tempo + β_{13}*duration ms \]

The code below creates the regression models and determines a summary of the coefficients for each:

#A simple regression model of track popularity and danceability
linearMod1 <- lm(track_popularity ~ danceability, data=spotify_songs_df)

#A multivariate linear regression model against danceability, energy, and key
linearMod2 <- lm(track_popularity ~ danceability + energy + 
                   key, data=spotify_songs_df)

#A linear multivariate regression model with all numeric spotify data 
linearMod3 <- lm(track_popularity ~ danceability + energy + key +
                loudness + mode + speechiness + acousticness +
                instrumentalness +liveness + valence +
                tempo + duration_ms, data=spotify_songs_df)

# Checks significance of each variable for every model
summary(linearMod1)
summary(linearMod2)
summary(linearMod3)

After creating this regression, there are several values that are crucial in determining a track’s popularity. The following table displays which variables in each model at a 95% confidence interval with a *:

Variable Name	Model 1	Model 2	Model 3
danceability	8.6956*	7.51518*	4.524e+00*
energy		-12.80025*	-2.354e+01*
key		-0.02244	3.034e-02
loudness			1.183e+00*
mode			1.118e+00*
speechiness			-5.755e+00*
acousticness			4.310e+00 *
instrumentalness			-9.760e+00*
liveness			-3.961e+00*
valence			2.330e+00 *
tempo			2.647e-02*
duration ms			-4.440e-05*
intercept	34.0454	43.85666	6.766e+01

Through all three models, the danceability coefficients were deemed as significant, showing the variable has an impact on track popularity. Even more interesting was that all coefficents were significant in a 95% confidence interval except key. According to these models key does not matter for determining popularity.

All coefficients in the final model were positive except for energy, speechiness, instrumentalness, liveness, and duration. Negative coefficients for both speechiness and instrumentalness displays the possibility that songs need a perfect mix of the two and genres that rely heavily on either factor might have lower popularity scores.

This simple analysis provides much about the significance of different music factors but a more complex model should be needed taking in account variable correlation to and weighted coefficients. A non-linear model may also benefit this analysis.

Summary

The purpose of this project was to explore how track_popularity relates to its playlist genre along with musical factors of the track. This inital purpose was divided into questions about the impact of genre, subgenre, danceability, and numeric music factors on track popularity. To solve these questions, the Spotify data set was altered to be ranked in popularity and only including complete and relevant variables. This data was then used to create 5 different visualizations meant to provide insights on the above questions.

Creating those plots allowed for some interesting points to be discovered:

Every playlist genre and subgenres had tracks with high popularity
Pop, latin, and r&b genres had highest average track popularity
Post-teen Pop, hip-hop, and progressive electro house subgenres have the highest average popularity
Overall popular track have some high level of danceability
EDM, latin, and pop have most frequent combination of high danciability with high popularity
There’s a fine mix of danceability needed for a track to be popular
Album rock, big room, classic rock, and dance pop are the top 4 subgenres where a mix of high danceability results in high popularity.

Doing a linear regression analysis also yield great insight:

Mode, key, acousticness, tempo, and instrumentalness have the most correlations between other numeric music variables
Mode and liveness are highly correlated *Loudness and key are highly correlated
Nearly all numeric variables are significant to track popularity
Key is not significant to determing track popularity
Energy, speechiness, instrumentalness, liveness, and duration are the only negative coefficients.

These insights from both the visual and regression analysis can be valuable to music companies. Based on the results, music companies can focus on creating pop, r&b, or latin tracks that can also be categorized by post-teen pop, the highest rated subgenre. These companies can also balance a track’s danceability dependent on the track genre to create popular songs to dance to. However, popular tracks can be made without a large focus on danceability. The linear regression models display that paying attention to the music factors is crucial to a track popularity and further modeling can narrow down which music factors have the biggest impact. This could be extended to take account for the different genres as well. Overall making music associated with the most popular genres and subgenres along with taking in significant music factors will result in popular tracks.

Although this analysis offers some insight for music companies, it does have some limitations. This particular project only conducted linear regressions and could benefit greatly with more complex models. It also did not explore how popular artists influence track popularity or if track popularity has correlation with monetary gains for music companies. These failing points of this project could be furthered explored with a more complex analysis in the future. Despite these limitations, this analysis still provides interesting insights for how music companies could create a formulaic approach to making smash-songs.