This project explores Spotify Song Data set to find relationships regarding track popularity and factors relating to the tracks such as danceability, tempo, or genre.
Relationships between track popularity will be measured against danceability, artist, genre, sub genre, and energy. Track popularity may be affected by its song components but also the information surrounding the track. For a thorough search, track popularity will me checked against popularity of the artist, album , and genre.
This exploration will be conducted through R to show each step in the process from data cleaning, exploration, visualization, and a brief analysis.
The following questions will be explored:
After this analysis is completed, the findings could be used for music companies to find the ideal factors for creating smash-hit songs.
The following packages are needed to properly analyze the Spotify data set:
Both ggplot2 and dplyer are inside tidyverse but code to import them will be shown:
library(tidyverse) #general data preparation
library(ggplot2) # visualization
library(dbplyr) # manipulate data
library(ggpubr) # to show ggplot2 neatly
library(ggcorrplot) #for correlation matrix visualization
The Spotify Songs data set will be first downloaded as a data frame labeled spotify_songs_df.
spotify_songs_df <- read_csv("spotify_songs.csv")
The value types, data frame size, missing values, and the general data will be explored in the following code:
dim(spotify_songs_df) # check row # and col #
## [1] 32833 23
str(spotify_songs_df) # Check variable type
## tibble [32,833 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
colSums(is.na(spotify_songs_df)) # check # of missing values
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
After running this code, 32833 observations with 23 variables in each entry are visible in spotify_songs_df. Variables are either numeric or double in type. There is also a total of 15 missing variables.
The following is how these missing variables are distributed:
track_artist: 5 missing valuestrack_name: 5 missing valuestrack_album_name: 5 missing valuesTo determine whether to revise or remove the missing values, a new data frame will be made to observe how these variables are distributed called spotify_songs_missing_df.
spotify_songs_missing_df <- spotify_songs_df[!complete.cases(spotify_songs_df),] # data frame of missing values
sum(is.na(spotify_songs_df)) # to check if all missing values are accounted for
## [1] 15
view(spotify_songs_missing_df) # view actual values
After running this code, all 15 missing variables are composed from 5 incomplete records all with 0 track popularity. The records also barely provide enough information to fill in missing values with a manual search. These missing values are all character values where replacing them with average variables will not make sense based on the actual variable meanings.
Since these missing values only effect 5 observations, they will be removed for this analysis. It is better to remove them since a loss of a few will not have a significant impact.
spotify_songs_df <- spotify_songs_df[complete.cases(spotify_songs_df), ] # keep complete observations
sum(is.na(spotify_songs_df)) #check for remaining missing
## [1] 0
There are now 0 missing variables.
Now that there a no more missing values, the data set will be manipulated to have the following attributes:
track_name, track_artist, track_popularity, playlist_genre, playlist_subgenre, and all of numerical variables.The code script below alters the data set to match those attributes and also checks the current status of the updated data set.
spotify_songs_df <- spotify_songs_df %>%
select(track_name, track_artist, track_popularity, playlist_genre,
playlist_subgenre, danceability, energy, key, loudness,
mode, speechiness, acousticness, instrumentalness,
liveness, valence, tempo, duration_ms) %>% # keep certain variables
distinct(track_name, .keep_all = TRUE) %>% # include unique tracks only
arrange(desc(track_popularity)) # order highest popularity first
dim(spotify_songs_df) # check row # and col #
## [1] 23449 17
Now the data has been altered to answer this project’s key questions.
After this cleaning process, the first 10 observations in the data set can be observed with the following code:
head(spotify_songs_df, 10)
## # A tibble: 10 x 17
## track_name track_artist track_popularity playlist_genre playlist_subgen~
## <chr> <chr> <dbl> <chr> <chr>
## 1 Tusa KAROL G 98 pop dance pop
## 2 Memories Maroon 5 98 pop dance pop
## 3 Blinding ~ The Weeknd 98 pop dance pop
## 4 The Box Roddy Ricch 98 rap hip hop
## 5 everythin~ Billie Eili~ 97 pop dance pop
## 6 Don't Sta~ Dua Lipa 97 pop post-teen pop
## 7 Falling Trevor Dani~ 97 pop electropop
## 8 RITMO (Ba~ The Black E~ 96 pop dance pop
## 9 bad guy Billie Eili~ 95 pop dance pop
## 10 Yummy Justin Bieb~ 95 pop dance pop
## # ... with 12 more variables: danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
The following table displays variables with their explanations based on the provided code description found here.
| Variable Name | Data Type | Missing Values | Explanation |
|---|---|---|---|
track_name |
character | None | Song name |
track_artist |
character | None | Song artist |
track_popularity |
double | None | Song popularity from 0 - 100 where the larger number is better. |
playlist_genre |
character | None | Playlist genre |
playlist_subgenre |
character | None | Playlist subgenre |
danceability |
double | None | 0 - 1.0 scale of how suitable a track is to dance to. |
energy |
double | None | 0 - 1.0 scale of a track’s perpetually measured activity and intensity where higher values are more energetic. |
key |
double | None | The average key/pitch of a track. Integers map pitches with pitch class notation. |
loudness |
double | None | The average loudness of a track measured in decibels (dB). |
mode |
double | None | Indicates whether a track is a major or minor with major equally 1 and minor equal to 0. |
speechiness |
double | None | 0 - 1.0 of how much of a track consists of words where the higher value likely is a voice recording. |
acousticness |
double | None | 0 - 1.0 scale that measures how likely a track is to be acoustic. |
instrumentalness |
double | None | 0 - 1.0 scale that measures how likely a track contains any vocals where 1.0 is a track without vocals. |
liveness |
double | None | Detects likelihood of the track having an audience in the recording. |
valence |
double | None | 0 - 1.0 scale that measures the positiveness conveyed from the track where 1 is positve and 0 is negative. |
tempo |
double | None | Average estimated beats per minute (BPM) for a track. |
duration_ms |
double | None | Song duration in milliseconds. |
Using the revised data set, the key questions mentioned previously will be explored with different graphs to gain new insights. Any additional changes to the data set will be conducted inside each individual analysis going forward.
Several types of visualization forms will be used to learn more about track popularity. The following questions will be explored with the various plot types:
Visualizing these 3 relationships will hopefully yield valuable insights to what makes a track popular.
Track Popularity and Genre Track popularity will be compared against playlist genre and playlist subgenre using box plots. This code below creates the first box plot:
#genre_popularity_box
spotify_songs_df %>%
ggplot(aes(x = playlist_genre ,y = track_popularity, fill =playlist_genre)) +
geom_boxplot()+
scale_y_continuous(name = "Track Popularity", labels = scales::comma) +
scale_x_discrete(name = "Playlist Genre") +
ggtitle("Track Popularity based on Playlist Genre")
The box plot displays that pop, latin, and rap are the top 3 genres with the highest average track popularity. Pop and latin have boxes with the highest popularity while edm has a box with the lowest variables. Despite these genres corresponding with high popularity, every genre has their tails reaching to close to 100 and 0. The plot shows that genre influences but is not the sole factor for determining popularity.
The next box plot will compare popularity against subgenre with the following code:
#subgenre_popularity_box
spotify_songs_df %>%
ggplot(aes(x = playlist_genre ,y = track_popularity, fill =playlist_subgenre)) +
geom_boxplot()+
scale_y_continuous(name = "Track Popularity", labels = scales::comma) +
scale_x_discrete(name = "Playlist Genre") +
ggtitle("Track Popularity based on Playlist Subgenre")
The chart shows a more detailed look at how playlist genres relate to popularity accounting for subgenres. Post-teen Pop, hip-hop, and progressive electro house have the highest average popularity. Something unsual is that hip-pop categorized in r&b playlists do better on average than hip-hop placed in rap playlists. However, hip-hop songs in rap playlists have several outliers of high popularity. So how a track is ordered in both playlist genre and subgenre can affect the way it is percieved by its audiences.
Track Popularity and Danceability Track popularity will be viewed against danceability in two different bar plots. One version will be categorized by playlist genre to see how the main genres relate to popularity with danceability. The second bar plot will instead be categorized with playlist subgenre.
The following code creates the first bar plot:
#genre_popularity_hist
spotify_songs_df %>%
ggplot(y = track_popularity, aes(x =danceability , fill = playlist_genre)) +
geom_bar()+
scale_x_continuous(name = "Danceability", labels = scales::comma) +
scale_y_continuous(name = "Track Popularity") +
ggtitle("Track Popularity and Track Danceability by Playlist Genre")
Looking at this chart, the curve is skewed fairly right, showing that tracks with danceability ranging from about 0.69-0.75 have the high popularity ratings above 50. There are also several occurrences where certain levels of danceability produce ratings higher than 75.
It is also apparent that tracks belonging to edm, latin, and pop playlists most frequently have higher danceability scores.
Based on this chart, it appears that tracks with danceability scores ranging in the 3rd quartile do perform well in popularity. Danceability is very apparent in tracks that also belong to edm, latin, and pop playlists. Danceablilty appears to have a certain mix that helps tracks become popular.
The bar plot below will take a much deeper look to compare track popularity and playlist subgenre:
#subgenre_popularity_hist
spotify_songs_df %>%
ggplot(y = track_popularity, aes(x =danceability , fill = playlist_subgenre)) +
geom_bar()+
scale_x_continuous(name = "Danceability", labels = scales::comma) +
scale_y_continuous(name = "Track Popularity") +
ggtitle("Track Popularity and Track Danceability by Playlist Subgenre")
Similar insights from the previous bar plot are seen here. Tracks with tend to have higher track popularity scores when their danceability ratings are relatively high. Danceability past 0.75 seem to drop off in popularity, so tracks must have a fine danceability mix to be a smash-hit.
Also with this bar plot, album rock, big room, classic rock, and dance pop are the top 4 subgenres where a mix of high danceability results in high popularity.
Track Popularity Regression: To statistically see which numeric variables affect track popularity, a correlation matrix and multivariate regression model will be used.
The numeric variable associated in this data set will be examined with a correlation matrix. The following code creates the matrix:
Based on this matrix, only a few variables have some correlation between each other. The following are key takeaways from the matrix:
These findings show some variables have positive correlations with other ones. There are also key variables that correlated greatly to many other variables like key or instrumentalness.
For a better grasp on the relationship between track popularity and numeric track factors, 3 different linear regression models will be made.
Model #1 This simple model explores a further explores the relationship between track popularity and danceability.
\[ track popularity =β_1 + β_2*danceability \]
Model 2 This multivariate linear model only compares danceability, energy, and key against track popularity.
\[ track popularity =β_1 + β_2*danceability + β_3 *energy + β_4 *key \] Model 3 This multivariate linear model compares track popularity to every numeric variable in the Spotify data set.
\[ track popularity =β_1+β_2*danceability + β_3 *energy + β_4 *key + β_5*loudness + β_6*mode + \\ + β_7*speechiness + β_8*acousticness + β_9*instrumentalness +\\ + β_{10}*liveness + β_{11}*valence + β_{12}*tempo + β_{13}*duration ms \]
The code below creates the regression models and determines a summary of the coefficients for each:
#A simple regression model of track popularity and danceability
linearMod1 <- lm(track_popularity ~ danceability, data=spotify_songs_df)
#A multivariate linear regression model against danceability, energy, and key
linearMod2 <- lm(track_popularity ~ danceability + energy +
key, data=spotify_songs_df)
#A linear multivariate regression model with all numeric spotify data
linearMod3 <- lm(track_popularity ~ danceability + energy + key +
loudness + mode + speechiness + acousticness +
instrumentalness +liveness + valence +
tempo + duration_ms, data=spotify_songs_df)
# Checks significance of each variable for every model
summary(linearMod1)
summary(linearMod2)
summary(linearMod3)
After creating this regression, there are several values that are crucial in determining a track’s popularity. The following table displays which variables in each model at a 95% confidence interval with a *:
| Variable Name | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| danceability | 8.6956* | 7.51518* | 4.524e+00* |
| energy | -12.80025* | -2.354e+01* | |
| key | -0.02244 | 3.034e-02 | |
| loudness | 1.183e+00* | ||
| mode | 1.118e+00* | ||
| speechiness | -5.755e+00* | ||
| acousticness | 4.310e+00 * | ||
| instrumentalness | -9.760e+00* | ||
| liveness | -3.961e+00* | ||
| valence | 2.330e+00 * | ||
| tempo | 2.647e-02* | ||
| duration ms | -4.440e-05* | ||
| intercept | 34.0454 | 43.85666 | 6.766e+01 |
Through all three models, the danceability coefficients were deemed as significant, showing the variable has an impact on track popularity. Even more interesting was that all coefficents were significant in a 95% confidence interval except key. According to these models key does not matter for determining popularity.
All coefficients in the final model were positive except for energy, speechiness, instrumentalness, liveness, and duration. Negative coefficients for both speechiness and instrumentalness displays the possibility that songs need a perfect mix of the two and genres that rely heavily on either factor might have lower popularity scores.
This simple analysis provides much about the significance of different music factors but a more complex model should be needed taking in account variable correlation to and weighted coefficients. A non-linear model may also benefit this analysis.
The purpose of this project was to explore how track_popularity relates to its playlist genre along with musical factors of the track. This inital purpose was divided into questions about the impact of genre, subgenre, danceability, and numeric music factors on track popularity. To solve these questions, the Spotify data set was altered to be ranked in popularity and only including complete and relevant variables. This data was then used to create 5 different visualizations meant to provide insights on the above questions.
Creating those plots allowed for some interesting points to be discovered:
Doing a linear regression analysis also yield great insight:
These insights from both the visual and regression analysis can be valuable to music companies. Based on the results, music companies can focus on creating pop, r&b, or latin tracks that can also be categorized by post-teen pop, the highest rated subgenre. These companies can also balance a track’s danceability dependent on the track genre to create popular songs to dance to. However, popular tracks can be made without a large focus on danceability. The linear regression models display that paying attention to the music factors is crucial to a track popularity and further modeling can narrow down which music factors have the biggest impact. This could be extended to take account for the different genres as well. Overall making music associated with the most popular genres and subgenres along with taking in significant music factors will result in popular tracks.
Although this analysis offers some insight for music companies, it does have some limitations. This particular project only conducted linear regressions and could benefit greatly with more complex models. It also did not explore how popular artists influence track popularity or if track popularity has correlation with monetary gains for music companies. These failing points of this project could be furthered explored with a more complex analysis in the future. Despite these limitations, this analysis still provides interesting insights for how music companies could create a formulaic approach to making smash-songs.