Spotify is one of the most popular music streaming services offering over 50 million songs and 700,000 podcasts. About 40,000 new songs are added to Spotify every day! So how does a song become popular on Spotify? Do the most popular songs share any common characteristics? In this project, I will be visually and statistically examining a data set of over 30,000 songs to try to determine what song features are correlated with popularity score. This type of information would be very useful for artists and producers so they know the “formula”" for creating the next biggest hit.
The following packages were used in this analysis.
library(tidyverse) #for data cleaning and manipulation
library(rccdates) #for converting date variables
The dataset was originally obtained from Spotify using the spotifyr package. The data for this project was downloaded via this GitHub link which became available in January 2020. According to GitHub, Kaylin Pavlick recently used a Spotify dataset of 5000 songs to try and classify song genres based on the audio features. The spotifyr package allows users to scrape data off Spotify for similar analysis.
#importing the data
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
The source data does not have any missing values and contains 32,833 observations and 23 variables that are a mixture of categorical and numeric. Descriptions for the non-intuitive variables can be found in the table below and a full description of all variables can be found here.
| name | type | description |
|---|---|---|
| track_popularity | double | popularity score (0-100) |
| danceability | double | how suitable the song is for dancing (0-1) |
| energy | double | measure of song intensity and activity (0-1) |
| key | double | key of track (mapped to integer where C=0) |
| loudness | double | loudness in decibels (dB) |
| mode | double | modality (major=1, minor=0) |
| speechiness | double | presence of spoken word in song (0-1) |
| acousticness | double | confidence (0-1) whether song is acoustic |
| instrumentalness | double | predicts if the track contains no vocals (0-1) |
| liveliness | double | detects presence of audience in recording (0-1) |
| valence | double | (0-1) measure of how positive the song sounds |
| tempo | double | estimted tempo in beats per minute (BPM) |
| duration_ms | double | length of song in milliseconds (ms) |
As previously mentioned, this data doesn’t contain any missing values or appear to have any outliers. It is also already in tidy format where each variable corresponds to its own column and each observation corresponds to its own row. The additional cleaning I’ve done is to make the data easier to analyze. First I removed the unique identifier columns for song, album, and playlist as well as the columns for album name and playlist name. Identifier variables are not relevant in my analysis and playlist and album name have nothing to do with the characteristics of a song that could influence the popularity score. Therefore, they will not be used in any visualizations or calculations.
#removing columns 1,5,6,8,& 9
spotify <- spotify[,-c(1,5,6,8,9)]
I also think it would be more useful to only look at ‘year’ for the track album release date. It is originally in “YYYY-MM-DD” format for the majority of rows, but 1,886 rows only contain the year. Using the tidyr separate() function, I split the data into three columns and then deleted day and month so only year remains. The song release years in this data set span from 1957 to 2020.
#separating track_album_release_date
spotify <- spotify%>%separate(track_album_release_date,c("release_year", "release_month", "release_day"), sep="-")
#deleting release_month and release_day
spotify <- spotify[,-c(5,6)]
#changing year to a factor
spotify$release_year <- as.factor(spotify$release_year)
I also changed playlist genre and playlist subgenre from characters to factors because I think these points may be relevant in my analysis of song popularity.
#changing genre to a factor
spotify$playlist_genre <- as.factor(spotify$playlist_genre)
#changing subgenre to a factor
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)
Finally, I wanted to simplify some of the variable names to make them easier to reference in my analysis.
#simplifying variable names
names(spotify) <- c("name", "artist", "popularity", "year", "genre", "subgenre", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumantalness", "liveness", "valence", "tempo", "duration")
A condensed snapshot of the cleaned data set is shown below.
| name | artist | popularity | year | genre | subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Let It Be Me | Steve Aoki | 52 | 2019 | pop | dance pop | 0.661 | 0.758 | 7 | -5.299 | 1 | 0.0864 | 0.0797 |
| Lovers + Strangers | Starley | 58 | 2019 | pop | dance pop | 0.653 | 0.690 | 1 | -5.003 | 1 | 0.0756 | 0.1090 |
The data set now contains 32,833 observations and 18 variables. The variable of interest, “popularity,” has values ranging from 0 to 100 with a mean of 42.48. There are six different genres of music represented in this data set including EDM, Latin, pop, R&B, rap, and rock, and there are also 24 sub-genres. The years of the songs span from 1957 to 2020. Finally, many of the song characteristics are on a 0-1 scale with 1 indicating the song has more of that characteristic.
The main questions I want to answer in my analysis are:
To find commonalities in the characteristics of popular songs, I can sort the dataset by popularity score and start by examining a smaller portion of the data containing only the most popular songs. I also plan to look at the correlation between the different numeric attributes in a correlation matrix. Regression analysis could also be helpful in uncovering these relationships if linear dependency is discovered. Cluster analysis on the smaller dataset might also be helpful to see how similar popular songs are to one another.
In dealing with the categorical variables, I plan to do ANOVA tests to see if there is a significant difference between the means of each of the genre categories. I also want to summarize average popularity score by artist to uncover the most popular artists. Bar plots and mean plots could be helpful for visualizing this.
Finally, I want to try and parse keywords from song titles to determine if any of these words have a relationship with the popularity score. A word cloud might be helpful to visualize words that appear commonly in song titles.
We have not yet covered regression or clustering, but I have some knowledge of these methods from my other classes that I will be pulling from. I also am not as familiar with how to visualize categorical variables so I will be looking into word clouds and other plots that can effectively communicate these insights.