Introduction

Spotify is one of the most popular music streaming services offering over 50 million songs and 700,000 podcasts. About 40,000 new songs are added to Spotify every day! So how does a song become popular on Spotify? Do the most popular songs share any common characteristics? In this project, I will be visually and statistically examining a data set of over 30,000 songs to try to determine what song features are correlated with popularity score. This type of information would be very useful for artists and producers so they know the “formula”" for creating the next biggest hit.

Packages Required

The following packages were used in this analysis.

library(tidyverse) #for data cleaning and manipulation
library(rccdates) #for converting date variables

Data Preparation

The dataset was originally obtained from Spotify using the spotifyr package. The data for this project was downloaded via this GitHub link which became available in January 2020. According to GitHub, Kaylin Pavlick recently used a Spotify dataset of 5000 songs to try and classify song genres based on the audio features. The spotifyr package allows users to scrape data off Spotify for similar analysis.

#importing the data
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The Variables

The source data does not have any missing values and contains 32,833 observations and 23 variables that are a mixture of categorical and numeric. Descriptions for the non-intuitive variables can be found in the table below and a full description of all variables can be found here.

name type description
track_popularity double popularity score (0-100)
danceability double how suitable the song is for dancing (0-1)
energy double measure of song intensity and activity (0-1)
key double key of track (mapped to integer where C=0)
loudness double loudness in decibels (dB)
mode double modality (major=1, minor=0)
speechiness double presence of spoken word in song (0-1)
acousticness double confidence (0-1) whether song is acoustic
instrumentalness double predicts if the track contains no vocals (0-1)
liveliness double detects presence of audience in recording (0-1)
valence double (0-1) measure of how positive the song sounds
tempo double estimted tempo in beats per minute (BPM)
duration_ms double length of song in milliseconds (ms)

Data Cleaning

As previously mentioned, this data doesn’t contain any missing values or appear to have any outliers. It is also already in tidy format where each variable corresponds to its own column and each observation corresponds to its own row. The additional cleaning I’ve done is to make the data easier to analyze. First I removed the unique identifier columns for song, album, and playlist as well as the columns for album name and playlist name. Identifier variables are not relevant in my analysis and playlist and album name have nothing to do with the characteristics of a song that could influence the popularity score. Therefore, they will not be used in any visualizations or calculations.

#removing columns 1,5,6,8,& 9
spotify <- spotify[,-c(1,5,6,8,9)]

I also think it would be more useful to only look at ‘year’ for the track album release date. It is originally in “YYYY-MM-DD” format for the majority of rows, but 1,886 rows only contain the year. Using the tidyr separate() function, I split the data into three columns and then deleted day and month so only year remains. The song release years in this data set span from 1957 to 2020.

#separating track_album_release_date
spotify <- spotify%>%separate(track_album_release_date,c("release_year", "release_month", "release_day"), sep="-")

#deleting release_month and release_day
spotify <- spotify[,-c(5,6)]

#changing year to a factor
spotify$release_year <- as.factor(spotify$release_year)

I also changed playlist genre and playlist subgenre from characters to factors because I think these points may be relevant in my analysis of song popularity.

#changing genre to a factor
spotify$playlist_genre <- as.factor(spotify$playlist_genre)

#changing subgenre to a factor
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)

Finally, I wanted to simplify some of the variable names to make them easier to reference in my analysis.

#simplifying variable names
names(spotify) <- c("name", "artist", "popularity", "year", "genre", "subgenre", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumantalness", "liveness", "valence", "tempo", "duration")

A condensed snapshot of the cleaned data set is shown below.

name artist popularity year genre subgenre danceability energy key loudness mode speechiness acousticness
Let It Be Me Steve Aoki 52 2019 pop dance pop 0.661 0.758 7 -5.299 1 0.0864 0.0797
Lovers + Strangers Starley 58 2019 pop dance pop 0.653 0.690 1 -5.003 1 0.0756 0.1090

Data Summary

The data set now contains 32,833 observations and 18 variables. The variable of interest, “popularity,” has values ranging from 0 to 100 with a mean of 42.48. There are six different genres of music represented in this data set including EDM, Latin, pop, R&B, rap, and rock, and there are also 24 sub-genres. The years of the songs span from 1957 to 2020. Finally, many of the song characteristics are on a 0-1 scale with 1 indicating the song has more of that characteristic.

Proposed Exploratory Data Analysis

The main questions I want to answer in my analysis are:

  1. What characteristics do popular songs have in common, if any?
  2. Does the name of the song have any influence on its popularity?
  3. Which artists have the most popular songs?
  4. Are there any characteristics that negatively impace popularity score?
  5. What genre has the most popular songs?

Methods

To find commonalities in the characteristics of popular songs, I can sort the dataset by popularity score and start by examining a smaller portion of the data containing only the most popular songs. I also plan to look at the correlation between the different numeric attributes in a correlation matrix. Regression analysis could also be helpful in uncovering these relationships if linear dependency is discovered. Cluster analysis on the smaller dataset might also be helpful to see how similar popular songs are to one another.

In dealing with the categorical variables, I plan to do ANOVA tests to see if there is a significant difference between the means of each of the genre categories. I also want to summarize average popularity score by artist to uncover the most popular artists. Bar plots and mean plots could be helpful for visualizing this.

Finally, I want to try and parse keywords from song titles to determine if any of these words have a relationship with the popularity score. A word cloud might be helpful to visualize words that appear commonly in song titles.

Additional Learning

We have not yet covered regression or clustering, but I have some knowledge of these methods from my other classes that I will be pulling from. I also am not as familiar with how to visualize categorical variables so I will be looking into word clouds and other plots that can effectively communicate these insights.