Spotify is universally used across countries and people and people use playlists to easily collect and store songs. Playlists are created for many different situations and events, including for weddings, birthdays, and hangouts with friends. For many of these events people are getting together to have fun and enjoy themselves. One of the main activities at parties and events is dancing. This brings us to our research question: What variables affect danceability the most?
Our plan is to use the Spotify dataset to see what variables affect danceability the most. We are going to separate out the variables that have a correlation and see what the correlations are to the danceability variable.
We will be using a linear regression analysis and decision tree to solve our problem. We created a correlation plot as an initial analysis to see which numerical variables have a significant correlation with danceability.
Our analysis will be especially helpful for DJ’s as well as leaders of dance teams who are creating dance routines, and music producers. It will help them to choose what songs to play during different times of their events or routines.
tidyverse
rpart
rpart.plot
knitr
kableExtra
ggplot2
broom
library(tidyverse)
library(rpart)
library(rpart.plot)
library(knitr)
library(kableExtra)
library(ggplot2)
library(broom)
spotify <- read.csv("spotify_songs.csv")
The source data comes from the Spotifyr package and was used in a blog post where audio features were explored by genre. It was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff on January 21st, 2020. The data comes from Spotify via the spotifyr package. Created on 2020-01-21, by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The SpotifyR dataset includes 32,833 records, and 23 variables.
Change track name, track artist, and album name to character variables.
spotify$track_name <- as.character(spotify$track_name)
spotify$track_artist <- as.character(spotify$track_artist)
spotify$track_album_name <- as.character(spotify$track_album_name)
Removing NAs from the dataset
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify$track_name[is.na(spotify$track_name)] <- "Not_Stated"
spotify$track_artist[is.na(spotify$track_artist)] <- "Not_Stated"
spotify$track_album_name[is.na(spotify$track_album_name)] <- "Not_Stated"
Remove Duplicate tracks
We now have 28,356 records.
spotify_unique <- spotify %>% distinct(track_id, .keep_all = TRUE)
Converted duration from milliseconds to seconds
spotify$duration_ms <- spotify$duration_ms / 1000
names(spotify)[23] <- "duration_seconds"
Change danceability to dance_ability to make it easier to understand.
names(spotify)[c(12)] <- c("dance_ability")
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | dance_ability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_seconds |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194.754 |
0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162.600 |
1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176.616 |
75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169.093 |
1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189.052 |
7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163.049 |
We plan on learning how to build a linear regression and decision tree for each of the variables included in the Spotify data chart, but we only found valence, speechiness, tempo, energy, and accousticness to have a significant impact on danceability.
We are interested to see what makes a song danceable. We plan on using two different models to see if our findings could be replicated. Before running these models, we changed the variable duration_milliseconds into a more applicable format by changing the data inside the variable from milliseconds to seconds and renamed it duration.
We used the histogram, violin chart, and box plot initially to begin plotting the relationships with danceability, but we ultimately changed to running other models to get more insightful results. We decided to use numerical variables because there is so much variation within each genre, that genre didn’t give us a dependable way to measure danceability. We ran two different model types to show what variables have a significant influence on danceability. First, we will use a linear regression to see how the strength of the correlations to danceability for all of the numeric variables. Second, we plan on running a decision tree to further support our findings on which song features matter most when deciding which songs would be considered danceable. We plan on running graphs to validate our findings.
We will need to learn how to run a linear regression model, as well as a decision tree to answer our questions. We will also need to learn how to build different graphs that will display our graphs the best.
We will implement linear regression and the decsion tree model to solve our questions.
We used two models in this project; a linear regression model and a decision tree model.
The linear regression model uses a line of best fit to show how strongly correlated different variables are to danceability. After putting different variables into the regression model, we determined which variables are significant when determining the danceability of a song. Valence and speechiness had the strongest correlations based on the regression model.
The decision tree model is another way of determining which variables are significant in predicting the danceability of a song. It allows us to see how different combinations of attributes can make a song more danceable. From the decision tree, we were able to determine that have high valence, high tempo, and high energy have a significant effect on making a song more danceable.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.9194 | 0.0075 | 121.7878 | 0.0000 |
track_popularity | 0.0001 | 0.0000 | 4.8092 | 0.0000 |
energy | -0.2434 | 0.0062 | -39.1344 | 0.0000 |
key | -0.0002 | 0.0002 | -1.0339 | 0.3012 |
loudness | 0.0091 | 0.0003 | 26.9353 | 0.0000 |
mode | -0.0125 | 0.0014 | -8.7348 | 0.0000 |
speechiness | 0.2401 | 0.0070 | 34.2217 | 0.0000 |
acousticness | -0.1001 | 0.0038 | -26.2459 | 0.0000 |
instrumentalness | 0.0777 | 0.0033 | 23.4794 | 0.0000 |
liveness | -0.0914 | 0.0046 | -19.8037 | 0.0000 |
valence | 0.2272 | 0.0031 | 72.4964 | 0.0000 |
tempo | -0.0009 | 0.0000 | -35.0393 | 0.0000 |
duration_seconds | -0.0001 | 0.0000 | -12.1331 | 0.0000 |
We addressed our problem statement by creating a linear regression and decision tree model, which we used to create a model which helped us to successfully determining what variables had the greatest influence on danceability.
First, we began by creating a linear regression to find our insights on which variables have the most statistically significant effect on danceability. In the linear regression, we found that valence and speechiness had the strongest positive correlations with danceability. After our initial findings, we decided to use a decision tree model to see if we could support our findings with multiple types of models. We found that for both models, songs that are happy (high valence), upbeat (fast tempo), and energetic are most danceable, while tracks that are speech-heavy or very acoustic tend to be less danceable, even if they are positive or fast-paced.
Valence (positivity) is the most important factor. Songs with higher valence are generally more danceable, while lower-valence songs need other features (like tempo or energy) to boost their danceability.
Tempo interacts with valence: upbeat songs with higher tempo (≥ ~147–149 BPM) tend to have higher danceability, especially when valence is already high.
Speechiness reduces danceability. Even if a track has high valence or tempo, high speechiness (lots of talking or rapping) brings danceability down.
Energy plays a role alongside tempo and valence. The more negative the value for energy becomes, the higher the energy is in the track. Higher-energy tracks in the high-valence group are more danceable than low-energy ones.
Acousticness lowers danceability when valence is low. Songs that are both low in valence and high in acousticness are predicted to have the lowest danceability.
The implications of our findings will help the consumer of our analysis by helping them decide which song variables to keep in mind when setting up a playlist. The most important aspect is to keep valence high, which refers to playing more positive music when choosing music. In conjunction with a high valence, it is important to have a higher tempo and have energetic music. Finally, avoid songs that have high speechiness and pick songs that don’t have a lot of singing or rapping. If the valence is low, keep the accousticness low as well by playing songs with more electronic elements and avoiding songs with more natural sounds.
The data is largely based on US music tastes. Cultural differences may cause those outside of the US to look for other attrbiutes when choosing music to dance to. The original creator of the dataset only looked at 6 different genres, and there are many more genres of music. This causes an underrepresentation of music that may belong to other genres. Variables are not easily accessible for consumers and it can be confusing because most people don’t breakdown music using this kind of language. If someone wants to improve on our findings, they can also add non-numeric variables to help with their findings and improve the model.