Spotify is universally used across countries and people and people use playlists to easily collect and store songs. Playlists are created for many different situations and events, including for weddings, birthdays, and hangouts with friends. For many of these events people are getting together to have fun and enjoy themselves. One of the main activities at parties and events is dancing. This brings us to our research question: What variables affect danceability the most?
Our plan is to use the Spotify dataset to see what variables affect danceability the most. We are going to separate out the variables that have a correlation and see what the correlations are to the danceability variable.
We will be using a linear regression analysis and decision tree to solve our problem.
Our analysis will be especially helpful for DJ’s as well as people who are throwing parties and events with friends and family.
library(tidyverse) # allows us to read the csv file that the Spotify data is in, give us the ability to add in data visualizations such as linear regression.
library(rpart) # includes the decision tree function that we are using for our analysis
library(rpart.plot) # adds the ability to function
library(knitr) # allows us to display our dataset in a condensed format
library(kableExtra) # allows us to scroll all the way across on the table
spotify <- read.csv("spotify_songs.csv")
The source data comes from the Spotifyr package and was used in a blog post where audio features were explored by genre. It was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff on January 21st, 2020. The data comes from Spotify via the spotifyr package. Created on 2020-01-21, by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The SpotifyR dataset includes 32,833 records, and 23 variables.
Removing NAs from the dataset
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify$track_name[is.na(spotify$track_name)] <- "Not_Stated"
spotify$track_artist[is.na(spotify$track_artist)] <- "Not_Stated"
spotify$track_album_name[is.na(spotify$track_album_name)] <- "Not_Stated"
Remove Duplicate tracks
We now have 28,356 records.
spotify_unique <- spotify %>% distinct(track_id, .keep_all = TRUE)
Converted duration from milliseconds to seconds
spotify$duration_ms <- spotify$duration_ms / 1000
names(spotify)[23] <- "duration"
spotify_unique <- head(spotify, 10) # replace spotify_data with your dataset
spotify_preview <- head(spotify_unique)
spotify_preview %>%
kable(format = "html", align = "lccrr",
caption = "Spotify Preview") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "300px")
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194.754 |
0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162.600 |
1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176.616 |
75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169.093 |
1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189.052 |
7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163.049 |
We plan on learning how to build a linear regression and decision tree for each of the variables included in the Spotify data chart, but we only found valence, speechiness, tempo, energy, and accousticness to have a significant impact on danceability.
We are interested to see what makes a song danceable. We plan on using two different models to see if our findings could be replicated. Before running these models, we changed the variable duration_milliseconds into a more applicable format by changing the data inside the variable from milliseconds to seconds and renamed it duration.
We intend on running two different model types to show what variables have a significant influence on danceability. First, we will use a linear regression to see how the strength of the correlations to danceability for all of the numeric variables. Second, we plan on running a decision tree to further support our findings on which song features matter most when deciding which songs would be considered danceable. We plan on running graphs to validate our findings.
We will need to learn how to run a linear regression model, as well as a decision tree to answer our questions. We will also need to learn how to build different graphs that will display our graphs the best.
We will implement linear regression and the decsion tree model to solve our questions.