The dataset spotify_songs contains key attributes of Spotify music, such as name, artist, album, genre, danceability, etc.
By analyzing the data, we could understand the features of all Spotify music and further the elements that make some more popular than the others. Songs could be separated into two groups by popularity. EDA would be conducted to identify the differences between the two groups. This would help new artists to catch the trend of popular music at the moment and increase the chance of making their new songs popular.
Since we have the track_album_relase_date information, we extract the “month” information and make it into a separate column. This is to explore whether the features of the songs will change by month.
Popular artists could also be identified by manipulating the data. Among the top Spotify songs, who are the most popular artists?
We could also cluster the songs based on their characteristics. Clustering analysis could be applied to achieve this. By putting the songs with similar features together, we could better recommend new songs to Spotify customers and help them enjoy more songs that are similar to the songs in their preferred playlists.
Linear regression or logistic regression methods could be used to predict the popularity of a song before it is released. This could be helpful to determine whether it is worthwhile to invest a lot of money in music videos or other marketing channels to promote the song.
These packages are required for data manipulation and visualization.
library(dplyr) # manipulate data
library(ggplot2) # visualizations
library(magrittr) # Pipe operator
library(DT) # create tables
library(knitr) # display tables
The data came from Spotify vis the spotifyr package and was provided by tidytuesday. I downloaded the dataset on 4/3/2020.
Data Source: Spotify Songs
I imported data into R Studio and checked the dimension and a few rows of the dataset.
songs <- read.csv("spotify_songs.csv",stringsAsFactors=FALSE)
dim(songs)
## [1] 32833 23
head(songs)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
In the original dataset, there are 32833 rows and 23 variables, with 5 rows containing missing values in columns “track_name”, “track_artist”, and “track_album name”. Since these rows could not provide the important name and artist information, therefore I deleted the five rows. After cleaning, there is no missing value in this dataset.
colSums(is.na(songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
subset(songs,is.na(songs$track_artist))
## track_id track_name track_artist track_popularity
## 8152 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## 9283 5cjecvX0CmC9gK0Laf5EMQ <NA> <NA> 0
## 9284 5TTzhRSWQS4Yu8xTgAuq6D <NA> <NA> 0
## 19569 3VKFip3OdAvv4OfNTgFWeQ <NA> <NA> 0
## 19812 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## track_album_id track_album_name track_album_release_date
## 8152 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## 9283 3luHJEPw434tvNbme3SP8M <NA> 2017-12-01
## 9284 3luHJEPw434tvNbme3SP8M <NA> 2017-12-01
## 19569 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## 19812 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## playlist_name playlist_id playlist_genre
## 8152 HIP&HOP 5DyJsJZOpMJh34WvUrQzMV rap
## 9283 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 9284 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 19569 Reggaeton viejito🔥 0si5tw70PIgPkY1Eva6V8f latin
## 19812 latin hip hop 3nH8aytdqNeRbcRCg3dw9q latin
## playlist_subgenre danceability energy key loudness mode speechiness
## 8152 southern hip hop 0.714 0.821 6 -7.635 1 0.1760
## 9283 gangster rap 0.678 0.659 11 -5.364 0 0.3190
## 9284 gangster rap 0.465 0.820 10 -5.907 0 0.3070
## 19569 reggaeton 0.675 0.919 11 -6.075 0 0.0366
## 19812 latin hip hop 0.714 0.821 6 -7.635 1 0.1760
## acousticness instrumentalness liveness valence tempo duration_ms
## 8152 0.0410 0.00000 0.1160 0.649 95.999 282707
## 9283 0.0534 0.00000 0.5530 0.191 146.153 202235
## 9284 0.0963 0.00000 0.0888 0.505 86.839 206465
## 19569 0.0606 0.00653 0.1030 0.726 97.017 252773
## 19812 0.0410 0.00000 0.1160 0.649 95.999 282707
new_songs <- filter(songs, !is.na(track_artist))
colSums(is.na(new_songs))
dim(new_songs)
summary(new_songs)
After removing the rows with missing values, there are 32,828 rows.
Checking the boxplot, we see there are outliers.
boxplot(new_songs$duration_ms)
lowerq <- quantile(new_songs$duration_ms,na.rm = TRUE)[2]
upperq <- quantile(new_songs$duration_ms,na.rm = TRUE)[4]
iqr <- upperq - lowerq
mild.threshold.upper <- (iqr * 1.5) + upperq
mild.threshold.lower <- lowerq - (iqr * 1.5)
new_songs_no_outliers <- new_songs[-which(new_songs$duration_ms < mild.threshold.lower | new_songs$duration_ms > mild.threshold.upper),]
dim(new_songs_no_outliers)
## [1] 31441 23
There were 1387 songs that were considered as outliers. There are 31441 songs in the dataset now, with maximum duration 334827ms (around 5.58 minutes).
summary(new_songs_no_outliers$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89250 187013 214115 219288 248000 352187
boxplot(new_songs_no_outliers$duration_ms)
Variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character with length 31441. Below is the statistics summary of the numeric variables.
summary(new_songs_no_outliers[,-c(1:3,5:11)])
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0771 Min. :0.000175 Min. : 0.000
## 1st Qu.: 25.00 1st Qu.:0.5650 1st Qu.:0.582000 1st Qu.: 2.000
## Median : 46.00 Median :0.6720 Median :0.722000 Median : 6.000
## Mean : 43.04 Mean :0.6563 Mean :0.699792 Mean : 5.359
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9810 Max. :1.000000 Max. :11.000
## loudness mode speechiness acousticness
## Min. :-46.448 Min. :0.0000 Min. :0.0224 Min. :0.0000014
## 1st Qu.: -8.069 1st Qu.:0.0000 1st Qu.:0.0411 1st Qu.:0.0158000
## Median : -6.083 Median :1.0000 Median :0.0631 Median :0.0817000
## Mean : -6.635 Mean :0.5651 Mean :0.1078 Mean :0.1753684
## 3rd Qu.: -4.612 3rd Qu.:1.0000 3rd Qu.:0.1330 3rd Qu.:0.2550000
## Max. : 1.275 Max. :1.0000 Max. :0.9180 Max. :0.9920000
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.00936 Min. :0.00001 Min. : 37.11
## 1st Qu.:0.0000000 1st Qu.:0.09310 1st Qu.:0.33400 1st Qu.: 99.93
## Median :0.0000121 Median :0.12800 Median :0.51500 Median :121.91
## Mean :0.0755050 Mean :0.19032 Mean :0.51312 Mean :120.88
## 3rd Qu.:0.0032900 3rd Qu.:0.24900 3rd Qu.:0.69500 3rd Qu.:133.99
## Max. :0.9940000 Max. :0.99400 Max. :0.99100 Max. :239.44
## duration_ms
## Min. : 89250
## 1st Qu.:187013
## Median :214115
## Mean :219288
## 3rd Qu.:248000
## Max. :352187
New variables “month” (based on track_album_release_date) are added to help analyze the music patterns by month.
new_songs_no_outliers$month <- months(as.Date(new_songs_no_outliers$track_album_release_date))
glimpse(new_songs_no_outliers)
## Observations: 31,441
## Variables: 24
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCY...
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud ...
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", ...
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X...
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud L...
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", ...
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Po...
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD...
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", ...
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "da...
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms <int> 194754, 162600, 176616, 169093, 189052, 16...
## $ month <chr> "June", "December", "July", "July", "March...
Below is a preview of the cleaned data.
new_songs_no_outliers %>%
head(100) %>%
datatable()
Below is the data dictionary of variable names, data type, and variable descriptions.
new_songs.type <- lapply(new_songs_no_outliers, class)
new_songs.var_desc <- c("Song unique ID",
"Song Name",
"Song Artist",
"Song Popularity (0-100) where higher is better",
"Album unique ID",
"Song album name",
"Date when album released",
"Name of playlist",
"Playlist ID",
"Playlist genre",
"Playlist subgenre",
"Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
"Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
"The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.",
"The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
"Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
"Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
"A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
"Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
"Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).",
"The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
"Duration of song in milliseconds",
"Month when album released")
new_songs.var_names <- colnames(new_songs_no_outliers)
data.description <- as_data_frame(cbind(new_songs.var_names,new_songs.type,new_songs.var_desc))
colnames(data.description) <- c("Variable Names","Data Type","Variable Description")
kable(data.description)
| Variable Names | Data Type | Variable Description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | integer | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | integer | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | numeric | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | numeric | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | numeric | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | numeric | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | integer | Duration of song in milliseconds |
| month | character | Month when album released |
Dataset might be separated into two based on the variable track_popularity. The summary statistics of the two groups will be compared.
Songs will also be categorized by month. Numeric variables would be analyzed to see if there is any pattern. For example, is there a certain variable super high in a month but low in another? Does month/season affect the features of songs?
For the given top 100 Spotify songs, who are the artists contribute mostly? By counting distinct artist names, we could know the names of the artists and the albums as well.
Histograms, ROC curves and etc. would be helpful to display the findings.
I’m not sure how to group the values into bins and look at the frequency in each bin and how to separate a character variable (track album release date) into year, month, day.
Linear regression, logistic regression, tree model, or cluster analysis might be used to answer the questions.