Spotify is an audio streaming platform that that provides DRM-restricted music, videos and podcasts from record labels and media companies. It has over 50 million tracks which user can browse using various parameters like artist, album, genre, or via playlists.It pays the artists or the right holders via royalties which is approximately 70% of thier revenue. Thus, it is also a good platform for musicians to not only showcase their talent but also to make some money.
Thus, from a musician point of view, we have following problem statements:
To find popular singer.
To find popular genre.
To find if there is a relationship between artist and genre.
To find the average duration of popular songs.
To find if there are specific parameters of a song that makes a particular genre.
Answering these questions might help a musician to find a particular singer and compose song that will increase the probability of becoming popular.
Methodology: We are using spotify dataset that was collected using spotify API and available via spotifyr package. We begin with the analyses of data structre and cleaning and processing the data for further use. Next, we explore the data to find various relationship between variables using lists,tables and visualizations.
library(DT)
library(tidyverse)
library(ggplot2)
The DT library is used to create a table format of the dataset.
The tidyverse library is used in cleaning and summarizing the dataset.
The ggplot2 library is used in Exploratory Data Analysis (EDA).
There are 23 variables in the dataset.
Here are variables we have:
Variables | Description | |
---|---|---|
track_id | Song unique ID | |
track_name | Song Name | |
track_artist | Song Artist | |
track_popularity | Song Popularity (0-100) where higher is better | |
track_album_id | Album unique ID | |
track_album_name | Song album name | |
track_album_release_date | Date when album released | |
playlist_name | Name of playlist | |
playlist_id | Playlist ID | |
playlist_genre | Playlist genre | |
playlist_subgenre | Playlist subgenre | |
danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. | |
energy | Song ArtisEnergy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. | |
key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 | |
loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. | |
mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. | |
speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. | |
acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic | |
instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0 | |
liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live | |
valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). | |
tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. | |
duration_ms | Duration of song in milliseconds |
Renaming the data
We have renamed the few columns of the dataset. We followed this step in order to make sure that the column names are not large and the description is well understood.
spotify_data = spotify_data %>%
rename(
genre = playlist_genre,
sub_genre = playlist_subgenre,
release_date = track_album_release_date,
artist = track_artist,
popularity = track_popularity,
album_name = track_album_name
)
Number of Rows and Columns
# Number of columns and rows in the dataset
nrow(spotify_data)
## [1] 32833
ncol(spotify_data)
## [1] 23
There are 32833 no of rows and 23 columns in the dataset.
missing_val = colSums(is.na(spotify_data))
missing_val
## track_id track_name artist popularity
## 0 5 5 0
## track_album_id album_name release_date playlist_name
## 0 5 0 0
## playlist_id genre sub_genre danceability
## 0 0 0 0
## energy key loudness mode
## 0 0 0 0
## speechiness acousticness instrumentalness liveness
## 0 0 0 0
## valence tempo duration_ms
## 0 0 0
We can see there are 5 missing values in album_name , artist and trackname.
datatable(
head(spotify_data,50),
extensions = 'FixedColumns',
options = list(
scrollY = "400px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
The track_name, artist and album_name have the same number of missing index number.
index_track_name = which(is.na(spotify_data$track_name))
index_track_name
## [1] 8152 9283 9284 19569 19812
index_artist = which(is.na(spotify_data$artist))
index_artist
## [1] 8152 9283 9284 19569 19812
index_album_name = which(is.na(spotify_data$album_name))
index_album_name
## [1] 8152 9283 9284 19569 19812
We are not going to remove missing values from the dataset as they will be handled further during data modelling.
Structure of Variable
str(spotify_data)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
## $ album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
## $ release_date : Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_name : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_id : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
## $ genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ sub_genre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Now looking at the structure of the dataset all the columns are defined as required for EDA and modelling. So we won’t be making any changes in the datatype.
Summary and Identifying Outliers
summary(spotify_data)
## track_id track_name artist
## 7BKLCZ1jbUBVqRi2FVlTVw: 10 Poison : 22 Martin Garrix : 161
## 14sOS5L36385FJ3OL8hew4: 9 Breathe : 21 Queen : 136
## 3eekarcy7kvN4yt5ZFzltW: 9 Alive : 20 The Chainsmokers: 123
## 0nbXyq5TXYPCO7pr3N8S4I: 8 Forever : 20 David Guetta : 110
## 0qaWEvPkts34WF68r8Dzx9: 8 Paradise: 19 Don Omar : 102
## 0rIAC4PXANcKmitJfoqmVm: 8 (Other) :32726 (Other) :32196
## (Other) :32781 NA's : 5 NA's : 5
## popularity track_album_id
## Min. : 0.00 5L1xcowSxwzFUSJzvyMp48: 42
## 1st Qu.: 24.00 5fstCqs5NpIlF42VhPNv23: 29
## Median : 45.00 7CjJb2mikwAWA1V6kewFBF: 28
## Mean : 42.48 4VFG1DOuTeDMBjBLZT7hCK: 26
## 3rd Qu.: 62.00 2HTbQ0RHwukKVXAlTmCZP2: 21
## Max. :100.00 4CzT5ueFBRpbILw34HQYxi: 21
## (Other) :32666
## album_name release_date
## Greatest Hits : 139 2020-01-10: 270
## Ultimate Freestyle Mega Mix: 42 2019-11-22: 244
## Gold : 35 2019-12-06: 235
## Malibu : 30 2019-12-13: 220
## Rock & Rios (Remastered) : 29 2013-01-01: 219
## (Other) :32553 2019-11-15: 215
## NA's : 5 (Other) :31430
## playlist_name
## Indie Poptimism : 308
## 2020 Hits & 2019 Hits â\200“ Top Global Tracks 🔥🔥🔥 : 247
## Permanent Wave : 244
## Hard Rock Workout : 219
## Ultimate Indie Presents... Best Indie Tracks of the 2010s : 198
## Fitness Workout Electro | House | Dance | Progressive House: 195
## (Other) :31422
## playlist_id genre sub_genre
## 4JkkvMpVl4lSioqQjeAL0q: 247 edm :6043 progressive electro house: 1809
## 37i9dQZF1DWTHM4kX49UKs: 198 latin:5155 southern hip hop : 1675
## 6KnQDwp0syvhfHOR4lWP7x: 195 pop :5507 indie poptimism : 1672
## 3xMQTDLOIGvj3lWH5e5x6F: 189 r&b :5431 latin hip hop : 1656
## 3Ho3iO0iJykgEQNbjB2sic: 182 rap :5746 neo soul : 1637
## 25ButZrVb1Zj1MJioMs09D: 109 rock :4951 pop edm : 1517
## (Other) :31713 (Other) :22867
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
At the first glance at the above table we can identify there are outliers in few of the columns as in instrumentalness, liveness, acousticness.
The main objective of the exploratory data analysis will be to determine the patterns between variables of interest to discover interesting insights that would help us answer our questions.
We will be using ggplot to make plots to demonstrate data knowledge and describe the information.
Example - Scatter Plots
ggplot(spotify_data, aes(x = energy, y = liveness , color = genre)) +
geom_point()
Formatting the data to display scatter plot filtering data with artist as Ed Sheeran.
spotify_edsheeran = spotify_data %>%
filter(artist == "Ed Sheeran")
ggplot(spotify_edsheeran, aes(x = energy, y = liveness , color = genre)) +
geom_point()
Example - Bar Graph
ggplot(spotify_data, aes(x = genre, y = duration_ms)) +
geom_col()
Correlation Plot, scatterplots and barplots are some of the plots that will help us with our analysis.