ANALYZING SONGS FROM SPOTIFY
Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.
library(tidyr) # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with.
library(dplyr) # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data
library(ggplot2) # This package can be used to create interesting visualisations
library(DT) # It is used to display tables in HTML
Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
Importing the data into R:
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.
d <- read.csv("spotify_dd.csv")
DT::datatable(d,options = list( pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')
Data Cleaning
ids <- c("track_id","track_album_id","playlist_id")
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])
Identifying missing data in the dataset and removing the rows
colSums(is.na(spotify.data))
## track_name track_artist track_popularity
## 5 5 0
## track_album_name track_album_release_date playlist_name
## 5 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify.data <- na.omit(spotify.data)
colSums(is.na(spotify.data))
## track_name track_artist track_popularity
## 0 0 0
## track_album_name track_album_release_date playlist_name
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
Converting categorical variables ‘key’ and ‘mode’ as factors
spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)
knitr::kable(head(spotify.data), align = "lccrr")
| track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_name | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| All the Time - Don Diablo Remix | Zara Larsson | 70 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
Checking for outliers and distribution for each numerical column:
Boxplots
num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
'liveness','valence','tempo')
par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Histogram of %s',i))
}
Histograms
par(mfrow=c(3,3))
for (i in num_cols){
hist(x = spotify.data[[i]],
col="blue",
lty=1,
freq = FALSE,
main = sprintf('Histogram of %s',i),
xlab = i)
lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}
Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.
d1 <- read.csv("summary.csv")
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')
We will be doing a correlation analysis to check if there is any correlation between the variables themselves
We can create a few new columns from the existing columns to check if they can explain the data in a better way. For Example, checking if ‘remix’ songs are more popular than other songs by creating a new column for remix with values 0 or 1
We can summarize the data mainly by picking most important qualitative variables such as genre and some numerical variables such as energy, beat and danceability to see how they are impacting popularity