Problem Statement: The dataset for this project is based on a music streaming application, Spotify
. It contains millions of music tracks across various genres, artists and sentimental factors such as motivation, energy, romance, etc. Our goal is to analyze on the popular genres among the Spotify population and categorize the artists and their famous tracks on the basis of their genre. Moreover, for each sentiment we will observe the most played artists and their tracks.
Motivation for choosing this topic: Music can reach our feelings and connects people across any boundaries. Our analysis will help people to choose the most popular tracks as per their mood.
Solution: Our study on the dataset provides us with the key insights on the songs to be played according to one’s mood and the artist with the highest no. of songs in each of the six genres
.
library(knitr) #displaying an aligned table on the screen
library(readr) #load .csv file
library(ggplot2) #visualize the data
library(dplyr) #manipulate data
library(tidyr)#tidying the data
library(DT) #output data in table
library(GGally) #Visualize the data
library(plotly) #Visualize the data
We used the following packages to analyze the dataset:
Steps followed to prepare data for analysis:
Data Import
The spotify_songs data file can be downloaded directly from the Spotify. This dataset comes originally from spotifyr package. This package was authored to make it easily accesible for anyone to get their own data or general metadata around songs from the Spotify’s API.
#Loading the dataset
sp_data <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
Data Description
#Display the dimensions of raw dataset
dim(sp_data)
## [1] 32833 23
The dataset contains 32833 observations and 23 variables. Names of the variables are below:
colnames(sp_data)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Not all of the 23 variables are relevant for our analysis. Firstly, There are these 7 variables which should better be cast in factor datatypes for better analysis results.
#Checking the datatype of the columns
lapply(sp_data, typeof)
## $track_id
## [1] "character"
##
## $track_name
## [1] "character"
##
## $track_artist
## [1] "character"
##
## $track_popularity
## [1] "double"
##
## $track_album_id
## [1] "character"
##
## $track_album_name
## [1] "character"
##
## $track_album_release_date
## [1] "character"
##
## $playlist_name
## [1] "character"
##
## $playlist_id
## [1] "character"
##
## $playlist_genre
## [1] "character"
##
## $playlist_subgenre
## [1] "character"
##
## $danceability
## [1] "double"
##
## $energy
## [1] "double"
##
## $key
## [1] "double"
##
## $loudness
## [1] "double"
##
## $mode
## [1] "double"
##
## $speechiness
## [1] "double"
##
## $acousticness
## [1] "double"
##
## $instrumentalness
## [1] "double"
##
## $liveness
## [1] "double"
##
## $valence
## [1] "double"
##
## $tempo
## [1] "double"
##
## $duration_ms
## [1] "double"
The variable “playlist_genre” contains 6 distinct categories and “playlist_subgenre” contains 24 distinct categories respectively, so it converted to factor type it would be easier to analyze. Hence, We will prune our variables’ list and explore the dataset further with respect to these variables only.
#Converting the non-numerical variables into categorical variables
sp_data$track_id <- as.factor(sp_data$track_id)
sp_data$track_artist <- as.factor(sp_data$track_artist)
sp_data$track_name <- as.factor(sp_data$track_name)
sp_data$track_album_name <- as.factor(sp_data$track_album_name)
sp_data$playlist_name <- as.factor(sp_data$playlist_name)
sp_data$playlist_genre <- as.factor(sp_data$playlist_genre)
sp_data$playlist_subgenre <- as.factor(sp_data$playlist_subgenre)
#Selecting the interesting variables
sp_songs <- select(sp_data,-c(5,7,9,14:19,22))
dim(sp_songs)
## [1] 32833 13
colnames(sp_songs)
## [1] "track_id" "track_name" "track_artist"
## [4] "track_popularity" "track_album_name" "playlist_name"
## [7] "playlist_genre" "playlist_subgenre" "danceability"
## [10] "energy" "liveness" "valence"
## [13] "duration_ms"
Now, we have shorten our variables list from 23 to 13. The metadata for these variables is provided below:
Variable Name | Description |
---|---|
track_id | Song unique ID |
track_name | Song Name |
track_artist | Song Artist |
track_popularity | Song Popularity (0-100) where higher is better |
track_album_name | Song album name |
playlist_name | Name of playlist |
playlist_genre | Playlist genre |
playlist_subgenre | Playlist subgenre |
danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
duration_ms | Duration of song in milliseconds |
#Dimensions of the updated dataset
dim(sp_songs)
## [1] 32833 13
Renaming the columns:
sp_songs = sp_songs %>% rename(track_danceability = danceability,
track_energy_level = energy,
live_performed = liveness,
musical_positivity = valence,
song_duration = duration_ms
)
Finding missing values
#Finding missing values
missing = colSums(is.na(sp_songs))
missing
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_name playlist_name
## 0 5 0
## playlist_genre playlist_subgenre track_danceability
## 0 0 0
## track_energy_level live_performed musical_positivity
## 0 0 0
## song_duration
## 0
There are 5 missing values each in track_name, track_artist and track_album_name. These corresppond to same 5 observations which are not even 1% of the entire dataset. We do not want to keep the missing values and hence we will remove the rows.
sp_songs = na.omit(sp_songs)
#Dimensions of the cleansed dataset
dim(sp_songs)
## [1] 32828 13
Our final dataset after removing missing observations contains 32828 observations with 13 variables.
Table View of the dataset:
datatable(head(sp_songs,100),extensions = 'FixedColumns', options = list(scrollX = TRUE, scrollY = "400px",fixedColumns= TRUE))
Structure of the data
str(sp_songs)
## Classes 'tbl_df', 'tbl' and 'data.frame': 32828 obs. of 13 variables:
## $ track_id : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9370 12876 1076 3237 18356 2101 13856 15783 20922 9819 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2840 6171 10632 9370 5519 2840 4993 8313 771 8556 ...
## $ track_popularity : num 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7926 10675 1059 2942 15182 1959 11512 13091 17780 8153 ...
## $ playlist_name : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ track_danceability: num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ track_energy_level: num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ live_performed : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ musical_positivity: num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ song_duration : num 194754 162600 176616 169093 189052 ...
## - attr(*, "na.action")= 'omit' Named int 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr "8152" "9283" "9284" "19569" ...
The cleansed data contains information about tracks, artists, genre, duration and other relevant information. We know there are 7 categorical and 6 numerical variables now.
Inference:
It was the right decision to convert all the non-numerical variables to factors as the columns like track_id(song unique ID) also contains duplicate values showing us that the total categories in “track_id” is less than the total number of observations. These 7 categorical variables will be used to perform further analysis in determining the specific results by grouping the values together. There are 3 variables which have adequate number of categories that can be used to discern the insights with better interpretability.
Summary statistics of variables
#Summary of the data
summary(sp_songs)
## track_id track_name track_artist
## 7BKLCZ1jbUBVqRi2FVlTVw: 10 Poison : 22 Martin Garrix : 161
## 14sOS5L36385FJ3OL8hew4: 9 Breathe : 21 Queen : 136
## 3eekarcy7kvN4yt5ZFzltW: 9 Alive : 20 The Chainsmokers: 123
## 0nbXyq5TXYPCO7pr3N8S4I: 8 Forever : 20 David Guetta : 110
## 0qaWEvPkts34WF68r8Dzx9: 8 Paradise: 19 Don Omar : 102
## 0rIAC4PXANcKmitJfoqmVm: 8 Stay : 19 Drake : 100
## (Other) :32776 (Other) :32707 (Other) :32096
## track_popularity track_album_name
## Min. : 0.00 Greatest Hits : 139
## 1st Qu.: 24.00 Ultimate Freestyle Mega Mix: 42
## Median : 45.00 Gold : 35
## Mean : 42.48 Malibu : 30
## 3rd Qu.: 62.00 Rock & Rios (Remastered) : 29
## Max. :100.00 Appetite For Destruction : 28
## (Other) :32525
## playlist_name
## Indie Poptimism : 308
## 2020 Hits & 2019 Hits – Top Global Tracks <U+0001F525><U+0001F525><U+0001F525>: 247
## Permanent Wave : 244
## Hard Rock Workout : 219
## Ultimate Indie Presents... Best Indie Tracks of the 2010s : 198
## Fitness Workout Electro | House | Dance | Progressive House : 195
## (Other) :31417
## playlist_genre playlist_subgenre track_danceability
## edm :6043 progressive electro house: 1809 Min. :0.0000
## latin:5153 southern hip hop : 1674 1st Qu.:0.5630
## pop :5507 indie poptimism : 1672 Median :0.6720
## r&b :5431 latin hip hop : 1655 Mean :0.6549
## rap :5743 neo soul : 1637 3rd Qu.:0.7610
## rock :4951 pop edm : 1517 Max. :0.9830
## (Other) :22864
## track_energy_level live_performed musical_positivity song_duration
## Min. :0.000175 Min. :0.0000 Min. :0.0000 Min. : 4000
## 1st Qu.:0.581000 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.:187805
## Median :0.721000 Median :0.1270 Median :0.5120 Median :216000
## Mean :0.698603 Mean :0.1902 Mean :0.5106 Mean :225797
## 3rd Qu.:0.840000 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:253581
## Max. :1.000000 Max. :0.9960 Max. :0.9910 Max. :517810
##
The above summary statistics shows that there are no more missing values present in the dataset. The results on the categorical columns provided above are conforming with our inferences mentioned earlier.
Inference on numerical variables:
#Summary Statistics Table
data.table::data.table(
Variable.Name = c("track_popularity",
"track_danceability","track_energy_level","live_performed",
"musical_positivity","song_duration(in min)"),
Min = c(0, 0, 0.000175, 0, 0, 0.067),
Mean = c(42.48, 0.65, 0.698, 0.19, 0.51, 3.76),
Medium = c(45, 0.67, 0.721, 0.13, 0.51, 3.6),
Max = c(100, 0.98, 1, 0.99, 0.99, 8.63)
)
## Variable.Name Min Mean Medium Max
## 1: track_popularity 0.000000 42.480 45.000 100.00
## 2: track_danceability 0.000000 0.650 0.670 0.98
## 3: track_energy_level 0.000175 0.698 0.721 1.00
## 4: live_performed 0.000000 0.190 0.130 0.99
## 5: musical_positivity 0.000000 0.510 0.510 0.99
## 6: song_duration(in min) 0.067000 3.760 3.600 8.63
Higher the metric, better the song!
(count <- sp_songs %>% count(playlist_genre) %>% knitr::kable())
playlist_genre | n |
---|---|
edm | 6043 |
latin | 5153 |
pop | 5507 |
r&b | 5431 |
rap | 5743 |
rock | 4951 |
Inference: The table shows us the count of songs in each genre.
ggcorr(sp_songs,label = TRUE)
Inference: We can see from the above graph that there is no significant correlation among the variables. track_danceability and musical positivity has the higest correlation of 0.3.
green <- "#1ed760"
yellow <- "#e7e247"
pink <- "#ff6f59"
blue <- "#17bebb"
orange <- "#ffa500"
grey <- "#808080"
#Plotting density distributions
#1. Danceability feature
viz1 <- ggplot(sp_songs, aes(x=track_danceability, fill=playlist_genre,
text = paste(playlist_genre)))+
geom_density(alpha=0.7, color=NA)+
scale_fill_manual(values=c(green, yellow, grey, blue, orange, pink))+
labs(x="Danceability", y="Density") +
guides(fill=guide_legend(title="Genres"))+
theme_minimal()+
ggtitle("Distribution of Danceability Data")
ggplotly(viz1, tooltip=c("text"))
Inference: All the genres are right-skewed except for the rock genre which is normally distributed. We can also infer that the latin genre has the highest density.
#1. Popularity feature
viz2 <- ggplot(sp_songs, aes(x=track_popularity, fill=playlist_genre,
text = paste(playlist_genre)))+
geom_density(alpha=0.7, color=NA)+
scale_fill_manual(values=c(green, yellow, grey, blue, orange, pink))+
labs(x="Tracks popularity score", y="Density") +
guides(fill=guide_legend(title="Genres"))+
theme_minimal()+
ggtitle("Distribution of Tracks popularity")
ggplotly(viz2, tooltip=c("text"))
Inference - It can be visualized that the tracks in different genres do have songs with low popularity but there are majority of tracks which are well distributed from the range of 15-100.
gen_valence <- sp_songs %>%
group_by(playlist_genre)%>%
mutate(max=max(musical_positivity))%>%
mutate(min=min(musical_positivity))%>%
select(playlist_genre, max, min)%>%
unique()
viz3 <- plot_ly(gen_valence, color = I("gray80"),
hoverinfo = 'text') %>%
add_segments(x = ~max, xend = ~min, y = ~playlist_genre, yend = ~playlist_genre, showlegend = FALSE) %>%
add_markers(x = ~max, y = ~playlist_genre, name = "High Positivity", color = I(pink), text=~paste('Max Valence: ', max)) %>%
add_markers(x = ~min, y = ~playlist_genre, name = "Low Positivity", color = I(blue), text=~paste('Min Valence: ', min))%>%
layout(
title = "Genres' Positivity Range",
xaxis = list(title = "Positivity Level"),
yaxis= list(title=""))
ggplotly(viz3)
Inference: The above graph provides us with the musical positivity in each genre and we can infer that rock and r&b has highest posiitvity while latin and rock has lowest positivity.
gen_energy <- sp_songs %>%
group_by(playlist_genre)%>%
mutate(max=max(track_energy_level))%>%
mutate(min=min(track_energy_level))%>%
select(playlist_genre, max, min)%>%
unique()
viz4 <- plot_ly(gen_energy, color = I("gray80"),
hoverinfo = 'text') %>%
add_segments(x = ~max, xend = ~min, y = ~playlist_genre, yend = ~playlist_genre, showlegend = FALSE) %>%
add_markers(x = ~max, y = ~playlist_genre, name = "Maximum Energy Level Value", color = I(pink), text=~paste('Max Energy Level: ', max)) %>%
add_markers(x = ~min, y = ~playlist_genre, name = "Minimum Energy Level Value", color = I(blue), text=~paste('Min Energy Level: ', min))%>%
layout(
title = "Genres' Energy Level Range",
xaxis = list(title = "Energy Level"),
yaxis= list(title=""))
ggplotly(viz4)
Inference: The above graph provides us with the energy level in each genre and we can infer that pop and latin has lowest energy level while all the genres are max at energy level.
sp_most_played <- sp_songs %>% group_by(playlist_genre) %>% count(track_artist) %>% arrange(-n) %>% top_n(1)
datatable(sp_most_played)
Inference:: The above table gives us the singer with the maximum of tracks and the sum of the in each genre.
#Finding the most popular track and its artist as per mood
final <- sp_songs %>%
group_by(playlist_genre, playlist_subgenre) %>%
select(c(2:4,9,10,12)) %>%
slice(which.max(track_popularity)) %>%
arrange(-track_popularity)
datatable(final[with(final, order(playlist_genre, playlist_subgenre)),])
Inference:: The above table provides us with the most popular songs and artists for each genre and subgenres within the genres.
Inference:
Based on your mood, choose the most popular track in each genre from the table above. You have your Spotify…Music for every mood!