Spotify is one of the largest music streaming services all over the world. With 271 million monthly active users, including 124 million paying subscribers, it is the ideal platform for artists to reach their audience. At the heart of Spotify lives a massive and growing data-set. What if we could analyze the music we listen to using Data Science?
In this analysis, we would mine the nuggets of insight hidden beneath mountains of Spotify data. In doing so, gain a greater understanding of the type of genres, tracks and artists the consumers have been listening to on Spotify.
Broadly, we will be performing the following steps to accomplish the project objectives:
Our analysis can help understand consumer behavior and suggest what music are they looking for and hence provide direction to artists and music producers.
Let’s explore the Spotify dataset to discover the patterns and insights.
The following packages have been used for the analysis:
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyverse)
library(kableExtra)
library(DT)
library(corrplot)
library(gridExtra)
library(treemap)
library(viridisLite)
library(fmsb)
library(cowplot)
library(factoextra)
library(formattable)
We would be using subset of Spotify tracks’ metadata. This dataset was created using the spotifyr package and can be downloaded from this link
The dataset consists of 32833 observations corresponding to each track and their 23 attributes.
Below is the detailed data dictionary to understand all the variables present in the dataset.
DataDictionary <- read.csv("DataDictionary.csv")
songs <- read.csv("spotify_songs.csv")
DataDictionary%>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F)
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
track_album_id | character | Album unique ID |
track_album_name | character | Song album name |
track_album_release_date | character | Date when album released |
playlist_name | character | Name of playlist |
playlist_id | character | Playlist ID |
playlist_genre | character | Playlist genre |
playlist_subgenre | character | Playlist subgenre |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C_/D_, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness | double | Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
There are various features used to describe songs in the dataset which have been scaled between 0 and 1 for ease of comparision and interpretability.
We now take a look at the structure and summary statistics of the dataset. The summaries would help us spot any anomalies like negative values. It would also indicate the fields with missing values and their counts.
str(songs)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
## $ track_name : Factor w/ 23449 levels "_away","¡Corre!",..: 9042 12696 896 3057 18176 1921 13676 15603 20742 9637 ...
## $ track_artist : Factor w/ 10692 levels "_tag","-M-","!!!",..: 2818 6149 10610 9348 5497 2818 4971 8291 749 8534 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
## $ track_album_name : Factor w/ 19743 levels "_away","!","¡Hola!",..: 7760 10569 951 2836 15075 1853 11406 12985 17674 8047 ...
## $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_name : Factor w/ 449 levels "¡Viva Latino!",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_id : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
summary(songs)
## track_id track_name track_artist
## 7BKLCZ1jbUBVqRi2FVlTVw: 10 Poison : 22 Martin Garrix : 161
## 14sOS5L36385FJ3OL8hew4: 9 Breathe : 21 Queen : 136
## 3eekarcy7kvN4yt5ZFzltW: 9 Alive : 20 The Chainsmokers: 123
## 0nbXyq5TXYPCO7pr3N8S4I: 8 Forever : 20 David Guetta : 110
## 0qaWEvPkts34WF68r8Dzx9: 8 Paradise: 19 Don Omar : 102
## 0rIAC4PXANcKmitJfoqmVm: 8 (Other) :32726 (Other) :32196
## (Other) :32781 NA's : 5 NA's : 5
## track_popularity track_album_id
## Min. : 0.00 5L1xcowSxwzFUSJzvyMp48: 42
## 1st Qu.: 24.00 5fstCqs5NpIlF42VhPNv23: 29
## Median : 45.00 7CjJb2mikwAWA1V6kewFBF: 28
## Mean : 42.48 4VFG1DOuTeDMBjBLZT7hCK: 26
## 3rd Qu.: 62.00 2HTbQ0RHwukKVXAlTmCZP2: 21
## Max. :100.00 4CzT5ueFBRpbILw34HQYxi: 21
## (Other) :32666
## track_album_name track_album_release_date
## Greatest Hits : 139 2020-01-10: 270
## Ultimate Freestyle Mega Mix: 42 2019-11-22: 244
## Gold : 35 2019-12-06: 235
## Malibu : 30 2019-12-13: 220
## Rock & Rios (Remastered) : 29 2013-01-01: 219
## (Other) :32553 2019-11-15: 215
## NA's : 5 (Other) :31430
## playlist_name
## Indie Poptimism : 308
## 2020 Hits & 2019 Hits – Top Global Tracks \U0001f525\U0001f525\U0001f525: 247
## Permanent Wave : 244
## Hard Rock Workout : 219
## Ultimate Indie Presents... Best Indie Tracks of the 2010s : 198
## Fitness Workout Electro | House | Dance | Progressive House : 195
## (Other) :31422
## playlist_id playlist_genre
## 4JkkvMpVl4lSioqQjeAL0q: 247 edm :6043
## 37i9dQZF1DWTHM4kX49UKs: 198 latin:5155
## 6KnQDwp0syvhfHOR4lWP7x: 195 pop :5507
## 3xMQTDLOIGvj3lWH5e5x6F: 189 r&b :5431
## 3Ho3iO0iJykgEQNbjB2sic: 182 rap :5746
## 25ButZrVb1Zj1MJioMs09D: 109 rock :4951
## (Other) :31713
## playlist_subgenre danceability energy
## progressive electro house: 1809 Min. :0.0000 Min. :0.000175
## southern hip hop : 1675 1st Qu.:0.5630 1st Qu.:0.581000
## indie poptimism : 1672 Median :0.6720 Median :0.721000
## latin hip hop : 1656 Mean :0.6548 Mean :0.698619
## neo soul : 1637 3rd Qu.:0.7610 3rd Qu.:0.840000
## pop edm : 1517 Max. :0.9830 Max. :1.000000
## (Other) :22867
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: -8.171 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.166 Median :1.0000 Median :0.0625
## Mean : 5.374 Mean : -6.720 Mean :0.5657 Mean :0.1071
## 3rd Qu.: 9.000 3rd Qu.: -4.645 3rd Qu.:1.0000 3rd Qu.:0.1320
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
##
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0151 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310
## Median :0.0804 Median :0.0000161 Median :0.1270 Median :0.5120
## Mean :0.1753 Mean :0.0847472 Mean :0.1902 Mean :0.5106
## 3rd Qu.:0.2550 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.96 1st Qu.:187819
## Median :121.98 Median :216000
## Mean :120.88 Mean :225800
## 3rd Qu.:133.92 3rd Qu.:253585
## Max. :239.44 Max. :517810
##
Missing values treatment
We can observe that there are 5 missing values in the columns ‘track_name’, ‘track_artist’ and ‘track_album_name’ in our dataset. Since these 5 records correspond to just 0.00015% of the dataset so, we would remove these observations with missing values from our dataset from further analysis.
songs_clean <- songs %>% filter(!is.na(track_name) & !is.na(track_artist) & !is.na(track_album_name))
Checking duplicate records
songs_clean[duplicated(songs_clean$Names) | duplicated(songs_clean$Names, fromLast = TRUE), ]
Above results show that all the rows in our dataset are unique.
Variable datatypes cleaning
We observed that the Spotify dataset has valid datatypes being assigned to corresponding variables and need not require any change.
Creation of new variable
We would be generating year-wise trends in the later part of our project. Thus, we should extract year from ‘track_album_release_date’ variable and create a new column for it.
songs_clean$year <- as.numeric(substring(songs_clean$track_album_release_date,1,4))
Deletion of unneccesary columns
Few of the columns like ‘track_id’, ‘track_album_id’ and ‘playlist_id’ we won’t be needing for analysis beacause these contain only long alpha-numeric values. Let’s get rid of the these columns.
songs_clean <- songs_clean%>%dplyr::select(-track_id,-track_album_id,-playlist_id)
Checking final dimensions of cleaned dataset
After the data cleaning, we would check the final number of rows and columns, as shown in the code below. The results show 32,828 unique tracks in dataset.
dim(songs_clean)
## [1] 32828 21
A preview of the clean dataset is given below:
head(songs_clean, 20) %>%
datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:4))
))
To begin our analysis, we simply wanted to plot the proportion of playlist genres accoss our dataset. The below plot depicts the required proportion in Spotify data.
Proportion of playlist genres
songs_clean_pie_data <- songs_clean %>%
group_by(playlist_genre) %>%
summarise(Total_number_of_tracks = length(playlist_genre))
ggplot(songs_clean_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0) +
geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
position = position_stack(vjust = 0.5))
Correlation between variables
In order to understand the correlation among variables, we’ll use corrplot function in R which is one of the basic data visualization functions.
songs_correlation <- cor(songs_clean[,-c(1,2,4,5,6,7,8)])
corrplot(songs_correlation, type = "upper", tl.srt = 45)
Density Plots of Variables Let’s see energy, danceability, valence, acousticness, speechiness and liveness are distributed over all the observations of our dataset. We would be plotting density plots of all these 6 variables together as they all on same scale and range from 0 to 1.
correlated_density <- ggplot(songs_clean) +
geom_density(aes(energy, fill ="energy", alpha = 0.1)) +
geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) +
geom_density(aes(valence, fill ="valence", alpha = 0.1)) +
geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) +
geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) +
geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) +
scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
theme_bw() +
theme(plot.title = element_text(size = 10, face = "bold"),
text = element_text(size = 10)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent")
correlated_density
Histograms of loudness, duration and track_popularity
loudness_density <- ggplot(songs_clean) +
geom_density(aes(loudness, fill ="loudness")) +
scale_x_continuous(name = "Loudness") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of Loudness") +
theme_bw() +
theme(plot.title = element_text(size = 14, face = "bold"),
text = element_text(size = 12)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Paired")
duration_ms_density <- ggplot(songs_clean) +
geom_density(aes(duration_ms, fill ="duration_ms")) +
scale_x_continuous(name = "duration_ms") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of duration_ms") +
theme_bw() +
theme(plot.title = element_text(size = 14, face = "bold"),
text = element_text(size = 12)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Dark2")
track_popularity_density <- ggplot(songs_clean) +
geom_density(aes(track_popularity, fill ="track_popularity")) +
scale_x_continuous(name = "track_popularity") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of track_popularity") +
theme_bw() +
theme(plot.title = element_text(size = 14, face = "bold"),
text = element_text(size = 12)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="RdBu")
grid.arrange(loudness_density, duration_ms_density,track_popularity_density, nrow = 3)
Visualizing top artists within each genre
top_genre <- songs_clean %>% select(playlist_genre, track_artist, track_popularity) %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(15, n)
tm <- treemap(top_genre, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette = viridis(6),title="Top 15 Track Artists within each Playlist Genre")
Let’s take a look at how various characteristics of our tracks are different among the 6 genres using radar charts.
A radar chart is useful to compare the musical vibes of genres in a more visual way. In order to plot it, we normalized the danceability, energy, loudness, speechiness, valence, instrumentalness and acousticness values to be from 0 to 1. This helps to make the chart more clear and readable.
To generate these radar plots we built a user-defined function which takes track’s feature as an argument and return its corresponding radar chart.
Plots showing variability of characteristics across genres
radar_chart <- function(arg){
songs_clean_filtered <- songs_clean %>% filter(playlist_genre==arg)
radar_data_v1 <- songs_clean_filtered %>%
select(danceability,energy,loudness,speechiness,valence,instrumentalness,acousticness)
radar_data_v2 <- apply(radar_data_v1,2,function(x){(x-min(x)) / diff(range(x))})
radar_data_v3 <- apply(radar_data_v2,2,mean)
radar_data_v4 <- rbind(rep(1,6) , rep(0,6) , radar_data_v3)
return(radarchart(as.data.frame(radar_data_v4),title=arg))
}
par(mfrow = c(2, 3))
Chart_pop<-radar_chart("pop")
Chart_rb<-radar_chart("r&b")
Chart_edm<-radar_chart("edm")
Chart_latin<-radar_chart("latin")
Chart_rap<-radar_chart("rap")
Chart_rock<-radar_chart("rock")
In this part of our project, we would try to find out how the features change across time. We can group the songs by its added year, get the average for each feature over time and visualize it. To generate these plots we built a user-defined function which takes track’s feature as an argument and return its trend chart.
Plots showing change in tracks’ feature values in last one decade
trend_chart <- function(arg){
trend_change <- songs_clean %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart<- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "#00AFBB", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep=""))
return(chart)
}
trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_ms<-trend_chart("duration_ms")
trend_chart_speechiness<-trend_chart("speechiness")
plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_ms, trend_chart_speechiness,ncol = 2, label_size = 1)
In this section, we will perform K-means clustering on the Spotify dataset and would try to analyze the change in output as the number of clusters increases. We would try to identify the optimal value of clusters K using the elbow method.
K-means clustering is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
Since K-Means is a distance-based algorithm, this difference of magnitude can create a problem. So let’s first bring all the variables to the same magnitude
# select required song features for clustering
cluster.input <- songs_clean[, c('energy', 'liveness','tempo', 'speechiness', 'acousticness',
'instrumentalness', 'danceability', 'duration_ms' ,'loudness','valence')]
# scale features for input to clustering
cluster.input.scaled <- scale(cluster.input[, c('energy', 'liveness', 'tempo', 'speechiness'
, 'acousticness', 'instrumentalness', 'danceability'
, 'duration_ms' ,'loudness', 'valence')])
Visualizing clusters for different values of k
We will first fit multiple k-means models and in each successive model, we will increase the number of clusters. We will then plot the results for visualization as below.
# kmeans with different k values
k2 <- kmeans(cluster.input.scaled, centers = 2, nstart = 25)
k3 <- kmeans(cluster.input.scaled, centers = 3, nstart = 25)
k4 <- kmeans(cluster.input.scaled, centers = 4, nstart = 25)
k5 <- kmeans(cluster.input.scaled, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = cluster.input.scaled) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = cluster.input.scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = cluster.input.scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = cluster.input.scaled) + ggtitle("k = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
Identifying optimal number of clusters in K-means clustering
Elbow method is widely used to determine optimal number of clusters in K-means clustering. This method takes into consideration the total within-cluster sum of square (WSS) as a function of the number of clusters K. Optimal value of K is such that adding another cluster doesn’t improve much better the total WSS.
set.seed(100)
fviz_nbclust(cluster.input[1:1000,], kmeans, method = "wss")
Looking at the above elbow curve, we can say that the optimal value of k is 3.
Let’s take a look at how our within sum of squared error changes with k in a table format.
n_clust<-fviz_nbclust(cluster.input[1:1000,], kmeans, method = "wss")
n_clust<-n_clust$data
n_clust %>% rename(Number_of_clusters=clusters,Within_sum_of_squared_error=y) %>%
mutate(Within_sum_of_squared_error = color_tile("white", "red")(Within_sum_of_squared_error)) %>%
kable("html", escape = F) %>%
kable_styling("hover", full_width = F) %>%
column_spec(2, width = "5cm") %>%
row_spec(3:3, bold = T, color = "white", background = "grey")
Number_of_clusters | Within_sum_of_squared_error |
---|---|
1 | 1.191996e+12 |
2 | 5.466280e+11 |
3 | 3.070086e+11 |
4 | 1.973334e+11 |
5 | 1.370321e+11 |
6 | 1.091560e+11 |
7 | 7.225112e+10 |
8 | 6.101159e+10 |
9 | 5.167471e+10 |
10 | 4.697773e+10 |
We observe that the within sum of squared error of clusters stablized when k becomes equal to 4 which was also infered from the elbow curve above.
5.1 Problem Statement The analysis was intended to understand the evolution of music over time as well as understand the characterisitcs of various genres of music. In addition to that, we also identified the underlying patterns and relationsships of various features that describe music using spotify data.
5.2 Methodology
5.3 Insights
5.4 Implications
This analysis was conducted to explore the evolution of music over time and also diving deeper to understand trends in what makes a song more danceable than others can help DJs, artists, and producers create music based on characteristics like tempo or level of speechiness.
Netflix has a commendable job by leveraging data to produce video content, and the next music revolution could be brought in by similar techniques.
5.5 Limitations
Even though there are millions of songs that exist, we only had about 32k records for our analysis, and hence we couldn’t obtain a full picture of the features of music.
Also the analysis could be strengthened by incorporating user related features like their demographical attributes, user history etc.