Spotify Data Analysis

Introduction

Spotify is a Swedish audio streaming and media services provider founded on 23rd April 2006 by Daniel Ek and Martin Lorentzon.

It substantially alters the music production industry’s operating model, as instead of purchasing a large number of tapes and CDs, one may listen to whatever music they want, whenever and anywhere they want, using their smart phone or tablet.

Spotify analyzes client listening habits by asking questions when they initially log in, such as what are your favorite music genres, and uses machine learning algorithms to recomend our favorite songs daily and weekly in the big data era. Our listening history is also being compiled in order to determine our preferences.

It’s a topic of interest as to how Spotify categorizes songs into broad categories. What are the characteristics of each genre, and how are they used to categorize them? In this project, we’ll go deeper into the issues.

Each song is assigned 12 audio features, 6 broad genres, and 24 subgenres based on the data. In the following sections, we will concentrate on these 14 variables.

The purpose of doing this project is to:

understand the relation between different features
identify patterns in different audio charecteristics with respect to different genres
understand which features makes a song popular

To fulfill those goals, we will perform:

Exploratory Data Analyses:

checking correlation between different
Learning Corelation between features

Model to use;

K-Means Clustering

Libraries

library(tibble) : Used to create tibbles
library(tidyr) : Used to tidy up data
library(prettydoc) : Document themes for R Markdown
library(DT) : used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate : used for date/time functions
library(magrittr) : used for piping
library(ggplot2) : used for data visualization
library(dplyr) : used for data manipulation
library(corrplot) : for displaying correlation matrices and confidence intervals
library(tm) : for text mining the “Genre” column
library(treemap) : For visualizing the treemap plots
library(factoextra) : For visualizing the clusters

Data Preparation

In this section, we’ll go over the procedures for preparing data for analysis.

About

We will be using the data set provided in the curriculum

The data used for analyzing the songs played on Spotify is sourced via the spotifyr package. This package was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to make it easier to acquire your own data or generic metadata from Spotify’s API. Check out the webpage for the spotifyr program to learn how to collect your own data!

Data Import

# Get the Data
library(tibble)
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv"
spotify_df <- as_data_frame(read.csv(url, stringsAsFactors = FALSE))

First Five

head(spotify_df)

## # A tibble: 6 x 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <int> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5                   67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <int>

Shape

dim(spotify_df)

## [1] 32833    23

Class

str(spotify_df)

## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : int [1:32833] 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Columns

names(spotify_df)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Summary

summary(spotify_df)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Data Clean

For good analysis results, it’s important to clean our data and make it analysis ready. To clean the data, we will perform the following steps:

Removing Duplicates
Removing Unwanted Variables
Removing Null Values
Creating New Variables
Transforming variables wherever required

Removing Duplicates

We’ll start by looking for duplicate records in the data; duplicate records will skew our results, so it’s critical to dedupe the data before proceeding with the analysis

spotify_df <- spotify_df[!duplicated(spotify_df$track_id),]

Dropping Variables

We can remove various IDs from the dataset because they are just used as unique identifiers and will not affect the analysis

spotify_df <- spotify_df %>% 
select(-ends_with("id"))
dim(spotify_df)

## [1] 28356    20

Treating Nulls

Null values can have a significant impact on our analysis and the interpretations we acquire, hence it’s critical to first identify the columns that contain null values and then treat them

Let’s check the percentage of missing values in each of the columns

colSums(is.na(spotify_df))

##               track_name             track_artist         track_popularity 
##                        4                        4                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        4                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

There are two columns, track_name and track_artist, both of which have 4 null values. We can eliminate the records because eliminating 4 records will have no effect on our analysis, which comprises of 32833 records

spotify_df <- na.omit(spotify_df)

Transforming Variables

Converting the following variables to factors to facilitate our analysis:

Genre
Sub Genre
Mode
Key

spotify_df <- spotify_df %>%
  mutate(playlist_genre = as.factor(spotify_df$playlist_genre),
         playlist_subgenre = as.factor(spotify_df$playlist_subgenre),
         mode = as.factor(mode),
         key = as.factor(key))

Let’s check if the conversion was succesful

str(spotify_df)

## tibble [28,352 x 20] (S3: tbl_df/tbl/data.frame)
##  $ track_name              : chr [1:28352] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:28352] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int [1:28352] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_name        : chr [1:28352] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:28352] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:28352] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability            : num [1:28352] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:28352] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness                : num [1:28352] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness             : num [1:28352] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:28352] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:28352] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:28352] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:28352] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:28352] 122 100 124 122 124 ...
##  $ duration_ms             : int [1:28352] 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 7669 8693 8694 17666
##   ..- attr(*, "names")= chr [1:4] "7669" "8693" "8694" "17666"

We can also observe that duration is given in milliseconds, let’s convert it into minutes

spotify_df <- spotify_df %>% mutate(duration_min = duration_ms/60000)

Creating New Vraiables

Let’s create a variable to assign rank based on the value in the column track_popularity that we’ve been given.:

spotify_df <- spotify_df %>% 
  mutate(popularity_group = as.numeric(case_when(
    ((track_popularity > 0) & (track_popularity < 20)) ~ "1",
    ((track_popularity >= 20) & (track_popularity < 40))~ "2",
    ((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
    TRUE ~ "4"))
    )
table(spotify_df$popularity_group)

## 
##    1    2    3    4 
## 4182 6162 8975 9033

Data Preview

library(DT)
datatable(head(spotify_df,5))

Exploratory Data Analysis

Exploratory Data Analysis (EDA) can help us find relevant information in data that isn’t immediately obvious, but only if it’s done appropriately. Before we begin to create a model on the data, EDA is required. We can use EDA to identify data patterns, spot outliers or unusual events, and discover interesting relationships between variables.

Correlation Between Features

We’ll start by looking at the correlation between the variables. Correlation tells us if the variables are interdependent. The magnitude of the correlation helps in determining the relationship’s strength, whilst the sign helps in determining whether the variables are moving in the same direction or in opposite directions

On the basis of the figure, we can see that there are a few variables with a high connection. To avoid multicollinearity, we must either choose one of the variables or use dimensionality reduction techniques

The correlation plot shows that energy and loudness are dependent on each other, let’s plot a scatter plot to visualize the relationship

b <- ggplot(spotify_df,aes(x = energy, y = loudness)) 
b + geom_point()

The graph indicates a strong relationship between the audio features energy and loudness

Genre Characteristics

Spotify has six broad genres into which songs can be categorized. Let’s look at the number of songs in each genre in our database.

# songs per genre
spotify_df %>% group_by(Genre = playlist_genre) %>%
  summarise(No_of_tracks = n()) %>% knitr::kable()

Genre	No_of_tracks
edm	4877
latin	4136
pop	5132
r&b	4504
rap	5398
rock	4305

From the above data, we can see that pop genre has the maximum number of songs i.e. 5132 out of all the other genres

Let’s check the % of tracks belonging to each of the genres

spotify_df_pie_data <- spotify_df %>% 
  group_by(playlist_genre) %>% 
  summarise(Total_number_of_tracks = length(playlist_genre))

ggplot(spotify_df_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
            position = position_stack(vjust = 0.5))

We can see from the pie chart that the number of tracks is evenly spread

Genre Rap has the highest percentage of tracks (19%)
Genres Rock and R&b have the lowest percentages (15.2% and 15.9%, respectively).

Let’s see if the amount of tracks in a given genre has an impact on its popularity:

ggplot(spotify_df_bar_data, aes(fill=playlist_genre, y=Total_playlist_genre, x=popularity_group)) + 
    geom_bar(position="dodge", stat="identity")

This plot can provide some really interesting inferences like:

As can be seen from the graph above, pop is the most popular genre, followed by rap
Pop’s popularity may be due to its ability to allow people to sing along, which is important for many people who want to listen to vibrant music in the car while commuting and wasting time
Genre EDM is the least popular, mostly because of the less number of songs

Let’s now check the relation between genre and the other variables in the data

library(ggpubr)

p1 <- spotify_df %>% 
  ggplot(aes(x = playlist_genre, y = valence, color = playlist_genre)) +
  geom_boxplot(alpha = 0.5, notch = TRUE) +
  theme_bw() +
  labs(title = 'Which genre is the happiest?', x= 'Genres', y = 'Happiness' )



p2 <- spotify_df %>% ggplot(aes(x = playlist_genre, y = energy, color = playlist_genre)) +
  geom_boxplot(alpha = 0.1, notch = TRUE) +
  theme_bw() +
  labs(title = 'How energetic are different Genres?', x= 'Genres', y = 'Energy' )


p3 <- spotify_df %>% ggplot(aes(x = playlist_genre, y = danceability, color = playlist_genre)) +
  geom_boxplot(alpha = 0.5, notch = TRUE) +
  theme_bw() +
  labs(title = 'Which genre is the most danceable?', x= 'Genres', y = 'Danceability' )


p4 <- spotify_df %>% ggplot(aes(x = playlist_genre, y = tempo, color = playlist_genre)) +
  geom_boxplot(alpha = 0.5, notch = TRUE) +
  theme_bw() +
  labs(title = 'Tempo across different Genres', x= 'Genres', y = 'Tempo' )

ggarrange(p1,p2,p3,p4 , nrow = 2, ncol = 2)

The graphs above depict the genres’ characteristics in terms of happiness, energy, danceability, and tempo. Let’s have a look at some of the findings:

Happiness : Latin songs have a syncopation feature that makes them sound jubilant and alive, and as can be seen in the graph, they have a high valence value
Energy : EDM and rock songs are full of loudness, beats, structural changes and the sounds of instruments and hence they are energetic
Danceability : Rap has the music to set the mood, drop the beat, and create the motivation needed to start moving and hence it has a high value when talking about danceability
Tempo : EDM tracks has a great variance in terms of tempo. Some of the EDM songs have really high tempo

Subgenres, Albums and Artists

Let’s look at the __top 3 subgenres within each genre

top <- spotify_df %>% select(playlist_genre, playlist_subgenre, track_popularity) %>% group_by(playlist_genre,playlist_subgenre) %>% summarise(n = n()) %>% top_n(3, n)

## `summarise()` has grouped output by 'playlist_genre'. You can override using the `.groups` argument.

tm <- treemap(top, index = c("playlist_genre", "playlist_subgenre"), vSize = "n", vColor = 'playlist_genre', palette="RdYlBu")

The top 15 artists within each genre:

top <- spotify_df %>% select(playlist_genre, track_artist, track_popularity) %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(15, n)

## `summarise()` has grouped output by 'playlist_genre'. You can override using the `.groups` argument.

tm <- treemap(top, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette="RdYlBu")

The top 15 albums overall:

library(ggplot2)
library(plotly)
#finding popular artists
popular_artists <- spotify_df %>% group_by(Songs = track_name) %>%
summarise(No_of_tracks = n(),Popularity = mean(track_popularity))  %>% 
  filter(No_of_tracks > 2) %>%
  arrange(desc(Popularity)) %>%
  top_n(15, wt = Popularity) %>% 
  ggplot(aes(x = Songs, y = Popularity)) +
        geom_bar(stat = "identity") +
        coord_flip() + labs(title = "popular songs overall", x = "Songs", y = "Popularity")

ggplotly(popular_artists)

The top 15 artists overall:

library(ggplot2)
library(plotly)
#finding popular artists
popular_artists <- spotify_df %>% group_by(Artist = track_artist) %>%
summarise(No_of_tracks = n(),Popularity = mean(track_popularity))  %>% 
  filter(No_of_tracks > 2) %>%
  arrange(desc(Popularity)) %>%
  top_n(15, wt = Popularity) %>% 
  ggplot(aes(x = Artist, y = Popularity)) +
        geom_bar(stat = "identity") +
        coord_flip() + labs(title = "popular artists overall", x = "Artists", y = "Popularity")

ggplotly(popular_artists)

Modelling

To identify the songs that belong to the same group we will be performing K-Means clustering. K-Means clustering will group the songs into groups having similar audio characteristics.

To perform K-Means, we will start by selecting the predictor variables which are - ‘energy’, ‘liveness’,‘tempo’, ‘speechiness’, ‘acousticness’,‘instrumentalness’, ‘danceability’, ‘duration_ms’ ,‘loudness’,‘valence’

spotify.inp <- spotify_df[, c('energy', 'liveness','tempo', 'speechiness', 'acousticness','instrumentalness', 'danceability', 'duration_ms' ,'loudness','valence')]

The next step is to scale the data which is performed in order to standardize all the columns

cluster.spotify_df.scaled <- scale(spotify.inp[, c('energy', 'liveness', 'tempo', 'speechiness' , 'acousticness', 'instrumentalness', 'danceability' , 'duration_ms' ,'loudness', 'valence')])

K-Means groups the data into K- Clusters, therefore we need to identify the number of optimal groups. We will be using the elbow method to get the optimal number of clusters in our data.

set.seed(100)
fviz_nbclust(spotify.inp[1:2000,], kmeans, method = "wss")

We can see that the dent in the above graph is at 3, so we will select k as 3 and fit the model

k <- kmeans(cluster.spotify_df.scaled, centers = 3)

fviz_cluster(k, geom = "point",  data = cluster.spotify_df.scaled) + ggtitle("Grouping similar songs")

These are the ideal set of clusters obtained by using K-Means clustering

Let’s check for the song charecteristics within each cluster

insights

## # A tibble: 3 x 9
##   kclust acousticness danceability energy instrumentalness speechiness valence
##    <int>        <dbl>        <dbl>  <dbl>            <dbl>       <dbl>   <dbl>
## 1      1       0.478         0.615  0.440           0.143       0.0889   0.399
## 2      2       0.148         0.742  0.714           0.0274      0.137    0.653
## 3      3       0.0596        0.560  0.812           0.147       0.0800   0.384
## # ... with 2 more variables: liveness <dbl>, track_popularity <dbl>

These are the characteristics of the songs within each cluster. Based on this analysis the artists can determine which group their song will fall into and the average popularity it may have.

Summary

1. Objective :
The objective of this study was to understand about the features of different musical genres. Using Spotify data, we also discovered the underlying patterns and relationships among numerous audio parameters that describe music

2. Data :
There were 32833 records and 23 columns in the spotify data used for this analysis. There were both categorical and continuous variables in the dataset. This information was sufficient to analyze the relationship between various genres and auditory characteristics, as well as to examine the most popular songs and artists

3. Methodology :

We started by looking at the relationship between audio features and making scatter plots to explore at the relationship between variables with a high correlation value
Then we looked at some of the genre aspects such as popularity, genres with the most tracks, and various genres and its characteristics such as valence, energy, danceability, and so on
Using a tree map, we found popular subgenres within the 6 broad genres, as well as popular artists
We also came up with a list of the top 15 most popular artists and songs

4. Insights :

Certain audio characteristics were discovered to be substantially correlated with one another, such as energy and loudness, which have a direct strong linear relationship, whereas energy and acoustiness have a strong inverse independence
We discovered that most of the pop songs had high popularity followed by rap. Rap had the most songs, despite the fact that the distribution of tracks in each genre was fairly even
EDM songs were found to be the most energetic, and the tempo variance of EDM tracks was high
Latin songs were found to have the highest valence as well as danceability mostly because of there syncopation feature that makes them sound jubilant and alive, and as can be seen in the graph, they have a high valence value
From our analysis, “Just the Way You Are” was found to be the most popular track and “JACKBOYS” the most popular artist

5. Insights from Model Fitting:

We performed K-Means clustering to group similar songs on the basis of various audio characteristics to group similar songs irrespective of the genre they belonged to
We identified 3 groups which indicated some of the genres were similar
With the K-Means model that we have developed, an artist will be able to estimate the popularity of a song, given the audio charecteristics are known

Mid-term Project

Amritha Sharma | Ayushi | Yujia Zhang

Spotify Data Analysis

Introduction

Libraries

Data Preparation

About

Data Import

First Five

Shape

Class

Columns

Summary

Data Clean

Removing Duplicates

Dropping Variables

Treating Nulls

Transforming Variables

Creating New Vraiables

Data Preview

Exploratory Data Analysis

Correlation Between Features

Genre Characteristics

Subgenres, Albums and Artists

Modelling

Summary