Spotify Data Analysis

Introduction

Introduction

Spotify is one of the largest global music streaming service, and a market leader. In this project we will analyze Spotify’s music library across characteristics such as popularity, generes, and releases to develop an understanding of Spotify’s strategic standing.

The questions we aim to answer in this report are:

Is there a correlation between popularity of a song and it’s audio properties such as loudness, adaptability, and acousticness?
What is the distribution of these charactersitcs across Spotify’s library?
What is the genere distribution across Spotify library?
How do various acoustic properties relate with different generes?
How have the interests of Spotify users evolved over time?

Package Information

The following R packages have been used for the data analysis in this project:

library('tidyverse') 
library('dplyr')
library('ggplot2')
library('hrbrthemes')
library('DT')
library('corrplot')
library('funModeling')

Library	Description
‘tidyverse’	Used for data manipulation.
‘dplyr’	Used for data wrangling & manipulation.
‘ggplot2’	Used for creating data visualizations.
‘hrbrthemes’	Used to add themes for plots(theme-ipsum).
‘DT’	Used for creating data tables.
‘corrplot’	Used to create correlation plots.
‘funModeling’	Used for data pre-processing and exploratory data analysis.

Data Pre-processing

Data Source

The Spotify songs data for analysis has been sourced from this GitHub repository.The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

Data Import

The Spotify data that has been imported contains 32833 tracks and 23 attributes detailing track_popularity, danceability, loudness, tempo and other such characteristics of the songs dating from 2019 to the late 1950s.

1.1 Read & View Data

Here we read the data from a CSV file and load it to the spotify_data variable and view the first 6 rows of the data to check the content.

spotify_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
print(head(spotify_data))

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # … with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

1.2 Check Data Dimensions

Here we check the number of rows and columns in the spotify_data which gives us 32833 rows and 23 attributes.

print(paste('The data has',dim(spotify_data)[1],'rows and',dim(spotify_data)[2],'attributes'))

## [1] "The data has 32833 rows and 23 attributes"

Data Dictionary

The data description for the spotify_data variable is described below.

spotify_data_dictionary <- read_csv("spotify_data_dictionary.csv")
datatable(spotify_data_dictionary, options = list(
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 3)),
  pageLength = 25,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Structure & Summary

3.1 Data Structure

The structure of the spotify_data dataset with the datatypes and column names is displayed below. Majority of the data columns are of numeric type and character type.

str(spotify_data)

## spec_tbl_df [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3.2 Data Summary

The summary of the character datatypes with the lengths and the numeric datatypes is investigated below with mean, mean and the quartile ranges for the data. It can been seen that the following variables are skewed as they have a significant difference between the mean and the max values:

speechiness
acoustincness
instrumentalness
liveness
tempo

Upon initial review, it seems like further investigation need to be done in terms of outlier analysis using boxplots and histograms on this variables to check if the outliers need to be retained for analysis or treated/removed.

summary(spotify_data)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Data Cleaning

4.1 Check the data

We are taking a glimpse at the type of data in the spotify_data dataset.

glimpse(spotify_data)

## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

4.2 Treatment of Missing Values

Here we are checking for the count of missing values per column to be able to analyse the if the values need to be dropped, retained or imputed with mean/median.

colSums(is.na(spotify_data))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify_data %>% 
  filter_all(any_vars(is.na(.)))

## # A tibble: 5 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA>       <NA>                        0 3luHJEPw434tv…
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA>       <NA>                        0 3luHJEPw434tv…
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA>       <NA>                        0 717UG2du6utFe…
## 5 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## # … with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

As there are only 5 missing values in this data set in 3 columns namely - * track_name * track_album_name * track_artist

which is less than 0.1% of the data, we decided to drop the rows with na’s as it will not impact our analysis.

spotify_data <- spotify_data %>% drop_na()

4.3 Treatment of Duplicate Values

Here we are checking for the count of duplicate values to be able to analyse the if the values need to be dropped, and we see that there are no rows that are complete duplicates as the dimensions are the same for the below query.

spotify_data %>% distinct() %>% dim()

## [1] 32828    23

As the data dictionary describes “track_id” to be a unique identifier for the songs in the data set, we verified if the “track_id” column has any duplicates and it contained 4472 duplicates which were dropped with the new dimensions of the cleaned data being 28356 rows and 23 attributes.

spotify_data %>% distinct(track_id, .keep_all=TRUE) %>% dim()

## [1] 28352    23

spotify_data <- spotify_data %>% distinct(track_id,.keep_all=TRUE)

4.4 Modify column data

As we analyse the “duration_ms” column, and see that it is provided in milliseconds which is not a standard measure for the duration of songs, which is why we created a new variable “duartion_m” that stores the duration of the songs in minutes. This data was mutated with the conversion factor and then a subset of the data was selected without the “duration_ms” column as it is no longer required for further analysis.

spotify_data <- spotify_data %>% mutate(duration_m = duration_ms/60000)
spotify_data <- select(spotify_data, -duration_ms)
colnames(spotify_data)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_m"

4.5 Variable Transformation

On analyzing the data, we see that the track popularity varies on the basis of time and genres, which why we would like to analyse this relation further in the exploratory data analysis section for which we will be extracting the year of the “track_album_release_date” column and creating a new variable “track_album_release_year” to be able to use it for a yearly trend analysis instead of a minute date level analysis.

spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
spotify_data$track_album_release_year <- as.numeric(format(spotify_data$track_album_release_date, "%Y"))

4.5 Data Binning

The data in the “track_popularity” column is ranging from 1-100 which makes an overall analysis of the trend of popularity with attributes like genres, sub_genres and release_year inconvenient in terms of fitting models while predicting popularity of new tracks in the future.

Therefore, we have binned the “track_popularity” data into the following 6 genres and stored it in a new column called “track_popularity_tag”:

(60-80]
(40-60]
(20-40]
[0-20]
(80-100]
(100+]

track_popularity_uniques <- spotify_data %>% distinct(track_popularity) %>% select(track_popularity)
tags <- c("[0-20]","(20-40]", "(40-60]", "(60-80]", "(80-100]", "(100+]")

spotify_data_binned <- spotify_data %>% 
  mutate(track_popularity_tag = case_when(
    track_popularity <= 20 ~ tags[1],
    track_popularity > 20 & track_popularity <= 40 ~ tags[2],
    track_popularity > 40 & track_popularity <= 60 ~ tags[3],
    track_popularity > 60 & track_popularity <= 80 ~ tags[4],
    track_popularity > 80 & track_popularity <= 100 ~ tags[5],
    track_popularity > 100 ~ tags[6]
    ))
spotify_data_binned %>% distinct(track_popularity_tag)

## # A tibble: 5 × 1
##   track_popularity_tag
##   <chr>               
## 1 (60-80]             
## 2 (40-60]             
## 3 (20-40]             
## 4 [0-20]              
## 5 (80-100]

4.5 Outlier Treatment

Next, to analyse if the outliers in the dataset needs to be removed, retained or imputed we plot the below boxplots for each of the numeric attributes of the song characteristics sub group.

spotify_pivot <- spotify_data_binned %>% select(12:22) %>% pivot_longer(cols = danceability:tempo, names_to = 
"Var", values_to = "val")
ggplot(spotify_pivot, aes(y = val, fill  = Var))+
  geom_boxplot(show.legend = FALSE, width = .6, position = "dodge")+
  coord_flip() +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

We notice on analyzing these boxplots that apart from “key”, “mode”, and “valence” characteristics, every other columns has several outlier data points, but without domain expertise regarding the contribution of information from these outliers on our final analysis, we will not be able to remove these outliers as they may provide some insights on the trend of track popularity with audience which can be worked on to increase popularity.

4.6 Trends In Dataset

To study the skewness of the data set, we plot histograms.

ggplot(spotify_pivot, aes(x = val, fill  = Var))+
  geom_histogram(show.legend = FALSE,  position = "dodge") +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

On analysis, we that only the attribute “valence” is normally distributed and whereas, * Loudness, Danceability and Energy are left skewed. * Liveness, Speechiness, Acousticness and Instrumentalness are right skewed.

This helps us take the below insights:

There are more number of songs that have a higher value of Loudness, Danceability and Energy which is an evident insight indicating that the more beats or energy in the songs maybe a good quality characteristic, but it needs to be analysed further through a correlation plot.
Less speechy and acoustic songs are preferred as they seem to be right skewed, which could suggest to look more at the EDM genre, or songs with more beats per minute which might be a better predictor variable to determine good songs.

Data Preview

The final preview of the cleaned data is displayed below after removing missing values and duplicates, adding new variables to gain insights in exploratory data analysis section, transforming the variables, verifying outliers and binning data for model predictions.

spotify_data_cleaned <- spotify_data_binned
datatable(head(spotify_data_cleaned, 25), options = list(
  scrollCollapse = TRUE,scrollX = TRUE,
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Exploratory Data Analysis

1.1 Correlation Analysis

To start off, we look at the correlation between the song attributes to see if there are any statistically dependent variables. This insight can help us either reduce to the features in the data by either implementing Principal Component Analysis before fitting it in the any model fit to the data in the future or do any sort of feature selection.

corr_data <-select(spotify_data_cleaned,track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)

corrplot(cor(corr_data), tl.col = 'black')

Insights:

From the correlation plot, it is evident that track popularity does not have much correlation with any of the audio characteristics.
There is a significant positive correlation of energy with loudness of the track.
Energy and acousticness have a significant negative correlation;Loudness and acousticness also have a significant negative correlation.

1.2 Skewness of Audio Characteristics

audio_characteristics <- select(spotify_data_cleaned,c(12:22))
plot_num(audio_characteristics)

Insights:

-5dB is the loudness level of majority of the tracks
Valence is normally distributed
Danceability and energy have a distribution that is left skewed
Majority of the tracks do not have values more than 0.1 in instrumentalness

This gives us an idea that features like instrulmentalness are not preferred, whereas danceability and energy are very significant factors in majority of the tracks.

1.3 Significance of Playlist_Genres

Here we visualize the distribution of genres among all the tracks in the data provided.

spotify_genre_pie_data <- spotify_data_cleaned %>% 
  group_by(playlist_genre) %>% 
  summarise(Total_number_of_tracks = length(playlist_genre))

ggplot(spotify_genre_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
            position = position_stack(vjust = 0.35))

Insights:

Pop has the highest proportion of tracks across playslist genres.
The number of songs per playlist genres is ranging from 15-18 % approximately that demonstrates the uniform distribution of the play_list genres in the spotify dataset, therefore no genres particularly stands out in majority.

1.4 Track characteristics by genre

plot_list <- 
  map(names(spotify_data_cleaned %>% select(where(is.numeric)) %>% select(-mode,-key)), 
      function(colName) {
        spotify_data_cleaned %>% 
          ggplot(aes(x = playlist_genre,
                     y = !! sym(colName),
                     fill = playlist_genre)) +
          geom_boxplot() +
          theme(legend.position = "NONE") +
          labs(title = colName, x = "", y = "")
    })
gridExtra::grid.arrange(grobs = plot_list[c(1:6)])

Insights:

Pop has the highest popularity across all genres.
Energy and Loudness of EDM songs are highest among all genres which is expected.
Acousticness is high for latin and pop and very low for EDM.
Rap accounts for highest danceability.

1.5 Release of Tracks by Genres(Time Analysis)

Here we plot the number of tracks released from 1957 to 2019 based on genres and derive insights.

song_years_genre_df <- spotify_data_cleaned %>%
  filter(track_album_release_year> 2005 & track_album_release_year<=2019)%>%
  select('track_album_release_year', 'playlist_genre')  %>%
  group_by(track_album_release_year, playlist_genre) %>%
  summarise(songs_released = n()) %>%
  ungroup()
ggplot(song_years_genre_df, aes(x = track_album_release_year, y = songs_released)) +
  geom_line(aes(color = playlist_genre)) + 
    ggtitle("Number of songs released over the years for each genre") + 
      ylab("songs released") +xlab("Release Year")

Insights:

EDM was not so popular before 2010 but the number of EDM songs released increased drastically post 2013 and became highest by 2019
The number of rap songs released yearly is lowest among all other genres

1.5 Popular Track_Artists

Here we analyse which of the artists are high in popularity and have their songs on the top of the charts more frequently.

top_10_artist_popularity <- spotify_data_cleaned %>% select(track_artist, track_popularity, track_album_release_year) %>% filter(track_popularity >0, track_album_release_year > 2010) %>% arrange(desc(track_popularity))  %>% slice_head(n = 10)  %>% distinct(track_artist, .keep_all = TRUE)

ggplot(data = top_10_artist_popularity, mapping = aes(x = reorder(track_artist, track_popularity, fill = track_artist), weight = track_popularity)) + geom_bar() + coord_flip() + scale_fill_brewer(palette="Spectral") + ggtitle("Top 10 Artists(By Popularity) From 2010") +
  xlab("Artist Name") + ylab("Popularity Index")

From the analysis we see that the following are the top 10 artist of the recent times and their tracks garner more popularity than the others.

Tones and I
Arizona Zervas
The Weekend
Roddy Rich
Post Malone

From all the above analysis, we get a better picture of the features in the dataset that add value to an insights and predictions in our data set. Further EDA can be done before proceeding with fitting the data to models and predicting the dependent target variables.

Summary

By analyzing the data we have developed the following insights:

There is very low statistically significant correlation between any of the audio characteristics such as danceability, loudness, energy, and liveness and track popularity.
Audio characterstics ‘loudness’ and ‘energy’ are positively correlated - ie: louder songs are perceived to be more energetic.
Energy and acousticness have a negative correlation.
While Valance is normally distributed, dancebility and energy have a distribution which is left skewed.
Pop is the most popular genre with the highest proportion of tracks in the spotify library as well as the most popular tracks
Genres such as Latin and Pop have high acousticness and generes such as EDM have low acousticness.
Rap genre has the highest dancebility attribute.
Popularity of EDM has grown in the past decade. Before 2010 EDM had the lease number of songs released form any genre before 2010 and it has grown to become the genre with the most number of tracks released in 2019.

As a roadmap plan, we can proceed with a more detailed exploratory data analytics process and conclude on which model is suited to predict our target variable , which could be the popularity of the song. For this classification task, we may use SVM, or Linear Regression as the data fits.

Spotify Data Analysis

Authors: Ananya Chakraborty | Sourav Roy | Devang Joshi

Introduction

Package Information

Data Pre-processing

Data Source

Data Import

1.1 Read & View Data

1.2 Check Data Dimensions

Data Dictionary

Structure & Summary

3.1 Data Structure

3.2 Data Summary

Data Cleaning

4.1 Check the data

4.2 Treatment of Missing Values

4.3 Treatment of Duplicate Values

4.4 Modify column data

4.5 Variable Transformation

4.5 Data Binning

4.5 Outlier Treatment

4.6 Trends In Dataset

Data Preview

Exploratory Data Analysis

1.1 Correlation Analysis

1.2 Skewness of Audio Characteristics

1.3 Significance of Playlist_Genres

1.4 Track characteristics by genre

1.5 Release of Tracks by Genres(Time Analysis)

1.5 Popular Track_Artists

Summary