Spotify Data Analysis

Introduction

Introduction

Spotify is one of the largest global music streaming service, and a market leader. In this project we will analyze Spotify’s music library across characteristics such as popularity, generes, and releases to develop an understanding of Spotify’s strategic standing.

The questions we aim to answer in this report are:

Is there a correlation between popularity of a song and it’s audio properties such as loudness, adaptability, and acousticness.
What is the distribution of these charactersitcs across Spotify’s library.
What is the genere distribution across Spotify library.
How do various acoustic properties relate with different generes.
How have the interests of Spotify users evolved over time.

Package Information

The following R packages have been used fro the data analysis in this project:

library('tidyverse') 
library('dplyr')
library('ggplot2')
library('hrbrthemes')
library('DT')
library('corrplot')
library('funModeling')

Library	Description
‘tidyverse’	Used for data manipulation.
‘dplyr’	Used for data wrangling & manipulation.
‘ggplot2’	Used for creating data visualizations.
‘hrbrthemes’	Used to add themes for plots(theme-ipsum).
‘DT’	Used for creating data tables.
‘corrplot’	Used to create correlation plots.
‘funModeling’	Used for data pre-processing and exploratory data analysis.

Data Pre-processing

Data Source

The Spotify songs data for analysis has been sourced from this GitHub repository.The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

Data Import

The Spotify data that has been imported contains 32833 tracks and 23 attributes detailing track_popularity, danceability, loudness, tempo and other such characteristics of the songs dating from 2019 to the late 1950s.

1.1 Read & View Data

Here we read the data from a CSV file and load it to the spotify_data variable and view the first 6 rows of the data to check the content.

spotify_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
print(head(spotify_data))

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # … with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

1.2 Check Data Dimensions

Here we check the number of rows and columns in the spotify_data which gives us 32833 rows and 23 attributes.

print(paste('The data has',dim(spotify_data)[1],'rows and',dim(spotify_data)[2],'attributes'))

## [1] "The data has 32833 rows and 23 attributes"

Data Dictionary

The data description for the spotify_data variable is described below.

spotify_data_dictionary <- read_csv("spotify_data_dictionary.csv")
datatable(spotify_data_dictionary, options = list(
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 3)),
  pageLength = 25,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Structure & Summary

3.1 Data Structure

The structure of the spotify_data dataset with the datatypes and column names is displayed below. Majority of the data columns are of numeric type and character type.

str(spotify_data)

## spec_tbl_df [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3.2 Data Summary

The summary of the character datayoes with the lengths and the numeric datatypes is investigated below with mean, mean and the quartile ranges for the data. It can been seen that the following variables are skewed as they have a significant difference between the mean and the max values:

speechiness
acoustincness
instrumentalness
liveness
tempo

Upon initial review, it seems like further investigation need to be done in terms of outlier analysis using boxplots and histograms on this variables to check if the outliers need to reatained for analysis or treated/removed

summary(spotify_data)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Data Cleaning

4.1 Check the data

We are taking a glimpse at the type of data in the spotify_data dataset.

head(spotify_data)

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # … with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

4.2 Treatment of Missing Values

Here we are checking for the count of missing values per column to be able to analyse the if the values need to be dropped, retained or imputed with mean/median.

colSums(is.na(spotify_data))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify_data %>% 
  filter_all(any_vars(is.na(.)))

## # A tibble: 5 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA>       <NA>                        0 3luHJEPw434tv…
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA>       <NA>                        0 3luHJEPw434tv…
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA>       <NA>                        0 717UG2du6utFe…
## 5 69gRFGOWY9OMpFJgFol1u0 <NA>       <NA>                        0 717UG2du6utFe…
## # … with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

As there are only 5 missing values in this data set in 3 columns namely - * track_name * track_album_name * track_artist

which is less than 0.1% of the data, we decided to drop the rows with na’s as it will not impact our analysis.

spotify_data <- spotify_data %>% drop_na()

4.3 Treatment of Duplicate Values

Here we are checking for the count of duplicate values to be able to analyse the if the values need to be dropped, and we see that there are no rows that are complete duplicates as the dimensions are the same.

spotify_data %>% distinct() %>% dim()

## [1] 32828    23

As the data dictionary describes “track_id” to be a unique identifier for the songs in the datasets, we verified if the “track_id” column has any duplicates and it contained 4472 duplicates which were dropped with the new dimensions being 28356 rows and 23 attributes.

spotify_data %>% distinct(track_id, .keep_all=TRUE) %>% dim()

## [1] 28352    23

spotify_data <- spotify_data %>% distinct(track_id,.keep_all=TRUE)

4.4 Modify column data

As we analyse the “duration_ms” column, we see that that is provided in milliseconds which is not a standard measure for the duration of songs, which is why we created a new variable “duartion_m” which stores the duration of the songs in minutes. This data was mutated with the conversion and then a subset of the data was selected without the “duration_ms” column as it is no longer required for further analysis.

spotify_data <- spotify_data %>% mutate(duration_m = duration_ms/60000)
spotify_data <- select(spotify_data, -duration_ms)
colnames(spotify_data)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_m"

4.5 Variable Transformation

On analyzing the data, we see that that the tracks popularity varies on the basis of time and genres, which why we would like to analyse this relation further in the exploratory data analysis section for which we will be extracting the year of the “track_album_release_date” column and creating a new variable “track_album_release_year” to be able to use it for a yearly trend analysis instead of a minute date level analysis.

spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
spotify_data$track_album_release_year <- as.numeric(format(spotify_data$track_album_release_date, "%Y"))

4.5 Data Binning

The data in the “track_popularity” column is ranging from 1-100 which makes an overall analysis of the trend of popularity with attributes like genres , sub_genres and release_year inconvenient in terms of fitting models while predicting popularity of new tracks in the future.

Therefore, we have binned the “track_popularity” data into the following 6 genres and stored it in a new column called “track_popularity_tag”: * (60-80]
* (40-60]
* (20-40]
* [0-20]
* (80-100] * (100+]

track_popularity_uniques <- spotify_data %>% distinct(track_popularity) %>% select(track_popularity)
tags <- c("[0-20]","(20-40]", "(40-60]", "(60-80]", "(80-100]", "(100+]")

spotify_data_binned <- spotify_data %>% 
  mutate(track_popularity_tag = case_when(
    track_popularity <= 20 ~ tags[1],
    track_popularity > 20 & track_popularity <= 40 ~ tags[2],
    track_popularity > 40 & track_popularity <= 60 ~ tags[3],
    track_popularity > 60 & track_popularity <= 80 ~ tags[4],
    track_popularity > 80 & track_popularity <= 100 ~ tags[5],
    track_popularity > 100 ~ tags[6]
    ))
spotify_data_binned %>% distinct(track_popularity_tag)

## # A tibble: 5 × 1
##   track_popularity_tag
##   <chr>               
## 1 (60-80]             
## 2 (40-60]             
## 3 (20-40]             
## 4 [0-20]              
## 5 (80-100]

4.5 Outlier Treatment

Next, to analyse if the outliers in the dataset needs to be removed, retained or imputed we plot the below boxplots for each of the numeric attributes of the song characteristics sub group.

spotify_pivot <- spotify_data_binned %>% select(12:22) %>% pivot_longer(cols = danceability:tempo, names_to = 
"Var", values_to = "val")
ggplot(spotify_pivot, aes(y = val, fill  = Var))+
  geom_boxplot(show.legend = FALSE, width = .6, position = "dodge")+
  coord_flip() +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

We notice on analyzing these boxplots that apart from “key”, “mode”, and “valence” characteristics, every other columns has several outlier data points, but with domain expertise regarding the contribution of information from these outliers on our final analysis, we will not be able to remove these outliers as they may provide some insights on the trend of track popularity with audience which can worked on to increase popularity.

4.6 Trends In Dataset

To study the skewness of the data set, we plot histograms.

ggplot(spotify_pivot, aes(x = val, fill  = Var))+
  geom_histogram(show.legend = FALSE,  position = "dodge") +
  facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()

On analysis, we that only the attribute “valence” is normally distributed and whereas, * Loudness, Danceability and Energy are left skewed. * Liveness, Speechiness, Acousticness and Instrumentalness are right skewed.

This helps us take the below insights:

The listeners like songs take have a higher value of Loudness, Danceability and Energy which is an evident insight indicating that the more beats or energy in the songs, the more it scores on the popularity index.
The listeners like less speechy and acoustic songs as they seem to be right skewed, which could suggest to look more at the EDM genre, or songs with more beats per minute, which would be more likeable.

Data Preview

The final preview of the cleaned data is displayed below after removing missing values and duplicates, adding new variables to gain insights in exploratory data analysis section, transforming the variables, verifying outliers and binning data for model predictions.

spotify_data_cleaned <- spotify_data_binned
datatable(head(spotify_data_cleaned, 25), options = list(
  scrollCollapse = TRUE,scrollX = TRUE,
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20, 25)
))

Exploratory Data Analysis

corr_data <-select(spotify_data_cleaned,track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)

corrplot(cor(corr_data), tl.col = 'black')

Insights:

From the correlation plot, it is evident that track popularity does not have much correlation with any of the audio characteristics
There is a significant positive correlation of energy with loudness
Energy and accousticness have a significant negative correlation;Loudness and accousticness also have a significant negative correlation

audio_characteristics <- select(spotify_data_cleaned,c(12:22))
plot_num(audio_characteristics)

Insights:

-5dB is the loudness level of majority of the tracks
Valence is normally distributed
Danceability and energy have a distribution that is left skewed
Majority of the tracks do not have values more than 0.1 in instrumentalness

spotify_genre_pie_data <- spotify_data_cleaned %>% 
  group_by(playlist_genre) %>% 
  summarise(Total_number_of_tracks = length(playlist_genre))

ggplot(spotify_genre_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
            position = position_stack(vjust = 0.35))

Insights:

Pop has the highest proportion of tracks across playslist genres
The number of songs per playlist genres is ranging from 15-18 % approximately that demonstrates the uniform distribution of the tracks in the spotify dataset

plot_list <- 
  map(names(spotify_data_cleaned %>% select(where(is.numeric)) %>% select(-mode,-key)), 
      function(colName) {
        spotify_data_cleaned %>% 
          ggplot(aes(x = playlist_genre,
                     y = !! sym(colName),
                     fill = playlist_genre)) +
          geom_boxplot() +
          theme(legend.position = "NONE") +
          labs(title = colName, x = "", y = "")
    })
gridExtra::grid.arrange(grobs = plot_list[c(1:6)])

Insights:

Pop has the highest popularity across all genres
Energy and Loudness of EDM songs are highest among all genres which is expected
Accousticness is high for latin and pop and very low for EDM
Rap accounts for highest danceability

song_years_genre_df = spotify_data_cleaned %>%
  filter(track_album_release_year> 2005 & track_album_release_year<=2019)%>%
  select('track_album_release_year', 'playlist_genre')  %>%
  group_by(track_album_release_year, playlist_genre) %>%
  summarise(songs_released = n()) %>%
  ungroup()
ggplot(song_years_genre_df, aes(x = track_album_release_year, y = songs_released)) +
  geom_line(aes(color = playlist_genre)) + 
    ggtitle("Number of songs released over the years for each genre") + 
      ylab("songs released") +xlab("Release Year")

Insights:

EDM was not so popular before 2010 but the number of EDM songs released increased drastically post 2013 and became highest by 2019
The number of rap songs released yearly is lowest among all other genres

Summary

By analyzing the data we have developed the following insights:

There is no statistically significant correlation between any of the audio characteristics such as dancability, loudness, energy, and liveness and track popularity
Audio characterstics ‘loudness’ and ‘energy’ are positively correlated - ie: louder songs are perceived to be more energetic
Energy and acousticness have a negative correlation
While Valance is normally distributed, dancebility and energy have a distribution which is left skewed.
Pop is the most popular genere with the highest proportion of tracks in the spotify library as well as the most popular tracks
Generes such as Latin and Pop have high acousticness and generes such as EDM have low acousticness.
Rap genere has the highest dancebility attribute.
Popularity of EDM has grown in the past decade. Before 2010 EDM had the lease number of songs released form any genere before 2010 and it has grown to become the genere with the most number of tracks released in 2019

Spotify Data Analysis

Authors: Ananya Chakraborty | Sourav Roy | Devang Joshi

Introduction

Package Information

Data Pre-processing

Data Source

Data Import

1.1 Read & View Data

1.2 Check Data Dimensions

Data Dictionary

Structure & Summary

3.1 Data Structure

3.2 Data Summary

Data Cleaning

4.1 Check the data

4.2 Treatment of Missing Values

4.3 Treatment of Duplicate Values

4.4 Modify column data

4.5 Variable Transformation

4.5 Data Binning

4.5 Outlier Treatment

4.6 Trends In Dataset

Data Preview

Exploratory Data Analysis

Summary