2020_BANA7025_liuj_finalproject.Rmd.utf8

Spotify Songs Data Analysis

Jiaoyao Liu

4/26/2020

Introduction

The dataset spotify_songs contains key attributes of Spotify music, such as name, artist, album, genre, danceability, etc.

By analyzing the data, we could understand the features of all Spotify music and further the elements that make some more popular than the others. Songs could be separated into two groups by popularity. EDA would be conducted to identify the differences between the two groups. This would help current artists to improve and new artists to catch the trend of popular music at the moment and increase the chance of making their new songs popular.

Since we have the track_album_relase_date information, we extract the “year” and “month” information and make them into separate columns. This is to explore whether the number of song releases will vary by month.

Popular songs and artists could also be identified by manipulating the data. Among the top Spotify songs, what are the most trendy songs and who are the most popular artists?

Packages Required

These packages are required for data manipulation and visualization.

library(dplyr) # manipulate data
library(ggplot2) # visualizations
library(magrittr) # pipe operator
library(DT) # create tables
library(knitr) # display tables
library(lubridate) # manipulate date and time

Data Preparation

The data came from Spotify vis the spotifyr package and was provided by tidytuesday. I downloaded the dataset on 4/3/2020.

Data Source: Spotify Songs

I imported data into R Studio and checked the dimension and a few rows of the dataset.

songs <- read.csv("spotify_songs.csv",stringsAsFactors=FALSE)
dim(songs)

## [1] 32833    23

head(songs)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

In the original dataset, there are 32833 rows and 23 variables, with 5 rows containing missing values in columns “track_name”, “track_artist”, and “track_album_name”. Since these rows could not provide the important name and artist information, therefore I deleted the five rows. After cleaning, there is no missing value in this dataset.

colSums(is.na(songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

subset(songs,is.na(songs$track_artist))

##                     track_id track_name track_artist track_popularity
## 8152  69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
## 9283  5cjecvX0CmC9gK0Laf5EMQ       <NA>         <NA>                0
## 9284  5TTzhRSWQS4Yu8xTgAuq6D       <NA>         <NA>                0
## 19569 3VKFip3OdAvv4OfNTgFWeQ       <NA>         <NA>                0
## 19812 69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
##               track_album_id track_album_name track_album_release_date
## 8152  717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
## 9283  3luHJEPw434tvNbme3SP8M             <NA>               2017-12-01
## 9284  3luHJEPw434tvNbme3SP8M             <NA>               2017-12-01
## 19569 717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
## 19812 717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
##               playlist_name            playlist_id playlist_genre
## 8152                HIP&HOP 5DyJsJZOpMJh34WvUrQzMV            rap
## 9283            GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 9284            GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 19569 Reggaeton viejitoðŸ”¥ 0si5tw70PIgPkY1Eva6V8f          latin
## 19812         latin hip hop 3nH8aytdqNeRbcRCg3dw9q          latin
##       playlist_subgenre danceability energy key loudness mode speechiness
## 8152   southern hip hop        0.714  0.821   6   -7.635    1      0.1760
## 9283       gangster rap        0.678  0.659  11   -5.364    0      0.3190
## 9284       gangster rap        0.465  0.820  10   -5.907    0      0.3070
## 19569         reggaeton        0.675  0.919  11   -6.075    0      0.0366
## 19812     latin hip hop        0.714  0.821   6   -7.635    1      0.1760
##       acousticness instrumentalness liveness valence   tempo duration_ms
## 8152        0.0410          0.00000   0.1160   0.649  95.999      282707
## 9283        0.0534          0.00000   0.5530   0.191 146.153      202235
## 9284        0.0963          0.00000   0.0888   0.505  86.839      206465
## 19569       0.0606          0.00653   0.1030   0.726  97.017      252773
## 19812       0.0410          0.00000   0.1160   0.649  95.999      282707

new_songs <- filter(songs, !is.na(track_artist))
colSums(is.na(new_songs))
dim(new_songs)
summary(new_songs)

After removing the rows with missing values, there are 32,828 rows.

Checking the boxplot, we see there are outliers.

boxplot(new_songs$duration_ms)

lowerq <-  quantile(new_songs$duration_ms,na.rm = TRUE)[2]
upperq <-  quantile(new_songs$duration_ms,na.rm = TRUE)[4]
iqr <-  upperq - lowerq
mild.threshold.upper <-  (iqr * 1.5) + upperq
mild.threshold.lower <-  lowerq - (iqr * 1.5)

new_songs_no_outliers <- new_songs[-which(new_songs$duration_ms < mild.threshold.lower | new_songs$duration_ms > mild.threshold.upper),]
dim(new_songs_no_outliers)

## [1] 31441    23

There were 1387 songs that were considered as outliers. There are 31441 songs in the dataset now, with maximum duration 334827ms (around 5.58 minutes).

summary(new_songs_no_outliers$duration_ms)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   89250  187013  214115  219288  248000  352187

boxplot(new_songs_no_outliers$duration_ms)

Variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character with length 31441. Below is the statistics summary of the numeric variables.

summary(new_songs_no_outliers[,-c(1:3,5:11)])

##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0771   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 25.00   1st Qu.:0.5650   1st Qu.:0.582000   1st Qu.: 2.000  
##  Median : 46.00   Median :0.6720   Median :0.722000   Median : 6.000  
##  Mean   : 43.04   Mean   :0.6563   Mean   :0.699792   Mean   : 5.359  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9810   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness      acousticness      
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0224   Min.   :0.0000014  
##  1st Qu.: -8.069   1st Qu.:0.0000   1st Qu.:0.0411   1st Qu.:0.0158000  
##  Median : -6.083   Median :1.0000   Median :0.0631   Median :0.0817000  
##  Mean   : -6.635   Mean   :0.5651   Mean   :0.1078   Mean   :0.1753684  
##  3rd Qu.: -4.612   3rd Qu.:1.0000   3rd Qu.:0.1330   3rd Qu.:0.2550000  
##  Max.   :  1.275   Max.   :1.0000   Max.   :0.9180   Max.   :0.9920000  
##  instrumentalness       liveness          valence            tempo       
##  Min.   :0.0000000   Min.   :0.00936   Min.   :0.00001   Min.   : 37.11  
##  1st Qu.:0.0000000   1st Qu.:0.09310   1st Qu.:0.33400   1st Qu.: 99.93  
##  Median :0.0000121   Median :0.12800   Median :0.51500   Median :121.91  
##  Mean   :0.0755050   Mean   :0.19032   Mean   :0.51312   Mean   :120.88  
##  3rd Qu.:0.0032900   3rd Qu.:0.24900   3rd Qu.:0.69500   3rd Qu.:133.99  
##  Max.   :0.9940000   Max.   :0.99400   Max.   :0.99100   Max.   :239.44  
##   duration_ms    
##  Min.   : 89250  
##  1st Qu.:187013  
##  Median :214115  
##  Mean   :219288  
##  3rd Qu.:248000  
##  Max.   :352187

New variables “year” and “month” (based on track_album_release_date) are added to help analyze the music patterns by time.

new_songs_no_outliers$year <- year(ymd(new_songs_no_outliers$track_album_release_date))
new_songs_no_outliers$month <- month(ymd(new_songs_no_outliers$track_album_release_date))
glimpse(new_songs_no_outliers)

## Observations: 31,441
## Variables: 25
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCY...
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud ...
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", ...
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X...
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud L...
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", ...
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Po...
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD...
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", ...
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "da...
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 189052, 16...
## $ year                     <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, ...
## $ month                    <dbl> 6, 12, 7, 7, 3, 7, 7, 8, 6, 6, 6, 8, 3, 5,...

Below is a preview of the cleaned data.

new_songs_no_outliers %>% 
  head(100) %>% 
  datatable()

Below is the data dictionary of variable names, data type, and variable descriptions.

new_songs.type <- lapply(new_songs_no_outliers, class)
new_songs.var_desc <- c("Song unique ID",
                        "Song Name",
                        "Song Artist",
                        "Song Popularity (0-100) where higher is better",
                        "Album unique ID",
                        "Song album name",
                        "Date when album released",
                        "Name of playlist",
                        "Playlist ID",
                        "Playlist genre",
                        "Playlist subgenre",
                        "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
                        "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
                        "The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.",
                        "The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
                        "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
                        "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
                        "A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
                        "Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
                        "Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
                        "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).",
                        "The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
                        "Duration of song in milliseconds",
    
                        "Year when album released",
                        "Month when album released")
new_songs.var_names <- colnames(new_songs_no_outliers)
data.description <- as_data_frame(cbind(new_songs.var_names,new_songs.type,new_songs.var_desc))
colnames(data.description) <- c("Variable Names","Data Type","Variable Description")
kable(data.description)

Variable Names	Data Type	Variable Description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	integer	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	integer	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness	numeric	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	integer	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	numeric	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	numeric	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	numeric	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	integer	Duration of song in milliseconds
year	numeric	Year when album released
month	numeric	Month when album released

Exploratory Data Analysis

From Good to Great Songs

Good songs are popular and people like them already. But how can they be improved to get into the top tiers? What are the key differences between the most popular songs and the good ones?

In this analysis, songs are picked and categorized into great and good songs based on the ranks of their popularities. The top 3% are considered as great songs and the ones within the top 10% to 20% range are defined as good songs. Below are their statistics summaries.

great_songs <- filter(new_songs_no_outliers, track_popularity>=quantile(new_songs_no_outliers$track_popularity, probs = .97))
summary(great_songs)

##    track_id          track_name        track_artist       track_popularity
##  Length:1045        Length:1045        Length:1045        Min.   : 83.00  
##  Class :character   Class :character   Class :character   1st Qu.: 84.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 87.00  
##                                                           Mean   : 87.65  
##                                                           3rd Qu.: 90.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:1045        Length:1045        Length:1045             
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:1045        Length:1045        Length:1045        Length:1045       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy            key           loudness      
##  Min.   :0.3100   Min.   :0.1250   Min.   : 0.00   Min.   :-18.717  
##  1st Qu.:0.6390   1st Qu.:0.5290   1st Qu.: 2.00   1st Qu.: -7.026  
##  Median :0.7290   Median :0.6450   Median : 6.00   Median : -5.678  
##  Mean   :0.7062   Mean   :0.6309   Mean   : 5.69   Mean   : -5.938  
##  3rd Qu.:0.7950   3rd Qu.:0.7440   3rd Qu.: 9.00   3rd Qu.: -4.219  
##  Max.   :0.9320   Max.   :0.9720   Max.   :11.00   Max.   : -1.940  
##       mode        speechiness      acousticness      instrumentalness   
##  Min.   :0.000   Min.   :0.0232   Min.   :0.000248   Min.   :0.0000000  
##  1st Qu.:0.000   1st Qu.:0.0456   1st Qu.:0.045100   1st Qu.:0.0000000  
##  Median :1.000   Median :0.0735   Median :0.141000   Median :0.0000000  
##  Mean   :0.534   Mean   :0.1223   Mean   :0.222963   Mean   :0.0113265  
##  3rd Qu.:1.000   3rd Qu.:0.1560   3rd Qu.:0.331000   3rd Qu.:0.0000219  
##  Max.   :1.000   Max.   :0.5030   Max.   :0.952000   Max.   :0.6570000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0197   Min.   :0.0528   Min.   : 72.54   Min.   :104591  
##  1st Qu.:0.0912   1st Qu.:0.3500   1st Qu.: 95.98   1st Qu.:180522  
##  Median :0.1130   Median :0.5290   Median :115.00   Median :201040  
##  Mean   :0.1640   Mean   :0.5140   Mean   :119.61   Mean   :204875  
##  3rd Qu.:0.1810   3rd Qu.:0.6680   3rd Qu.:135.13   3rd Qu.:222347  
##  Max.   :0.9620   Max.   :0.9650   Max.   :205.27   Max.   :342040  
##       year          month       
##  Min.   :1978   Min.   : 1.000  
##  1st Qu.:2018   1st Qu.: 5.000  
##  Median :2019   Median : 8.000  
##  Mean   :2018   Mean   : 7.569  
##  3rd Qu.:2019   3rd Qu.:10.000  
##  Max.   :2020   Max.   :12.000

good_songs <- filter(new_songs_no_outliers, track_popularity>=quantile(new_songs_no_outliers$track_popularity, probs = .8), track_popularity<quantile(new_songs_no_outliers$track_popularity, probs = .9))
summary(good_songs)

##    track_id          track_name        track_artist       track_popularity
##  Length:3209        Length:3209        Length:3209        Min.   :66.00   
##  Class :character   Class :character   Class :character   1st Qu.:67.00   
##  Mode  :character   Mode  :character   Mode  :character   Median :69.00   
##                                                           Mean   :69.31   
##                                                           3rd Qu.:71.00   
##                                                           Max.   :73.00   
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:3209        Length:3209        Length:3209             
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:3209        Length:3209        Length:3209        Length:3209       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy            key            loudness      
##  Min.   :0.1570   Min.   :0.0167   Min.   : 0.000   Min.   :-28.309  
##  1st Qu.:0.5650   1st Qu.:0.5780   1st Qu.: 2.000   1st Qu.: -7.508  
##  Median :0.6750   Median :0.7110   Median : 6.000   Median : -5.830  
##  Mean   :0.6625   Mean   :0.6905   Mean   : 5.337   Mean   : -6.299  
##  3rd Qu.:0.7700   3rd Qu.:0.8200   3rd Qu.: 9.000   3rd Qu.: -4.469  
##  Max.   :0.9790   Max.   :0.9930   Max.   :11.000   Max.   : -0.739  
##                                                                      
##       mode         speechiness      acousticness       instrumentalness   
##  Min.   :0.0000   Min.   :0.0233   Min.   :0.0000129   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0401   1st Qu.:0.0244000   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0612   Median :0.0985000   Median :0.0000018  
##  Mean   :0.5964   Mean   :0.1020   Mean   :0.1803308   Mean   :0.0217440  
##  3rd Qu.:1.0000   3rd Qu.:0.1240   3rd Qu.:0.2640000   3rd Qu.:0.0002100  
##  Max.   :1.0000   Max.   :0.6090   Max.   :0.9830000   Max.   :0.9340000  
##                                                                           
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0212   Min.   :0.0371   Min.   : 61.66   Min.   : 92093  
##  1st Qu.:0.0912   1st Qu.:0.3640   1st Qu.: 99.91   1st Qu.:189296  
##  Median :0.1230   Median :0.5350   Median :121.01   Median :213338  
##  Mean   :0.1823   Mean   :0.5369   Mean   :121.28   Mean   :217770  
##  3rd Qu.:0.2280   3rd Qu.:0.7130   3rd Qu.:136.02   3rd Qu.:240907  
##  Max.   :0.9830   Max.   :0.9850   Max.   :210.16   Max.   :352160  
##                                                                     
##       year          month       
##  Min.   :1958   Min.   : 1.000  
##  1st Qu.:2008   1st Qu.: 3.000  
##  Median :2017   Median : 7.000  
##  Mean   :2011   Mean   : 6.502  
##  3rd Qu.:2019   3rd Qu.:10.000  
##  Max.   :2020   Max.   :12.000  
##  NA's   :176    NA's   :176

From the summaries, we could see there are 1045 songs in the top 3% and 3209 songs in the top 10%~20%. Among the factors, “year” seems to be essential between the two groups. For the most popular songs (top 3%), their years are relatively much newer than the good songs’. 75% of the most popular songs were released in the year 2018, 2019 and 2020, and the median(50%) year is 2019. For the good ones, 25% of the songs were released before 2008 and the median year is 2017.

The numbers indicate that most songs may have their popularity “lifespan,” the older it gets the less popular it is. From this perspective, new songs created by people who have less experience in the music industry could gain more popularity than people are already in the field for a long while. The popularity of the songs is not limited by the number of years experience of the artist. This fact also implies that you may have never heard of a singer even though he/she created many songs in the past. This would be a good sign for fresh singers with limited or no experience who could create songs that satisfy the general public needs.

great_songs$year <- (year(ymd(great_songs$track_album_release_date)))
ggplot(data = as_tibble(great_songs), aes(y = year)) +
  geom_boxplot() +
 ggtitle("Year Distribution of Great Songs") +
theme(plot.title = element_text(hjust = 0.5))

good_songs$year <- (year(ymd(good_songs$track_album_release_date)))
good_songs_1 <- filter(good_songs, !is.na(year))
ggplot(data = as_tibble(good_songs_1), aes(y = year)) +
  geom_boxplot() +
 ggtitle("Year Distribution of Good Songs") +
theme(plot.title = element_text(hjust = 0.5))

Another noticeable factor is danceability. Great songs tend to have overall higher danceability than good songs. The median danceability of great songs is 0.728 and it is 0.676 for good songs. Check the parameters Min. 1st quantile, Median, 3rd Quantile, and we could find the same pattern. We notice that the maximal danceability of great songs (0.932) is lower than good songs (0.979). This might show that extremely high or low danceability may cause the song to be less popular. Below is the density plot of danceability for both great (in red) and good songs (in blue).

plot(density(great_songs$danceability), col="red", main = "Danceability of Great and Good Songs", xlab = "Danceability")
lines(density(good_songs$danceability), col="blue")
abline(v=median(great_songs$danceability), col="red", lty=2, lwd=1.5)
abline(v=median(good_songs$danceability), col="blue", lty=2, lwd=1.5)

Besides newer year and higher danceability, great songs have other features such as lower energy, higher acousticness, slower tempo and less duration_ms, compared with good songs. Starters in the music industry could refer to the characteristics of the great songs and adjust their songs before publishing. Below are the recommended ranges(25%~75%) of song features for beginners.

recommend <- matrix(c(0.795,0.744,0.331,135.13,222347,0.639,0.529,0.045,95.98,180522),ncol=5,byrow=TRUE)
colnames(recommend) <- c("Danceability","Energy","Acousticness","Tempo","Duration(in ms)")
rownames(recommend) <- c("Upper Bound", "Lower Bound")
recommend <- as.table(recommend)
recommend

##             Danceability     Energy Acousticness      Tempo Duration(in ms)
## Upper Bound        0.795      0.744        0.331    135.130      222347.000
## Lower Bound        0.639      0.529        0.045     95.980      180522.000

More Release in Certain Months?

Is the release of popular songs affected by month? We will be using the top 3% songs for studying the trend.

From the summary statistics, we realize that most of the top 3% of songs were released in 2019. Therefore, we separate the 2019 data from the others and use it to study the month pattern. From the histogram below, we could tell that November, October and June are the top 3 months when songs were released. January, February and April have the least release amount.

Supposing all the songs in 2019 were released in the northern hemisphere, warm and nice weather might be a factor that influences the release amount and the market. Imaging nights are long and days are gloomy and rainy, people may feel more upset and less likely to enjoy music. Impacted by the unappealing weather, musicians might become less inspiring and passionate, which could lead to less release amount. Nature and the world itself are where the inspiration originates. Music represents the passion of human beings. Without nature, we could do nothing.

However, further research on how weather/environment affects the market of the music industry and musician productivity could be conducted.

dim(great_songs)

## [1] 1045   25

great_songs_2019 <- filter(great_songs,year=="2019")
ggplot(data=great_songs_2019, aes(x=month)) +
    geom_bar(stat="count")+
  scale_x_continuous(breaks=seq(0,12,1))+
 ggtitle("Release of Great Songs in 2019") +
theme(plot.title = element_text(hjust = 0.5))

Top Tier Songs and Artists in 2019

What are the most popular songs in 2019? After duplicate track_name were removed, the most popular songs in 2019 are listed below.

great_songs_2019_nodup<- great_songs_2019 %>% distinct(track_name, .keep_all = TRUE)
great_songs_2019_nodup %>% 
  select(track_name,track_artist,track_popularity,playlist_genre) %>%
  arrange(desc(track_popularity)) %>% 
  datatable()

Who are the most popular artists in 2019? Billie Eilish, Ariana Grande, DaBaby, and Post Malone are the top 4. The list below includes the name of the artist and the number of songs that are in the great_songs_2019 list (ranked by the number of songs, from high to low).

great_songs_2019_nodup %>% 
  group_by(track_artist) %>%
  tally() %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  datatable()

In the great_songs_2019 list, what genres are the songs belong to? After counting the songs in each genre, we could see that most of the songs are in the “pop” genre, some in the “rap” and a very few in the “r&b” or “edm.”

great_songs_2019_nodup %>% 
  group_by(playlist_genre) %>%
  tally() %>% 
  filter(n>1) %>% 
  arrange(desc(n))

## # A tibble: 5 x 2
##   playlist_genre     n
##   <chr>          <int>
## 1 pop              102
## 2 rap               41
## 3 latin             29
## 4 r&b                5
## 5 edm                2

great_songs_2019_nodup$playlist_genre <- factor(great_songs_2019_nodup$playlist_genre,levels = c("pop", "rap", "latin", "r&b", "edm"))
ggplot(data=great_songs_2019_nodup, aes(x = playlist_genre)) +
    geom_bar()+
    ggtitle("Release of Great Songs in 2019") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_discrete(name = "Playlist Genre")

Summary

The analysis on the Spotify Songs dataset mainly answered the following three questions:

What are the key elements that make super popular songs great, compared with good songs in general?
Is the release amount of great songs affected by the factor month?
What are the top songs and artists in 2019?

Definitions:

Great Songs: the ones at the top 3% of all songs based on popularity.

Good Songs: the ones in the top 10%~20% range based on popularity.

By comparing the summary statistics of great songs and good songs, we found that danceability, energy, acousticness, tempo and duration these five elements seem to be important. Great songs tend to have overall higher danceability, lower energy, higher acousticness, slower tempo and less duration than good songs.

For current artists, adjusting your next songs based on the features of the great songs could improve the chances of making them more popular. For new artists, below are the recommended song feature ranges to help get your feet into the field and improve the quality of your songs.

Danceability (0.639~0.795), Energy (0.529~0.744), Acousticness (0.045~0.331), Tempo (95.98~135.13), Duration in ms (180522~221872).

Since most of the “great songs” were released in 2019, we used the data in this specific year to study the relationship between release and month. The previous histogram shows that that November, October and June are the top 3 months when songs were released. January, February and April have the least release amount. In general, the release is much lower in winter and early Spring while much higher in fall and stay kind of stable in summer. The weather might affect the market of the music industry and the productivity of musicians - nice weather energizes the music market and production. This may need further thorough research and analysis to prove.

We dived further into the 2019 “great songs” and ranked the songs based on track popularity and the artists according to the number of songs in the 2019 “great songs” list. The top 4 artists are Billie Eilish, Ariana Grande, DaBaby, and Post Malone. When it comes to the genre of the songs, most of them are pop music, more than 50%.