The dataset spotify_songs contains key attributes of Spotify music, such as name, artist, album, genre, danceability, etc.
By analyzing the data, we could understand the features of all Spotify music and further the elements that make some more popular than the others. Songs could be separated into two groups by popularity. EDA would be conducted to identify the differences between the two groups. This would help current artists to improve and new artists to catch the trend of popular music at the moment and increase the chance of making their new songs popular.
Since we have the track_album_relase_date information, we extract the “year” and “month” information and make them into separate columns. This is to explore whether the number of song releases will vary by month.
Popular songs and artists could also be identified by manipulating the data. Among the top Spotify songs, what are the most trendy songs and who are the most popular artists?
These packages are required for data manipulation and visualization.
library(dplyr) # manipulate data
library(ggplot2) # visualizations
library(magrittr) # pipe operator
library(DT) # create tables
library(knitr) # display tables
library(lubridate) # manipulate date and time
The data came from Spotify vis the spotifyr package and was provided by tidytuesday. I downloaded the dataset on 4/3/2020.
Data Source: Spotify Songs
I imported data into R Studio and checked the dimension and a few rows of the dataset.
songs <- read.csv("spotify_songs.csv",stringsAsFactors=FALSE)
dim(songs)
## [1] 32833 23
head(songs)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
In the original dataset, there are 32833 rows and 23 variables, with 5 rows containing missing values in columns “track_name”, “track_artist”, and “track_album_name”. Since these rows could not provide the important name and artist information, therefore I deleted the five rows. After cleaning, there is no missing value in this dataset.
colSums(is.na(songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
subset(songs,is.na(songs$track_artist))
## track_id track_name track_artist track_popularity
## 8152 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## 9283 5cjecvX0CmC9gK0Laf5EMQ <NA> <NA> 0
## 9284 5TTzhRSWQS4Yu8xTgAuq6D <NA> <NA> 0
## 19569 3VKFip3OdAvv4OfNTgFWeQ <NA> <NA> 0
## 19812 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## track_album_id track_album_name track_album_release_date
## 8152 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## 9283 3luHJEPw434tvNbme3SP8M <NA> 2017-12-01
## 9284 3luHJEPw434tvNbme3SP8M <NA> 2017-12-01
## 19569 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## 19812 717UG2du6utFe7CdmpuUe3 <NA> 2012-01-05
## playlist_name playlist_id playlist_genre
## 8152 HIP&HOP 5DyJsJZOpMJh34WvUrQzMV rap
## 9283 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 9284 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 19569 Reggaeton viejito🔥 0si5tw70PIgPkY1Eva6V8f latin
## 19812 latin hip hop 3nH8aytdqNeRbcRCg3dw9q latin
## playlist_subgenre danceability energy key loudness mode speechiness
## 8152 southern hip hop 0.714 0.821 6 -7.635 1 0.1760
## 9283 gangster rap 0.678 0.659 11 -5.364 0 0.3190
## 9284 gangster rap 0.465 0.820 10 -5.907 0 0.3070
## 19569 reggaeton 0.675 0.919 11 -6.075 0 0.0366
## 19812 latin hip hop 0.714 0.821 6 -7.635 1 0.1760
## acousticness instrumentalness liveness valence tempo duration_ms
## 8152 0.0410 0.00000 0.1160 0.649 95.999 282707
## 9283 0.0534 0.00000 0.5530 0.191 146.153 202235
## 9284 0.0963 0.00000 0.0888 0.505 86.839 206465
## 19569 0.0606 0.00653 0.1030 0.726 97.017 252773
## 19812 0.0410 0.00000 0.1160 0.649 95.999 282707
new_songs <- filter(songs, !is.na(track_artist))
colSums(is.na(new_songs))
dim(new_songs)
summary(new_songs)
After removing the rows with missing values, there are 32,828 rows.
Checking the boxplot, we see there are outliers.
boxplot(new_songs$duration_ms)
lowerq <- quantile(new_songs$duration_ms,na.rm = TRUE)[2]
upperq <- quantile(new_songs$duration_ms,na.rm = TRUE)[4]
iqr <- upperq - lowerq
mild.threshold.upper <- (iqr * 1.5) + upperq
mild.threshold.lower <- lowerq - (iqr * 1.5)
new_songs_no_outliers <- new_songs[-which(new_songs$duration_ms < mild.threshold.lower | new_songs$duration_ms > mild.threshold.upper),]
dim(new_songs_no_outliers)
## [1] 31441 23
There were 1387 songs that were considered as outliers. There are 31441 songs in the dataset now, with maximum duration 334827ms (around 5.58 minutes).
summary(new_songs_no_outliers$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89250 187013 214115 219288 248000 352187
boxplot(new_songs_no_outliers$duration_ms)
Variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character with length 31441. Below is the statistics summary of the numeric variables.
summary(new_songs_no_outliers[,-c(1:3,5:11)])
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0771 Min. :0.000175 Min. : 0.000
## 1st Qu.: 25.00 1st Qu.:0.5650 1st Qu.:0.582000 1st Qu.: 2.000
## Median : 46.00 Median :0.6720 Median :0.722000 Median : 6.000
## Mean : 43.04 Mean :0.6563 Mean :0.699792 Mean : 5.359
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9810 Max. :1.000000 Max. :11.000
## loudness mode speechiness acousticness
## Min. :-46.448 Min. :0.0000 Min. :0.0224 Min. :0.0000014
## 1st Qu.: -8.069 1st Qu.:0.0000 1st Qu.:0.0411 1st Qu.:0.0158000
## Median : -6.083 Median :1.0000 Median :0.0631 Median :0.0817000
## Mean : -6.635 Mean :0.5651 Mean :0.1078 Mean :0.1753684
## 3rd Qu.: -4.612 3rd Qu.:1.0000 3rd Qu.:0.1330 3rd Qu.:0.2550000
## Max. : 1.275 Max. :1.0000 Max. :0.9180 Max. :0.9920000
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.00936 Min. :0.00001 Min. : 37.11
## 1st Qu.:0.0000000 1st Qu.:0.09310 1st Qu.:0.33400 1st Qu.: 99.93
## Median :0.0000121 Median :0.12800 Median :0.51500 Median :121.91
## Mean :0.0755050 Mean :0.19032 Mean :0.51312 Mean :120.88
## 3rd Qu.:0.0032900 3rd Qu.:0.24900 3rd Qu.:0.69500 3rd Qu.:133.99
## Max. :0.9940000 Max. :0.99400 Max. :0.99100 Max. :239.44
## duration_ms
## Min. : 89250
## 1st Qu.:187013
## Median :214115
## Mean :219288
## 3rd Qu.:248000
## Max. :352187
New variables “year” and “month” (based on track_album_release_date) are added to help analyze the music patterns by time.
new_songs_no_outliers$year <- year(ymd(new_songs_no_outliers$track_album_release_date))
new_songs_no_outliers$month <- month(ymd(new_songs_no_outliers$track_album_release_date))
glimpse(new_songs_no_outliers)
## Observations: 31,441
## Variables: 25
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCY...
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud ...
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", ...
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X...
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud L...
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", ...
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Po...
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD...
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", ...
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "da...
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms <int> 194754, 162600, 176616, 169093, 189052, 16...
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, ...
## $ month <dbl> 6, 12, 7, 7, 3, 7, 7, 8, 6, 6, 6, 8, 3, 5,...
Below is a preview of the cleaned data.
new_songs_no_outliers %>%
head(100) %>%
datatable()
Below is the data dictionary of variable names, data type, and variable descriptions.
new_songs.type <- lapply(new_songs_no_outliers, class)
new_songs.var_desc <- c("Song unique ID",
"Song Name",
"Song Artist",
"Song Popularity (0-100) where higher is better",
"Album unique ID",
"Song album name",
"Date when album released",
"Name of playlist",
"Playlist ID",
"Playlist genre",
"Playlist subgenre",
"Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
"Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
"The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.",
"The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
"Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
"Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
"A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
"Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
"Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).",
"The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
"Duration of song in milliseconds",
"Year when album released",
"Month when album released")
new_songs.var_names <- colnames(new_songs_no_outliers)
data.description <- as_data_frame(cbind(new_songs.var_names,new_songs.type,new_songs.var_desc))
colnames(data.description) <- c("Variable Names","Data Type","Variable Description")
kable(data.description)
| Variable Names | Data Type | Variable Description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | integer | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | integer | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | numeric | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | numeric | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | numeric | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | numeric | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | integer | Duration of song in milliseconds |
| year | numeric | Year when album released |
| month | numeric | Month when album released |
Good songs are popular and people like them already. But how can they be improved to get into the top tiers? What are the key differences between the most popular songs and the good ones?
In this analysis, songs are picked and categorized into great and good songs based on the ranks of their popularities. The top 3% are considered as great songs and the ones within the top 10% to 20% range are defined as good songs. Below are their statistics summaries.
great_songs <- filter(new_songs_no_outliers, track_popularity>=quantile(new_songs_no_outliers$track_popularity, probs = .97))
summary(great_songs)
## track_id track_name track_artist track_popularity
## Length:1045 Length:1045 Length:1045 Min. : 83.00
## Class :character Class :character Class :character 1st Qu.: 84.00
## Mode :character Mode :character Mode :character Median : 87.00
## Mean : 87.65
## 3rd Qu.: 90.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:1045 Length:1045 Length:1045
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:1045 Length:1045 Length:1045 Length:1045
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.3100 Min. :0.1250 Min. : 0.00 Min. :-18.717
## 1st Qu.:0.6390 1st Qu.:0.5290 1st Qu.: 2.00 1st Qu.: -7.026
## Median :0.7290 Median :0.6450 Median : 6.00 Median : -5.678
## Mean :0.7062 Mean :0.6309 Mean : 5.69 Mean : -5.938
## 3rd Qu.:0.7950 3rd Qu.:0.7440 3rd Qu.: 9.00 3rd Qu.: -4.219
## Max. :0.9320 Max. :0.9720 Max. :11.00 Max. : -1.940
## mode speechiness acousticness instrumentalness
## Min. :0.000 Min. :0.0232 Min. :0.000248 Min. :0.0000000
## 1st Qu.:0.000 1st Qu.:0.0456 1st Qu.:0.045100 1st Qu.:0.0000000
## Median :1.000 Median :0.0735 Median :0.141000 Median :0.0000000
## Mean :0.534 Mean :0.1223 Mean :0.222963 Mean :0.0113265
## 3rd Qu.:1.000 3rd Qu.:0.1560 3rd Qu.:0.331000 3rd Qu.:0.0000219
## Max. :1.000 Max. :0.5030 Max. :0.952000 Max. :0.6570000
## liveness valence tempo duration_ms
## Min. :0.0197 Min. :0.0528 Min. : 72.54 Min. :104591
## 1st Qu.:0.0912 1st Qu.:0.3500 1st Qu.: 95.98 1st Qu.:180522
## Median :0.1130 Median :0.5290 Median :115.00 Median :201040
## Mean :0.1640 Mean :0.5140 Mean :119.61 Mean :204875
## 3rd Qu.:0.1810 3rd Qu.:0.6680 3rd Qu.:135.13 3rd Qu.:222347
## Max. :0.9620 Max. :0.9650 Max. :205.27 Max. :342040
## year month
## Min. :1978 Min. : 1.000
## 1st Qu.:2018 1st Qu.: 5.000
## Median :2019 Median : 8.000
## Mean :2018 Mean : 7.569
## 3rd Qu.:2019 3rd Qu.:10.000
## Max. :2020 Max. :12.000
good_songs <- filter(new_songs_no_outliers, track_popularity>=quantile(new_songs_no_outliers$track_popularity, probs = .8), track_popularity<quantile(new_songs_no_outliers$track_popularity, probs = .9))
summary(good_songs)
## track_id track_name track_artist track_popularity
## Length:3209 Length:3209 Length:3209 Min. :66.00
## Class :character Class :character Class :character 1st Qu.:67.00
## Mode :character Mode :character Mode :character Median :69.00
## Mean :69.31
## 3rd Qu.:71.00
## Max. :73.00
##
## track_album_id track_album_name track_album_release_date
## Length:3209 Length:3209 Length:3209
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:3209 Length:3209 Length:3209 Length:3209
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## danceability energy key loudness
## Min. :0.1570 Min. :0.0167 Min. : 0.000 Min. :-28.309
## 1st Qu.:0.5650 1st Qu.:0.5780 1st Qu.: 2.000 1st Qu.: -7.508
## Median :0.6750 Median :0.7110 Median : 6.000 Median : -5.830
## Mean :0.6625 Mean :0.6905 Mean : 5.337 Mean : -6.299
## 3rd Qu.:0.7700 3rd Qu.:0.8200 3rd Qu.: 9.000 3rd Qu.: -4.469
## Max. :0.9790 Max. :0.9930 Max. :11.000 Max. : -0.739
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0233 Min. :0.0000129 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0401 1st Qu.:0.0244000 1st Qu.:0.0000000
## Median :1.0000 Median :0.0612 Median :0.0985000 Median :0.0000018
## Mean :0.5964 Mean :0.1020 Mean :0.1803308 Mean :0.0217440
## 3rd Qu.:1.0000 3rd Qu.:0.1240 3rd Qu.:0.2640000 3rd Qu.:0.0002100
## Max. :1.0000 Max. :0.6090 Max. :0.9830000 Max. :0.9340000
##
## liveness valence tempo duration_ms
## Min. :0.0212 Min. :0.0371 Min. : 61.66 Min. : 92093
## 1st Qu.:0.0912 1st Qu.:0.3640 1st Qu.: 99.91 1st Qu.:189296
## Median :0.1230 Median :0.5350 Median :121.01 Median :213338
## Mean :0.1823 Mean :0.5369 Mean :121.28 Mean :217770
## 3rd Qu.:0.2280 3rd Qu.:0.7130 3rd Qu.:136.02 3rd Qu.:240907
## Max. :0.9830 Max. :0.9850 Max. :210.16 Max. :352160
##
## year month
## Min. :1958 Min. : 1.000
## 1st Qu.:2008 1st Qu.: 3.000
## Median :2017 Median : 7.000
## Mean :2011 Mean : 6.502
## 3rd Qu.:2019 3rd Qu.:10.000
## Max. :2020 Max. :12.000
## NA's :176 NA's :176
From the summaries, we could see there are 1045 songs in the top 3% and 3209 songs in the top 10%~20%. Among the factors, “year” seems to be essential between the two groups. For the most popular songs (top 3%), their years are relatively much newer than the good songs’. 75% of the most popular songs were released in the year 2018, 2019 and 2020, and the median(50%) year is 2019. For the good ones, 25% of the songs were released before 2008 and the median year is 2017.
The numbers indicate that most songs may have their popularity “lifespan,” the older it gets the less popular it is. From this perspective, new songs created by people who have less experience in the music industry could gain more popularity than people are already in the field for a long while. The popularity of the songs is not limited by the number of years experience of the artist. This fact also implies that you may have never heard of a singer even though he/she created many songs in the past. This would be a good sign for fresh singers with limited or no experience who could create songs that satisfy the general public needs.
great_songs$year <- (year(ymd(great_songs$track_album_release_date)))
ggplot(data = as_tibble(great_songs), aes(y = year)) +
geom_boxplot() +
ggtitle("Year Distribution of Great Songs") +
theme(plot.title = element_text(hjust = 0.5))
good_songs$year <- (year(ymd(good_songs$track_album_release_date)))
good_songs_1 <- filter(good_songs, !is.na(year))
ggplot(data = as_tibble(good_songs_1), aes(y = year)) +
geom_boxplot() +
ggtitle("Year Distribution of Good Songs") +
theme(plot.title = element_text(hjust = 0.5))
Another noticeable factor is danceability. Great songs tend to have overall higher danceability than good songs. The median danceability of great songs is 0.728 and it is 0.676 for good songs. Check the parameters Min. 1st quantile, Median, 3rd Quantile, and we could find the same pattern. We notice that the maximal danceability of great songs (0.932) is lower than good songs (0.979). This might show that extremely high or low danceability may cause the song to be less popular. Below is the density plot of danceability for both great (in red) and good songs (in blue).
plot(density(great_songs$danceability), col="red", main = "Danceability of Great and Good Songs", xlab = "Danceability")
lines(density(good_songs$danceability), col="blue")
abline(v=median(great_songs$danceability), col="red", lty=2, lwd=1.5)
abline(v=median(good_songs$danceability), col="blue", lty=2, lwd=1.5)
Besides newer year and higher danceability, great songs have other features such as lower energy, higher acousticness, slower tempo and less duration_ms, compared with good songs. Starters in the music industry could refer to the characteristics of the great songs and adjust their songs before publishing. Below are the recommended ranges(25%~75%) of song features for beginners.
recommend <- matrix(c(0.795,0.744,0.331,135.13,222347,0.639,0.529,0.045,95.98,180522),ncol=5,byrow=TRUE)
colnames(recommend) <- c("Danceability","Energy","Acousticness","Tempo","Duration(in ms)")
rownames(recommend) <- c("Upper Bound", "Lower Bound")
recommend <- as.table(recommend)
recommend
## Danceability Energy Acousticness Tempo Duration(in ms)
## Upper Bound 0.795 0.744 0.331 135.130 222347.000
## Lower Bound 0.639 0.529 0.045 95.980 180522.000
Is the release of popular songs affected by month? We will be using the top 3% songs for studying the trend.
From the summary statistics, we realize that most of the top 3% of songs were released in 2019. Therefore, we separate the 2019 data from the others and use it to study the month pattern. From the histogram below, we could tell that November, October and June are the top 3 months when songs were released. January, February and April have the least release amount.
Supposing all the songs in 2019 were released in the northern hemisphere, warm and nice weather might be a factor that influences the release amount and the market. Imaging nights are long and days are gloomy and rainy, people may feel more upset and less likely to enjoy music. Impacted by the unappealing weather, musicians might become less inspiring and passionate, which could lead to less release amount. Nature and the world itself are where the inspiration originates. Music represents the passion of human beings. Without nature, we could do nothing.
However, further research on how weather/environment affects the market of the music industry and musician productivity could be conducted.
dim(great_songs)
## [1] 1045 25
great_songs_2019 <- filter(great_songs,year=="2019")
ggplot(data=great_songs_2019, aes(x=month)) +
geom_bar(stat="count")+
scale_x_continuous(breaks=seq(0,12,1))+
ggtitle("Release of Great Songs in 2019") +
theme(plot.title = element_text(hjust = 0.5))
What are the most popular songs in 2019? After duplicate track_name were removed, the most popular songs in 2019 are listed below.
great_songs_2019_nodup<- great_songs_2019 %>% distinct(track_name, .keep_all = TRUE)
great_songs_2019_nodup %>%
select(track_name,track_artist,track_popularity,playlist_genre) %>%
arrange(desc(track_popularity)) %>%
datatable()
Who are the most popular artists in 2019? Billie Eilish, Ariana Grande, DaBaby, and Post Malone are the top 4. The list below includes the name of the artist and the number of songs that are in the great_songs_2019 list (ranked by the number of songs, from high to low).
great_songs_2019_nodup %>%
group_by(track_artist) %>%
tally() %>%
filter(n>1) %>%
arrange(desc(n)) %>%
datatable()
In the great_songs_2019 list, what genres are the songs belong to? After counting the songs in each genre, we could see that most of the songs are in the “pop” genre, some in the “rap” and a very few in the “r&b” or “edm.”
great_songs_2019_nodup %>%
group_by(playlist_genre) %>%
tally() %>%
filter(n>1) %>%
arrange(desc(n))
## # A tibble: 5 x 2
## playlist_genre n
## <chr> <int>
## 1 pop 102
## 2 rap 41
## 3 latin 29
## 4 r&b 5
## 5 edm 2
great_songs_2019_nodup$playlist_genre <- factor(great_songs_2019_nodup$playlist_genre,levels = c("pop", "rap", "latin", "r&b", "edm"))
ggplot(data=great_songs_2019_nodup, aes(x = playlist_genre)) +
geom_bar()+
ggtitle("Release of Great Songs in 2019") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_discrete(name = "Playlist Genre")
The analysis on the Spotify Songs dataset mainly answered the following three questions:
What are the key elements that make super popular songs great, compared with good songs in general?
Is the release amount of great songs affected by the factor month?
What are the top songs and artists in 2019?
Definitions:
Great Songs: the ones at the top 3% of all songs based on popularity.
Good Songs: the ones in the top 10%~20% range based on popularity.
By comparing the summary statistics of great songs and good songs, we found that danceability, energy, acousticness, tempo and duration these five elements seem to be important. Great songs tend to have overall higher danceability, lower energy, higher acousticness, slower tempo and less duration than good songs.
For current artists, adjusting your next songs based on the features of the great songs could improve the chances of making them more popular. For new artists, below are the recommended song feature ranges to help get your feet into the field and improve the quality of your songs.
Danceability (0.639~0.795), Energy (0.529~0.744), Acousticness (0.045~0.331), Tempo (95.98~135.13), Duration in ms (180522~221872).
Since most of the “great songs” were released in 2019, we used the data in this specific year to study the relationship between release and month. The previous histogram shows that that November, October and June are the top 3 months when songs were released. January, February and April have the least release amount. In general, the release is much lower in winter and early Spring while much higher in fall and stay kind of stable in summer. The weather might affect the market of the music industry and the productivity of musicians - nice weather energizes the music market and production. This may need further thorough research and analysis to prove.
We dived further into the 2019 “great songs” and ranked the songs based on track popularity and the artists according to the number of songs in the 2019 “great songs” list. The top 4 artists are Billie Eilish, Ariana Grande, DaBaby, and Post Malone. When it comes to the genre of the songs, most of them are pop music, more than 50%.