Mid-term Project

Jiaoyao Liu

4/3/2020

Introduction

The dataset spotify_songs contains key attributes of Spotify music, such as name, artist, album, genre, danceability, etc.

By analyzing the data, we could understand the features of all Spotify music and further the elements that make some more popular than the others. Songs could be separated into two groups by popularity. EDA would be conducted to identify the differences between the two groups. This would help new artists to catch the trend of popular music at the moment and increase the chance of making their new songs popular.

Since we have the track_album_relase_date information, we extract the “month” information and make it into a separate column. This is to explore whether the features of the songs will change by month.

Popular artists could also be identified by manipulating the data. Among the top Spotify songs, who are the most popular artists?

We could also cluster the songs based on their characteristics. Clustering analysis could be applied to achieve this. By putting the songs with similar features together, we could better recommend new songs to Spotify customers and help them enjoy more songs that are similar to the songs in their preferred playlists.

Linear regression or logistic regression methods could be used to predict the popularity of a song before it is released. This could be helpful to determine whether it is worthwhile to invest a lot of money in music videos or other marketing channels to promote the song.

Packages Required

These packages are required for data manipulation and visualization.

library(dplyr) # manipulate data
library(ggplot2) # visualizations
library(magrittr) # Pipe operator
library(DT) # create tables
library(knitr) # display tables

Data Preparation

The data came from Spotify vis the spotifyr package and was provided by tidytuesday. I downloaded the dataset on 4/3/2020.

Data Source: Spotify Songs

I imported data into R Studio and checked the dimension and a few rows of the dataset.

songs <- read.csv("spotify_songs.csv",stringsAsFactors=FALSE)
dim(songs)
## [1] 32833    23
head(songs)
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

In the original dataset, there are 32833 rows and 23 variables, with 5 rows containing missing values in columns “track_name”, “track_artist”, and “track_album name”. Since these rows could not provide the important name and artist information, therefore I deleted the five rows. After cleaning, there is no missing value in this dataset.

colSums(is.na(songs))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
subset(songs,is.na(songs$track_artist))
##                     track_id track_name track_artist track_popularity
## 8152  69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
## 9283  5cjecvX0CmC9gK0Laf5EMQ       <NA>         <NA>                0
## 9284  5TTzhRSWQS4Yu8xTgAuq6D       <NA>         <NA>                0
## 19569 3VKFip3OdAvv4OfNTgFWeQ       <NA>         <NA>                0
## 19812 69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
##               track_album_id track_album_name track_album_release_date
## 8152  717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
## 9283  3luHJEPw434tvNbme3SP8M             <NA>               2017-12-01
## 9284  3luHJEPw434tvNbme3SP8M             <NA>               2017-12-01
## 19569 717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
## 19812 717UG2du6utFe7CdmpuUe3             <NA>               2012-01-05
##               playlist_name            playlist_id playlist_genre
## 8152                HIP&HOP 5DyJsJZOpMJh34WvUrQzMV            rap
## 9283            GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 9284            GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 19569 Reggaeton viejito🔥 0si5tw70PIgPkY1Eva6V8f          latin
## 19812         latin hip hop 3nH8aytdqNeRbcRCg3dw9q          latin
##       playlist_subgenre danceability energy key loudness mode speechiness
## 8152   southern hip hop        0.714  0.821   6   -7.635    1      0.1760
## 9283       gangster rap        0.678  0.659  11   -5.364    0      0.3190
## 9284       gangster rap        0.465  0.820  10   -5.907    0      0.3070
## 19569         reggaeton        0.675  0.919  11   -6.075    0      0.0366
## 19812     latin hip hop        0.714  0.821   6   -7.635    1      0.1760
##       acousticness instrumentalness liveness valence   tempo duration_ms
## 8152        0.0410          0.00000   0.1160   0.649  95.999      282707
## 9283        0.0534          0.00000   0.5530   0.191 146.153      202235
## 9284        0.0963          0.00000   0.0888   0.505  86.839      206465
## 19569       0.0606          0.00653   0.1030   0.726  97.017      252773
## 19812       0.0410          0.00000   0.1160   0.649  95.999      282707
new_songs <- filter(songs, !is.na(track_artist))
colSums(is.na(new_songs))
dim(new_songs)
summary(new_songs)

After removing the rows with missing values, there are 32,828 rows.

Checking the boxplot, we see there are outliers.

boxplot(new_songs$duration_ms)

lowerq <-  quantile(new_songs$duration_ms,na.rm = TRUE)[2]
upperq <-  quantile(new_songs$duration_ms,na.rm = TRUE)[4]
iqr <-  upperq - lowerq
mild.threshold.upper <-  (iqr * 1.5) + upperq
mild.threshold.lower <-  lowerq - (iqr * 1.5)

new_songs_no_outliers <- new_songs[-which(new_songs$duration_ms < mild.threshold.lower | new_songs$duration_ms > mild.threshold.upper),]
dim(new_songs_no_outliers)
## [1] 31441    23

There were 1387 songs that were considered as outliers. There are 31441 songs in the dataset now, with maximum duration 334827ms (around 5.58 minutes).

summary(new_songs_no_outliers$duration_ms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   89250  187013  214115  219288  248000  352187
boxplot(new_songs_no_outliers$duration_ms)

Variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character with length 31441. Below is the statistics summary of the numeric variables.

summary(new_songs_no_outliers[,-c(1:3,5:11)])
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0771   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 25.00   1st Qu.:0.5650   1st Qu.:0.582000   1st Qu.: 2.000  
##  Median : 46.00   Median :0.6720   Median :0.722000   Median : 6.000  
##  Mean   : 43.04   Mean   :0.6563   Mean   :0.699792   Mean   : 5.359  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9810   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness      acousticness      
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0224   Min.   :0.0000014  
##  1st Qu.: -8.069   1st Qu.:0.0000   1st Qu.:0.0411   1st Qu.:0.0158000  
##  Median : -6.083   Median :1.0000   Median :0.0631   Median :0.0817000  
##  Mean   : -6.635   Mean   :0.5651   Mean   :0.1078   Mean   :0.1753684  
##  3rd Qu.: -4.612   3rd Qu.:1.0000   3rd Qu.:0.1330   3rd Qu.:0.2550000  
##  Max.   :  1.275   Max.   :1.0000   Max.   :0.9180   Max.   :0.9920000  
##  instrumentalness       liveness          valence            tempo       
##  Min.   :0.0000000   Min.   :0.00936   Min.   :0.00001   Min.   : 37.11  
##  1st Qu.:0.0000000   1st Qu.:0.09310   1st Qu.:0.33400   1st Qu.: 99.93  
##  Median :0.0000121   Median :0.12800   Median :0.51500   Median :121.91  
##  Mean   :0.0755050   Mean   :0.19032   Mean   :0.51312   Mean   :120.88  
##  3rd Qu.:0.0032900   3rd Qu.:0.24900   3rd Qu.:0.69500   3rd Qu.:133.99  
##  Max.   :0.9940000   Max.   :0.99400   Max.   :0.99100   Max.   :239.44  
##   duration_ms    
##  Min.   : 89250  
##  1st Qu.:187013  
##  Median :214115  
##  Mean   :219288  
##  3rd Qu.:248000  
##  Max.   :352187

New variables “month” (based on track_album_release_date) are added to help analyze the music patterns by month.

new_songs_no_outliers$month <- months(as.Date(new_songs_no_outliers$track_album_release_date))
glimpse(new_songs_no_outliers)
## Observations: 31,441
## Variables: 24
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCY...
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud ...
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", ...
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X...
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud L...
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", ...
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Po...
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD...
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", ...
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "da...
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 189052, 16...
## $ month                    <chr> "June", "December", "July", "July", "March...

Below is a preview of the cleaned data.

new_songs_no_outliers %>% 
  head(100) %>% 
  datatable()

Below is the data dictionary of variable names, data type, and variable descriptions.

new_songs.type <- lapply(new_songs_no_outliers, class)
new_songs.var_desc <- c("Song unique ID",
                        "Song Name",
                        "Song Artist",
                        "Song Popularity (0-100) where higher is better",
                        "Album unique ID",
                        "Song album name",
                        "Date when album released",
                        "Name of playlist",
                        "Playlist ID",
                        "Playlist genre",
                        "Playlist subgenre",
                        "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
                        "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
                        "The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.",
                        "The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
                        "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
                        "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
                        "A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
                        "Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
                        "Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
                        "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).",
                        "The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
                        "Duration of song in milliseconds",
    
                        "Month when album released")
new_songs.var_names <- colnames(new_songs_no_outliers)
data.description <- as_data_frame(cbind(new_songs.var_names,new_songs.type,new_songs.var_desc))
colnames(data.description) <- c("Variable Names","Data Type","Variable Description")
kable(data.description)
Variable Names Data Type Variable Description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity integer Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability numeric Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy numeric Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key integer The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness numeric The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode integer Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness numeric Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness numeric A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness numeric Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness numeric Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence numeric A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo numeric The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms integer Duration of song in milliseconds
month character Month when album released

Proposed Exploratory Data Analysis

Dataset might be separated into two based on the variable track_popularity. The summary statistics of the two groups will be compared.

Songs will also be categorized by month. Numeric variables would be analyzed to see if there is any pattern. For example, is there a certain variable super high in a month but low in another? Does month/season affect the features of songs?

For the given top 100 Spotify songs, who are the artists contribute mostly? By counting distinct artist names, we could know the names of the artists and the albums as well.

Histograms, ROC curves and etc. would be helpful to display the findings.

I’m not sure how to group the values into bins and look at the frequency in each bin and how to separate a character variable (track album release date) into year, month, day.

Linear regression, logistic regression, tree model, or cluster analysis might be used to answer the questions.