In our analysis, we would like to explore what factors influence the popularity of a song based on a Spotify data set from the TidyTuesday series. Our study may be of interest to musicians or producers who want to understand what ways they can make music that would be more popular with their Spotify target audience. Perhaps the factors we identify here can inform artists on ways they can make their music heard by a larger audience as well. Even if we don’t find relationships among the metrics here with popularity, that in itself is an interesting conclusion that can inform the decisions of those who make and listen to music. As a consumer, it can sometimes be hard to pin point why or why not a song is enjoyable. We can help Spotify listeners to identify certain songs that have similar songs to others that they enjoy in a way that can help improve their listening experience.
Since there are a lot of variables in this data set to explore, through the course of our cleaning and anlaysis of data we will pinpoint a handful that we can explore further. In this way, addressing our problems involves choosing some key variables to look at and not getting lost in the data. After cleaning, we plan on exploring the following variables’ relationships with popularity – valence, energy, mode, loudness – which are further explained in our variable dictionary. Another relationship we may potentially explore is the relationship between release date and popularity of a song, which will require care in data type conversion.
Tidyverse is a collection of packages that is designed to simplify data analysis. A number of the functions contained in library(tidyverse) make it easier to sort through data, look at specific variables or columns, rename or create variables, group data differently, and much more.
library(tidyverse)
The data used in this project was obtained from this page 1 via the following code.
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## track_id = col_character(),
## track_name = col_character(),
## track_artist = col_character(),
## track_album_id = col_character(),
## track_album_name = col_character(),
## track_album_release_date = col_character(),
## playlist_name = col_character(),
## playlist_id = col_character(),
## playlist_genre = col_character(),
## playlist_subgenre = col_character()
## )
## See spec(...) for full column specifications.
This 2020 Spotify data comes from the spotifyr package, which is an R wrapper that was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to make it easier to access your own Spotify data or general data about songs from Spotify’s API.
The data set explored here was gathered by Kaylin Pavlik using audio features of the Spotify data in pursuit of exploration and classification of a collection of songs from 6 main genres (EDM, Latin, Pop, R&B, Rap, and Rock).
When initially downloaded, this data contained the following 23 variables:
| Variable | Class | Description |
|---|---|---|
| track_id | character | Unique ID of a song |
| track_name | character | Song name |
| track_artist | character | Song artist |
| track_popularity | double | Song popularity on a scale of 0 to 100 where a higher number means more popular |
| track_album_id | character | Unique ID of an album |
| track_album_name | character | Album name that the song belongs to |
| track_album_release_date | character | Date the album was released |
| playlist_name | character | Name of the playlist |
| playlist_id | character | Unique playlist ID |
| playlist_genre | character | Genre of a playlist |
| playlist_subgenre | character | Subgenre of a playlist |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
First, we look at the first and last few rows in order to see what our actual data looks like.
head(spotify_songs, 5)
tail(spotify_songs, 10)
Next, we would like to investigate the structure or our data set and its classifications. We learn that there are 32,833 total entries here and that the Spotify data inherits the attributes of multiple classes. Knowing what classes this data belongs to gives insight into what different methods we can use to conduct our data analysis.
class(spotify_songs)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
str(spotify_songs)
## tibble [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
Looking at a summary of each variable allows us to identify any initially abnormal values. In the cleaning steps, the character variables will be changed to factors in order to categorize the data into levels so we can learn more. From this initial summary, none of the numeric variables have any apparent outliers.
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
The following gives us a concise look at as much data as possible by printing the column of a data frame downward instead of across.
glimpse(spotify_songs)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 1630…
Before changing any data types, we have five missing values each in track_name, track_artist, and track_album_name.
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
In the next section, we will do more investigation and decide how to handle these missing values.
To begin cleaning the data, we decided to change the following character variables to factors to better understand the data for now: track_name,track_artist, track_album_release_date, playlist_genre, playlist_subgenre, playlist_name.
spotify_songs$track_name <- as.factor(spotify_songs$track_name)
spotify_songs$track_artist <- as.factor((spotify_songs$track_artist))
spotify_songs$track_album_name <- as.factor(spotify_songs$track_album_name)
spotify_songs$track_album_release_date <- as.factor(spotify_songs$track_album_release_date)
spotify_songs$playlist_genre <- as.factor((spotify_songs$playlist_genre))
spotify_songs$playlist_subgenre <- as.factor(spotify_songs$playlist_subgenre)
spotify_songs$playlist_name <- as.factor(spotify_songs$playlist_name)
What is the story with these tracks with NAs?
spotify_songs[rowSums(is.na(spotify_songs)) != 0,]
There are five that lack track_name and track_artist and we will delete these from the data set.
spotify_songs %>%
filter(track_name != " ") -> spotify_songs
Quality check:
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 0 0
## track_popularity track_album_id track_album_name
## 0 0 0
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
Now we have cleaned our data set of missing values.
The duplicates of songs make sense because different playlists may contain the same song. The output below shows that there are fewer unique songs than our total observations.
Number of unique track ID’s (which would delineate the number of unique songs):
spotify_songs %>%
distinct(track_id) %>%
tally()
This shows us concisely which artists are in this data and how many times they appear.
spotify_songs %>%
group_by(track_album_name) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head()
We observe that there are 10,692 artists when grouped by track_artist using the code, unique(spotify_songs$track_artist).
The album names are displayed here.
spotify_songs %>%
group_by(track_album_name) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head()
While it may seem logical to convert track_album_release_date to a date variable, we will need to treat this carefully because of the inconsistencies in the recording of dates. Below we see that some of these dates are just years, as opposed to the year, month, date format.
summary(spotify_songs$track_album_release_date)
## 2020-01-10 2019-11-22 2019-12-06 2019-12-13 2013-01-01 2019-11-15 2012-01-01
## 270 244 235 220 219 215 209
## 2010-01-01 2019-11-08 2008-01-01 2019-10-25 2019-11-01 2019-12-20 2006-01-01
## 192 192 188 188 180 180 170
## 2019-10-18 2019-11-29 2019-10-11 2005-01-01 2019-10-04 2019-09-27 2019-06-28
## 168 168 164 163 159 156 154
## 2009-01-01 2014-01-01 2007-01-01 2019-09-06 2019-09-13 2019-06-21 2019-09-20
## 147 147 146 140 139 138 135
## 2019-08-16 2020-01-03 2011-01-01 2020-01-17 2019-08-30 2004-01-01 2019-12-27
## 133 133 132 131 130 116 115
## 2019-08-23 2019-07-26 2019-05-10 2019-05-17 2019-07-12 2003-01-01 2019-07-19
## 114 110 109 109 107 104 101
## 2002-01-01 2019-04-26 2019-05-31 2019-08-09 2019-05-24 2019-04-05 2019-06-14
## 95 95 95 95 92 90 90
## 2019-07-05 2019-05-03 2001-01-01 2019-06-07 2019-03-22 2018-12-14 2019-01-18
## 90 88 87 87 86 85 85
## 2019-02-22 2019-03-29 1998-01-01 2005 2019-02-08 2019-08-02 1999-01-01
## 85 84 80 80 79 79 78
## 2019-03-01 2019-03-08 2019-02-01 2017-06-09 2018-04-06 2018-10-05 2018-11-09
## 76 76 74 71 71 71 70
## 2000-01-01 2019-10-31 2018-10-19 2019-04-12 2019-01-25 1998 2016-10-21
## 69 69 68 67 66 63 63
## 2018-11-30 1976 1988-01-01 2018-08-17 2019-01-11 2018-11-16 2004
## 62 59 59 59 59 58 57
## 2018-11-02 2019-02-15 2019-04-19 2006 2015-08-28 2018-09-28 2018-12-07
## 57 57 57 56 56 56 56
## 2001 2018-10-26 1987-01-01 2016-06-24 2018-04-27 2003 2016-05-06
## 55 55 53 53 53 52 52
## 2018-08-24 (Other)
## 52 22126
There appear to be strange characters in some playlist names in the R Console. However, when we use View(summary(spotify_songs$playlist_name)), we see that these strange characters are emojis in the playlist titles. So, although this seemed abnormal, we don’t need to change these.
spotify_songs %>%
group_by(playlist_name) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head()
As described in the data background, there are six genres in this data.
summary(spotify_songs$playlist_genre)
## edm latin pop r&b rap rock
## 6043 5153 5507 5431 5743 4951
In the relationships with playlist_genre explored through boxplots below, the plot danceability suggests the presence of outliers.
#plot(spotify_songs$playlist_genre)
boxplot(danceability ~ playlist_genre , spotify_songs)
This is a boxplot of valence with playlist genre.
boxplot(valence ~ playlist_genre , spotify_songs)
This is a boxplot of key with playlist genre.
boxplot(key ~ playlist_genre , spotify_songs)
This is a boxplot of mode and playlist genre.
boxplot(mode ~ playlist_genre , spotify_songs)
This is a boxplot of track popularity and playlist genre.
boxplot(track_popularity ~ playlist_genre , spotify_songs)
There are 24 playlist subgenres.
unique(spotify_songs$playlist_subgenre) #24 Levels
## [1] dance pop post-teen pop
## [3] electropop indie poptimism
## [5] hip hop southern hip hop
## [7] gangster rap trap
## [9] album rock classic rock
## [11] permanent wave hard rock
## [13] tropical latin pop
## [15] reggaeton latin hip hop
## [17] urban contemporary hip pop
## [19] new jack swing neo soul
## [21] electro house big room
## [23] pop edm progressive electro house
## 24 Levels: album rock big room classic rock dance pop ... urban contemporary
If we plan to do more exploration into subgenre, we need to determine a better way to label the x-axis so we know which subgenre is being referred to more specifically. While the following boxplots reveal some interesting outliers, the unclear labelling of the axis makes it hard to see which subgenre in particular has the outliers.
There are some interesting outliers between danceability and playlist subgenre.
#plot(spotify_songs$playlist_subgenre)
boxplot(danceability ~ playlist_subgenre , spotify_songs)
In addition there are interesting outliers with valence.
boxplot(valence ~ playlist_subgenre , spotify_songs)
No outliers apparent between playlist subgenre and key.
boxplot(key ~ playlist_subgenre , spotify_songs)
There appears to be an interesting outlier that could be investigated further between playlist subgenre and mode.
boxplot(mode ~ playlist_subgenre , spotify_songs)
There appear to be outliers present here.
boxplot(track_popularity ~ playlist_subgenre , spotify_songs)
When grouped by playlist_name, there are 449 playlists based on the output of unique(spotify_songs$playlist_name).
However, when grouped by playlist_id, there are 471 playlists; conclusion based upon unique(spotify_songs$playlist_id).
We will not do further cleaning on playlists here because playlists are not necessarily what we are concerned about in the problems we plan to address with this data.
Here is a glimpse at our final data in the most condensed form possible
glimpse(spotify_songs)
## Rows: 32,828
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name <fct> I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <fct> Ed Sheeran, Maroon 5, Zara Larsson, The Chai…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name <fct> I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2019-07-…
## $ playlist_name <fct> Pop Remix, Pop Remix, Pop Remix, Pop Remix, …
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre <fct> pop, pop, pop, pop, pop, pop, pop, pop, pop,…
## $ playlist_subgenre <fct> dance pop, dance pop, dance pop, dance pop, …
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 1630…
Of the 23 original variables, we plan on keeping 11 of them: track_id, track_name, track_artist, track_album_id, track_album_name, valence, energy, mode, loudness, track_popularity, track_album_release_date.
Among those, the variables of interest particular interest to us are valence, energy, mode, loudness, track_popularity.
While there may be insights that could come from playlist genre or subgenre, they are not of interest to us here and would not answer our initial questions as directly as the other song metrics available to us through the Spotify data. So, we will drop those variables in our analysis.
An area of concern may be that we investigate the popularity as a response variable in a linear regression with release date. Another aspect of the data we need to be careful with is mode, since it is a binary variable.
The original data frame has multiple observations of many songs because many playlists can have the same song as we can see here. Of our original dat set, approximately 5000 are duplicated.
spotify_songs %>% distinct(track_id, .keep_all = TRUE)
For our analysis, we want to make sure we have only 1 observation of each song. However, we will leave variations of each song in the data set as the variables of interest, like popularity, do change across the variations.
In order to have only 1 observation of each song we need to create a subset of the original data frame in this way:
spotify_songs %>% distinct(track_id, .keep_all = TRUE) -> tib_single_songs summary(tib_single_songs)
Finally, the the numeric variables we are most interested in exploring are summed up here, taken from the output summary of the subset we created in the code directly above this section.
| Variable | Q1 | Mean | Q3 | Comment |
|---|---|---|---|---|
| track_popularity | 21 | 39.33 | 58 | Response Variable |
| energy | 0.579 | 0.698 | 0.843 | Explanatory Variable 1 |
| loudness | -8.309 | -6.818 | -4.709 | Explanatory Variable 2 |
| valence | 0.329 | 0.510 | 0.695 | Explanatory Variable 3 |
| mode | 0 | 0.566 | 1 | Explanatory Variable 4, Binary Variable |
Moving forward, we will be splitting our columns of interest into two data frames in order find to new information in our data that is not readily apparent. Both of these new data frames will contain things like song and album names and IDs in addition to popularity. The difference will be that
We are considering creating a new variable that is a composite of the explanatory variables. For instance, if loudness and valence could be combined maybe that would be interesting to explore further.
In order to illustrate our findings, we hope to create plots and tables using ggplot, which we be learning soon. Another type plot we would like to help illustrate our findings is one that would properly represent our binary variable mode.
We look forward to learning about the following topics to be able to answer our questions:
How to see outliers with such a large data set. We know that there is a lot of information to fit onto a small plot and it will be hard to understand the ways we choose to see what is going on in the data in a way that provides insight to our problem.
The usage and syntax of summarise(). While we attained the output we desired, there was a message we suppressed.
Will we need the date to refine our analysis? because there are duplicate songs with different versions. We anticipate that we may need to choose the earliest release of a song.
We currently plan on incorporating linear regression of our song metrics of interest with track_popularity as our response variable. When we do regression, we won’t use playlist influences. We may look at what are the top 5 most and least popular songs.
Overall, we are not sure where this analysis will lead us, but the variables we choose explore more deeply will wind up being revealing regardless of if any relationships do or do not exist.