1.1 Spotify was created as an alternative to pirating music online for free. It allows people to listen to over 50 million songs (for a fee) which gives exposure to thousands of artists and songs that may otherwise never have reached a broad audience. This dataset through Tidy Tuesday gives insight into track popularity, danceability, tempo, genre, and loudness, among other variables. Artists want a high track popularity number which means more people are listening to their song in comparison to other songs. On the flip side, consumers want to find songs that they enjoy listening to based on a number of variables. We are looking to explore if any variables, such as danceability, tempo or loudness, have a relationship with track popularity and if certain genres have a higher track popularity than others.
1.2 Any potential relationships will be explored through statistical measures as well as graphical visualizations after the data is cleaned and it is determined which variables can be explored thoroughly.
1.3 Once the data is cleaned, we will explore potential relationships between several variables, including track popularity through danceability, tempo, genre, and loudness. This will be done by looking at statistical measures of these variables and by graphing the variables alone and with other variables to see if there are any relationships.
1.4 This analysis will help artists and producers understand if and to what degree there is a relationship between track popularity and the variables described above. They can then decide if they want to create or alter songs to hopefully lead to a higher track popularity rating within Spotify, meaning more people listen to their music.
2.1 & 2.2
library(tidyverse)
library(DT)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(factoextra)
#download packages with messages and warning suppressed
2.3
| Package | Function |
|---|---|
| tidyverse | data manipulation and analysis |
| DT | HTML display of data |
| dplyr | data manipulation and analysis |
| ggplot2 | visual graphs of data |
| yaml | commonly used for configuration files |
| gidExtra | visual grid-based graphs |
| factoextra | data manipulation and analysis |
3.1 The Spotify data was obtained through TidyTuesday via Github.
3.2 This data was collected as a way to download general metadata around songs from Spotify’s API. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This data can then be used to see data around variables such as track popularity and danceability for a particular artist, song, or genre. The metadata set was generated, but there were also directions on how to download your own Spotify data if you are a user. It was originally published on 1/21/2020. The original dataset has 32,833 rows and 23 columns (variables). There are only a few missing values in the data (5 in track_name, track_artist and track_album_name); the missing values will be removed as part of data cleaning.
3.3 After the data is imported, we will next look at if there are any missing variables.
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spotify)
## # A tibble: 6 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5 67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson 70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~ 60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~ 69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
First thing we need to do is change the duration_ms from milliseconds to seconds. Seconds is much more easier for everyone to understand than milliseconds. And renaming duration_ms to duration_s.
spotify$duration_ms <- spotify$duration_ms / 1000
spotify <- spotify %>% rename(duration_s = duration_ms)
Some variables are categorical and need to be converted from ‘character’ variables. For example, an artist can have multiple songs, but that artist is one ‘category’ in this instance.
spotify$track_artist <- as.factor(spotify$track_artist)
spotify$track_album_name <- as.factor(spotify$track_album_name)
spotify$playlist_genre <- as.factor(spotify$playlist_genre)
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)
There are five in each of the following three categories: track_name, track_artist, and track_album_name. Because this is such a small amount, these will be deleted.
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_s
## 0 0
spotify <- na.omit(spotify)
We have some values in our data set that don’t belong. Loudness should have a maximum value of 0, if a song has that value, but right now the max is 1.275. We will remove all the values that exceed 0 for Loudness.
There is also a minimum value of 4 seconds for the duration of a song. That’s a fairly quick song. If we look closer, there are 2 observations that fall below 30 seconds and one of the them being the 4 second song. We’ll exclude any song that falls below 30 seconds.
spotify <- spotify %>% filter(duration_s >= 30)
spotify <- spotify %>% filter(loudness <= 0)
There are duplicated track id’s which means they probably show up in different playlists or genres. We will remove them so the numerical data is unique. Summary data can then be observed for the full data set.
spotify <- spotify[!duplicated(spotify$track_id), ]
dim(spotify)
## [1] 28344 23
summary(spotify)
## track_id track_name track_artist
## Length:28344 Length:28344 Queen : 130
## Class :character Class :character Martin Garrix : 87
## Mode :character Mode :character Don Omar : 84
## David Guetta : 81
## Dimitri Vegas & Like Mike: 68
## Drake : 68
## (Other) :27826
## track_popularity track_album_id track_album_name
## Min. : 0.00 Length:28344 Greatest Hits : 135
## 1st Qu.: 21.00 Class :character Ultimate Freestyle Mega Mix: 42
## Median : 42.00 Mode :character Gold : 34
## Mean : 39.34 Rock & Rios (Remastered) : 29
## 3rd Qu.: 58.00 Asian Dreamer : 20
## Max. :100.00 Trip Stories : 20
## (Other) :28064
## track_album_release_date playlist_name playlist_id playlist_genre
## Length:28344 Length:28344 Length:28344 edm :4875
## Class :character Class :character Class :character latin:4136
## Mode :character Mode :character Mode :character pop :5132
## r&b :4504
## rap :5394
## rock :4303
##
## playlist_subgenre danceability energy
## southern hip hop : 1581 Min. :0.0771 Min. :0.000175
## indie poptimism : 1547 1st Qu.:0.5610 1st Qu.:0.579000
## neo soul : 1478 Median :0.6700 Median :0.722000
## progressive electro house: 1460 Mean :0.6534 Mean :0.698337
## electro house : 1415 3rd Qu.:0.7600 3rd Qu.:0.843000
## gangster rap : 1313 Max. :0.9830 Max. :1.000000
## (Other) :19550
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0224
## 1st Qu.: 2.000 1st Qu.: -8.310 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.262 Median :1.0000 Median :0.0626
## Mean : 5.368 Mean : -6.819 Mean :0.5654 Mean :0.1079
## 3rd Qu.: 9.000 3rd Qu.: -4.710 3rd Qu.:1.0000 3rd Qu.:0.1330
## Max. :11.000 Max. : -0.046 Max. :1.0000 Max. :0.9180
##
## acousticness instrumentalness liveness valence
## Min. :0.0000014 Min. :0.0000000 Min. :0.00936 Min. :0.00001
## 1st Qu.:0.0143000 1st Qu.:0.0000000 1st Qu.:0.09260 1st Qu.:0.32900
## Median :0.0797000 Median :0.0000207 Median :0.12700 Median :0.51200
## Mean :0.1771834 Mean :0.0911491 Mean :0.19093 Mean :0.51042
## 3rd Qu.:0.2600000 3rd Qu.:0.0065725 3rd Qu.:0.24900 3rd Qu.:0.69500
## Max. :0.9940000 Max. :0.9940000 Max. :0.99600 Max. :0.99100
##
## tempo duration_s
## Min. : 35.48 Min. : 31.43
## 1st Qu.: 99.97 1st Qu.:187.75
## Median :121.99 Median :216.93
## Mean :120.96 Mean :226.60
## 3rd Qu.:134.00 3rd Qu.:254.98
## Max. :239.44 Max. :517.81
##
The data will then be manipulated to exclude the following data that is not needed in our analysis: * track_id * track_album_id * track_album_release_date * playlist_name * playlist_id
spotify <- spotify %>%
select(track_name, track_artist, track_popularity, track_album_name, playlist_genre, playlist_subgenre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_s)
str(spotify)
## tibble [28,344 x 18] (S3: tbl_df/tbl/data.frame)
## $ track_name : chr [1:28344] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2840 6171 10632 9370 5519 2840 4993 8313 771 8556 ...
## $ track_popularity : num [1:28344] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7926 10675 1059 2942 15182 1959 11512 13091 17780 8153 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre: Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num [1:28344] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:28344] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:28344] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:28344] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:28344] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:28344] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:28344] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:28344] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:28344] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:28344] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:28344] 122 100 124 122 124 ...
## $ duration_s : num [1:28344] 195 163 177 169 189 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
3.4 Below is our cleaned data set.
knitr::kable(head(spotify, 5), align = "lccrr")
| track_name | track_artist | track_popularity | track_album_name | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_s |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194.754 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162.600 |
| All the Time - Don Diablo Remix | Zara Larsson | 70 | All the Time (Don Diablo Remix) | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176.616 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Call You Mine - The Remixes | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169.093 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Someone You Loved (Future Humans Remix) | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189.052 |
3.5 Below is a list of all variables, their class and description. The original data can be found here. I am unable to add the variables here for some reason
4.1 We will use bivariate analysis to determine if a relationship exists between track popularity and several other variables, including danceability, tempo, genre, and loudness. We will also utilize clustering to look at the observations in various size groups to see how the mean for track popularity and other variables change based on group size.
4.2 Tables that highlight most popular song, artist, and genre will be used to give overview information of track popularity. Scatterplots, histograms, and correlation charts may be used to show bivaraite analysis between track popularity and other variables in the data. Clustering via grid graphs will help us to see how the mean of different variables shift depending on the number of groups we cluster the data into.
4.3 We need to learn more about data visualizations in R to create high quality visualizations of our analysis. We also need to learn more about how to code the regression model in R.
4.4 We hope to run a logistic regression model which will allow us to predict if a song is deemed as popular based on other variables in our data set, which is a classification model concept. We’ll also use k-clusters to analyze how all the numeric variables are grouped together. We’ll be able to examine the means for each variable in the groups. We’ll gain insights in how danceability, tempo, genre, and loudness relate to the track popularity as observations are assigned to their respected groups.