Mid-Term Project

Spotify

Introduction

1.1 Spotify was created as an alternative to pirating music online for free. It allows people to listen to over 50 million songs (for a fee) which gives exposure to thousands of artists and songs that may otherwise never have reached a broad audience. This dataset through Tidy Tuesday gives insight into track popularity, danceability, tempo, genre, and loudness, among other variables. Artists want a high track popularity number which means more people are listening to their song in comparison to other songs. On the flip side, consumers want to find songs that they enjoy listening to based on a number of variables. We are looking to explore if any variables, such as danceability, tempo or loudness, have a relationship with track popularity and if certain genres have a higher track popularity than others.

1.2 Any potential relationships will be explored through statistical measures as well as graphical visualizations after the data is cleaned and it is determined which variables can be explored thoroughly.

1.3 Once the data is cleaned, we will explore potential relationships between several variables, including track popularity through danceability, tempo, genre, and loudness. This will be done by looking at statistical measures of these variables and by graphing the variables alone and with other variables to see if there are any relationships.

1.4 This analysis will help artists and producers understand if and to what degree there is a relationship between track popularity and the variables described above. They can then decide if they want to create or alter songs to hopefully lead to a higher track popularity rating within Spotify, meaning more people listen to their music.

Packages

2.1 & 2.2

library(tidyverse)
library(DT)
library(dplyr)
library(ggplot2) 
library(gridExtra)
library(factoextra)
library(knitr)
library(hrbrthemes)
library(reshape2)
#download packages with messages and warning suppressed

2.3

Package	Function
tidyverse	data manipulation and analysis
DT	HTML display of data
dplyr	data manipulation and analysis
ggplot2	visual graphs of data
yaml	commonly used for configuration files
gidExtra	visual grid-based graphs
factoextra	data manipulation and analysis
knitr	a general-purpose package for dynamic report generation in R
hrbrthemes	visualization
reshape2	data manipulation

Data Preparation

3.1 The Spotify data was obtained through TidyTuesday via Github.

3.2 This data was collected as a way to download general metadata around songs from Spotify’s API. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This data can then be used to see data around variables such as track popularity and danceability for a particular artist, song, or genre. The metadata set was generated, but there were also directions on how to download your own Spotify data if you are a user. It was originally published on 1/21/2020. The original dataset has 32,833 rows and 23 columns (variables). There are only a few missing values in the data (5 in track_name, track_artist and track_album_name); the missing values will be removed as part of data cleaning.

3.3 After the data is imported, we will next clean the data.

spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

## Rows: 32833 Columns: 23

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(spotify)

## # A tibble: 6 x 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5                   67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

First thing we need to do is change the duration_ms from milliseconds to seconds. Seconds is much more easier for everyone to understand than milliseconds. And renaming duration_ms to duration_s.

spotify$duration_ms <- spotify$duration_ms / 1000

spotify <- spotify %>% rename(duration_s = duration_ms)

Some variables are categorical and need to be converted from ‘character’ variables. For example, an artist can have multiple songs, but that artist is one ‘category’ in this instance.

spotify$track_artist <- as.factor(spotify$track_artist)
spotify$track_album_name <- as.factor(spotify$track_album_name)
spotify$playlist_genre <- as.factor(spotify$playlist_genre)
spotify$playlist_subgenre <- as.factor(spotify$playlist_subgenre)

There are five missing values in each of the following three categories: track_name, track_artist, and track_album_name. Because this is such a small amount, these will be deleted.

colSums(is.na(spotify))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo               duration_s 
##                        0                        0

spotify <- na.omit(spotify)

We have some values in our data set that don’t belong. Loudness should have a maximum value of 0, if a song has that value, but right now the max is 1.275. We will remove all the values that exceed 0 for Loudness.

There is also a minimum value of 4 seconds for the duration of a song. That’s a fairly quick song. If we look closer, there are 2 observations that fall below 30 seconds and one of the them being the 4 second song. We’ll exclude any song that falls below 30 seconds.

spotify <- spotify %>% filter(duration_s >= 30)
spotify <- spotify %>% filter(loudness <= 0)

There are duplicated track id’s which means they probably show up in different playlists or genres. We will remove them so the numerical data is unique. Summary data can then be observed for the full data set.

spotify <- spotify[!duplicated(spotify$track_id), ]
dim(spotify)

## [1] 28344    23

summary(spotify)

##    track_id          track_name                           track_artist  
##  Length:28344       Length:28344       Queen                    :  130  
##  Class :character   Class :character   Martin Garrix            :   87  
##  Mode  :character   Mode  :character   Don Omar                 :   84  
##                                        David Guetta             :   81  
##                                        Dimitri Vegas & Like Mike:   68  
##                                        Drake                    :   68  
##                                        (Other)                  :27826  
##  track_popularity track_album_id                        track_album_name
##  Min.   :  0.00   Length:28344       Greatest Hits              :  135  
##  1st Qu.: 21.00   Class :character   Ultimate Freestyle Mega Mix:   42  
##  Median : 42.00   Mode  :character   Gold                       :   34  
##  Mean   : 39.34                      Rock & Rios (Remastered)   :   29  
##  3rd Qu.: 58.00                      Asian Dreamer              :   20  
##  Max.   :100.00                      Trip Stories               :   20  
##                                      (Other)                    :28064  
##  track_album_release_date playlist_name      playlist_id        playlist_genre
##  Length:28344             Length:28344       Length:28344       edm  :4875    
##  Class :character         Class :character   Class :character   latin:4136    
##  Mode  :character         Mode  :character   Mode  :character   pop  :5132    
##                                                                 r&b  :4504    
##                                                                 rap  :5394    
##                                                                 rock :4303    
##                                                                               
##                  playlist_subgenre  danceability        energy        
##  southern hip hop         : 1581   Min.   :0.0771   Min.   :0.000175  
##  indie poptimism          : 1547   1st Qu.:0.5610   1st Qu.:0.579000  
##  neo soul                 : 1478   Median :0.6700   Median :0.722000  
##  progressive electro house: 1460   Mean   :0.6534   Mean   :0.698337  
##  electro house            : 1415   3rd Qu.:0.7600   3rd Qu.:0.843000  
##  gangster rap             : 1313   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :19550                                      
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0224  
##  1st Qu.: 2.000   1st Qu.: -8.310   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.262   Median :1.0000   Median :0.0626  
##  Mean   : 5.368   Mean   : -6.819   Mean   :0.5654   Mean   :0.1079  
##  3rd Qu.: 9.000   3rd Qu.: -4.710   3rd Qu.:1.0000   3rd Qu.:0.1330  
##  Max.   :11.000   Max.   : -0.046   Max.   :1.0000   Max.   :0.9180  
##                                                                      
##   acousticness       instrumentalness       liveness          valence       
##  Min.   :0.0000014   Min.   :0.0000000   Min.   :0.00936   Min.   :0.00001  
##  1st Qu.:0.0143000   1st Qu.:0.0000000   1st Qu.:0.09260   1st Qu.:0.32900  
##  Median :0.0797000   Median :0.0000207   Median :0.12700   Median :0.51200  
##  Mean   :0.1771834   Mean   :0.0911491   Mean   :0.19093   Mean   :0.51042  
##  3rd Qu.:0.2600000   3rd Qu.:0.0065725   3rd Qu.:0.24900   3rd Qu.:0.69500  
##  Max.   :0.9940000   Max.   :0.9940000   Max.   :0.99600   Max.   :0.99100  
##                                                                             
##      tempo          duration_s    
##  Min.   : 35.48   Min.   : 31.43  
##  1st Qu.: 99.97   1st Qu.:187.75  
##  Median :121.99   Median :216.93  
##  Mean   :120.96   Mean   :226.60  
##  3rd Qu.:134.00   3rd Qu.:254.98  
##  Max.   :239.44   Max.   :517.81  
##

The data will then be manipulated to exclude the following data that is not needed in our analysis: * track_id * track_album_id * track_album_release_date * playlist_name * playlist_id

spotify <- spotify %>%
  select(track_name, track_artist, track_popularity, track_album_name, playlist_genre, playlist_subgenre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_s)
str(spotify)

## tibble [28,344 x 18] (S3: tbl_df/tbl/data.frame)
##  $ track_name       : chr [1:28344] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist     : Factor w/ 10692 levels "'Til Tuesday",..: 2840 6171 10632 9370 5519 2840 4993 8313 771 8556 ...
##  $ track_popularity : num [1:28344] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7926 10675 1059 2942 15182 1959 11512 13091 17780 8153 ...
##  $ playlist_genre   : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre: Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability     : num [1:28344] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy           : num [1:28344] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key              : num [1:28344] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness         : num [1:28344] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode             : num [1:28344] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness      : num [1:28344] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness     : num [1:28344] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness : num [1:28344] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness         : num [1:28344] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence          : num [1:28344] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo            : num [1:28344] 122 100 124 122 124 ...
##  $ duration_s       : num [1:28344] 195 163 177 169 189 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...

3.4 Below is our cleaned data set.

datatable(head(spotify, 5),options = list(scrollX = TRUE))

3.5 Below is a list of all concerned variables, their class and description. For definition of all variables please find the data dictionary by clicking here.

var <- read_csv("D:/Summary_Table1.csv")
datatable(var,options = list(scrollX = TRUE))

EDA

4.1 We will use bivariate analysis to determine if a relationship exists between track popularity and several other variables, including danceability, tempo, genre, and loudness. We will also utilize clustering to look at the observations in various size groups to see how the mean for track popularity and other variables change based on group size.

We’re also creating a new variable called pop, which is a binary variable for 1 if the song has a track_popularity over 75 and 0 if its below 75. This will help with the correlation plot and running a logistic regression model.

spotify$pop <- ifelse(spotify$track_popularity >= 75, 1, 0)

4.2 Tables that highlight most popular song, artist, and genre will be used to give overview information of track popularity. Scatterplots, histograms may be used to show bivaraite analysis between track popularity and other variables in the data. Clustering via grid graphs will help us to see how the mean of different variables shift depending on the number of groups we cluster the data into.

Below is a histogram of track_populatity. There are far more tracks that are 0 than any other value. There are few tracks that have a popularity over 75, which we might have to rethink our binary variable pop and assign a new value for. A majority of tracks fall in-between the range 25-75.

track_pop_hist <- spotify %>%
  ggplot( aes(x=track_popularity)) +
  geom_histogram( binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Distribution of Track Popularity") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  )

track_pop_hist

danceability, loudness, and acousticness are the highest correlated variables to track_popularity with 0.06, 0.06, and 0.09. We see the same thing with the binary variable pop, which is 1 if the song has a track popularity over 75 and 0 otherwise. danceability and loudness are the highest correlated variables with 0.06 and 0.07.

4.3 We need to learn more about data visualizations in R to create high quality visualizations of our analysis. We also need to learn more about how to code the regression model in R.

4.4 We hope to run a logistic regression model which will allow us to predict if a song is deemed as popular based on other variables in our data set, which is a classification model concept. We’ll also use k-clusters to analyze how all the numeric variables are grouped together. We’ll be able to examine the means for each variable in the groups. We’ll gain insights in how danceability, tempo, genre, and loudness relate to the track popularity as observations are assigned to their respected groups.

Mid-Term Project

Group 10: Alex Goellner, Mihir Patel, Ashley Webber

07 November 2021

Spotify

Introduction

Packages

Data Preparation

EDA