Spotify Project

Introducation

Introduction

1.1 Problem Statement: The objective of this project is to explore the Spotify Genre Data and see what factors make a song popular (if any), what genre is considered the most popular, what are the most produced genres, etc. We have 23 different variables to help us try and answer these questions. By considering these questions, we are able to determine what a song needs in order to be become popular in today’s world, aiding the music recording industry and hopeful recording artists.

1.2 Plan: The first step will be to clean and organize the data, spotify_songs.csv. Next, we will start simple and find out what are the most produced genres, what genre is the most popular, etc. After that, we will compare a few variables to each other to try and find correlation. This step will be tricky because correlation doesn’t always mean causation, so we want to avoid comparing variables that may not impact each other.

1.3 Approach: Our current approach consists of using basic statistical techniques, such as mean, variance, correlation, and hypothesis testing, to partially address our problem. We know that this approach will not fully address all our problems, but it is a good start as we continue to learn more. We will use packages such as tidyverse to help clean the data. We may use ggplot2 in order to help with the correlation graphs of the data, but we have never used ggplot2 before, so we might stick with the base R codes to do that.

1.4 Consumer Advantage: Our analysis can help a consumer, such as a recording artist or a music producer, decide how to approach creating a song and what factors to consider in order to maximize the popularity of their song. It can also allow the consumer to decide what genre to get involved with.

Packages Required

tidyr - clean and organize the data
ggplot2 - visualize the data
dplyr - helps with common data manipulation tasks
readr - gives the name and type of each column
tibble - stores data as a tibble, which makes it easier to handle and manipulate data

*all of the above packages are included in tidyverse, so it’s easiest to load that

knitr - helps create easy-to-read tables
broom - turns lm functions into tidy tibbles

library(tidyverse)
library(knitr)
library(broom)

Data Preparation

Spotify Songs

The data comes from Spotify via the spotifyr package. Kaylin Pavlik gathered the original data to compare six genres of music to summarize what variables stand out in specific genres.

The original dataset contains 23 variables and 32,833 songs spanning across 6 genres (EDM, Latin, Pop, R&B, & Rock). Some peculiarities that we noticed in the original data set was that some of the observations within track_artist,track_name, playlist_name, and track_album_name all had many observations with some unusual characters. For example, a couple of the observations had BeyoncÃ© as the track_artist. This may be an error with how the data was imputed, and the observations are supposed to indicate Beyonce potentially instead. Our challenge will be to fix these unusual characters.

Data Import

Let’s take a look at the data:

spotify <- read_csv("spotify_songs.csv")

## Warning: Missing column names filled in: 'X24' [24]

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   track_id = col_character(),
##   track_name = col_character(),
##   track_artist = col_character(),
##   track_album_id = col_character(),
##   track_album_name = col_character(),
##   track_album_release_date = col_character(),
##   playlist_name = col_character(),
##   playlist_id = col_character(),
##   playlist_genre = col_character(),
##   playlist_subgenre = col_character(),
##   X24 = col_logical()
## )
## i Use `spec()` for the full column specifications.

class(spotify)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

dim(spotify)

## [1] 32833    24

glimpse(spotify)

## Rows: 32,833
## Columns: 24
## $ track_id                 <chr> "0017A6SJgTbfQVU2EtsPNo", "002xjHwzEx66OWF...
## $ track_name               <chr> "Pangarap", "The Others", "I Feel Alive", ...
## $ track_artist             <chr> "Barbie's Cradle", "RIKA", "Steady Rollin"...
## $ track_popularity         <dbl> 41, 15, 28, 24, 38, 21, 0, 41, 37, 28, 65,...
## $ track_album_id           <chr> "1srJQ0njEQgd8w4XSqI4JQ", "1ficfUnZMaY1QkN...
## $ track_album_name         <chr> "Trip", "The Others", "Love & Loss", "Liqu...
## $ track_album_release_date <chr> "1/1/2001", "1/26/2018", "11/21/2017", "8/...
## $ playlist_name            <chr> "Pinoy Classic Rock", "Groovy // Funky // ...
## $ playlist_id              <chr> "37i9dQZF1DWYDQ8wBxd7xt", "0JmBB9HfrzDiZoP...
## $ playlist_genre           <chr> "rock", "r&b", "rock", "pop", "pop", "pop"...
## $ playlist_subgenre        <chr> "classic rock", "neo soul", "hard rock", "...
## $ danceability             <dbl> 0.682, 0.582, 0.303, 0.659, 0.662, 0.763, ...
## $ energy                   <dbl> 0.401, 0.704, 0.880, 0.794, 0.838, 0.763, ...
## $ key                      <dbl> 2, 5, 9, 10, 1, 10, 6, 5, 6, 5, 9, 7, 6, 1...
## $ loudness                 <dbl> -10.068, -6.242, -4.739, -5.644, -6.300, -...
## $ mode                     <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...
## $ speechiness              <dbl> 0.0236, 0.0347, 0.0442, 0.0540, 0.0499, 0....
## $ acousticness             <dbl> 0.279000, 0.065100, 0.011700, 0.000761, 0....
## $ instrumentalness         <dbl> 1.17e-02, 0.00e+00, 9.94e-03, 1.32e-01, 6....
## $ liveness                 <dbl> 0.0887, 0.2120, 0.3470, 0.3220, 0.0881, 0....
## $ valence                  <dbl> 0.566, 0.698, 0.404, 0.852, 0.496, 0.953, ...
## $ tempo                    <dbl> 97.091, 150.863, 135.225, 128.041, 129.884...
## $ duration_ms              <dbl> 235440, 197286, 373512, 228565, 236308, 21...
## $ X24                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

Looking at the initial data, we can see that the variable names don’t need adjusting as they are already in an easy to read and organized format. The variables are also the correct type except for track_album_release_date which is currently a character. We don’t expect to use this variable though so we will leave it as is for now and change it later if we need to.

Data Cleaning

The first part of data cleaning is removing the outliers. One outlier is song duration. There is one song that is 4 seconds, and multiple songs around 30 seconds. We are going to remove the 4 second song, but keep the 30 second songs as many of these are interludes and they could hint at if an entire album is popular, as if an entire album is popular, people would listen all the way through, including the interludes. The longest song is 8 minutes and 37 seconds, which we will keep.

The second part of data cleaning is removing any abnormalities. We are also going to remove a few songs that go above 0 dB as this is considered abnormal.

The third part of data cleaning is accessing any missing values. We also have a total of 15 missing values that are spread across 5 rows. Since we aren’t sure what they are and since the popularity is 0 for all 5 rows, we decided to delete them.

spotify <- spotify[!(spotify$duration_ms < 5000), ] # removing the 4 second song
spotify <- spotify[!(spotify$loudness > 0), ] # removing songs over 0 dB (considered too loud)
colSums(is.na(spotify))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms                      X24 
##                        0                        0                    32826

spotify <- spotify[!is.na(spotify$track_name), ] # remove NA values

There are also 4477 duplicates of songs (using the track_id variable) due to the same song being on multiple playlists. After examining the data, the playlist and the playlist_genre was the only difference between these duplicates (all other values remained the same), so we determined that it was okay to remove these duplicates.

However, before we removed the duplicates, we did some manipulation of the data so we could keep the information from those observations. We gathered the data to see which track_id’s appeared on the most playlists and which playlists they appeared on. We then counted the number of duplicates, and count the frequency of the number of duplicates, as this can give us a hint at what makes a song popular.

We also created a new identification variable, so that we could have an observation for every playlist_genre the song appeared on, and a count of repetitions for that genre.

spotify_dup_count <- spotify %>% 
  count(track_id, sort=TRUE) %>% 
  rename(dup_count = n) #creating a tibble to store track_id's and number of duplicates

spotify_freq_count <- spotify_dup_count %>% 
  count(dup_count,sort=TRUE) %>% 
  rename(freq_count = n) #creating a tibble to store number of duplicates and their frequencies

# Adding counts of song duplicates and duplicate frequencies to full data set
spotify_dup <- spotify %>%
  full_join(spotify_dup_count, by = "track_id") %>%
  full_join(spotify_freq_count, by = "dup_count")

# Creating variable another identification variable (track_genre_id), counting duplicates of that, and adding to full data set
spotify_id_genre <- spotify_dup %>%
  mutate(track_genre_id = paste(track_id,playlist_genre))

id_genre_count <- spotify_id_genre %>%
  count(track_genre_id) %>%
  rename(id_genre_dup = n)

spotify_full <- spotify_id_genre %>%
  full_join(id_genre_count, by = "track_genre_id")

# Proving that it worked
spotify_full %>%
  select(track_id, track_name, playlist_genre, freq_count, dup_count, id_genre_dup)%>%
  filter(dup_count == 10)

## # A tibble: 10 x 6
##    track_id       track_name    playlist_genre freq_count dup_count id_genre_dup
##    <chr>          <chr>         <chr>               <int>     <int>        <int>
##  1 7BKLCZ1jbUBVq~ Closer (feat~ pop                     1        10            4
##  2 7BKLCZ1jbUBVq~ Closer (feat~ pop                     1        10            4
##  3 7BKLCZ1jbUBVq~ Closer (feat~ pop                     1        10            4
##  4 7BKLCZ1jbUBVq~ Closer (feat~ pop                     1        10            4
##  5 7BKLCZ1jbUBVq~ Closer (feat~ rap                     1        10            1
##  6 7BKLCZ1jbUBVq~ Closer (feat~ latin                   1        10            3
##  7 7BKLCZ1jbUBVq~ Closer (feat~ latin                   1        10            3
##  8 7BKLCZ1jbUBVq~ Closer (feat~ latin                   1        10            3
##  9 7BKLCZ1jbUBVq~ Closer (feat~ r&b                     1        10            1
## 10 7BKLCZ1jbUBVq~ Closer (feat~ edm                     1        10            1

Doing this will give us three datasets, one dedicated to the entire dataset and the new variables, one where there is a single observation of each song, and one where this is a single observation of a song for each playlist genre it appeared in. We wanted the spotify dataset to consist of only of single observations, so that when analysis is done, the overall output is not skewed from having multiple entries of the same song. We also wanted the spotify_id_genre dataset to consist of only of single observation of a song for each playlist genre, so that when additional analysis is done, the output is not skewed due to observations of playlist_genre being removed. spotify_full is our full dataset with the addition of the new variables.

# Removing removing the desired duplicates from spotify and spotify_id_genre
spotify_id_genre <- spotify_full[!duplicated(spotify_full$track_genre_id),]
spotify <- spotify[!duplicated(spotify$track_id), ]

After cleaning the data, we are left with the following observations and columns for our three datasets:

dim(spotify)

## [1] 28345    24

dim(spotify_id_genre)

## [1] 30373    28

dim(spotify_full)

## [1] 32821    28

Data Preview

glimpse(spotify)

## Rows: 28,345
## Columns: 24
## $ track_id                 <chr> "0017A6SJgTbfQVU2EtsPNo", "002xjHwzEx66OWF...
## $ track_name               <chr> "Pangarap", "The Others", "I Feel Alive", ...
## $ track_artist             <chr> "Barbie's Cradle", "RIKA", "Steady Rollin"...
## $ track_popularity         <dbl> 41, 15, 28, 24, 38, 21, 0, 41, 37, 28, 65,...
## $ track_album_id           <chr> "1srJQ0njEQgd8w4XSqI4JQ", "1ficfUnZMaY1QkN...
## $ track_album_name         <chr> "Trip", "The Others", "Love & Loss", "Liqu...
## $ track_album_release_date <chr> "1/1/2001", "1/26/2018", "11/21/2017", "8/...
## $ playlist_name            <chr> "Pinoy Classic Rock", "Groovy // Funky // ...
## $ playlist_id              <chr> "37i9dQZF1DWYDQ8wBxd7xt", "0JmBB9HfrzDiZoP...
## $ playlist_genre           <chr> "rock", "r&b", "rock", "pop", "pop", "pop"...
## $ playlist_subgenre        <chr> "classic rock", "neo soul", "hard rock", "...
## $ danceability             <dbl> 0.682, 0.582, 0.303, 0.659, 0.662, 0.763, ...
## $ energy                   <dbl> 0.401, 0.704, 0.880, 0.794, 0.838, 0.763, ...
## $ key                      <dbl> 2, 5, 9, 10, 1, 10, 6, 5, 6, 5, 9, 7, 6, 1...
## $ loudness                 <dbl> -10.068, -6.242, -4.739, -5.644, -6.300, -...
## $ mode                     <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...
## $ speechiness              <dbl> 0.0236, 0.0347, 0.0442, 0.0540, 0.0499, 0....
## $ acousticness             <dbl> 0.279000, 0.065100, 0.011700, 0.000761, 0....
## $ instrumentalness         <dbl> 1.17e-02, 0.00e+00, 9.94e-03, 1.32e-01, 6....
## $ liveness                 <dbl> 0.0887, 0.2120, 0.3470, 0.3220, 0.0881, 0....
## $ valence                  <dbl> 0.566, 0.698, 0.404, 0.852, 0.496, 0.953, ...
## $ tempo                    <dbl> 97.091, 150.863, 135.225, 128.041, 129.884...
## $ duration_ms              <dbl> 235440, 197286, 373512, 228565, 236308, 21...
## $ X24                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(spotify_id_genre)

## Rows: 30,373
## Columns: 28
## $ track_id                 <chr> "0017A6SJgTbfQVU2EtsPNo", "002xjHwzEx66OWF...
## $ track_name               <chr> "Pangarap", "The Others", "I Feel Alive", ...
## $ track_artist             <chr> "Barbie's Cradle", "RIKA", "Steady Rollin"...
## $ track_popularity         <dbl> 41, 15, 28, 24, 38, 21, 0, 41, 37, 28, 65,...
## $ track_album_id           <chr> "1srJQ0njEQgd8w4XSqI4JQ", "1ficfUnZMaY1QkN...
## $ track_album_name         <chr> "Trip", "The Others", "Love & Loss", "Liqu...
## $ track_album_release_date <chr> "1/1/2001", "1/26/2018", "11/21/2017", "8/...
## $ playlist_name            <chr> "Pinoy Classic Rock", "Groovy // Funky // ...
## $ playlist_id              <chr> "37i9dQZF1DWYDQ8wBxd7xt", "0JmBB9HfrzDiZoP...
## $ playlist_genre           <chr> "rock", "r&b", "rock", "pop", "pop", "pop"...
## $ playlist_subgenre        <chr> "classic rock", "neo soul", "hard rock", "...
## $ danceability             <dbl> 0.682, 0.582, 0.303, 0.659, 0.662, 0.763, ...
## $ energy                   <dbl> 0.401, 0.704, 0.880, 0.794, 0.838, 0.763, ...
## $ key                      <dbl> 2, 5, 9, 10, 1, 10, 6, 5, 6, 5, 9, 7, 6, 1...
## $ loudness                 <dbl> -10.068, -6.242, -4.739, -5.644, -6.300, -...
## $ mode                     <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...
## $ speechiness              <dbl> 0.0236, 0.0347, 0.0442, 0.0540, 0.0499, 0....
## $ acousticness             <dbl> 0.279000, 0.065100, 0.011700, 0.000761, 0....
## $ instrumentalness         <dbl> 1.17e-02, 0.00e+00, 9.94e-03, 1.32e-01, 6....
## $ liveness                 <dbl> 0.0887, 0.2120, 0.3470, 0.3220, 0.0881, 0....
## $ valence                  <dbl> 0.566, 0.698, 0.404, 0.852, 0.496, 0.953, ...
## $ tempo                    <dbl> 97.091, 150.863, 135.225, 128.041, 129.884...
## $ duration_ms              <dbl> 235440, 197286, 373512, 228565, 236308, 21...
## $ X24                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dup_count                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ freq_count               <int> 25180, 25180, 25180, 25180, 25180, 25180, ...
## $ track_genre_id           <chr> "0017A6SJgTbfQVU2EtsPNo rock", "002xjHwzEx...
## $ id_genre_dup             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

Preview of a single song and the genre of playlists it appears in.

spotify_id_genre %>%
  select(track_id, track_name, playlist_genre, freq_count, dup_count, id_genre_dup)%>%
  filter(dup_count == 10)

## # A tibble: 5 x 6
##   track_id       track_name     playlist_genre freq_count dup_count id_genre_dup
##   <chr>          <chr>          <chr>               <int>     <int>        <int>
## 1 7BKLCZ1jbUBVq~ Closer (feat.~ pop                     1        10            4
## 2 7BKLCZ1jbUBVq~ Closer (feat.~ rap                     1        10            1
## 3 7BKLCZ1jbUBVq~ Closer (feat.~ latin                   1        10            3
## 4 7BKLCZ1jbUBVq~ Closer (feat.~ r&b                     1        10            1
## 5 7BKLCZ1jbUBVq~ Closer (feat.~ edm                     1        10            1

Variable Summary

Variable.type <- lapply(spotify, class)
Variable.desc <- c("Unique ID assigned to each song", "Song name", "Song artist", 
                   "Song popularity (0-100) where higher is better", "Unique ID assigned to each album",
                   "Album name that song is on", "Date when album was released", 
                   "Name of playlist that has song on it",
                   "Unique ID of playlist", "Genre of the playlist", "Subgenre of the playlist", 
                   "How suitable a track is for dancing (0 is least danceable and 1 is most danceable",
                   "Energy represents the measure of intensity and activity (range from 0 to 1)",
                   "Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected)",
                   "Loudness of a track in decibels (dB)", "Modality of a track (major = 1, minor = 0)",
                   "Presence of spoken words in a track (range from 0 to 1)",
                   "Confidence measure from 0 to 1 of whether the track is acoustic", 
                   "Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals",
                   "Detects the presence of an audience in the recording (values above .8 are most likely live tracks",
                   "Describes positivity of a song (1 is cheerful, 0 is sad or angry)", 
                   "Tempo of a track in beats per minute (BPM)", "Duration of song in milliseconds")
Variable.name1 <- colnames(spotify)
data.desc <- as_tibble(cbind(Variable.name1, Variable.type, Variable.desc))
colnames(data.desc) <- c("Variable Name", "Data Type", "Variable Description")
kable(data.desc)

Variable Name	Data Type	Variable Description
track_id	character	Unique ID assigned to each song
track_name	character	Song name
track_artist	character	Song artist
track_popularity	numeric	Song popularity (0-100) where higher is better
track_album_id	character	Unique ID assigned to each album
track_album_name	character	Album name that song is on
track_album_release_date	character	Date when album was released
playlist_name	character	Name of playlist that has song on it
playlist_id	character	Unique ID of playlist
playlist_genre	character	Genre of the playlist
playlist_subgenre	character	Subgenre of the playlist
danceability	numeric	How suitable a track is for dancing (0 is least danceable and 1 is most danceable
energy	numeric	Energy represents the measure of intensity and activity (range from 0 to 1)
key	numeric	Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected)
loudness	numeric	Loudness of a track in decibels (dB)
mode	numeric	Modality of a track (major = 1, minor = 0)
speechiness	numeric	Presence of spoken words in a track (range from 0 to 1)
acousticness	numeric	Confidence measure from 0 to 1 of whether the track is acoustic
instrumentalness	numeric	Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals
liveness	numeric	Detects the presence of an audience in the recording (values above .8 are most likely live tracks
valence	numeric	Describes positivity of a song (1 is cheerful, 0 is sad or angry)
tempo	numeric	Tempo of a track in beats per minute (BPM)
duration_ms	numeric	Duration of song in milliseconds
X24	logical	Unique ID assigned to each song

Data Analysis

Exploratory Data Analysis

We broke up out analysis into two parts. The first part investigates the spotify dataset to see what variables have a greater impact on a songs popularity. The second part is a continuation for the first, where we investigate the spotify_id_genre data to see if those impactful variables have an effect on the number of playlists a song is on.

`spotify` dataset

This dataset was broken down into 6 different genres: edm, latin, pop, r&b, rap, and rock. We decided to divide the data into these different genres and see what variables may make a song in a particular genre more popular.

First, we decided to see if there was any correlation between variables and track popularity based on genre:

genre_cor <- spotify %>% 
  split(.$playlist_genre) %>% 
  map(~{
    cor(.x[12:23], .x$track_popularity)
  })
genre_cor

## $edm
##                           [,1]
## danceability      0.0075376101
## energy           -0.0668702825
## key               0.0209147980
## loudness          0.0271986813
## mode              0.0007403464
## speechiness       0.0313314807
## acousticness      0.1564316725
## instrumentalness -0.1560880174
## liveness          0.0165835867
## valence           0.0896622935
## tempo            -0.0211804472
## duration_ms      -0.2342469540
## 
## $latin
##                          [,1]
## danceability      0.026975650
## energy           -0.094950355
## key              -0.013667132
## loudness          0.131388909
## mode              0.036887548
## speechiness       0.027512401
## acousticness      0.123371070
## instrumentalness -0.116688999
## liveness         -0.045433453
## valence           0.006128485
## tempo             0.034920050
## duration_ms      -0.074868461
## 
## $pop
##                           [,1]
## danceability      0.0905809945
## energy           -0.0550889626
## key               0.0030874651
## loudness          0.1147850423
## mode              0.0026690334
## speechiness       0.0905472909
## acousticness      0.0289534168
## instrumentalness -0.1409987507
## liveness         -0.0003136197
## valence           0.0207726249
## tempo            -0.0353436168
## duration_ms      -0.1460902174
## 
## $`r&b`
##                         [,1]
## danceability     -0.04995012
## energy           -0.11546595
## key              -0.03462840
## loudness          0.07985844
## mode              0.04151223
## speechiness      -0.01992277
## acousticness      0.09701521
## instrumentalness -0.06204294
## liveness         -0.06084676
## valence          -0.12925181
## tempo             0.02357607
## duration_ms      -0.14218905
## 
## $rap
##                          [,1]
## danceability      0.135056064
## energy           -0.120348171
## key              -0.009297088
## loudness         -0.037419085
## mode             -0.031245856
## speechiness      -0.078878945
## acousticness      0.075451701
## instrumentalness  0.029341513
## liveness         -0.060381634
## valence          -0.027577613
## tempo             0.049371542
## duration_ms      -0.137329192
## 
## $rock
##                          [,1]
## danceability      0.063008912
## energy           -0.049757882
## key              -0.015977131
## loudness          0.018858356
## mode             -0.001623460
## speechiness      -0.001189018
## acousticness      0.004641102
## instrumentalness -0.098389698
## liveness         -0.102005851
## valence          -0.009733179
## tempo            -0.002827498
## duration_ms      -0.031150549

As you can see from the results, there really is no correlation between the measurable variables and track popularity.

We did not give up here though. We decided to run a regression analysis to determine how each variable impacts the track popularity:

genre_lm <- spotify %>% 
  split(.$playlist_genre) %>% 
  map(~{
    lm(track_popularity ~ ., data = .x[c(4, 12:23)])
  })
genre_lm

## $edm
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        4.311e+01         5.465e+00        -6.782e+00         1.818e-01  
##         loudness              mode       speechiness      acousticness  
##        8.157e-02         5.928e-01        -2.640e+00         1.468e+01  
## instrumentalness          liveness           valence             tempo  
##       -4.652e+00         7.648e-01         2.747e+00         3.207e-03  
##      duration_ms  
##       -5.656e-05  
## 
## 
## $latin
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        7.216e+01         3.987e+00        -3.039e+01        -5.505e-02  
##         loudness              mode       speechiness      acousticness  
##        1.930e+00         1.096e+00        -3.757e-01         9.304e+00  
## instrumentalness          liveness           valence             tempo  
##       -7.475e+00        -2.971e+00         3.141e-01         3.298e-02  
##      duration_ms  
##       -2.242e-05  
## 
## 
## $pop
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        7.942e+01         1.242e+01        -2.828e+01         8.428e-02  
##         loudness              mode       speechiness      acousticness  
##        2.013e+00         2.896e-01         2.314e+01         1.527e-01  
## instrumentalness          liveness           valence             tempo  
##       -8.585e+00         1.082e+00        -3.988e-01        -1.279e-02  
##      duration_ms  
##       -4.144e-05  
## 
## 
## $`r&b`
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        7.448e+01         1.796e-01        -2.184e+01        -1.042e-01  
##         loudness              mode       speechiness      acousticness  
##        1.486e+00         1.270e+00        -5.692e+00         2.909e+00  
## instrumentalness          liveness           valence             tempo  
##       -9.176e+00        -7.086e+00        -7.046e+00         1.905e-02  
##      duration_ms  
##       -4.555e-05  
## 
## 
## $rap
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        4.661e+01         2.197e+01        -1.503e+01        -4.076e-02  
##         loudness              mode       speechiness      acousticness  
##        5.248e-01        -1.258e+00        -1.145e+01         4.171e+00  
## instrumentalness          liveness           valence             tempo  
##       -2.799e+00        -6.590e-01        -2.282e+00         4.971e-02  
##      duration_ms  
##       -4.297e-05  
## 
## 
## $rock
## 
## Call:
## lm(formula = track_popularity ~ ., data = .x[c(4, 12:23)])
## 
## Coefficients:
##      (Intercept)      danceability            energy               key  
##        5.516e+01         1.171e+01        -1.630e+01        -1.009e-01  
##         loudness              mode       speechiness      acousticness  
##        6.925e-01        -2.162e-01         9.512e+00        -4.155e+00  
## instrumentalness          liveness           valence             tempo  
##       -1.119e+01        -1.102e+01        -3.273e+00         2.017e-02  
##      duration_ms  
##       -5.573e-06

We discovered four variables that impact track popularity the most:

Danceability
Energy
Speechiness
Instrumentalness

Let’s take a closer look at these variables, and see how much they change the track popularity based on genre.

Danceability

dance_graph <- genre_lm %>% 
  map(tidy) %>% 
  imap(~.x %>% 
         mutate(genre = .y)) %>% 
  bind_rows() %>% 
  filter(term == 'danceability') %>% 
  ggplot(aes(x = reorder(genre, estimate), estimate)) +
  geom_col(fill = 'royalblue2') +
  geom_text(aes(label = round(estimate, digits = 1), 
                vjust = -0.3)) +
  labs(title = 'Impact of Danceability on Track Popularity',
       x = 'Genre', y = 'Estimate') +
  theme(plot.title = element_text(size = rel(1.4), face = "bold", 
                                  color = "royalblue2"),
        panel.background = element_rect(fill = "white"))
dance_graph

According to this graph, if we increase the danceability of a rap song by 1, the track’s popularity will grow by 23.9 points. However, danceability is on a range from 0 to 1, so the highest it can go is 1. Regardless, it does seem that if you want to slightly increase the popularity of your rap song, you might want to increase the danceability a bit.

Energy

energy_graph <- genre_lm %>% 
  map(tidy) %>% 
  imap(~.x %>% 
         mutate(genre = .y)) %>% 
  bind_rows() %>% 
  filter(term == 'energy') %>% 
  ggplot(aes(x = reorder(genre, estimate), estimate)) +
  geom_col(fill = 'springgreen3') +
  geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
  labs(title = 'Impact of Energy on Track Popularity',
       x = 'Genre', y = 'Estimate') +
  theme(plot.title = element_text(size = rel(1.4), face = "bold", 
                                  color = "springgreen3"),
        panel.background = element_rect(fill = "white"))
energy_graph

Energy can impact a track’s popularity by a lot, especially if the track is in the latin genre. We were pretty surprised to see these results, as raising the energy in latin and pop results in the biggest popularity losses. We typically think of Latin and pop music as energetic, but the genres do have a large range of tracks, and the standard error for this regression was quite high.

Speechiness

speechiness_graph <- genre_lm %>% 
  map(tidy) %>% 
  imap(~.x %>% 
         mutate(genre = .y)) %>% 
  bind_rows() %>% 
  filter(term == 'speechiness') %>% 
  ggplot(aes(x = reorder(genre, estimate), estimate)) +
  geom_col(fill = 'orangered3') +
  geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
  labs(title = 'Impact of Speechiness on Track Popularity',
       x = 'Genre', y = 'Estimate') +
  theme(plot.title = element_text(size = rel(1.4), face = "bold", 
                                  color = "orangered3"),
        panel.background = element_rect(fill = "white"))
speechiness_graph

By looking at this graph, speechiness (the presence of talking) can positively impact a track’s popularity, or negatively impact a track’s popularity depending on the genre. Speechiness can help a pop song, but hurt a rap song, which is the opposite that we were thinking it would do.

Instrumentalness

instrumentalness_graph <- genre_lm %>% 
  map(tidy) %>% 
  imap(~.x %>% 
         mutate(genre = .y)) %>% 
  bind_rows() %>% 
  filter(term == 'instrumentalness') %>% 
  ggplot(aes(x = reorder(genre, estimate), estimate)) +
  geom_col(fill = 'purple3') +
  geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
  labs(title = 'Impact of Instrumentalness on Track Popularity',
       x = 'Genre', y = 'Estimate') +
  theme(plot.title = element_text(size = rel(1.4), face = "bold", 
                                  color = "purple3"),
        panel.background = element_rect(fill = "white"))
instrumentalness_graph

It’s clear to see that increasing the instrumentalness will negatively impact a track’s popularity, regardless of the genre. We were most surprised to see the rock genre leading on this graph since we typically associate rock with a lot of guitar solos at times.

`spotify_id_genre` dataset

Just like the spotify dataset, the spotify_id_genre and spotify_full datasets still can be broken down into the 6 different playlist genres, but it can also be broken down by the number of times a song is repeated. This allows us to see the distribution between genres and the repetition of songs. We used the spotify_full to generate this, so that we had the most accurate counts for the dataset.

subset <- spotify_full %>%
    select(dup_count,playlist_genre)%>%
    group_by(dup_count)%>%
    count(playlist_genre)%>%
    rename(genre_count = n)

count_table <- subset %>%
  spread(playlist_genre,genre_count)

kable(count_table)

dup_count	edm	latin	pop	r&b	rap	rock
1	4511	3729	3988	4270	4822	3860
2	1013	770	796	729	624	834
3	296	308	352	192	171	211
4	102	140	157	79	59	31
5	45	79	86	61	18	11
6	36	51	61	39	22	1
7	17	38	29	27	8	NA
8	18	30	29	30	12	1
9	2	5	5	3	3	NA
10	1	3	4	1	1	NA

Because of the sheer numbers, we decided to continue the rest of our analysis by focusing on the duplicate counts of 3, 4, and 5. This is for a few reasons:

Each of the songs were repeated more that 1 time.
There would be enough observations in each to see if there is a pattern

not too many where you cannot see anything like 1 and 2
not too few where you cannot detect a strong pattern like 6, 7, 8, 9, and 10

There are enough factors (not sure if best word) to show consistency across duplicate counts

Our analysis of the spotify dataset indicated 4 main variables that are the most impactful to the popularity of a track.

These are:

Danceability
Energy
Speechiness
Instrumentness

Our analysis of the spotify_id_genre dataset will take a closer look at these variables:

Danceability

# Track_popularity vs. Danceability

ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=danceability,y=track_popularity, color=playlist_genre))+
  geom_point()+
  geom_smooth()+
  facet_grid(playlist_genre~dup_count)

Across the board, high danceability and high track_popularity, indicate popularity.(will write more about the analysis but that is the general concensus)

Energy

#Energy

ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=energy,y=track_popularity, color=playlist_genre))+
  geom_point()+
  geom_smooth()+
  facet_grid(playlist_genre~dup_count)

Energy depends more one the genre for popularity, but the biggest indicator is the track_popularity. No matter what the energy is, if the track is well liked it will most likely be found on multiple playlists.

Speechiness

#Speechiness

ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=speechiness,y=track_popularity, color=playlist_genre))+
  geom_point()+
  geom_smooth()+
  facet_grid(playlist_genre~dup_count)

For speechiness, pretty consistent across genres. big indicator but values varies across the genre.

Instrumentalness

#Instrumentalness

ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=instrumentalness,y=track_popularity, color=playlist_genre))+
  geom_point()+
  geom_smooth()+
  coord_cartesian(ylim = c(-300,300))+
  facet_grid(playlist_genre~dup_count)

Instrumentalness seems to be all over the place, but consistently if a song appears in 3 different playlists, then the instrumentalness ranges from 0 to around .90,

appears in 4 playlists, generally ranges from 0 to about .50

appears in 5 playlists, ranges from 0 to .125

Summary

Analyzing this data helped us understand what might impact a song’s popularity. Here are a few key points that we discovered during this analysis:

will be adding more but biggest indicators for popularity are instrumentalness, speechiness, danceability, and energy

of those 4, when compared to popularity, speechiness, danceability and energy are the biggest indicators of whether or not a some will be repeated arcoss multiple playlists.

more playlists across more genre mean a wider range of people listening which can mean more sales. *

Data Wrangling Final

Massey Pierce & Megan Heuker

11/27/2020

Music Exploratory Analysis

Spotify Project

Introducation

Introduction

Packages Required

Packages Required

Data Preparation

Data Preparation

Data Import

Data Cleaning

Data Preview

Variable Summary

Data Analysis

Exploratory Data Analysis

`spotify` dataset

`spotify_id_genre` dataset

Summary

Data Wrangling Final

Massey Pierce & Megan Heuker

11/27/2020

Music Exploratory Analysis

Spotify Project

Introducation

Introduction

Packages Required

Packages Required

Data Preparation

Data Preparation

Data Import

Data Cleaning

Data Preview

Variable Summary

Data Analysis

Exploratory Data Analysis

spotify dataset

spotify_id_genre dataset

Summary

`spotify` dataset

`spotify_id_genre` dataset