Spotify Final Project - What makes a popular song?

INTRODUCTION:

What makes a good song? Why do we enjoy certain songs more than others? How can we make the next most popular song?

We are going to explore the data in the spotify_songs.csv file to approach the issue of what factors can impact the popularity of any song. This data can help us predict certain features of future popular songs that the majority of Spotify users might enjoy listening to.

To address whether there is a relationship between these song factors, we can use the given data to curate a collection of visual graphs that can be compared with the track’s popularity. We can observe aspects of song like its playlist genre, rhythm, date released, acousticness, loudness, danceability, energy, speechiness, and duration (in ms) in the intentions of seeing which factors and specific settings of these factor play the most relevance in the success and overall popularity of a song.

By creating bar charts of the playlist’s release date and genre, we will be able to see which songs are in each genre category and which month most of these popular songs were released. In addition, by graphing plots of the tracks’ acousticness, loudness, danceability, energy, speechiness, and duration vs. popularity, we can see the most common range that these popular songs are in for these 6 specific factors.

PACKAGES REQUIRED

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Here in this chunk we have loaded in tidyverse because it includes the packages tidyr and ggplot2, which we will be using to clean and tidy up the data (tidyr) and create graphics to display the data (ggplot2).

Tidyr and tibble will allow us to utilize and adjust the data to make it accessible with the purpose of using it to plot it in a graph. Ggplot2 will create visual graphics using the cleaned data to to help us see the relation between the variables and therefore, the factors that make a popular song.

The package “lubridate” makes it easier to work with dates and times, as I want to extract the months from the track release dates to use in my tibble and bar graph.

The packages grid and gridExtra will allow us to combine graphs and add graphical objects on them as well (such as textGrob). This helps make the graphs appear more concise and clear to the audience because we are labeling and plotting graphs together, allowing us to look at them simultaneously in the same workspace without having to run them separately.

DATA PREPARATION

The data used was attained from this website: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md Created by Tom Mock on Github on Jan 20, 2020.

Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff collected data of 5000 songs from EDM, Latin, Pop, R&B, Rap, & Rock song categories. There are NAs present in columns of data_released, which I will omit so that the graphs do not include the tracks with a date_released of NA.

In this first chunk for data preparation, I brought in the csv file from my Documents folder as a data frame and I named it spotify. Then, I wanted to see the first few rows and last few rows of spotify to see what the data frame is showing. Next to gather what variables I wanted to analyze, I looked at the column names of the data frame and chose the ones most relevant to the issue that I wanted to tackle.

I also wanted to see how many rows and columns are included in this data frame and I find that there are 32833 rows and 24 columns. The names of the 24 columns are also listed with the function colnames(). I will not be using all of the columns, but I will be working specifically with track_popularity, track_album_release_date, acousticness, loudness, danceability, energy, speechiness, and duration from this original data frame. Since this is as an extensively large data set, I have to separate this data into tibbles to make it easier to analyze and graph.

spotify <- read.csv('/Users/vanilla/Documents/fall22/329/spotify_songs.csv')
head(spotify)

##   X               track_id
## 1 1 6f807x0ima9a1j3VPbc7VN
## 2 2 0r7CVbZTWZgbTCYdfa2P31
## 3 3 1z1Hg7Vb0AhHDiEmnDE79l
## 4 4 75FpbthrwQmzHlBJLuGdC7
## 5 5 1e8PAfcKUYoKkxPhrHqw4x
## 6 6 7fvUMiyapMsRRxr07cU8Ef
##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##   track_popularity         track_album_id
## 1               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2               67 63rPSO264uRjW1X5E6cWv6
## 3               70 1HoSmj2eLcsrR0vE9gThr4
## 4               60 1nqYsOef1yKKuGOVchbsk6
## 5               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

tail(spotify)

##           X               track_id                           track_name
## 32828 32828 0aBDrRTgDCwWbcOnEIp7DJ               Many Ways - Radio Edit
## 32829 32829 7bxnKAamR3snQ1VGLuVfC1 City Of Lights - Official Radio Edit
## 32830 32830 5Aevni09Em4575077nkWHz  Closer - Sultan & Ned Shepard Remix
## 32831 32831 7ImMqPP3Q1yfUHvsdn7wEo         Sweet Surrender - Radio Edit
## 32832 32832 2m69mhnfQ1Oq6lGtXuYhgX       Only For You - Maor Levi Remix
## 32833 32833 29zWqhca3zt5NsckZqDf6c               Typhoon - Original Mix
##                              track_artist track_popularity
## 32828 Ferry Corsten feat. Jenny Wahlstrom               27
## 32829                        Lush & Simon               42
## 32830                      Tegan and Sara               20
## 32831                         Starkillers               14
## 32832                              Mat Zo               15
## 32833                        Julian Calor               27
##               track_album_id             track_album_name
## 32828 59XOfNjuYZB6feC6QUzS3e                    Many Ways
## 32829 2azRoBBWEEEYhqV6sb7JrT   City Of Lights (Vocal Mix)
## 32830 6kD6KLxj7s8eCE3ABvAyf5               Closer Remixed
## 32831 0ltWNSY9JgxoIZO4VzuCa6 Sweet Surrender (Radio Edit)
## 32832 1fGrOkHnHJcStl14zNx8Jy       Only For You (Remixes)
## 32833 0X3mUOm6MhxR7PzxG95rAo                Typhoon/Storm
##       track_album_release_date   playlist_name            playlist_id
## 32828                     2013 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32829               2014-04-28 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32830               2013-03-08 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32831               2014-04-21 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32832               2014-01-01 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32833               2014-03-03 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
##       playlist_genre         playlist_subgenre danceability energy key loudness
## 32828            edm progressive electro house        0.581  0.640   5   -8.367
## 32829            edm progressive electro house        0.428  0.922   2   -1.814
## 32830            edm progressive electro house        0.522  0.786   0   -4.462
## 32831            edm progressive electro house        0.529  0.821   6   -4.899
## 32832            edm progressive electro house        0.626  0.888   2   -3.361
## 32833            edm progressive electro house        0.603  0.884   5   -4.571
##       mode speechiness acousticness instrumentalness liveness valence   tempo
## 32828    1      0.0365     0.026600         0.00e+00   0.5720  0.2880 128.001
## 32829    1      0.0936     0.076600         0.00e+00   0.0668  0.2100 128.170
## 32830    1      0.0420     0.001710         4.27e-03   0.3750  0.4000 128.041
## 32831    0      0.0481     0.108000         1.11e-06   0.1500  0.4360 127.989
## 32832    1      0.1090     0.007920         1.27e-01   0.3430  0.3080 128.008
## 32833    0      0.0385     0.000133         3.41e-01   0.7420  0.0894 127.984
##       duration_ms
## 32828      196993
## 32829      204375
## 32830      353120
## 32831      210112
## 32832      367432
## 32833      337500

class(spotify)

## [1] "data.frame"

colnames(spotify)

##  [1] "X"                        "track_id"                
##  [3] "track_name"               "track_artist"            
##  [5] "track_popularity"         "track_album_id"          
##  [7] "track_album_name"         "track_album_release_date"
##  [9] "playlist_name"            "playlist_id"             
## [11] "playlist_genre"           "playlist_subgenre"       
## [13] "danceability"             "energy"                  
## [15] "key"                      "loudness"                
## [17] "mode"                     "speechiness"             
## [19] "acousticness"             "instrumentalness"        
## [21] "liveness"                 "valence"                 
## [23] "tempo"                    "duration_ms"

dim(spotify)

## [1] 32833    24

Here in this second chunk for data preparation, I gathered all the variables from the spotify data frame that I wanted. I used them to create three tibbles: gen, dates, and numerics.

(The explanation for each of the variables I used is written in the code, so check there if you want a summary of why I used each variable!)

First, I wanted to take the data in the column of track_album_release_date and allow the program to recognize it as year-month-date. Then, using the lubridate package, I added in a new column called month_name to include the abbreviated month names from the date released. This is so that when I graph it, the abbreviated month names would be shown rather than the month number.

Then, I simply relabeled all the columns in the spotify data frame so that it is easier to address and so that I don’t have to put in the ‘spotify$’ every single time I call upon the column. Using these columns, I created the tibbles which splices the spotify data so that I can work with these specific variables that will help tell me what makes a popular song.

In gen, I used the track name, track popularity, playlist name, and playlist genre for the purpose of a bar chart later on. You will noticed that I filtered the popularity values to only include those that are greater than 80 in each of these tibbles. This is because I noticed that the higher the popularity number, the better. So, I only included the tracks that have a popularity greater than 80 in each of my tibbles. I wanted it to be arranged in ascending order of popularity so that the first song we see in the tibble of gen will start at 81.

For tibble dates, I used the track name, date released, month name, and popularity and I rearranged this data to be from oldest to newest track so that it will list the tracks chronologically by release date and then popularity. I also used the factor() function so that when I graph the bar chart for this tibble, the month names can be shown in this specific order from left to right, rather than in alphabetical order. With this tibble, I will also make a bar chart for the months in which the songs are released.

And in numerics, I included 6 numerical data columns from spotify, which I will be using to make jitter plots with to show me which are the best ranges to have these variables in a popular song. For this tibble, I included genre, popularity, acousticness, loudness, danceability, energy, speechiness, and duration. As mentioned previously, these tibbles include tracks for which the popularity has to be greater than 80 and I wanted to arrange the data in ascending order of popularity for this tibble, as well.

spotify$date_released <- as.Date(spotify$track_album_release_date, '%Y-%m-%d')
spotify$month_name <- format(spotify$date_released, '%b')

month_name <- spotify$month_name
date_released <- spotify$date_released
popularity <- spotify$track_popularity
genre <- spotify$playlist_genre
track <- spotify$track_name
acousticness <- spotify$acousticness
loudness <- spotify$loudness
danceability <- spotify$danceability
energy <- spotify$energy
speechiness <- spotify$speechiness
duration <- spotify$duration_ms

#date released and month name will show us which month these popular songs are released during the year.
#popularity is what we want to compare all of this chosen data to, as we want to see what makes a song more popular. It goes from (0-100) where the higher value is better.
#playlist genre will be used in a bar plot to show us how these songs are distributed amongst the genres.
#acousticness, loudness, danceability, energy, speechiness, and duration are the 6 variables that I want to compare to popularity. Since the classes for these 6 columns are doubles and not characters, I will use it to create jitter plots rather than a bar plot.

gen <- tibble(track, popularity, genre)
gen <- gen %>%
  filter(popularity > 80) %>%
  arrange(popularity)
gen

## # A tibble: 1,340 × 3
##    track                                     popularity genre
##    <chr>                                          <int> <chr>
##  1 Sixteen                                           81 pop  
##  2 What I Like About You (feat. Theresa Rex)         81 pop  
##  3 Electricity (with Dua Lipa)                       81 pop  
##  4 Scared to Be Lonely                               81 pop  
##  5 Let Me Love You                                   81 pop  
##  6 Sorry                                             81 pop  
##  7 Instagram                                         81 pop  
##  8 Happy Now                                         81 pop  
##  9 Without You (feat. Sandro Cavazza)                81 pop  
## 10 HIP                                               81 pop  
## # ℹ 1,330 more rows

dates <- tibble(track, date_released, month_name, popularity)
dates <- dates %>%
  drop_na(date_released) %>%
  separate(col=date_released, into=c('year', 'month', 'date'), sep='-') %>%
  arrange(year, month, date, popularity) %>%
  filter(popularity > 80)
dates$month_name = factor(dates$month_name, 
                          levels=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
dates

## # A tibble: 1,340 × 6
##    track                                 year  month date  month_name popularity
##    <chr>                                 <chr> <chr> <chr> <fct>           <int>
##  1 Rocket Man (I Think It's Going To Be… 1972  05    19    May                81
##  2 Sweet Home Alabama                    1974  04    15    Apr                81
##  3 Sweet Home Alabama                    1974  04    15    Apr                81
##  4 Sweet Home Alabama                    1974  04    15    Apr                81
##  5 Bohemian Rhapsody - 2011 Mix          1975  11    21    Nov                84
##  6 Hotel California - 2013 Remaster      1976  12    08    Dec                82
##  7 Hotel California - 2013 Remaster      1976  12    08    Dec                82
##  8 Hotel California - 2013 Remaster      1976  12    08    Dec                82
##  9 Hotel California - 2013 Remaster      1976  12    08    Dec                82
## 10 Don't Stop Me Now - 2011 Mix          1978  11    10    Nov                83
## # ℹ 1,330 more rows

numerics <- tibble(genre, popularity, acousticness, loudness, danceability, energy, speechiness, duration)
numerics <- numerics %>%
  filter(popularity > 80) %>%
  arrange(popularity)
numerics

## # A tibble: 1,340 × 8
##    genre popularity acousticness loudness danceability energy speechiness
##    <chr>      <int>        <dbl>    <dbl>        <dbl>  <dbl>       <dbl>
##  1 pop           81      0.268      -5.44        0.669  0.801      0.136 
##  2 pop           81      0.289      -3.58        0.46   0.8        0.05  
##  3 pop           81      0.0104     -6.44        0.588  0.67       0.0473
##  4 pop           81      0.0895     -7.79        0.584  0.54       0.0576
##  5 pop           81      0.0863     -5.37        0.649  0.716      0.0349
##  6 pop           81      0.0797     -3.67        0.654  0.76       0.045 
##  7 pop           81      0.125      -2.10        0.765  0.906      0.0965
##  8 pop           81      0.374      -7.00        0.693  0.575      0.0801
##  9 pop           81      0.00163    -4.84        0.662  0.858      0.0428
## 10 pop           81      0.0376     -4.58        0.782  0.731      0.143 
## # ℹ 1,330 more rows
## # ℹ 1 more variable: duration <dbl>

EXPLORATORY DATA ANALYSIS

For this first chunk, I made bar charts comparing genre and month released with the counts of the popular songs. So I used my tibbles gen and dates created in the previous section because I mainly want to access the genre and month names of the songs with popularity over 80.

Using grid.arrange() to view both of these bar charts, we can observe that most of the songs with popularity > 80 are latin or pop. We also observe that the time released of these songs are mostly in the month October or generally, during the months of mid-autumn to early winter. (So if a new artist wanted to release a new song, the best time to do so would probably be during the late fall season.)

base <- ggplot(gen)
genre1<-base + geom_bar(aes(x=genre, fill=genre)) + theme_classic() + ggtitle('Genre vs. Popularity') + ylab('Number of Popular Songs')

base1<-ggplot(na.omit(dates))
monthplot <- base1 + geom_bar(aes(x=month_name, fill=month_name)) + theme_classic() + ggtitle('Month Released vs. Popularity') + ylab('Number of Popular Songs')

grid.arrange(genre1, monthplot)

numbase <- ggplot(na.omit(numerics))
acousdot <- numbase + geom_jitter(aes(x=acousticness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")

louddot <- numbase + geom_jitter(aes(x=loudness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")

dancedot <- numbase + geom_jitter(aes(x=danceability, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")

energydot <- numbase + geom_jitter(aes(x=energy, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")

speechdot <- numbase + geom_jitter(aes(x=speechiness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")

durdot <- numbase + geom_jitter(aes(x=duration, y=popularity, color=genre), alpha = 0.3) + xlab('duration in ms')

grid.arrange(acousdot, louddot, dancedot, energydot, speechdot, durdot, nrow=3, 
             top = textGrob("Plots of 6 Track Factors vs. Popularity"))

In this second chunk, I’ve created 6 jitter plots comparing acousticness, loudness, danceability, energy, speechiness, and duration with popularity of the songs with popularity over 80. So I used the data in the numerics tibble to graph these 6 different factors with the popularity value on the y-axis. Since I turned the opacity down for the points, we can see that the most concentrated areas are where most of the points are blended together.

(For aesthetic purposes, I also added in a legend where the colors of the points are based on their genre.)

Here we can see in the most common ranges for these popular songs: For accousticness: (0.00,0.25) For loudness: (-10, -2) For danceability: (0.5,0.9) For energy: (0.5, 0.8) For speechiness: (0.0, 0.15) For duration: (1.5e+0.5, 3e+0.5) ms

So if an artist wanted to make a new popular song, they can use this data to imitate the traits based on these graphs for the factors that create a popular song.

SUMMARY

All in all, the issue that I wanted to address in this report is what are the common characteristics of popular song. Specificially pertaining to the data for their month released, genre, acousticness, loudness, danceabiilty, energy, speechiness, and duration.

To address this problem, I used the spotify.csv file to create multiple tibbles for each topic, comparing it to the popularity values. With these tibbles, I created bar charts for the genre and month released and jitter plots for acousticness, loudness, danceabiilty, energy, speechiness, and duration vs. the popularity values. My graphs show that popular songs of popularity over 80 are mostly latin and pop. In addition, I found ranges for the acousticness, loudness, danceabiilty, energy, speechiness, and duration of where these popular songs are most plotted on the graphs. In the separate plots, the acousticness points were most concentrated in the range (0.00,0.25), for loudness (-10, -2), for danceability: (0.5,0.9), for energy (0.5, 0.8), for speechiness (0.0, 0.15), and for duration: (1.5e+0.5, 3e+0.5) ms.

These findings imply the characteristics we would want for a popular song. So if a music artist or producer was interested in making a song that has higher chances in having greater popularity, they can use this report for reference in recreation of past popular songs.

One of the limitations for my data is that I focused on the tracks with popularity values over 80. This disregards the tracks with popularity values of 80 and lower in my graphs. So to improve on this, I could’ve taken into consideration of the extra data values where popularity <= 80. However, I did not want to include these into my graphs because the data set was already so extensive which would’ve impacted the visuals of the graphs (especially the jitter plots). The points would’ve been more scattered and clustered, and the bar plots would’ve shown that EDM has the greatest number of songs in that genre. Another limitation is that I did not include the other numeric variables with doubles also in their columns, such as valence, tempo, liveness, and mode. I could’ve used these variables to create scatter or jitter plots as well.

project

2022-10-10