What makes a good song? Why do we enjoy certain songs more than others? How can we make the next most popular song?
We are going to explore the data in the spotify_songs.csv file to approach the issue of what factors can impact the popularity of any song. This data can help us predict certain features of future popular songs that the majority of Spotify users might enjoy listening to.
To address whether there is a relationship between these song factors, we can use the given data to curate a collection of visual graphs that can be compared with the track’s popularity. We can observe aspects of song like its playlist genre, rhythm, date released, acousticness, loudness, danceability, energy, speechiness, and duration (in ms) in the intentions of seeing which factors and specific settings of these factor play the most relevance in the success and overall popularity of a song.
By creating bar charts of the playlist’s release date and genre, we will be able to see which songs are in each genre category and which month most of these popular songs were released. In addition, by graphing plots of the tracks’ acousticness, loudness, danceability, energy, speechiness, and duration vs. popularity, we can see the most common range that these popular songs are in for these 6 specific factors.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(grid)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
Here in this chunk we have loaded in tidyverse because it includes the packages tidyr and ggplot2, which we will be using to clean and tidy up the data (tidyr) and create graphics to display the data (ggplot2).
Tidyr and tibble will allow us to utilize and adjust the data to make it accessible with the purpose of using it to plot it in a graph. Ggplot2 will create visual graphics using the cleaned data to to help us see the relation between the variables and therefore, the factors that make a popular song.
The package “lubridate” makes it easier to work with dates and times, as I want to extract the months from the track release dates to use in my tibble and bar graph.
The packages grid and gridExtra will allow us to combine graphs and add graphical objects on them as well (such as textGrob). This helps make the graphs appear more concise and clear to the audience because we are labeling and plotting graphs together, allowing us to look at them simultaneously in the same workspace without having to run them separately.
The data used was attained from this website: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md Created by Tom Mock on Github on Jan 20, 2020.
Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff collected data of 5000 songs from EDM, Latin, Pop, R&B, Rap, & Rock song categories. There are NAs present in columns of data_released, which I will omit so that the graphs do not include the tracks with a date_released of NA.
In this first chunk for data preparation, I brought in the csv file from my Documents folder as a data frame and I named it spotify. Then, I wanted to see the first few rows and last few rows of spotify to see what the data frame is showing. Next to gather what variables I wanted to analyze, I looked at the column names of the data frame and chose the ones most relevant to the issue that I wanted to tackle.
I also wanted to see how many rows and columns are included in this data frame and I find that there are 32833 rows and 24 columns. The names of the 24 columns are also listed with the function colnames(). I will not be using all of the columns, but I will be working specifically with track_popularity, track_album_release_date, acousticness, loudness, danceability, energy, speechiness, and duration from this original data frame. Since this is as an extensively large data set, I have to separate this data into tibbles to make it easier to analyze and graph.
spotify <- read.csv('/Users/vanilla/Documents/fall22/329/spotify_songs.csv')
head(spotify)
## X track_id
## 1 1 6f807x0ima9a1j3VPbc7VN
## 2 2 0r7CVbZTWZgbTCYdfa2P31
## 3 3 1z1Hg7Vb0AhHDiEmnDE79l
## 4 4 75FpbthrwQmzHlBJLuGdC7
## 5 5 1e8PAfcKUYoKkxPhrHqw4x
## 6 6 7fvUMiyapMsRRxr07cU8Ef
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_id
## 1 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 67 63rPSO264uRjW1X5E6cWv6
## 3 70 1HoSmj2eLcsrR0vE9gThr4
## 4 60 1nqYsOef1yKKuGOVchbsk6
## 5 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
tail(spotify)
## X track_id track_name
## 32828 32828 0aBDrRTgDCwWbcOnEIp7DJ Many Ways - Radio Edit
## 32829 32829 7bxnKAamR3snQ1VGLuVfC1 City Of Lights - Official Radio Edit
## 32830 32830 5Aevni09Em4575077nkWHz Closer - Sultan & Ned Shepard Remix
## 32831 32831 7ImMqPP3Q1yfUHvsdn7wEo Sweet Surrender - Radio Edit
## 32832 32832 2m69mhnfQ1Oq6lGtXuYhgX Only For You - Maor Levi Remix
## 32833 32833 29zWqhca3zt5NsckZqDf6c Typhoon - Original Mix
## track_artist track_popularity
## 32828 Ferry Corsten feat. Jenny Wahlstrom 27
## 32829 Lush & Simon 42
## 32830 Tegan and Sara 20
## 32831 Starkillers 14
## 32832 Mat Zo 15
## 32833 Julian Calor 27
## track_album_id track_album_name
## 32828 59XOfNjuYZB6feC6QUzS3e Many Ways
## 32829 2azRoBBWEEEYhqV6sb7JrT City Of Lights (Vocal Mix)
## 32830 6kD6KLxj7s8eCE3ABvAyf5 Closer Remixed
## 32831 0ltWNSY9JgxoIZO4VzuCa6 Sweet Surrender (Radio Edit)
## 32832 1fGrOkHnHJcStl14zNx8Jy Only For You (Remixes)
## 32833 0X3mUOm6MhxR7PzxG95rAo Typhoon/Storm
## track_album_release_date playlist_name playlist_id
## 32828 2013 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32829 2014-04-28 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32830 2013-03-08 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32831 2014-04-21 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32832 2014-01-01 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## 32833 2014-03-03 ♥ EDM LOVE 2020 6jI1gFr6ANFtT8MmTvA2Ux
## playlist_genre playlist_subgenre danceability energy key loudness
## 32828 edm progressive electro house 0.581 0.640 5 -8.367
## 32829 edm progressive electro house 0.428 0.922 2 -1.814
## 32830 edm progressive electro house 0.522 0.786 0 -4.462
## 32831 edm progressive electro house 0.529 0.821 6 -4.899
## 32832 edm progressive electro house 0.626 0.888 2 -3.361
## 32833 edm progressive electro house 0.603 0.884 5 -4.571
## mode speechiness acousticness instrumentalness liveness valence tempo
## 32828 1 0.0365 0.026600 0.00e+00 0.5720 0.2880 128.001
## 32829 1 0.0936 0.076600 0.00e+00 0.0668 0.2100 128.170
## 32830 1 0.0420 0.001710 4.27e-03 0.3750 0.4000 128.041
## 32831 0 0.0481 0.108000 1.11e-06 0.1500 0.4360 127.989
## 32832 1 0.1090 0.007920 1.27e-01 0.3430 0.3080 128.008
## 32833 0 0.0385 0.000133 3.41e-01 0.7420 0.0894 127.984
## duration_ms
## 32828 196993
## 32829 204375
## 32830 353120
## 32831 210112
## 32832 367432
## 32833 337500
class(spotify)
## [1] "data.frame"
colnames(spotify)
## [1] "X" "track_id"
## [3] "track_name" "track_artist"
## [5] "track_popularity" "track_album_id"
## [7] "track_album_name" "track_album_release_date"
## [9] "playlist_name" "playlist_id"
## [11] "playlist_genre" "playlist_subgenre"
## [13] "danceability" "energy"
## [15] "key" "loudness"
## [17] "mode" "speechiness"
## [19] "acousticness" "instrumentalness"
## [21] "liveness" "valence"
## [23] "tempo" "duration_ms"
dim(spotify)
## [1] 32833 24
Here in this second chunk for data preparation, I gathered all the variables from the spotify data frame that I wanted. I used them to create three tibbles: gen, dates, and numerics.
(The explanation for each of the variables I used is written in the code, so check there if you want a summary of why I used each variable!)
First, I wanted to take the data in the column of track_album_release_date and allow the program to recognize it as year-month-date. Then, using the lubridate package, I added in a new column called month_name to include the abbreviated month names from the date released. This is so that when I graph it, the abbreviated month names would be shown rather than the month number.
Then, I simply relabeled all the columns in the spotify data frame so that it is easier to address and so that I don’t have to put in the ‘spotify$’ every single time I call upon the column. Using these columns, I created the tibbles which splices the spotify data so that I can work with these specific variables that will help tell me what makes a popular song.
In gen, I used the track name, track popularity, playlist name, and playlist genre for the purpose of a bar chart later on. You will noticed that I filtered the popularity values to only include those that are greater than 80 in each of these tibbles. This is because I noticed that the higher the popularity number, the better. So, I only included the tracks that have a popularity greater than 80 in each of my tibbles. I wanted it to be arranged in ascending order of popularity so that the first song we see in the tibble of gen will start at 81.
For tibble dates, I used the track name, date released, month name, and popularity and I rearranged this data to be from oldest to newest track so that it will list the tracks chronologically by release date and then popularity. I also used the factor() function so that when I graph the bar chart for this tibble, the month names can be shown in this specific order from left to right, rather than in alphabetical order. With this tibble, I will also make a bar chart for the months in which the songs are released.
And in numerics, I included 6 numerical data columns from spotify, which I will be using to make jitter plots with to show me which are the best ranges to have these variables in a popular song. For this tibble, I included genre, popularity, acousticness, loudness, danceability, energy, speechiness, and duration. As mentioned previously, these tibbles include tracks for which the popularity has to be greater than 80 and I wanted to arrange the data in ascending order of popularity for this tibble, as well.
spotify$date_released <- as.Date(spotify$track_album_release_date, '%Y-%m-%d')
spotify$month_name <- format(spotify$date_released, '%b')
month_name <- spotify$month_name
date_released <- spotify$date_released
popularity <- spotify$track_popularity
genre <- spotify$playlist_genre
track <- spotify$track_name
acousticness <- spotify$acousticness
loudness <- spotify$loudness
danceability <- spotify$danceability
energy <- spotify$energy
speechiness <- spotify$speechiness
duration <- spotify$duration_ms
#date released and month name will show us which month these popular songs are released during the year.
#popularity is what we want to compare all of this chosen data to, as we want to see what makes a song more popular. It goes from (0-100) where the higher value is better.
#playlist genre will be used in a bar plot to show us how these songs are distributed amongst the genres.
#acousticness, loudness, danceability, energy, speechiness, and duration are the 6 variables that I want to compare to popularity. Since the classes for these 6 columns are doubles and not characters, I will use it to create jitter plots rather than a bar plot.
gen <- tibble(track, popularity, genre)
gen <- gen %>%
filter(popularity > 80) %>%
arrange(popularity)
gen
## # A tibble: 1,340 × 3
## track popularity genre
## <chr> <int> <chr>
## 1 Sixteen 81 pop
## 2 What I Like About You (feat. Theresa Rex) 81 pop
## 3 Electricity (with Dua Lipa) 81 pop
## 4 Scared to Be Lonely 81 pop
## 5 Let Me Love You 81 pop
## 6 Sorry 81 pop
## 7 Instagram 81 pop
## 8 Happy Now 81 pop
## 9 Without You (feat. Sandro Cavazza) 81 pop
## 10 HIP 81 pop
## # ℹ 1,330 more rows
dates <- tibble(track, date_released, month_name, popularity)
dates <- dates %>%
drop_na(date_released) %>%
separate(col=date_released, into=c('year', 'month', 'date'), sep='-') %>%
arrange(year, month, date, popularity) %>%
filter(popularity > 80)
dates$month_name = factor(dates$month_name,
levels=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
dates
## # A tibble: 1,340 × 6
## track year month date month_name popularity
## <chr> <chr> <chr> <chr> <fct> <int>
## 1 Rocket Man (I Think It's Going To Be… 1972 05 19 May 81
## 2 Sweet Home Alabama 1974 04 15 Apr 81
## 3 Sweet Home Alabama 1974 04 15 Apr 81
## 4 Sweet Home Alabama 1974 04 15 Apr 81
## 5 Bohemian Rhapsody - 2011 Mix 1975 11 21 Nov 84
## 6 Hotel California - 2013 Remaster 1976 12 08 Dec 82
## 7 Hotel California - 2013 Remaster 1976 12 08 Dec 82
## 8 Hotel California - 2013 Remaster 1976 12 08 Dec 82
## 9 Hotel California - 2013 Remaster 1976 12 08 Dec 82
## 10 Don't Stop Me Now - 2011 Mix 1978 11 10 Nov 83
## # ℹ 1,330 more rows
numerics <- tibble(genre, popularity, acousticness, loudness, danceability, energy, speechiness, duration)
numerics <- numerics %>%
filter(popularity > 80) %>%
arrange(popularity)
numerics
## # A tibble: 1,340 × 8
## genre popularity acousticness loudness danceability energy speechiness
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pop 81 0.268 -5.44 0.669 0.801 0.136
## 2 pop 81 0.289 -3.58 0.46 0.8 0.05
## 3 pop 81 0.0104 -6.44 0.588 0.67 0.0473
## 4 pop 81 0.0895 -7.79 0.584 0.54 0.0576
## 5 pop 81 0.0863 -5.37 0.649 0.716 0.0349
## 6 pop 81 0.0797 -3.67 0.654 0.76 0.045
## 7 pop 81 0.125 -2.10 0.765 0.906 0.0965
## 8 pop 81 0.374 -7.00 0.693 0.575 0.0801
## 9 pop 81 0.00163 -4.84 0.662 0.858 0.0428
## 10 pop 81 0.0376 -4.58 0.782 0.731 0.143
## # ℹ 1,330 more rows
## # ℹ 1 more variable: duration <dbl>
For this first chunk, I made bar charts comparing genre and month released with the counts of the popular songs. So I used my tibbles gen and dates created in the previous section because I mainly want to access the genre and month names of the songs with popularity over 80.
Using grid.arrange() to view both of these bar charts, we can observe that most of the songs with popularity > 80 are latin or pop. We also observe that the time released of these songs are mostly in the month October or generally, during the months of mid-autumn to early winter. (So if a new artist wanted to release a new song, the best time to do so would probably be during the late fall season.)
base <- ggplot(gen)
genre1<-base + geom_bar(aes(x=genre, fill=genre)) + theme_classic() + ggtitle('Genre vs. Popularity') + ylab('Number of Popular Songs')
base1<-ggplot(na.omit(dates))
monthplot <- base1 + geom_bar(aes(x=month_name, fill=month_name)) + theme_classic() + ggtitle('Month Released vs. Popularity') + ylab('Number of Popular Songs')
grid.arrange(genre1, monthplot)
numbase <- ggplot(na.omit(numerics))
acousdot <- numbase + geom_jitter(aes(x=acousticness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")
louddot <- numbase + geom_jitter(aes(x=loudness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")
dancedot <- numbase + geom_jitter(aes(x=danceability, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")
energydot <- numbase + geom_jitter(aes(x=energy, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")
speechdot <- numbase + geom_jitter(aes(x=speechiness, y=popularity, color=genre), alpha = 0.3) + theme(legend.position="none")
durdot <- numbase + geom_jitter(aes(x=duration, y=popularity, color=genre), alpha = 0.3) + xlab('duration in ms')
grid.arrange(acousdot, louddot, dancedot, energydot, speechdot, durdot, nrow=3,
top = textGrob("Plots of 6 Track Factors vs. Popularity"))
In this second chunk, I’ve created 6 jitter plots comparing acousticness, loudness, danceability, energy, speechiness, and duration with popularity of the songs with popularity over 80. So I used the data in the numerics tibble to graph these 6 different factors with the popularity value on the y-axis. Since I turned the opacity down for the points, we can see that the most concentrated areas are where most of the points are blended together.
(For aesthetic purposes, I also added in a legend where the colors of the points are based on their genre.)
Here we can see in the most common ranges for these popular songs: For accousticness: (0.00,0.25) For loudness: (-10, -2) For danceability: (0.5,0.9) For energy: (0.5, 0.8) For speechiness: (0.0, 0.15) For duration: (1.5e+0.5, 3e+0.5) ms
So if an artist wanted to make a new popular song, they can use this data to imitate the traits based on these graphs for the factors that create a popular song.
All in all, the issue that I wanted to address in this report is what are the common characteristics of popular song. Specificially pertaining to the data for their month released, genre, acousticness, loudness, danceabiilty, energy, speechiness, and duration.
To address this problem, I used the spotify.csv file to create multiple tibbles for each topic, comparing it to the popularity values. With these tibbles, I created bar charts for the genre and month released and jitter plots for acousticness, loudness, danceabiilty, energy, speechiness, and duration vs. the popularity values. My graphs show that popular songs of popularity over 80 are mostly latin and pop. In addition, I found ranges for the acousticness, loudness, danceabiilty, energy, speechiness, and duration of where these popular songs are most plotted on the graphs. In the separate plots, the acousticness points were most concentrated in the range (0.00,0.25), for loudness (-10, -2), for danceability: (0.5,0.9), for energy (0.5, 0.8), for speechiness (0.0, 0.15), and for duration: (1.5e+0.5, 3e+0.5) ms.
These findings imply the characteristics we would want for a popular song. So if a music artist or producer was interested in making a song that has higher chances in having greater popularity, they can use this report for reference in recreation of past popular songs.
One of the limitations for my data is that I focused on the tracks with popularity values over 80. This disregards the tracks with popularity values of 80 and lower in my graphs. So to improve on this, I could’ve taken into consideration of the extra data values where popularity <= 80. However, I did not want to include these into my graphs because the data set was already so extensive which would’ve impacted the visuals of the graphs (especially the jitter plots). The points would’ve been more scattered and clustered, and the bar plots would’ve shown that EDM has the greatest number of songs in that genre. Another limitation is that I did not include the other numeric variables with doubles also in their columns, such as valence, tempo, liveness, and mode. I could’ve used these variables to create scatter or jitter plots as well.