Valence in Spotify Songs Report

Introduction
Packages Required
Data Preparation
Proposed Exploratory Data Analysis

By: Nuranissa and Branden

Introduction

Our report explores what contributes to a song’s valence (positivity). We want to learn if there is any patterns in producing valence. We used the R programming language to produce relevant graphs. We used data containing songs from Spotify and graphed their valence with their other song factors using barplots and scatterplots. This information will help psychologists and therapists learn what factors of a song would likely to improve a listener’s happiness. This can help in recommending or producing songs for the purposes of boosting a patient’s emotions to help relieve their anxiety or depression.

Packages Required

We used the following package:

library(tidyverse)

Tidyverse is a package used to transform and visualise data, so this is required to graph our data.

Data Preparation

Originally, the data contained a total of 32,833 values with 15 missing values. Out of the 15 missing values, 5 were track names, 5 were artist names, and 5 were album names. They are marked with NA. We found these missing values to be peculiar as they are all names; However, our research does not consider names since names are subjective, and we focused on musical elements in relevance to valence.

We imported and cleaned the data with the following steps:

Import the Spotify data file called “spotify_songs.csv” using read.csv() into a new data frame called “spotify”. The function reads and import .csv files into a data frame in R.

spotify <- read.csv("spotify_songs.csv")

Check for missing values in “spotify” using colSums(is.na()). First of all, is.na() checks if the data frame has missing values. colSums() returns the sum of a column/category. With these two functions combined, colSums(is.na()) will produce a list of categories from the data frame containing how many missing values each category/column has. 0 means there are no missing values in that category. As mentioned earlier, we found that track names, artist names, and album names each have 5 missing values (totaling 15 missing values).

colSums(is.na(spotify))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Determine whenever to omit the missing values or replace the missing values with its mean. We did not consider names into our research due to its subjective nature, and names are not numerical, so they are difficult to graph and cannot be calculated. Along with this, there are too few missing values in the data. Due to these reasonings, we decided to omit the missing values in “spotify” using na.omit().

spotify <- na.omit(spotify)

Below is the clean data frame with top observations in regards to valence:

spotify %>%
  # Arranges the valence from highest to lowest (in descending order)
  arrange(desc(valence)) %>%
  # Selects the top 10 rows by valence
  top_n(valence, n = 10) %>%
  # Selects only the variables listed below to be outputted
  select(valence, playlist_genre, loudness, duration_ms, energy, track_popularity, 
         danceability, speechiness, acousticness, instrumentalness, liveness, tempo)

##    valence playlist_genre loudness duration_ms energy track_popularity
## 1    0.991           rock   -9.680      191333  0.676               36
## 2    0.990            r&b  -10.989      191560  0.647               59
## 3    0.985           rock  -15.308      223867  0.378               68
## 4    0.984            r&b  -13.575      291000  0.725               25
## 5    0.983            edm   -6.415      412973  0.973               52
## 6    0.981            pop  -12.677      240707  0.630               54
## 7    0.981            pop   -8.848      173195  0.740               49
## 8    0.980            edm   -5.689      196360  0.724                0
## 9    0.979            r&b   -7.327      140182  0.940               60
## 10   0.979            r&b   -7.327      140182  0.940               60
## 11   0.979            edm  -10.053      105234  0.773               21
##    danceability speechiness acousticness instrumentalness liveness   tempo
## 1         0.815      0.0470     8.25e-02         6.62e-01   0.0508 139.764
## 2         0.811      0.0498     8.23e-02         6.81e-01   0.0572 139.787
## 3         0.758      0.0449     2.84e-01         0.00e+00   0.0490 120.736
## 4         0.717      0.0295     3.82e-01         2.28e-01   0.3320 132.014
## 5         0.747      0.0370     3.62e-04         5.65e-01   0.3080 128.127
## 6         0.780      0.0321     2.61e-01         1.59e-05   0.0550 136.719
## 7         0.880      0.0327     2.01e-01         3.57e-05   0.0772 119.985
## 8         0.837      0.0524     7.06e-02         3.41e-04   0.0715 119.983
## 9         0.784      0.0540     7.16e-02         1.71e-04   0.1330 137.003
## 10        0.784      0.0540     7.16e-02         1.71e-04   0.1330 137.003
## 11        0.742      0.0617     1.38e-05         8.25e-01   0.0517 128.006

With the variables above, below are their summaries, which can be achieved by using summary().

For example:

# Summarises all categories/variables in the spotify data frame
summary(spotify)

# Summarises the selected category/variable from the spotify data frame
# This would summarise only the valence from spotify
summary(spotify$valence)

Valence:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3310  0.5120  0.5106  0.6930  0.9910

Valence is the measure of musical positivity in a song. Most songs appear to bring a leveled valence, meaning that these songs evoke a sufficient amount of positivity in listeners but not strongly. Most interestingly is that there is at least one song that has 0 valence, so that song is likely to bring only sadness to its listeners.

Playlist Genre:

##    Length     Class      Mode 
##     32828 character character

The length is 32828 for all character variables. This only reflects how many observations there are, which does not provide any insight to our research.

Loudness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -46.448  -8.171  -6.166  -6.720  -4.645   1.275

The most interesting thing to take away from this is how there are songs that have a decibel (dB) higher than 1 dB and as low as -46.45 dB. This is because songs typically have a range from -60 dB to 0 dB. The higher the dB, the louder the sound. Songs with 1 dB would be as loud as a whisper, which does not sound loud. However, on Spotify, most songs are around -6 dB due to various factors, such as volume normalisation and playback systems. For more information, click here.

Duration:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4000  187804  216000  225797  253581  517810

The duration is measured in millisecond (ms), so most songs are around 3.76 minutes. The longest being 8.63 minutes long. What is unbelievable is how there is a song that is only 4 seconds long.

Energy:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000175 0.581000 0.721000 0.698603 0.840000 1.000000

Energy represents the intensity perceived in a song. This means that the faster and high-beat the song is, the higher its energy is. The mean is approximately 0.70, so most songs are fast-paced with its music. Pop music would fall around the mean. The highest is 1, so songs with high-intensity like metal music would be around the high-end of the scale. Loss-bass music would be around the low-end of the scale.

Track Popularity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   45.00   42.48   62.00  100.00

Interestingly enough, the mean of song popularity is 42.48, where the range is 0-100 in popularity. The higher the number, the more popular the song is. The mean shows that the data provided is not biased towards songs with either high or low popularity, so the distribution of songs provided are fair.

Danceability:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5630  0.6720  0.6549  0.7610  0.9830

Danceability is based on various musical elements, such as tempo and rhythm stability. Most songs seem to have a lot of dancability with the mean being over 0.65. Songs with 0 dancability may be likely to be associated with less energy and lower tone.

Speechiness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0410  0.0625  0.1071  0.1320  0.9180

Speechiness is the measure of how often there is spoken words in a song. With the mean as 0.11, songs with high speechiness does not necessarily associate with high or low valence. Songs with 0 speechiness could have a high beat music without spoken words, such as instrumentals. Those with high speechiness could have little-to-no music and be full of spoken words, like an audiobook or podcast. Although at that point, they are not considered songs.

Acousticness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0151  0.0804  0.1754  0.2550  0.9940

The higher acousticness a song has, the higher the presence of instrumental sounds as opposed to electronic amplifications. With a mean of 0.18, most songs rely on sounds produced electronically rather than traditionally through instruments. Songs with high acousticness appear to in small numbers compared those with low-to-no acousticness, implying a rising trend away from the use of instruments in music.

Instrumentalness:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000161 0.0847599 0.0048300 0.9940000

Instrumentalness refers to presence of vocals. The less vocals there are in a song, the higher the instrumentalness the song has. With the mean being 0.08 and the median being 0.0000161, most songs include some-to-many vocals through its duration. Thus, there are a very few songs with high instrumentalness (little-to-no vocals) in the data.

Liveness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0927  0.1270  0.1902  0.2480  0.9960

Liveness measures the presence of a audience in a song. The higher presence of an audience, the higher the liveness a song has. The mean is 0.19, so most songs have a some-to-few audience present. These songs are likely to be produced in a studio, so there is a lack of audience except for background singing vocals. Songs with high liveness are likely to be recorded from a live performance.

Tempo:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   99.96  121.98  120.88  133.92  239.44

Tempo is the pacing of a song in a minute or beat per minute (BPM). The higher BRM in a song, the higher the tempo. The mean is 121 BPM, which implies most songs are medium-paced. What is interesting is that there is a song with 0 BPM, which should not be possible unless the song has no music or has an extremely short duration, like the 4 second song mentioned earlier.

Proposed Exploratory Data Analysis

We propose to slice the data, create new variables, and use plotting functions to visualise and analyse our graphs. To summarise our data, we may use a new data frame to summarise the new information, especially with the new variables.

We will use the slice functions to select specific rows to determine songs with top valence and songs with low valence.

slice_max() - Selects rows with the highest value
slice_min() - Selects rows with the lowest value

# Selects 1500 rows of the highest valence 
spotify_top_valence <- spotify %>%
  slice_max(valence, n = 1500)

# Selects 1500 rows of the lowest valence 
spotify_bottom_valence <- spotify %>%
  slice_min(valence, n = 1500)

We will use the new variables to plot bar charts and scatter plots with other musical elements.

For the song genre, we will plot a bar chart since the track genre is a character variable, which is not suitable for scatter plots. We will use barplot(table()) to achieve this. We will need to transform the song genre from songs with high valence into a table first. This will arrange the data into a standard table, which will be easy to then graph the barplot. If this is not done, barplot() will output an error due to not providing a vector or matrix.

# Graphs a barplot about the count of top valence songs in each genre
barplot(table(spotify_top_valence$playlist_genre), 
        main = "Genre Count Top Valence")

From the bar chart, we can see that Latino songs have the highest valence out of the genres.

All other variables, which are numeric, will be graphed in scatter plots using ggplot(). However, before we could do that, we need to learn how to properly use the function.

For now, we created a scatterplot example for loudness below. We referred to source # 1 and source # 2 for guidance on how to plot using ggplot().

# Plots scatter plot relationship between valence and loudness from songs with top valence
ggplot(spotify_top_valence, aes(loudness, valence)) +
  # Set points to be a noticeable size
  geom_point(size = 2) +
  ggtitle("Top Valence Values Compared to Loudness") +
  # Adjust title to the centre of the graph
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = .5))

With the scatter plot shown above, we can determine if there is a pattern in the relationship.

We also may plan to use linear regression to see if the results would contribute to our research, but we currently do not have the knowledge to create one.