By: Nuranissa and Branden

Introduction

Our report explores what contributes to a song’s valence (positivity). We want to learn if there are any patterns in producing valence. We used the R programming language to produce relevant graphs. We used data containing songs from Spotify and graphed their valence with their other song factors using barplots and scatterplots. If this project proves to have sustainable results, this information may help psychologists and therapists learn what factors of a song would likely to improve a listener’s happiness. This may help in recommending or producing songs for the purposes of boosting a patient’s emotions to help relieve their anxiety or depression.

Packages Required

For this project, we loaded the following standard packages:

library(tidyverse)
library(modelr)
library('corrplot')
library(knitr) # Optional
  • Tidyverse is a package used to transform and visualise data, so this is required to graph our data.
  • Modelr is a package enables a couple of modelling functions that pipelines data manipulation, so this is required to graph our data.
  • 'corrplot' is a package that plots data correlation, so this is required to graph our data.
  • Knitr is a package that allows the integration of R code in a RMarkdown file or HTML. This is not necessary to produce the same results for this project, but it allows for greater flexibility with a creating a report.

Data Preparation

Originally, the data contained a total of 32,833 observations with 15 missing observations. Out of the 15 missing observations, 5 were track names, 5 were artist names, and 5 were album names. They are marked with NA. We found these missing values to be peculiar as they are all names; However, our research does not consider names since names are subjective, and we focused on musical elements in relevance to valence.

We imported and cleaned the data with the following steps:

  1. Import the Spotify data file called “spotify_songs.csv” using read.csv() into a new data frame called “spotify”. The function reads and import .csv files into a data frame in R.
spotify <- read.csv("spotify_songs.csv")
  1. Check for missing observations in “spotify” using colSums(is.na()). First of all, is.na() checks if the data frame has missing v. colSums() returns the sum of a column/category. With these two functions combined, colSums(is.na()) will produce a list of categories from the data frame containing how many missing observations each category/column has. 0 means there are no missing observations in that category. As mentioned earlier, we found that track names, artist names, and album names each have 5 missing observations (totaling 15 missing observations).
colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
  1. Determine whenever to omit the missing values or replace the missing values with its mean. We did not consider names into our research due to its subjective nature, and names are not numerical, so they are difficult to graph and cannot be calculated. Along with this, there are too few missing values in the data. Due to these reasonings, we decided to omit the missing values in “spotify” using na.omit().
spotify <- na.omit(spotify)

Below is the clean data frame with top observations in regards to valence:

clean_data <- spotify %>%
  # Arranges the valence from highest to lowest (in descending order)
  arrange(desc(valence)) %>%
  # Selects the top 10 rows by valence
  top_n(valence, n = 10) %>%
  # Selects only the variables listed below to be outputted
  select(valence, playlist_genre, loudness, duration_ms, energy, track_popularity, 
         danceability, speechiness, acousticness, instrumentalness, liveness, tempo)

With the variables above, below are their summaries, which can be achieved by using summary().

For example:

# Summarises all categories/variables in the spotify data frame
summary(spotify)

# Summarises the selected category/variable from the spotify data frame
# This would summarise only the valence from spotify
summary(spotify$valence)

Valence:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3310  0.5120  0.5106  0.6930  0.9910

Valence is the measure of musical positivity in a song. Most songs appear to bring a leveled valence, meaning that these songs evoke a sufficient amount of positivity in listeners but not strongly. Most interestingly is that there is at least one song that has 0 valence, so that song is likely to bring only sadness to its listeners.

Playlist Genre:

##    Length     Class      Mode 
##     32828 character character

The length is 32828 for all character variables. This only reflects how many observations there are, which does not provide any insight to our research.

Loudness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -46.448  -8.171  -6.166  -6.720  -4.645   1.275

The most interesting thing to take away from this is how there are songs that have a decibel (dB) higher than 1 dB and as low as -46.45 dB. This is because songs typically have a range from -60 dB to 0 dB. The higher the dB, the louder the sound. Songs with 1 dB would be as loud as a whisper, which does not sound loud. However, on Spotify, most songs are around -6 dB due to various factors, such as volume normalisation and playback systems. For more information, click here.

Duration:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4000  187804  216000  225797  253581  517810

The duration is measured in millisecond (ms), so most songs are around 3.76 minutes. The longest being 8.63 minutes long. What is unbelievable is how there is a song that is only 4 seconds long.

Energy:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000175 0.581000 0.721000 0.698603 0.840000 1.000000

Energy represents the intensity perceived in a song. This means that the faster and high-beat the song is, the higher its energy is. The mean is approximately 0.70, so most songs are fast-paced with its music. Pop music would fall around the mean. The highest is 1, so songs with high-intensity like metal music would be around the high-end of the scale. Loss-bass music would be around the low-end of the scale.

Track Popularity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   45.00   42.48   62.00  100.00

Interestingly enough, the mean of song popularity is 42.48, where the range is 0-100 in popularity. The higher the number, the more popular the song is. The mean shows that the data provided is not biased towards songs with either high or low popularity, so the distribution of songs provided are fair.

Danceability:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5630  0.6720  0.6549  0.7610  0.9830

Danceability is based on various musical elements, such as tempo and rhythm stability. Most songs seem to have a lot of dancability with the mean being over 0.65. Songs with 0 dancability may be likely to be associated with less energy and lower tone.

Speechiness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0410  0.0625  0.1071  0.1320  0.9180

Speechiness is the measure of how often there is spoken words in a song. With the mean as 0.11, songs with high speechiness does not necessarily associate with high or low valence. Songs with 0 speechiness could have a high beat music without spoken words, such as instrumentals. Those with high speechiness could have little-to-no music and be full of spoken words, like an audiobook or podcast. Although at that point, they are not considered songs.

Acousticness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0151  0.0804  0.1754  0.2550  0.9940

The higher acousticness a song has, the higher the presence of instrumental sounds as opposed to electronic amplifications. With a mean of 0.18, most songs rely on sounds produced electronically rather than traditionally through instruments. Songs with high acousticness appear to in small numbers compared those with low-to-no acousticness, implying a rising trend away from the use of instruments in music.

Instrumentalness:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000161 0.0847599 0.0048300 0.9940000

Instrumentalness refers to presence of vocals. The less vocals there are in a song, the higher the instrumentalness the song has. With the mean being 0.08 and the median being 0.0000161, most songs include some-to-many vocals through its duration. Thus, there are a very few songs with high instrumentalness (little-to-no vocals) in the data.

Liveness:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0927  0.1270  0.1902  0.2480  0.9960

Liveness measures the presence of a audience in a song. The higher presence of an audience, the higher the liveness a song has. The mean is 0.19, so most songs have a some-to-few audience present. These songs are likely to be produced in a studio, so there is a lack of audience except for background singing vocals. Songs with high liveness are likely to be recorded from a live performance.

Tempo:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   99.96  121.98  120.88  133.92  239.44

Tempo is the pacing of a song in a minute or beat per minute (BPM). The higher BRM in a song, the higher the tempo. The mean is 121 BPM, which implies most songs are medium-paced. What is interesting is that there is a song with 0 BPM, which should not be possible unless the song has no music or has an extremely short duration, like the 4 second song mentioned earlier.

Exploratory Data Analysis

We sliced the data, created new variables, and used plotting functions to visualise and analyse our graphs. To summarise our data, we may use a new data frame to summarise the new information, especially with the new variables.

We will use the slice functions to select specific rows to determine songs with top valence and songs with low valence.

  • slice_max() - Selects rows with the highest value
  • slice_min() - Selects rows with the lowest value
# Selects 1500 rows of the highest valence 
spotify_top_valence <- spotify %>%
  slice_max(valence, n = 1500)

# Selects 1500 rows of the lowest valence 
spotify_bottom_valence <- spotify %>%
  slice_min(valence, n = 1500)

# Selects the top 10 rows of valence 
top_ten_valence <- spotify %>%
  slice_max(valence, n = 10)

# Selects the bottom 10 rows of valence 
bottom_ten_valence <- spotify %>%
  slice_min(valence, n = 10)

We used new variables to plot bar charts and scatter plots with other musical elements.

For the song genre, we plotted a bar chart since the track genre is a character variable, which is not suitable for scatter plots. We used barplot(table()) to achieve this. We needed to transform the song genre from songs with high valence into a table first. This will arrange the data into a standard table, which will be easy to then graph the barplot. If this is not done, barplot() will output an error due to not providing a vector or matrix.

# Graphs a barplot about the count of top valence songs in each genre
barplot(table(spotify_top_valence$playlist_genre), 
        main = "Genre Count Top Valence")

This bar chart counts the distribution of the genres that are in the top valence data frame. We can see that Latino songs have the highest valence out of the genres as the others are pretty close in distribution. The only other genres that stick out would be Rock in second place and EDM with the least amount of counts.

All other variables, which are numeric, will be graphed in scatter plots using ggplot().

With scatter plots, we can determine if there is a pattern in the relationship.

# Plots scatter plot relationship between valence and energy from songs with top valence
ggplot(spotify_top_valence, aes(energy, valence)) +
  # Set points to be a noticeable size
  geom_point(size = 2) +
  geom_smooth(method = "lm") +
  # Set the top ten valence songs to be coloured red
  geom_point(data = top_ten_valence, color = "red") +
  geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
  ggtitle("Top Valence Values Compared to Energy") +
  # Adjust title to the centre of the graph
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = .5))

This graph compares energy and valence within the top valence scores to see if there is a correlation. It also highlights the top ten valence tracks in red with a separate correlation line while the blue line are for other top valence tracks that are below the top 10. The non-top ten valence artist lines show no correlation while there might be a weak negative correlation with the top ten valence tracks. This may imply that the top valance songs may be heading towards a songs with less energy due to various factors. Maybe positivity does not have to have the highest beats, and more people are finding comfort in more tamer beats and energy.

# Plots scatter plot relationship between valence and popularity from songs with top valence
ggplot(spotify_top_valence, aes(track_popularity, valence)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm") +
  # Set the top ten valence songs to be coloured red
  geom_point(data = top_ten_valence, color = "red") +
  geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
  ggtitle("Top Valence Values Compared to Popularity") +
  # Adjust title to the centre of the graph
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = .5))

While there is no correlation with the non top ten valence tracks, there is a weak positive correlation with the top ten valence tracks. High valence songs may not be as popular as it used to be. There may be more people finding low valence songs to be more of their taste. This does not necessarily mean that these people are not happy with more positive songs. It implies that people ahve different tastes, and so may find more joy in listening to more somber songs.

# Plots scatter plot relationship between valence and instrumentalness from songs with top valence
ggplot(spotify_top_valence, aes(instrumentalness, valence)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm") +
  # Set the top ten valence songs to be coloured red
  geom_point(data = top_ten_valence, color = "red") +
  geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
  ggtitle("Top Valence Values Compared to Instrumentalness") +
  # Adjust title to the centre of the graph
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = .5))

While there is no correlation with the non top ten valence tracks, there is a weak positive correlation with the top ten valence tracks. This implies that more positive songs include more instrumentals and less vocals. Songs like this can be disco and club songs, where those songs are played at large gatherings, such as a club. At these gatherings, many people are excited and happy in general.

ggplot(spotify_bottom_valence, aes(speechiness, valence)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm") +
  # Set the top bottom valence songs to be coloured red
  geom_point(data = bottom_ten_valence, color = "red") +
  geom_smooth(data = bottom_ten_valence, method = "lm", color = "red") +
  ggtitle("Bottom Valence Values Compared to Speechiness") +
  # Adjust title to the centre of the graph
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = .5))

This is a scatter plot graph of the bottom 1500 valence tracks seeing if there is any correlation between the variables speechiness and valence. There are also some red points and a separate correlation line that highlight the bottom ten valence tracks while the blue line are for other bottom valence tracks that are below the top 10. There is no correlation for the non-bottom ten tracks, but there is a weak positive for the bottom ten tracks. This implies that low valence songs are having a trend of including more words in their songs. The trend seems to be straying away from songs, such as orchestra. So, more people are preferring some words in songs. This may be that some people find comfort or are positive when there are words in songs. This may be that they might find lyrics to be relatable to themselves or someone they know that makes them happy.

We summarized the correlation using cor(), which displays the correlation of the spotify data set in a generalized manner. As below, we used coloured circles to show what strong or weak a correlation is between two variables.

# Displays a correlation graph in the top valence songs with simple, circle visuals
top_cor <- cor(spotify_top_valence[12:23])
corrplot(top_cor, type = "upper", method = "circle", order = "alphabet", 
         title = "Top Valence Correlations",
         # Makes the title completely above the graph
         mar=c(0,0,1,0)) # Source: http://stackoverflow.com/a/14754408/54964

# Displays a correlation graph in the bottom valence songs with simple, circle visuals
bottom_cor <- cor(spotify_bottom_valence[12:23])
corrplot(bottom_cor, type = "upper", method = "circle", order = "alphabet",
         title = "Bottom Valence Correlations",
         # Makes the title completely above the graph
         mar=c(0,0,1,0)) # Source: http://stackoverflow.com/a/14754408/54964

As seen in the correlation plot above, there is a lack of correlation in regards to valence. However, there are some weak positives and negatives in it. This may indicate that our current graphs and data set does not indicate any evidence if valence makes a person happier.

We also may plan to use more regression graphs to see if the results would contribute to our research, but we currently do not have much knowledge about regression.

Summary

While there might not be any strong correlation between what makes a song feel positive (valence) and other variables from this data set, there are some graphs from the data set that do point to a possible correlation between the top tracks. Whether it is a positive or negative, there is a possibility that there is an unseen correlation between other variables that affected the valence of the track and those tracks created a trend. Most of the other tracks point to the idea there is a extremely weak correlation between valence and the other variables.