By: Nuranissa and Branden
Our report explores what contributes to a song’s valence (positivity). We want to learn if there are any patterns in producing valence. We used the R programming language to produce relevant graphs. We used data containing songs from Spotify and graphed their valence with their other song factors using barplots and scatterplots. If this project proves to have sustainable results, this information may help psychologists and therapists learn what factors of a song would likely to improve a listener’s happiness. This may help in recommending or producing songs for the purposes of boosting a patient’s emotions to help relieve their anxiety or depression.
For this project, we loaded the following standard packages:
library(tidyverse)
library(modelr)
library('corrplot')
library(knitr) # Optional
Tidyverse is a package used to transform and visualise
data, so this is required to graph our data.Modelr is a package enables a couple of modelling
functions that pipelines data manipulation, so this is required to graph
our data.'corrplot' is a package that plots data correlation, so
this is required to graph our data.Knitr is a package that allows the integration of R
code in a RMarkdown file or HTML. This is not necessary to produce the
same results for this project, but it allows for greater flexibility
with a creating a report.Originally, the data contained a total of 32,833 observations with 15 missing observations. Out of the 15 missing observations, 5 were track names, 5 were artist names, and 5 were album names. They are marked with NA. We found these missing values to be peculiar as they are all names; However, our research does not consider names since names are subjective, and we focused on musical elements in relevance to valence.
We imported and cleaned the data with the following steps:
read.csv() into a new data frame called “spotify”. The
function reads and import .csv files into a data frame in R.spotify <- read.csv("spotify_songs.csv")
colSums(is.na()). First of all, is.na() checks
if the data frame has missing v. colSums() returns the sum
of a column/category. With these two functions combined,
colSums(is.na()) will produce a list of categories from the
data frame containing how many missing observations each category/column
has. 0 means there are no missing observations in that category. As
mentioned earlier, we found that track names, artist names, and album
names each have 5 missing observations (totaling 15 missing
observations).colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
na.omit().spotify <- na.omit(spotify)
Below is the clean data frame with top observations in regards to valence:
clean_data <- spotify %>%
# Arranges the valence from highest to lowest (in descending order)
arrange(desc(valence)) %>%
# Selects the top 10 rows by valence
top_n(valence, n = 10) %>%
# Selects only the variables listed below to be outputted
select(valence, playlist_genre, loudness, duration_ms, energy, track_popularity,
danceability, speechiness, acousticness, instrumentalness, liveness, tempo)
With the variables above, below are their summaries, which can be
achieved by using summary().
For example:
# Summarises all categories/variables in the spotify data frame
summary(spotify)
# Summarises the selected category/variable from the spotify data frame
# This would summarise only the valence from spotify
summary(spotify$valence)
Valence:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3310 0.5120 0.5106 0.6930 0.9910
Valence is the measure of musical positivity in a song. Most songs appear to bring a leveled valence, meaning that these songs evoke a sufficient amount of positivity in listeners but not strongly. Most interestingly is that there is at least one song that has 0 valence, so that song is likely to bring only sadness to its listeners.
Playlist Genre:
## Length Class Mode
## 32828 character character
The length is 32828 for all character variables. This only reflects how many observations there are, which does not provide any insight to our research.
Loudness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.448 -8.171 -6.166 -6.720 -4.645 1.275
The most interesting thing to take away from this is how there are songs that have a decibel (dB) higher than 1 dB and as low as -46.45 dB. This is because songs typically have a range from -60 dB to 0 dB. The higher the dB, the louder the sound. Songs with 1 dB would be as loud as a whisper, which does not sound loud. However, on Spotify, most songs are around -6 dB due to various factors, such as volume normalisation and playback systems. For more information, click here.
Duration:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 187804 216000 225797 253581 517810
The duration is measured in millisecond (ms), so most songs are around 3.76 minutes. The longest being 8.63 minutes long. What is unbelievable is how there is a song that is only 4 seconds long.
Energy:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000175 0.581000 0.721000 0.698603 0.840000 1.000000
Energy represents the intensity perceived in a song. This means that the faster and high-beat the song is, the higher its energy is. The mean is approximately 0.70, so most songs are fast-paced with its music. Pop music would fall around the mean. The highest is 1, so songs with high-intensity like metal music would be around the high-end of the scale. Loss-bass music would be around the low-end of the scale.
Track Popularity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 45.00 42.48 62.00 100.00
Interestingly enough, the mean of song popularity is 42.48, where the range is 0-100 in popularity. The higher the number, the more popular the song is. The mean shows that the data provided is not biased towards songs with either high or low popularity, so the distribution of songs provided are fair.
Danceability:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5630 0.6720 0.6549 0.7610 0.9830
Danceability is based on various musical elements, such as tempo and rhythm stability. Most songs seem to have a lot of dancability with the mean being over 0.65. Songs with 0 dancability may be likely to be associated with less energy and lower tone.
Speechiness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0410 0.0625 0.1071 0.1320 0.9180
Speechiness is the measure of how often there is spoken words in a song. With the mean as 0.11, songs with high speechiness does not necessarily associate with high or low valence. Songs with 0 speechiness could have a high beat music without spoken words, such as instrumentals. Those with high speechiness could have little-to-no music and be full of spoken words, like an audiobook or podcast. Although at that point, they are not considered songs.
Acousticness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0151 0.0804 0.1754 0.2550 0.9940
The higher acousticness a song has, the higher the presence of instrumental sounds as opposed to electronic amplifications. With a mean of 0.18, most songs rely on sounds produced electronically rather than traditionally through instruments. Songs with high acousticness appear to in small numbers compared those with low-to-no acousticness, implying a rising trend away from the use of instruments in music.
Instrumentalness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000161 0.0847599 0.0048300 0.9940000
Instrumentalness refers to presence of vocals. The less vocals there are in a song, the higher the instrumentalness the song has. With the mean being 0.08 and the median being 0.0000161, most songs include some-to-many vocals through its duration. Thus, there are a very few songs with high instrumentalness (little-to-no vocals) in the data.
Liveness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0927 0.1270 0.1902 0.2480 0.9960
Liveness measures the presence of a audience in a song. The higher presence of an audience, the higher the liveness a song has. The mean is 0.19, so most songs have a some-to-few audience present. These songs are likely to be produced in a studio, so there is a lack of audience except for background singing vocals. Songs with high liveness are likely to be recorded from a live performance.
Tempo:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 99.96 121.98 120.88 133.92 239.44
Tempo is the pacing of a song in a minute or beat per minute (BPM). The higher BRM in a song, the higher the tempo. The mean is 121 BPM, which implies most songs are medium-paced. What is interesting is that there is a song with 0 BPM, which should not be possible unless the song has no music or has an extremely short duration, like the 4 second song mentioned earlier.
We sliced the data, created new variables, and used plotting functions to visualise and analyse our graphs. To summarise our data, we may use a new data frame to summarise the new information, especially with the new variables.
We will use the slice functions to select specific rows to determine songs with top valence and songs with low valence.
slice_max() - Selects rows with the highest valueslice_min() - Selects rows with the lowest value# Selects 1500 rows of the highest valence
spotify_top_valence <- spotify %>%
slice_max(valence, n = 1500)
# Selects 1500 rows of the lowest valence
spotify_bottom_valence <- spotify %>%
slice_min(valence, n = 1500)
# Selects the top 10 rows of valence
top_ten_valence <- spotify %>%
slice_max(valence, n = 10)
# Selects the bottom 10 rows of valence
bottom_ten_valence <- spotify %>%
slice_min(valence, n = 10)
We used new variables to plot bar charts and scatter plots with other musical elements.
For the song genre, we plotted a bar chart since the track genre is a
character variable, which is not suitable for scatter plots. We used
barplot(table()) to achieve this. We needed to transform
the song genre from songs with high valence into a table first. This
will arrange the data into a standard table, which will be easy to then
graph the barplot. If this is not done, barplot() will
output an error due to not providing a vector or matrix.
# Graphs a barplot about the count of top valence songs in each genre
barplot(table(spotify_top_valence$playlist_genre),
main = "Genre Count Top Valence")
This bar chart counts the distribution of the genres that are in the top valence data frame. We can see that Latino songs have the highest valence out of the genres as the others are pretty close in distribution. The only other genres that stick out would be Rock in second place and EDM with the least amount of counts.
All other variables, which are numeric, will be graphed in scatter
plots using ggplot().
With scatter plots, we can determine if there is a pattern in the relationship.
# Plots scatter plot relationship between valence and energy from songs with top valence
ggplot(spotify_top_valence, aes(energy, valence)) +
# Set points to be a noticeable size
geom_point(size = 2) +
geom_smooth(method = "lm") +
# Set the top ten valence songs to be coloured red
geom_point(data = top_ten_valence, color = "red") +
geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
ggtitle("Top Valence Values Compared to Energy") +
# Adjust title to the centre of the graph
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = .5))
This graph compares energy and valence within the top valence scores to see if there is a correlation. It also highlights the top ten valence tracks in red with a separate correlation line while the blue line are for other top valence tracks that are below the top 10. The non-top ten valence artist lines show no correlation while there might be a weak negative correlation with the top ten valence tracks. This may imply that the top valance songs may be heading towards a songs with less energy due to various factors. Maybe positivity does not have to have the highest beats, and more people are finding comfort in more tamer beats and energy.
# Plots scatter plot relationship between valence and popularity from songs with top valence
ggplot(spotify_top_valence, aes(track_popularity, valence)) +
geom_point(size = 2) +
geom_smooth(method = "lm") +
# Set the top ten valence songs to be coloured red
geom_point(data = top_ten_valence, color = "red") +
geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
ggtitle("Top Valence Values Compared to Popularity") +
# Adjust title to the centre of the graph
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = .5))
While there is no correlation with the non top ten valence tracks, there is a weak positive correlation with the top ten valence tracks. High valence songs may not be as popular as it used to be. There may be more people finding low valence songs to be more of their taste. This does not necessarily mean that these people are not happy with more positive songs. It implies that people ahve different tastes, and so may find more joy in listening to more somber songs.
# Plots scatter plot relationship between valence and instrumentalness from songs with top valence
ggplot(spotify_top_valence, aes(instrumentalness, valence)) +
geom_point(size = 2) +
geom_smooth(method = "lm") +
# Set the top ten valence songs to be coloured red
geom_point(data = top_ten_valence, color = "red") +
geom_smooth(data = top_ten_valence, color = "red", method = "lm") +
ggtitle("Top Valence Values Compared to Instrumentalness") +
# Adjust title to the centre of the graph
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = .5))
While there is no correlation with the non top ten valence tracks, there is a weak positive correlation with the top ten valence tracks. This implies that more positive songs include more instrumentals and less vocals. Songs like this can be disco and club songs, where those songs are played at large gatherings, such as a club. At these gatherings, many people are excited and happy in general.
ggplot(spotify_bottom_valence, aes(speechiness, valence)) +
geom_point(size = 2) +
geom_smooth(method = "lm") +
# Set the top bottom valence songs to be coloured red
geom_point(data = bottom_ten_valence, color = "red") +
geom_smooth(data = bottom_ten_valence, method = "lm", color = "red") +
ggtitle("Bottom Valence Values Compared to Speechiness") +
# Adjust title to the centre of the graph
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = .5))
This is a scatter plot graph of the bottom 1500 valence tracks seeing if there is any correlation between the variables speechiness and valence. There are also some red points and a separate correlation line that highlight the bottom ten valence tracks while the blue line are for other bottom valence tracks that are below the top 10. There is no correlation for the non-bottom ten tracks, but there is a weak positive for the bottom ten tracks. This implies that low valence songs are having a trend of including more words in their songs. The trend seems to be straying away from songs, such as orchestra. So, more people are preferring some words in songs. This may be that some people find comfort or are positive when there are words in songs. This may be that they might find lyrics to be relatable to themselves or someone they know that makes them happy.
We summarized the correlation using cor(), which
displays the correlation of the spotify data set in a generalized
manner. As below, we used coloured circles to show what strong or weak a
correlation is between two variables.
# Displays a correlation graph in the top valence songs with simple, circle visuals
top_cor <- cor(spotify_top_valence[12:23])
corrplot(top_cor, type = "upper", method = "circle", order = "alphabet",
title = "Top Valence Correlations",
# Makes the title completely above the graph
mar=c(0,0,1,0)) # Source: http://stackoverflow.com/a/14754408/54964
# Displays a correlation graph in the bottom valence songs with simple, circle visuals
bottom_cor <- cor(spotify_bottom_valence[12:23])
corrplot(bottom_cor, type = "upper", method = "circle", order = "alphabet",
title = "Bottom Valence Correlations",
# Makes the title completely above the graph
mar=c(0,0,1,0)) # Source: http://stackoverflow.com/a/14754408/54964
As seen in the correlation plot above, there is a lack of correlation in regards to valence. However, there are some weak positives and negatives in it. This may indicate that our current graphs and data set does not indicate any evidence if valence makes a person happier.
We also may plan to use more regression graphs to see if the results would contribute to our research, but we currently do not have much knowledge about regression.
While there might not be any strong correlation between what makes a song feel positive (valence) and other variables from this data set, there are some graphs from the data set that do point to a possible correlation between the top tracks. Whether it is a positive or negative, there is a possibility that there is an unseen correlation between other variables that affected the valence of the track and those tracks created a trend. Most of the other tracks point to the idea there is a extremely weak correlation between valence and the other variables.