By: Nuranissa and Branden
Our report explores what contributes to a song’s valence (positivity). We want to learn if there is any patterns in producing valence. We used the R programming language to produce relevant graphs. We used data containing songs from Spotify and graphed their valence with their other song factors using barplots and scatterplots. This information will help psychologists and therapists learn what factors of a song would likely to improve a listener’s happiness. This can help in recommending or producing songs for the purposes of boosting a patient’s emotions to help relieve their anxiety or depression.
We used the following package:
library(tidyverse)
Tidyverse
is a package used to transform and visualise
data, so this is required to graph our data.
Originally, the data contained a total of 32,833 values with 15 missing values. Out of the 15 missing values, 5 were track names, 5 were artist names, and 5 were album names. They are marked with NA. We found these missing values to be peculiar as they are all names; However, our research does not consider names since names are subjective, and we focused on musical elements in relevance to valence.
We imported and cleaned the data with the following steps:
read.csv()
into a new data frame called “spotify”. The
function reads and import .csv files into a data frame in R.spotify <- read.csv("spotify_songs.csv")
colSums(is.na())
. First of all, is.na()
checks
if the data frame has missing values. colSums()
returns the
sum of a column/category. With these two functions combined,
colSums(is.na())
will produce a list of categories from the
data frame containing how many missing values each category/column has.
0 means there are no missing values in that category. As mentioned
earlier, we found that track names, artist names, and album names each
have 5 missing values (totaling 15 missing values).colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
na.omit()
.spotify <- na.omit(spotify)
Below is the clean data frame with top observations in regards to valence:
spotify %>%
# Arranges the valence from highest to lowest (in descending order)
arrange(desc(valence)) %>%
# Selects the top 10 rows by valence
top_n(valence, n = 10) %>%
# Selects only the variables listed below to be outputted
select(valence, playlist_genre, loudness, duration_ms, energy, track_popularity,
danceability, speechiness, acousticness, instrumentalness, liveness, tempo)
## valence playlist_genre loudness duration_ms energy track_popularity
## 1 0.991 rock -9.680 191333 0.676 36
## 2 0.990 r&b -10.989 191560 0.647 59
## 3 0.985 rock -15.308 223867 0.378 68
## 4 0.984 r&b -13.575 291000 0.725 25
## 5 0.983 edm -6.415 412973 0.973 52
## 6 0.981 pop -12.677 240707 0.630 54
## 7 0.981 pop -8.848 173195 0.740 49
## 8 0.980 edm -5.689 196360 0.724 0
## 9 0.979 r&b -7.327 140182 0.940 60
## 10 0.979 r&b -7.327 140182 0.940 60
## 11 0.979 edm -10.053 105234 0.773 21
## danceability speechiness acousticness instrumentalness liveness tempo
## 1 0.815 0.0470 8.25e-02 6.62e-01 0.0508 139.764
## 2 0.811 0.0498 8.23e-02 6.81e-01 0.0572 139.787
## 3 0.758 0.0449 2.84e-01 0.00e+00 0.0490 120.736
## 4 0.717 0.0295 3.82e-01 2.28e-01 0.3320 132.014
## 5 0.747 0.0370 3.62e-04 5.65e-01 0.3080 128.127
## 6 0.780 0.0321 2.61e-01 1.59e-05 0.0550 136.719
## 7 0.880 0.0327 2.01e-01 3.57e-05 0.0772 119.985
## 8 0.837 0.0524 7.06e-02 3.41e-04 0.0715 119.983
## 9 0.784 0.0540 7.16e-02 1.71e-04 0.1330 137.003
## 10 0.784 0.0540 7.16e-02 1.71e-04 0.1330 137.003
## 11 0.742 0.0617 1.38e-05 8.25e-01 0.0517 128.006
With the variables above, below are their summaries, which can be
achieved by using summary()
.
For example:
# Summarises all categories/variables in the spotify data frame
summary(spotify)
# Summarises the selected category/variable from the spotify data frame
# This would summarise only the valence from spotify
summary(spotify$valence)
Valence:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3310 0.5120 0.5106 0.6930 0.9910
Valence is the measure of musical positivity in a song. Most songs appear to bring a leveled valence, meaning that these songs evoke a sufficient amount of positivity in listeners but not strongly. Most interestingly is that there is at least one song that has 0 valence, so that song is likely to bring only sadness to its listeners.
Playlist Genre:
## Length Class Mode
## 32828 character character
The length is 32828 for all character variables. This only reflects how many observations there are, which does not provide any insight to our research.
Loudness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.448 -8.171 -6.166 -6.720 -4.645 1.275
The most interesting thing to take away from this is how there are songs that have a decibel (dB) higher than 1 dB and as low as -46.45 dB. This is because songs typically have a range from -60 dB to 0 dB. The higher the dB, the louder the sound. Songs with 1 dB would be as loud as a whisper, which does not sound loud. However, on Spotify, most songs are around -6 dB due to various factors, such as volume normalisation and playback systems. For more information, click here.
Duration:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 187804 216000 225797 253581 517810
The duration is measured in millisecond (ms), so most songs are around 3.76 minutes. The longest being 8.63 minutes long. What is unbelievable is how there is a song that is only 4 seconds long.
Energy:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000175 0.581000 0.721000 0.698603 0.840000 1.000000
Energy represents the intensity perceived in a song. This means that the faster and high-beat the song is, the higher its energy is. The mean is approximately 0.70, so most songs are fast-paced with its music. Pop music would fall around the mean. The highest is 1, so songs with high-intensity like metal music would be around the high-end of the scale. Loss-bass music would be around the low-end of the scale.
Track Popularity:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 45.00 42.48 62.00 100.00
Interestingly enough, the mean of song popularity is 42.48, where the range is 0-100 in popularity. The higher the number, the more popular the song is. The mean shows that the data provided is not biased towards songs with either high or low popularity, so the distribution of songs provided are fair.
Danceability:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5630 0.6720 0.6549 0.7610 0.9830
Danceability is based on various musical elements, such as tempo and rhythm stability. Most songs seem to have a lot of dancability with the mean being over 0.65. Songs with 0 dancability may be likely to be associated with less energy and lower tone.
Speechiness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0410 0.0625 0.1071 0.1320 0.9180
Speechiness is the measure of how often there is spoken words in a song. With the mean as 0.11, songs with high speechiness does not necessarily associate with high or low valence. Songs with 0 speechiness could have a high beat music without spoken words, such as instrumentals. Those with high speechiness could have little-to-no music and be full of spoken words, like an audiobook or podcast. Although at that point, they are not considered songs.
Acousticness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0151 0.0804 0.1754 0.2550 0.9940
The higher acousticness a song has, the higher the presence of instrumental sounds as opposed to electronic amplifications. With a mean of 0.18, most songs rely on sounds produced electronically rather than traditionally through instruments. Songs with high acousticness appear to in small numbers compared those with low-to-no acousticness, implying a rising trend away from the use of instruments in music.
Instrumentalness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000161 0.0847599 0.0048300 0.9940000
Instrumentalness refers to presence of vocals. The less vocals there are in a song, the higher the instrumentalness the song has. With the mean being 0.08 and the median being 0.0000161, most songs include some-to-many vocals through its duration. Thus, there are a very few songs with high instrumentalness (little-to-no vocals) in the data.
Liveness:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0927 0.1270 0.1902 0.2480 0.9960
Liveness measures the presence of a audience in a song. The higher presence of an audience, the higher the liveness a song has. The mean is 0.19, so most songs have a some-to-few audience present. These songs are likely to be produced in a studio, so there is a lack of audience except for background singing vocals. Songs with high liveness are likely to be recorded from a live performance.
Tempo:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 99.96 121.98 120.88 133.92 239.44
Tempo is the pacing of a song in a minute or beat per minute (BPM). The higher BRM in a song, the higher the tempo. The mean is 121 BPM, which implies most songs are medium-paced. What is interesting is that there is a song with 0 BPM, which should not be possible unless the song has no music or has an extremely short duration, like the 4 second song mentioned earlier.
We propose to slice the data, create new variables, and use plotting functions to visualise and analyse our graphs. To summarise our data, we may use a new data frame to summarise the new information, especially with the new variables.
We will use the slice functions to select specific rows to determine songs with top valence and songs with low valence.
slice_max()
- Selects rows with the highest valueslice_min()
- Selects rows with the lowest value# Selects 1500 rows of the highest valence
spotify_top_valence <- spotify %>%
slice_max(valence, n = 1500)
# Selects 1500 rows of the lowest valence
spotify_bottom_valence <- spotify %>%
slice_min(valence, n = 1500)
We will use the new variables to plot bar charts and scatter plots with other musical elements.
For the song genre, we will plot a bar chart since the track genre is
a character variable, which is not suitable for scatter plots. We will
use barplot(table())
to achieve this. We will need to
transform the song genre from songs with high valence into a table
first. This will arrange the data into a standard table, which will be
easy to then graph the barplot. If this is not done,
barplot()
will output an error due to not providing a
vector or matrix.
# Graphs a barplot about the count of top valence songs in each genre
barplot(table(spotify_top_valence$playlist_genre),
main = "Genre Count Top Valence")
From the bar chart, we can see that Latino songs have the highest valence out of the genres.
All other variables, which are numeric, will be graphed in scatter
plots using ggplot()
. However, before we could do that, we
need to learn how to properly use the function.
For now, we created a scatterplot example for loudness below. We
referred to source
# 1 and source
# 2 for guidance on how to plot using ggplot()
.
# Plots scatter plot relationship between valence and loudness from songs with top valence
ggplot(spotify_top_valence, aes(loudness, valence)) +
# Set points to be a noticeable size
geom_point(size = 2) +
ggtitle("Top Valence Values Compared to Loudness") +
# Adjust title to the centre of the graph
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = .5))
With the scatter plot shown above, we can determine if there is a pattern in the relationship.
We also may plan to use linear regression to see if the results would contribute to our research, but we currently do not have the knowledge to create one.