Vincent and I were interested in exploring connections between biological sex of an artist or band and the characteristics of the popular songs. Our focus is not look for negative bia but rather we are curious about exploring possible trends in technical aspects of songs like speechiness, acousticness, instrumentalness, valence,tempo, etc. to determine if there may exist underlying psychological associates along cultural perspectives of sex and gender. A complete analysis would require a more global or culturally diverse dataset.
In combination with the Spotify dataset provided in the project description, we have downloaded a music database from MusicBrainz (https://musicbrainz.org/) licensed under the CC0 (http://creativecommons.org/publicdomain/zero/1.0/). This will allow us to query by sone and artist to add additional elements to our dataset, the most important to us being the gender table. The full schema can he found here: https://musicbrainz.org/doc/MusicBrainz_Database/Schema
While also being an interesting academic topic, it may also provide useful information to artists and those responsible for marketing music artists as well. If the public does prefer certain types of songs based on the sex or gender of the artist, then the music industry could use these preferences to guide the distribution of music on an artist’s albums.
library(dplyr) >> for data manipulatation
library(fpc) >> used for clustering analysis
we are still exploring and researching additional methodologies outside of classes to use to gain additional experience so this list may grow in the next week
The original data was taken from this Github. This site collects weekly data and metadata from Spotify’s API. The following data has these 23 variables with types and descriptions included:
Importing the data was done via the read.csv function of Rstudio with the data being attached to the name “spotify”. It should be noted from looking at the data that no real cleanup needs to take place as there are no missing values or abnormal values outside of what has been listed in the variable description above.
spotify=read.csv("spotify/spotify_songs.csv")
In addition, the first five rows and their inputs are below.
head(spotify,5)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
The spotify data set is about as clean as a data set can be. So there was no need to handle NAs or other such cleansing techniques we have used in the past. We are still discussing and researching different composite variables we would like to add to the dataset along with the augmentation with the musicbrainz database. We realize that we are behind where we should be for this submission, but we feel that within 2 days or less from this submission we will hit that project milestone.
The first activity we took which is still in progress is looking at the distributions of each of the individual music characteristics. What appealed to us with the dataset is that a lot of the datapoints are “code snippets” versus what we classified as “flat stats”. For example, in sports data set, Passes Caught or Rush Yards are flat in the sense that are single numbers when external context and relationships are not considered. Music characteristics though, similar to generics, are “coded” using notes and other elements from musicology. The implication is that there are different ways to achieve the same numeric value for valence for example.
Therefore digging deeply it the individual distributions of the variables has a greater significance is discovering patterns before exploring relationships between variables and we anticipate may product additional variables we will create from the existing data we have. An additional outcome of this will be in creating composite variables to explore if the analysis changes via breaking down certain variables into meaningful tuples. This is where KNN and Classification tree analysis will play an important role for uncovering “hidden” relationships leading to new variables for our team to create.
Regarding the type of visualizations, we have already mentioned classifications trees as an important part of our analysis. In addition simple histograms and scatterplots will also play a key role in the initial phases (as is true with all data analysis). One of initial hypothesis (which may prove incorrect and we will adjust accordingly) is that clustering could be a powerful tool for a data set like this. We plan to explore what results we can generate from K-Means and Hierarchical Clustering with the related plots. Beyond visualizations to drive EDA, there will be a section of plots that will show our progress and decision-making process in variable selection and modification.
Depending on the results of our analysis and time permitting, we would like to provide an analytic comparison of using advanced tree models versus the basic tree models we have mentioned above such as random forest and boosting. This will come into play more in the predictive part of our project when attempting to make recommendations to artists or music companies on underlying cultural preferences.