Results

Introduction

Vincent and I were interested in exploring connections between biological sex of an artist or band and the characteristics of the popular songs. Our focus is not look for negative bia but rather we are curious about exploring possible trends in technical aspects of songs like speechiness, acousticness, instrumentalness, valence,tempo, etc. to determine if there may exist underlying psychological associates along cultural perspectives of sex and gender. A complete analysis would require a more global or culturally diverse dataset.

In combination with the Spotify dataset provided in the project description, we have downloaded a music database from MusicBrainz (https://musicbrainz.org/) licensed under the CC0 (http://creativecommons.org/publicdomain/zero/1.0/). This will allow us to query by sone and artist to add additional elements to our dataset, the most important to us being the gender table. The full schema can he found here: https://musicbrainz.org/doc/MusicBrainz_Database/Schema

While also being an interesting academic topic, it may also provide useful information to artists and those responsible for marketing music artists as well. If the public does prefer certain types of songs based on the sex or gender of the artist, then the music industry could use these preferences to guide the distribution of music on an artist’s albums.

Packages Required

library(spotifyr) >> used to load spotify data
library(ggplot2) >> plotting
library(tidyverse) >> for data manipulatation
library(dplyr) >> for data manipulatation
library(rpart) >> Used for regression trees in R
library(rpart.plot) >> Used for plotting regression trees in R
library(neuralnet) >> used for neural net analysis
library(nnet) >> used for neural net analysis
library(caret) >> used for classification and regression training
library(class) >> used for KNN analysis
library(fpc) >> used for clustering analysis
we are still exploring and researching additional methodologies outside of classes to use to gain additional experience so this list may grow in the next week

Data Preparation

Original Data and Variable Description

The original data was taken from this Github. This site collects weekly data and metadata from Spotify’s API. The following data has these 23 variables with types and descriptions included:

Track ID (character): The tracks’ unique ID
Track Name (character): The tracks’ name
Track Artist (character): The track artist
Track Popularity (double): The track’s popularity between 0 and 100 with higher being better
Track Album Id (character): The track album’s unique ID
Track Album Name (character): The track album’s name
Track Album Release Date (character): The track album release date
Playlist Name (character): Name of the playlist the track resides in
Playlist Id (character): Id of the playlist the track resides in
Playlist Genre (character): Genre of the playlist the track resides in
Playlist Subgenre (character): Subgenre of the playlist the track resides in
Danceability (double): A metric of how suitable the track is for dancing based on a combination of musical elements such as tempo, rhythm stability, beat strength and regularity. 0.0 is the least danceable and 1.0 is most danceable.
Energy (double): A metric between 0.0 and 1.0 that represents a perceptual measure of intensity and activity. Features contributing to this metric include dynamic range, perceived loudness, timbre, onset rate and general entropy
Key (integer): The estimated overall key of the track with integers mapping to pitches using standard pitch class notation. If no key is detected, the value is -1
Loudness (double): The overall loudness of a track in decibels using the average across the entire track. Values range for -60 to 0 db.
Mode (integer): The modality (major or minor) of a track, the scale in which the melodic content is derived. Major is represented by 1, Minor by 0
Speechiness (double): An indicator of how speech-like the recording with values closer to 1.0 being more speech-like. Values above 0.66 are most likely made entirely of spoken words. Between 0.33 and 0.66, the tracks may have a combination of music and speech. Below 0.33, the track most likely is music and other non-speech-like tracks.
Acousticness (double): A confidence measure between 0.0 and 1.0 of whether the track is Acoustic with 1.0 indicating high confidence
Instrumentalness (double): Predictor of whether a track has vocals or not. The closer to 1.0 the value is, the more likely the track contains no vocal content. Above 0.5 indicates instrumental tracks with confidence increasing as the value apporaches 1.0
Liveness (double): Detector of the presence of an audience in the recording. Higher values represent an increased probability that the track was performed live with values above 0.8 providing a strong likelihood of the track being live.
Valence (double): A measure between 0.0 and 1.0 of the “positiveness” conveyed by a track with high scores sounding more positive and lower scoring tracks sounding more negative
Tempo (double): The overall estimated tempo or speed in Beats Per Minute (BPM).
Duration (Integer): The duration of the song in Milliseconds.

Data Importing and Cleaning

Importing the data was done via the read.csv function of Rstudio with the data being attached to the name “spotify”. It should be noted from looking at the data that no real cleanup needs to take place as there are no missing values or abnormal values outside of what has been listed in the variable description above.

spotify=read.csv("spotify/spotify_songs.csv")

In addition, the first five rows and their inputs are below.

head(spotify,5)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052

Notes

The spotify data set is about as clean as a data set can be. So there was no need to handle NAs or other such cleansing techniques we have used in the past. We are still discussing and researching different composite variables we would like to add to the dataset along with the augmentation with the musicbrainz database. We realize that we are behind where we should be for this submission, but we feel that within 2 days or less from this submission we will hit that project milestone.

Final Summary

The first activity we took which is still in progress is looking at the distributions of each of the individual music characteristics. What appealed to us with the dataset is that a lot of the datapoints are “code snippets” versus what we classified as “flat stats”. For example, in sports data set, Passes Caught or Rush Yards are flat in the sense that are single numbers when external context and relationships are not considered. Music characteristics though, similar to generics, are “coded” using notes and other elements from musicology. The implication is that there are different ways to achieve the same numeric value for valence for example.

Therefore digging deeply it the individual distributions of the variables has a greater significance is discovering patterns before exploring relationships between variables and we anticipate may product additional variables we will create from the existing data we have. An additional outcome of this will be in creating composite variables to explore if the analysis changes via breaking down certain variables into meaningful tuples. This is where KNN and Classification tree analysis will play an important role for uncovering “hidden” relationships leading to new variables for our team to create.

Regarding the type of visualizations, we have already mentioned classifications trees as an important part of our analysis. In addition simple histograms and scatterplots will also play a key role in the initial phases (as is true with all data analysis). One of initial hypothesis (which may prove incorrect and we will adjust accordingly) is that clustering could be a powerful tool for a data set like this. We plan to explore what results we can generate from K-Means and Hierarchical Clustering with the related plots. Beyond visualizations to drive EDA, there will be a section of plots that will show our progress and decision-making process in variable selection and modification.

Depending on the results of our analysis and time permitting, we would like to provide an analytic comparison of using advanced tree models versus the basic tree models we have mentioned above such as random forest and boosting. This will come into play more in the predictive part of our project when attempting to make recommendations to artists or music companies on underlying cultural preferences.

7025_Project_Midterm

Ethan Hodys & Vincent Chiang

April 3, 2020