Introduction

1.1 Spotify allows its users to listen to a variety of songs ranging from Pop to Soul and everything in between. Within this data set, there are very interesting variables involving song information, intricate measures of music like tempo, and unique measures like dancibility.

This data set is interesting for such a unique variety of variables I wouldn’t have thought of to measure. As such, I would like to explore the relationship between song popularity and the other variables.

1.2 The relationship between popularity and other variables inside this data set will be explored with graphs and statistical analysis.

1.3 This analysis could help Spotify or music companies determine which factors correlate with positive song popularity and use those insights to create popular songs for profit.

Packages Required

The tidyverse package will be installed to manipulate the variable data for better use in this analysis. Data will be modified for easier understanding of users.

library(tidyverse)

Data Preparation

3.1 First, the Spotify data set will be downloaded from here. The data set data is read as a csv file and named as spotify_songs_df.

spotify_songs_df <- read_csv("spotify_songs.csv")

Next, the data dimensions, structure, and number of missing values per variable will be looked at.

dim(spotify_songs_df)
str(spotify_songs_df)
colSums(is.na(spotify_songs_df))

After running this code, 32833 observations and 23 variables are visible in spotify_songs_df. There is also a total of 15 missing variables. The following is how these missing variables are distributed:

Observations containing these missing values will be removed for this analysis. These missing values are all character values where replacing them with average variables will not make sense based on the actual variable meanings. It is better to remove them and there are enough observations that a loss of a few will not have a significant impact.

spotify_songs_df <- spotify_songs_df[complete.cases(spotify_songs_df), ]
sum(is.na(spotify_songs_df))

There are now 0 missing variables.

3.3 After this cleaning process, the first 10 observations in the data set can be observed with the following code:

head(spotify_songs_df, 10)

3.4 The following table displays variables with their explanations based on the provided code description found here.

Variable Name Data Type Explanation
track_id character Unique ID for a song
track_name character Song name
track_artist character Song artist
track_popularity double Song popularity from 0 - 100 where the larger number is better.
track_album_id character Unique ID for an album
track_album_name character Album name
track_album_release_date character Date of album released
playlist_name character Playlist name
playlist_id character ID for playlist
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
dancibility double 0 - 1.0 scale of how suitable a track is to dance to.
energy double 0 - 1.0 scale of a track’s perpetually measured activity and intensity where higher values are more energetic.
key double The average key/pitch of a track. Integers map pitches with pitch class notation.
loudness double The average loudness of a track measured in decibels (dB).
mode double Indicates whether a track is a major or minor with major equally 1 and minor equal to 0.
speechiness double 0 - 1.0 of how much of a track consists of words where the higher value likely is a voice recording.
acousticness double 0 - 1.0 scale that measures how likely a track is to be acoustic.
instrumentalness double 0 - 1.0 scale that measures how likely a track contains any vocals where 1.0 is a track without vocals.
liveness double Detects likelihood of the track having an audience in the recording.
valence double 0 - 1.0 scale that measures the positiveness conveyed from the track where 1 is positve and 0 is negative.
tempo double Average estimated beats per minute (BPM) for a track.
duration_ms double Song duration in milliseconds.

Proposed Exploratory Data Analysis

4.1 Moving forward, I would create new variables like for track duration or binary variables based on double values dependent on the variables description (EX: 0.66 < for being acoustic). Doing such changes will make interpretation of values make more sense and to test against (EX: graph popularity with track as acoustic). To answer my question about song popularity, I can summarize the numeric type variables in a smaller data frame for easier manipulation and graphing.

4.2 Scatter plots, bar plots, and box plots may prove useful in finding any correlations between track popularity and other variables. Scatter plots in particular will help visualize the popularity with other variables and could help create a linear regression model.

4.3 I do not know right now better ways to display data using graphs. Another thing I will need to learn to create binary variables to represent the different genre types. I would also need to learn to code for linear regression.

4.4 I plan to use linear regression models to better understand the relationship between track popularity and the variables relating to its sound along with possibly its genre.