1.1 Spotify allows its users to listen to a variety of songs ranging from Pop to Soul and everything in between. Within this data set, there are very interesting variables involving song information, intricate measures of music like tempo, and unique measures like dancibility.
This data set is interesting for such a unique variety of variables I wouldn’t have thought of to measure. As such, I would like to explore the relationship between song popularity and the other variables.
1.2 The relationship between popularity and other variables inside this data set will be explored with graphs and statistical analysis.
1.3 This analysis could help Spotify or music companies determine which factors correlate with positive song popularity and use those insights to create popular songs for profit.
The tidyverse package will be installed to manipulate the variable data for better use in this analysis. Data will be modified for easier understanding of users.
library(tidyverse)
3.1 First, the Spotify data set will be downloaded from here. The data set data is read as a csv file and named as spotify_songs_df.
spotify_songs_df <- read_csv("spotify_songs.csv")
Next, the data dimensions, structure, and number of missing values per variable will be looked at.
dim(spotify_songs_df)
str(spotify_songs_df)
colSums(is.na(spotify_songs_df))
After running this code, 32833 observations and 23 variables are visible in spotify_songs_df. There is also a total of 15 missing variables. The following is how these missing variables are distributed:
track_artist: 5 missing valuestrack_name: 5 missing valuestrack_album_name: 5 missing valuesObservations containing these missing values will be removed for this analysis. These missing values are all character values where replacing them with average variables will not make sense based on the actual variable meanings. It is better to remove them and there are enough observations that a loss of a few will not have a significant impact.
spotify_songs_df <- spotify_songs_df[complete.cases(spotify_songs_df), ]
sum(is.na(spotify_songs_df))
There are now 0 missing variables.
3.3 After this cleaning process, the first 10 observations in the data set can be observed with the following code:
head(spotify_songs_df, 10)
3.4 The following table displays variables with their explanations based on the provided code description found here.
| Variable Name | Data Type | Explanation |
|---|---|---|
track_id |
character | Unique ID for a song |
track_name |
character | Song name |
track_artist |
character | Song artist |
track_popularity |
double | Song popularity from 0 - 100 where the larger number is better. |
track_album_id |
character | Unique ID for an album |
track_album_name |
character | Album name |
track_album_release_date |
character | Date of album released |
playlist_name |
character | Playlist name |
playlist_id |
character | ID for playlist |
playlist_genre |
character | Playlist genre |
playlist_subgenre |
character | Playlist subgenre |
dancibility |
double | 0 - 1.0 scale of how suitable a track is to dance to. |
energy |
double | 0 - 1.0 scale of a track’s perpetually measured activity and intensity where higher values are more energetic. |
key |
double | The average key/pitch of a track. Integers map pitches with pitch class notation. |
loudness |
double | The average loudness of a track measured in decibels (dB). |
mode |
double | Indicates whether a track is a major or minor with major equally 1 and minor equal to 0. |
speechiness |
double | 0 - 1.0 of how much of a track consists of words where the higher value likely is a voice recording. |
acousticness |
double | 0 - 1.0 scale that measures how likely a track is to be acoustic. |
instrumentalness |
double | 0 - 1.0 scale that measures how likely a track contains any vocals where 1.0 is a track without vocals. |
liveness |
double | Detects likelihood of the track having an audience in the recording. |
valence |
double | 0 - 1.0 scale that measures the positiveness conveyed from the track where 1 is positve and 0 is negative. |
tempo |
double | Average estimated beats per minute (BPM) for a track. |
duration_ms |
double | Song duration in milliseconds. |
4.1 Moving forward, I would create new variables like for track duration or binary variables based on double values dependent on the variables description (EX: 0.66 < for being acoustic). Doing such changes will make interpretation of values make more sense and to test against (EX: graph popularity with track as acoustic). To answer my question about song popularity, I can summarize the numeric type variables in a smaller data frame for easier manipulation and graphing.
4.2 Scatter plots, bar plots, and box plots may prove useful in finding any correlations between track popularity and other variables. Scatter plots in particular will help visualize the popularity with other variables and could help create a linear regression model.
4.3 I do not know right now better ways to display data using graphs. Another thing I will need to learn to create binary variables to represent the different genre types. I would also need to learn to code for linear regression.
4.4 I plan to use linear regression models to better understand the relationship between track popularity and the variables relating to its sound along with possibly its genre.