I have reviewed and commented on the midterm submitted by Sharanya Amaravadhi, and I also reviewed and commented on the midterm submitted by Emily Thie and Cynthia Corby. In return, Emily Thie and Sharanya Amaravadhi agreed to review my midterm.
In order to sell as many albums as possible, it is important for record labels to sign contracts with new performers who will create popular songs and/or recordings. Hence to help record labels strategically select new artists with strong potential for developing popular material, we examine possible relationships between popularity of recordings and their musical characteristics. We then present our findings to provide record labels with an understanding of the musical characteristics inherent in recordings that tend to become popular.
In order to address the problem statement, we analyze data from Spotify. This data was originally obtained by using the spotifyr package to collect data from Spotify. We plan to use regression, residual plots, and graphical displays (i.e., scatter plots, histograms, etc.) to investigate the influence of various musical characteristics on the popularity of a recording. In particular by analyzing the data set, we determine whether genre, sub-genre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and/or duration of tracks are statistically significant in predicting the popularity of the track. We hypothesize that these musical characteristics alone may not be enough to establish a highly accurate model for the prediction of track popularity. However, the analysis will still provide insights into potentially statistically significant relationships between musical characteristics and track popularity, and record labels can use these findings to inform their performer contract decisions. Because we are trying to provide record labels with information regarding the potential popularity of track recordings for new performers with whom they are considering initial contracts, the artist name, track id, track name, album name, album id, playlist name, playlist id, and album release date do not play a substantial role in our analysis, as new performers may not necessarily already have produced albums and/or songs appearing in Spotify playlists. However if the musical characteristics alone are insufficient for developing a highly predictive model, we will explore the potential benefit of incorporating such other possible variables in our regression model.
By creating graphical displays of the data set (such as scatter plots, histograms, etc.), we visualize any potential relationships between musical characteristics and track popularity. If relationships appear to be nonlinear, we transform the data appropriately prior to utilizing regression. By performing regression analysis, we determine whether certain musical characteristics are statistically significant for predicting track popularity. After performing regression analysis, we can use the sign (i.e., positive or negative) of coefficients corresponding to statistically significant covariates to inform our understanding of general relationships between the covariates (i.e., musical characteristics) and the dependent variable (i.e., track popularity). We also use the Bayesian information criterion to determine the musical characteristics which are best-suited for inclusion in our regression model. After developing our model using logistic regression, we examine its predictive capabilities by calculating a confusion matrix and the corresponding misclassification error rate.
This report is meant primarily to be consumed and read by record labels. Record labels can use the results of this analysis in the strategic planning of new performer contracts. By providing record labels with insights into the musical characteristics that are most significant in determining track popularity, record labels gain an understanding of the musical characteristics that are commonly inherent in popular tracks, and they can use this understanding to inform their contract decisions.
The following four packages are used in this final project report:
tidyverse
leaps
psych
MASS
By clicking on the code button shown directly below, you can see the code for loading these four packages.
# Below, the package tidyverse is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("tidyverse")
library(tidyverse)
# Below, the package leaps is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("leaps")
library(leaps)
# Below, the package psych is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("psych")
library(psych)
# Below, the package MASS is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("MASS")
library(MASS)
Note that while loading the packages in Section 2.1, message and warning were both set to FALSE. This suppressed the messages and warnings resulting from loading the four packages. Also, echo was set to TRUE in order to ensure that the reader is able to view the R code for loading the required packages.
Here, we explain the purpose of using each package in our data analysis. The package leaps will help us to understand the best subset of variables in our data set to choose in developing our regression model (i.e., subsets can be chosen using this package based on adjusted R squared for instance). In loading the package tidyverse, other packages are automatically loaded that will be helpful in our analysis. When we load the package tidyverse, the package ggplot2 is automatically loaded. The package ggplot2 will allow us to create nice visualizations of our data (i.e., graphs and plots). A couple other packages that are automatically loaded with tidyverse include dplyr and tidyr. We will leverage the power of dplyr to manipulate our data set, and we will use tidyr to tidy our data. The package psych contains functions that are useful for psychological, personality, and psychometric research. We will use a function from this package called pairs.panels() to create a visual display that contains scatter plots, histograms, and correlations among our variables. The package MASS is useful for selecting Box-Cox transformations to use in developing regression models. We will use this package to select the tuning parameter for a Box-Cox transformation which maximizes the log-likelihood of the transformed data.
In Section 1.2, we mentioned that we will analyze data from Spotify that was originally obtained using the spotifyr package. We downloaded this data at the following link: https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1
We will analyze data from Spotify that was originally obtained using the spotifyr package. This package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff in order to make it easier for individuals to obtain data from Spotify. The data set that we analyze appeared as the 01/21/2020 tidy tuesday data set in the r for data science GitHub organization, where a data dictionary for the 23 variables in the original data set is provided (note that the data set contains 32833 observations, each of which is a Spotify track). We also make note of some peculiarities in the data set. For the variable key, a value of -1 is recorded whenever no musical key is detected. We note that this may occur when multiple keys are used throughout the piece, preventing a single standard key from being detected. We also note that the mode variable only allows for the entry of two possible modes (i.e., major or minor). Although these are the two most prevalent modes in today’s music, we note that other modes do exist, and hence this variable is not defined in a manner that allows for the entry of all possible modes. As a consequence, we will need to explicitly analyze and consider potential missing values for the variable mode, as they may be an indicator that a mode was used either than major or minor. Another possibility is that missing values for the variable mode may indicate that multiple modes were used throughout the musical piece.
The data dictionary for the original data set can be obtained at the following link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
For convenience, we provide a description of each of the 23 variables in the original data set here as well:
track_id - the unique ID for a song
track_name - the name of the song
track_artist - the name of the performing artist
track_popularity - the popularity of a song based on an integer scale from 0 to 100, with 100 being the most popular
track_album_id - the unique ID for the album
track_album_name - the name of the album
track_album_release_date - the date when the album was released
playlist_name - the name of the playlist
playlist_id - the unique ID for the playlist
playlist_genre - the genre of the playlist
playlist_subgenre - the sub-genre of the playlist
danceability - a measure of the suitability of a song for dancing based on a scale from 0 to 1, with 1 being the most dance-able
energy - a measure of the intensity of a song on a scale of 0 to 1, with 1 being the most energetic
key - the overall key of the song represented as an integer; here, 0 is the key of C (also called B#), 1 is C# (also referred to as D-flat), 2 is D, 3 is D# (also called E-flat), 4 is E, 5 is F (also called E#), 6 is F# (also called G-flat), 7 is G, 8 is G# (also called A-flat), 9 is A, 10 is A# (also called B-flat), 11 is B (also called C-flat); -1 is used whenever an overall key cannot be detected
loudness - a measure of the loudness of a song in decibels (loudness is generally between -60 and zero decibels)
mode - the value 1 is used to represent pieces that are written in a major key, and the value 0 is used to represent pieces that are written in a minor key; we note here that it is possible for songs to be written in modes that are neither major nor minor, but that is not prevalent in today’s popular music
speechiness - a measure of the use of spoken words on a scale of 0 to 1, with 1 being tracks with the most spoken words
acousticness - a measure of the confidence that a track is acoustic (on a scale of 0 to 1), with 1 used to indicate tracks that are most likely to be acoustic
instrumentalness - a measure of the prediction that instruments are used with no vocals (on a scale of 0 to 1), with 1 representing tracks that are predicted to use the least amount of vocals
liveness - a measure of the prediction that the track was recorded live in front of an audience (on a scale of 0 to 1), with 1 representing tracks that are most likely to have been recorded live
valence - a measure of the positiveness of a recording on a scale from 0 to 1, with 1 representing the tracks that are most positive
tempo - the speed of a track recorded in beats per minute
duration_ms - the length of the track in milliseconds
Based on the descriptions provided in the above data dictionary and also the values that tend to appear in the data set, the type of the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, and playlist_subgenre should all be character. The type of the variables track_popularity, key, mode, and duration_ms should be integer. The type of the variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo should be double. We will check for these variable types in Section 4 as we clean the data.
Also, the 01/21/2020 tidy tuesday issue mentions that Kaylin Pavlik used the spotifyr package to collect data from Spotify in order to design a model for predicting the genre of specific pieces of music.
We begin this section by importing the data set described in Section 3.2. To do this, we first set our working directory to the folder location containing the data set. We then import the data set, and we name the data set spotify_songs. We also begin to understand the data by viewing the first rows of the data set, which are displayed below. The code for this can be seen by clicking on the code button directly below.
# Using the below code, we set our working directory to the folder location containing the Spotify data.
setwd("C:/Users/richa/Dropbox/My PC (DESKTOP-B9LT0L1)/Documents/Data Wrangling/Possible Data Sets/spotify")
# Using the below code, we import the Spotify data, and we name the data set spotify_songs.
spotify_songs <- read.csv("spotify_songs.csv")
# We view the first rows of the Spotify data using the following code.
head(spotify_songs)
We also gain an understanding of the data by examining its structure. Output about the data structure obtained from R is shown below, along with the code required to obtain the output. Looking at this output, we see that the data set contains 32833 observations of Spotify tracks and 23 variables. We also see that all variables are properly named in a manner that follows the snake_case. Further as desired, we see that the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, and playlist_subgenre all are of type character. Because characters are stored in these variables, we did want to double-check and ensure that these variables were correctly assigned a character type. Because the variables track_popularity, key, mode, and duration_ms all contain integer values, we must check that they are appropriately assigned an integer type. Looking at the output regarding the data structure shown below, we see that they indeed are an integer type. The variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo all contain decimal numbers, and hence they should be assigned the numeric type double. Looking at the below output regarding the structure of the data set, we see that these variables are indeed appropriately assigned a numeric type. Because all variables are of the appropriate type, we need not change any of the variable types in our data cleaning.
# Using the below code, we examine the structure of the data set spotify_songs. This is helpful in gaining a better initial understanding of the data, and it also allows us to check if the variable names and type are appropriate. By examining the output, we find that they are indeed appropriate and do not need to be changed.
str(spotify_songs)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
We next examine and handle the missing values in the data. We find that most of the variables do not have any missing values. In fact, the only variables with missing values are track_name, track_artist, and track_album_name. Each of these three variables have five missing values. The below code and output is used to display the number of missing values per variable.
# The below code is used to calculate the number of missing values in each variable.
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
The output below indicates the five observations with missing values for the variable track_name. The required code to produce this output can also be seen by clicking on the code button directly below.
# The below code is used to determine the five observations containing missing values for the variable track_name.
which(is.na(spotify_songs$track_name))
## [1] 8152 9283 9284 19569 19812
The output below indicates the five observations with missing values for the variable track_artist. The required code to produce this output can also be seen by clicking on the code button directly below.
# The below code is used to determine the five observations containing missing values for the variable track_artist.
which(is.na(spotify_songs$track_artist))
## [1] 8152 9283 9284 19569 19812
The output below indicates the five observations with missing values for the variable track_album_name. The required code to produce this output can also be seen by clicking on the code button directly below.
# The below code is used to determine the five observations containing missing values for the variable track_album_name.
which(is.na(spotify_songs$track_album_name))
## [1] 8152 9283 9284 19569 19812
Note that the preceding three pieces of output are identical. That is, observations 8152, 9283, 9284, 19569, and 19812 contain the missing values for the variables track_name, track_artist, and track_album_name. Five observations is only a very small proportion of the total 32,833 observations, and so we choose to simply remove these five observations using the following code:
# Using the below code, we remove the five observations that contain missing values.
spotify_songs <- spotify_songs[-c(8152, 9283, 9284, 19569, 19812),]
The output below indicates the number of missing values that each variable now contains. We see that we have now successfully removed all missing values. You can view the code that was used to produce this output by clicking on the code button directly below.
# Using the below code, we check that there are now zero missing values in each variable.
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 0 0
## track_popularity track_album_id track_album_name
## 0 0 0
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
We note that the purpose of this study is to understand the musical characteristics that tend to most often occur in tracks which become popular. In our attempt to develop a regression model to see if it is possible to predict track popularity, it is important that each track should occur no more than once in our data set. Having the same track occur multiple times could skew our results. Hence we need to examine the number of unique track id’s occurring in the data set, and we need to ensure that this number is equivalent to the total number of observations. The below output indicates that there are 28352 unique track id’s in the data set. You can view the code that was used to produce this output by clicking on the code button directly below.
# Using the below code, we determine the number of unique track id's occuring in the data set.
length(unique(spotify_songs$track_id))
## [1] 28352
We compare the number of unique track id’s to the total number of unique observations in the data set. The below output indicates that there are 32828 unique observations in the data set. By clicking on the code button directly below, you can view the code that was used to create this output.
# Using the below code, we determine the number of unique observations in the data set.
nrow(unique(spotify_songs))
## [1] 32828
We also note that the total number of unique observations in the data set is actually equivalent to the total number of observations. Using the below code and output, we see that there are a total of 32828 observations in the data set.
# Using the below code, we determine the total number of observations in the data set.
nrow(spotify_songs)
## [1] 32828
Because there are more unique observations than unique track id’s, we need to remove some of the rows in the data to ensure that each track occurs only once. After examining the data set, we find that there are multiple playlist sub-genres assigned to the same track_id. Because track_id cannot be uniquely assigned a specific sub-genre to use as a covariate in our regression model, we exclude this variable in the development of our regression model. We create a new data set called spotify_songs_2 which stores all of the variables in spotify_songs except for playlist_subgenre. The below output indicates that the number of unique observations contained in spotify_songs_2 is only 32505. By clicking on the code button directly below, you can view the code used to create the spotify_songs_2 data set and to find the number of unique observations in the data set.
# Using the below code, we create a new data set called spotify_songs_2 which stores all of the variables in spotify_songs except for playlist_subgenre.
spotify_songs_2 <- spotify_songs[, c(1:10, 12:23)]
# Using the below code, we find the number of unique observations in the new data set spotify_songs_2.
nrow(unique(spotify_songs_2))
## [1] 32505
Because the number of unique observations in spotify_songs_2 is larger than 28352 (the number of unique track id’s that we found earlier), we need to further alter the data set to ensure that each track id occurs only once. Upon examining the data set further, we see that each track id can be assigned multiple playlist genres. As a result, we need to remove the variable playlist_genre for the same reason that we removed playlist_subgenre. After removing the variable playlist_genre from spotify_songs_2, we produce the following output displaying the number of unique observations in the revised data set. We see that the number of unique observations has now reduced to 32246. By clicking on the code button directly below, you can view the code that was used to modify spotify_songs_2 and count the number of unique observations in the data set.
# Using the below code, we remove the variable playlist_genre from the data set spotify_songs_2.
spotify_songs_2 <- spotify_songs_2[, c(1:9, 11:22)]
# Using the below code, we find the number of unique observations in spotify_songs_2.
nrow(unique(spotify_songs_2))
## [1] 32246
Because the number of unique observations in the revised data set is still greater than 28352, we need to still further revise the data to ensure that each track id occurs only once. Upon further examination of the data, we find that each track can occur in multiple playlists. As a consequence, playlist_name and playlist_id are not uniquely-valued potential covariates. Hence we remove playlist_name and playlist_id from the data set spotify_songs_2, and after removing these variables, we produce the output below to see that there are now 28352 unique observations in the data set. The code and output for performing the actions described in this paragraph are shown directly below.
# Using the below code, we remove the variables playlist_name and playlist_id from the data set spotify_songs_2.
spotify_songs_2 <- spotify_songs_2[, c(1:7, 10:21)]
# Using the below code, we determine the number of unique observations in the revised spotify_songs_2.
nrow(unique(spotify_songs_2))
## [1] 28352
We then revise spotify_songs_2 so that it contains only unique observations, ensuring that each track id occurs only once (for a total of 28352 unique observations). We then produce the output displayed below to double-check that there are now a total of 28352 observations in the revised data set. The code and output for the actions described in this paragraph can be found directly below.
# Using the below code, we revise spotify_songs_2 so that it contains only unique observations, ensuring that each track id occurs only once in the data set.
spotify_songs_2 <- unique(spotify_songs_2)
# Using the below code, we determine the total number of observations in spotify_songs_2.
nrow(spotify_songs_2)
## [1] 28352
The below output indicates that there are 28352 unique track id’s in our revised data set, verifying that we have successfully created a data set containing each track only once. You can view the code that was used to create this output by clicking on the code button directly below.
# Using the below code, we find the number of unique track id's in spotify_songs_2.
length(unique(spotify_songs_2$track_id))
## [1] 28352
As mentioned in the introduction, we are trying to help record labels make informed strategic decisions on developing contracts with new performers. Because new performers may not already have created albums or tracks appearing in playlists on Spotify, the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, and playlist_id will not play a large role in our analysis. These variables cannot necessarily be used by record labels to predict the popularity of recordings by new performers. Because we do not plan to utilize these variables in our recommendations for record labels, we do not worry about further cleaning of these variables. We also remind the reader that we have removed the variables playlist_genre and playlist_subgenre from consideration as tracks are not always assigned a unique genre and subgenre, and hence inclusion of these two variables would prevent us from having each track occur only once in our data set. The below output provides a numerical summary of the values for track_popularity. Using the below code and output, we see that the values for track_popularity range from 0 to 100 as indicated in the data dictionary. Although the maximum value of 100 seems to be much larger than the 3rd quantile of 58, this may just simply indicate that most tracks do not become extremely popular, and so we do not remove the observations having high popularity.
# Using the below code, we see that the values for track_popularity appropriately range from 0 to 100, as indicated in the data dictionary.
summary(spotify_songs_2$track_popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 21.00 42.00 39.34 58.00 100.00
The below output provides a numerical summary of the variable danceability. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for danceability all fall appropriately within this scale. We note that the minimum value of 0 is rather far away from the 1st quantile of 0.561. However this may occur simply because most tracks are at least somewhat dance-able. Because we want to determine if tracks with low danceability tend to be less popular, we do not remove tracks with low danceability from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we create a numerical summary for the values of the variable danceability.
summary(spotify_songs_2$danceability)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5610 0.6700 0.6534 0.7600 0.9830
The below output provides a numerical summary of the variable energy. According to the data dictionary, this variable should also be measured on a scale between 0 and 1, and we see that the values for energy all fall appropriately within this scale. We note that the minimum value of 0.000175 is rather far away from the 1st quantile of 0.579. However this may occur simply because most tracks are at least somewhat energetic. Because we want to determine if tracks which are less energetic tend to be less popular, we do not remove tracks with low energy from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we create a numerical summary for the values of the variable energy.
summary(spotify_songs_2$energy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000175 0.579000 0.722000 0.698372 0.843000 1.000000
The below output displays the number of observations in our data having each value for key. We notice that there do not appear to be any outliers for the variable key. We also notice that because -1 does not occur anywhere in the table, an overall key was identified for each piece of music in our data set. We also notice that the values for key are all integers between 0 and 11, and these values align appropriately with our expectations based on the definition of key provided in the data dictionary. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we create a table displaying the number of observations for each value of key.
table(spotify_songs_2$key)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3001 3436 2478 797 1925 2301 2261 2907 2066 2631 1972 2577
The below output displays the number of observations in our data having each value for mode. We notice that each piece of music is classified as either major or minor, and hence we do not need to worry about the fact that other modes do exist. We also notice that, as we would expect based on the definition provided in the data dictionary, the variable mode only has the values 0 and 1. Additionally, there do not appear to be any outliers for the variable mode. By clicking on the code button below, you can see the code that was used to produce the following output.
# Using the below code, we create a table displaying the number of observations for each value of mode.
table(spotify_songs_2$mode)
##
## 0 1
## 12318 16034
The below output provides a numerical summary of the variable speechiness. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for speechiness all fall appropriately within this scale. We note that the maximum value of 0.918 is rather far away from the 3rd quantile of 0.133. However this may occur simply because most tracks do not contain many spoken words. Because we want to determine if tracks with many spoken words tend to be less popular, we do not remove tracks with large speechiness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we produce a numerical summary of the variable speechiness.
summary(spotify_songs_2$speechiness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0410 0.0626 0.1079 0.1330 0.9180
The below output provides a numerical summary of the variable acousticness. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for acousticness all fall appropriately within this scale. We note that the maximum value of 0.994 is rather far away from the 3rd quantile of 0.26. However this may occur simply because most tracks do not tend to be extremely acoustic. Because we want to determine if tracks that are mostly acoustic tend to be less popular, we do not remove tracks with large values for acousticness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we produce a numerical summary of the variable acousticness.
summary(spotify_songs_2$acousticness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0143 0.0797 0.1772 0.2600 0.9940
The below output provides a numerical summary of the variable instrumentalness. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for instrumentalness all fall appropriately within this scale. We note that the maximum value of 0.994 is rather far away from the 3rd quantile of 0.00657. However this may occur simply because most tracks tend to use a great deal of vocals. Because we want to determine if tracks that do not contain vocals tend to be less popular, we do not remove tracks with large instrumentalness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we produce a numerical summary of the variable instrumentalness.
summary(spotify_songs_2$instrumentalness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000207 0.0911294 0.0065725 0.9940000
The below output provides a numerical summary of the variable liveness. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for liveness all fall appropriately within this scale. We note that the maximum value of 0.996 is rather far away from the 3rd quantile of 0.249. However this may occur simply because most performers do not record their tracks live. Because we want to determine if popularity is influenced by live recording, we do not remove tracks with large values for liveness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we produce a numerical summary of the variable liveness.
summary(spotify_songs_2$liveness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0926 0.1270 0.1910 0.2490 0.9960
The below output provides a numerical summary of the variable valence. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for valence all fall appropriately within this scale. Based on this output, there do not appear to be extreme outlying values for valence in the data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.
# Using the below code, we produce a numerical summary of the variable valence.
summary(spotify_songs_2$valence)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3290 0.5120 0.5104 0.6950 0.9910
The below histogram also provides further support of the idea that there are no outlying values for the variable valence, and it indicates that the variable valence appears to be normally distributed. You can view the code used to produce the below histogram by clicking on the code button directly below.
# Using the below code, we produce a histogram for the variable valence.
hist(spotify_songs_2$valence, main = "Frequency of Values for Valence", xlab = "Valence (measured on scale from 0 to 1)")
In looking at the data dictionary, we notice that an ideal range for tempo is not provided. This is quite different from the other variables for which typical, expected ranges were given in their definitions. Because no standard range is provided for the expected values of tempo, we will remove any extreme outliers when examining this variable. The below output displays a numerical summary and box-plot for the variable tempo. Based on this output, we notice that there appears to be an outlier that is unusually small, and there is also an outlier that is abnormally large. The code used to produce this output can be seen by clicking on the code buttons directly below.
# Using the below code, we obtain a numerical summary of the variable tempo.
summary(spotify_songs_2$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 99.97 121.99 120.96 134.00 239.44
The below code and output is for the creation of the box-plot for the variable tempo.
# Using the below code, we obtain a boxplot for the variable tempo.
boxplot(spotify_songs_2$tempo, main = "Boxplot for Tempo", ylab = "Tempo")
The below output indicates that only observation 10639 has a tempo less than 30. You can view the code used to show this by clicking on the code button directly below.
# Using the below code, we determine that only observation 10639 has a tempo less than 30.
which(spotify_songs_2$tempo < 30)
## [1] 10639
The below output indicates that only observation 17960 has a tempo greater than 230. You can view the code used to show this by clicking on the code button directly below.
# Using the below code, we find that only observation 17960 has a tempo greater than 230.
which(spotify_songs_2$tempo > 230)
## [1] 17960
We create a new data set that contains all of the observations in spotify_songs_2 except for these two outliers (i.e., all observations except for observations 10639 and 17960), and we name this revised data set spotify_songs_3. The code that we use to do this is shown directly below.
# Using the below code, we create a new data set containing all of the observations in spotify_songs_2 which have a value for tempo that is neither less than 30 nor greater than 230. We name this revised data set spotify_songs_3.
spotify_songs_3 <- filter(spotify_songs_2, tempo >= 30, tempo <= 230)
After removing these two outliers, we create a new numerical summary and box-plot for the variable tempo. In the box-plot, we can especially see that there no longer seem to be any extreme outliers. The numerical summary and box-plot are shown directly below, and you can access the code required to produce them by clicking on the code buttons below.
# Using the below code, we produce a numerical summary of the variable tempo.
summary(spotify_songs_3$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.48 99.97 121.99 120.96 134.00 220.25
The below code and output is for the creation of a box-plot for the variable tempo.
# Using the below code, we produce a boxplot for the variable tempo.
boxplot(spotify_songs_3$tempo, main = "Boxplot for Tempo with Outliers Removed", ylab = "Tempo")
The below output provides a numerical summary and box-plot for the variable duration_ms in the data set spotify_songs_3. In the box-plot especially, we can see that there appear to be no extreme outliers. You can view the code required to produce this output by clicking on the code buttons below.
# Using the below code, we produce a numerical summary for the variable duration_ms.
summary(spotify_songs_3$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29493 187743 216933 226587 254976 517810
The below output and code is for the creation of a box-plot for the variable duration_ms.
# Using the below code, we produce a boxplot for the variable duration_ms.
boxplot(spotify_songs_3$duration_ms, main = "Boxplot of Duration", ylab = "Duration (in milliseconds)")
The below output provides a numerical summary of the variable loudness. In the variable dictionary, it is stated that the loudness can generally be expected to be between -60 and zero decibels. We note that the minimum observed value of loudness is -46.448, which is quite far from the 1st quantile of -8.309. Although there is a large difference between the minimum and 1st quantile, we do not remove observations with extremely small values for loudness because they are not perceived as abnormal based on the definition provided in the data dictionary. Keeping these observations will also allow us to determine if tracks with extremely small values for loudness tend to be less popular. You can view the code required to produce the below output by clicking on the code button directly below.
# Using the below code, we produce a numerical summary of the variable loudness.
summary(spotify_songs_3$loudness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.448 -8.309 -6.261 -6.817 -4.708 1.275
Based on the above output, we note that the maximum observed value of 1.275 is outside of the expected normal range for loudness stated in the data dictionary. Using the below code and output, we find that there are only six observations that contain an abnormal value for loudness which is above zero. Because this is only a small proportion of the total number of observations, we will remove these six observations from the data set.
# Using the below code, we find that there are only six observations having a loudness that is greater than zero decibels.
length(which(spotify_songs_3$loudness > 0))
## [1] 6
Because the data dictionary states that tracks having a loudness above zero are considered abnormal, we create a new data set with these unusual observations removed. This new data set is called spotify_songs_4, and it contains all of the observations in spotify_songs_3 except for those having a value for loudness that is greater than zero. The code that we use to accomplish this can be viewed by clicking on the code button shown directly below.
# Using the below code, we create a new data set called spotify_songs_4 that contains all of the observations in spotify_songs_3 except for those having a value for loudness that is greater than zero.
spotify_songs_4 <- filter(spotify_songs_3, loudness <= 0)
Recall that we mentioned that the variables track_id, track_name, track_artist, track_album_id, track_album_name, and track_album_release_date are not extremely important for our analysis. New performers may not necessarily have already produced albums or tracks to appear in Spotify playlists, and consequently, record labels cannot always use these variables to strategically select new performers to engage in contracts. Because these variables are not heavily utilized in our analysis, we will unite columns from among these variables which contain similar information. For instance, we will unite track_id and track_name, as these two variables both serve similar purposes for identifying tracks. Similarly, we will unite track_album_id and track_album_name since these two variables serve the same purpose (i.e., identifying the album containing the track). By uniting these columns, we create a tidied data set that is more simplified in the sense that it has a smaller number of variables, and we name this new tidied data set spotify_songs_5. We use underscores to separate the entries of united columns, as indicated in the below code.
# Using the below code, we create a new data set called spotify_songs_5 in which the track_id and track_name columns have been united to form a column titled track_id_and_name
spotify_songs_5 <- unite(spotify_songs_4, track_id_and_name, track_id, track_name, sep = "_")
# Using the below code, we unite the track_album_id and track_album_name columns to form a new column titled track_album_id_and_name
spotify_songs_5 <- unite(spotify_songs_5, track_album_id_and_name, track_album_id, track_album_name, sep = "_")
This section is devoted to displaying visual summaries of the data in order to gain a better understanding of the data. We will use this understanding later to inform our exploratory data analysis.
The below code and output is used to create a column chart depicting the frequency of each artist in the data set.
# Using the below code, we produce a column chart displaying the frequency of each artist in the data set.
barplot(table(spotify_songs_5$track_artist), main = "Frequency of Each Artist", xlab = "Artist Name", ylab = "Frequency")
The below code and output is used to create a histogram of track_popularity. We see that there are many tracks that are not very popular.
# Using the below code, a histogram is created for the variable track_popularity.
hist(spotify_songs_5$track_popularity, main = "Histogram of Track Popularity", xlab = "Track Popularity")
The below code and output is used to display a histogram for the variable danceability. Notice that the distribution appears to be left-skewed.
# Using the below code, we produce a histogram for the variable danceability.
hist(spotify_songs_5$danceability, main = "Histogram for Danceability", xlab = "Danceability")
Using the below code and histogram output, we see that the distribution of the variable energy also appears to be left-skewed.
# Using the below code, we produce a histogram for the variable energy.
hist(spotify_songs_5$energy, main = "Histogram for Energy", xlab = "Energy")
The below code and output is used to display a column chart depicting the frequency of each track_album_id_and_name.
# Using the below code, I display a column chart depicting the frequency of each track_album_id_and_name.
barplot(table(spotify_songs_5$track_album_id_and_name), main = "Frequency of Each Album", xlab = "Track Album ID and Name", ylab = "Frequency")
Using the below code and output, we display a box-plot for the variable key. We notice that there do not seem to be any outliers.
# Using the below code, we produce a boxplot for the variable key.
boxplot(spotify_songs_5$key, main = "Box-plot for Key", ylab = "Key")
Using the below code and histogram output, we see that the distribution for the variable loudness is left-skewed.
# Using the below code, we produce a histogram of the variable loudness.
hist(spotify_songs_5$loudness, breaks = 100, main = "Histogram of Loudness", xlab = "Loudness in Decibels")
Using the below code and histogram output, we visually see that the variable mode contains only two values.
# Using the below code, we produce a histogram of the variable mode.
hist(spotify_songs_5$mode, main = "Histogram of Mode", xlab = "Mode")
Using the below code and histogram output, we see that the distribution for the variable speechiness is right-skewed.
# Using the below code, we produce a histogram of the variable speechiness.
hist(spotify_songs_5$speechiness, main = "Histogram of Speechiness", xlab = "Speechiness")
Using the below code and output, we display a column chart of the frequency of each value for track_id_and_name. Because each track occurs only once in the data set, we observe that the frequency for each track is 1 in the column chart.
# Using the below code, we create a column chart for the variable track_id_and_name
barplot(table(spotify_songs_5$track_id_and_name), main = "Frequency of Each Track", xlab = "Track ID and Name", ylab = "Frequency")
Using the below code and output, we display a histogram of the variable acousticness, and we see that the distribution of the variable acousticness is right-skewed.
# Using the below code, we produce a histogram of the variable acousticness.
hist(spotify_songs_5$acousticness, main = "Histogram of Acousticness", xlab = "Acousticness")
Using the below code and output, we display a histogram for the variable instrumentalness, and we see that most tracks have very low values for the variable instrumentalness.
# Using the below code, we produce a histogram for the variable instrumentalness.
hist(spotify_songs_5$instrumentalness, main = "Histogram of Instrumentalness", xlab = "Instrumentalness")
Using the below code and output, we display a histogram for the variable liveness. We notice that the distribution for liveness appears to be somewhat bi-modal, though the left-most peak is much more significantly pronounced.
# Using the below code, we produce a histogram for the variable liveness.
hist(spotify_songs_5$liveness, breaks = 100, main = "Histogram of Liveness", xlab = "Liveness")
The below output displays a column chart which indicates the frequency of each release date for albums. You can view the code used to create this output by clicking on the code button directly below.
# Using the below code, we produce a column chart which indicates the frequency of each release date for albums.
barplot(table(spotify_songs_5$track_album_release_date), main = "Frequency of Each Release Date for Albums", xlab = "Album Release Date", ylab = "Frequency")
Using the below code and output, we display a box-plot for the variable valence, and we see that the variable valence does not appear to contain any outliers.
# Using the below code, we produce a boxplot for the variable valence.
boxplot(spotify_songs_5$valence, main = "Box-plot for Valence", ylab = "Valence")
Using the below code and output, we display a histogram for the variable tempo. In the histogram, we see that the distribution of tempo appears to have multiple peaks.
# Using the below code, we produce a histogram for the variable tempo.
hist(spotify_songs_5$tempo, breaks = 100, main = "Histogram of Tempo", xlab = "Tempo")
Using the below code and output, we display a histogram of the variable duration_ms, and we see that the variable’s distribution is right-skewed.
# Using the below code, we produce a histogram for the variable duration_ms.
hist(spotify_songs_5$duration_ms, breaks = 100, main = "Histogram of the Duration of Tracks", xlab = "Duration (in milliseconds)")
The below table displays the first rows of the clean data set (i.e., spotify_songs_5). You can view the code that was used to produce this output by clicking on the code button directly below.
# Using the below code, we display a table containing the first rows of the clean data.
knitr::kable(
head(spotify_songs_5),
align = "ccccccccccccccccc"
)
| track_id_and_name | track_artist | track_popularity | track_album_id_and_name | track_album_release_date | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN_I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx_I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31_Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6_Memories (Dillon Francis Remix) | 2019-12-13 | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l_All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4_All the Time (Don Diablo Remix) | 2019-07-05 | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7_Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6_Call You Mine - The Remixes | 2019-07-19 | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x_Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ_Someone You Loved (Future Humans Remix) | 2019-03-05 | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef_Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k_Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
We remind the reader that visual summaries of the clean data were already produced in Section 5.
Using the below code, we find that the clean data set contains 28344 observations and 17 variables. We also find that the variables track_id_and_name, track_artist, track_album_id_and_name, and track_album_release_date all have a character type. The variables track_popularity, key, mode, and duration_ms all have an integer type. Also, the variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo all have a numeric type (in particular, they are of type double). As instructed in the grading rubric, we have removed the output from display in favor of providing the succinct summary contained in this paragraph.
# Using the below code, we examine the structure of the clean data.
str(spotify_songs_5)
Using the below code, we also find that there are no longer any missing values in the data. As instructed in the grading rubric, we have not shown the R output in favor of the more succinct summary provided in the preceding sentence.
# Using the below code, we find the number of missing values for each variable.
colSums(is.na(spotify_songs_5))
We remind the reader that track_id_and_name, track_artist, track_album_id_and_name, and track_album_release_date are not extremely important in our analysis. Since new performers may not already have an album and tracks in Spotify playlists, record labels will not always be able to use these variables in decisions regarding initial contracts with new performers. Hence we focus our numerical summaries around the other variables which we will be studying in our analysis. The below output displays a table of descriptive statistics for these important variables used in our analysis. You can view the code used to produce this output by clicking the code button directly below.
# Using the below code, we create a vector called minimum which stores the minimum value of each of the important variables that we plan to analyze.
minimum <- c(min(spotify_songs_5$track_popularity), min(spotify_songs_5$danceability), min(spotify_songs_5$energy), min(spotify_songs_5$key), min(spotify_songs_5$loudness), min(spotify_songs_5$mode), min(spotify_songs_5$speechiness), min(spotify_songs_5$acousticness), min(spotify_songs_5$instrumentalness), min(spotify_songs_5$liveness), min(spotify_songs_5$valence), min(spotify_songs_5$tempo), min(spotify_songs_5$duration_ms))
# Using the below code, we create a vector called maximum which stores the maximum value of each of the important variables that we plan to analyze.
maximum <- c(max(spotify_songs_5$track_popularity), max(spotify_songs_5$danceability), max(spotify_songs_5$energy), max(spotify_songs_5$key), max(spotify_songs_5$loudness), max(spotify_songs_5$mode), max(spotify_songs_5$speechiness), max(spotify_songs_5$acousticness), max(spotify_songs_5$instrumentalness), max(spotify_songs_5$liveness), max(spotify_songs_5$valence), max(spotify_songs_5$tempo), max(spotify_songs_5$duration_ms))
# Using the below code, we create a vector called first_quantile which stores the first quantile of each of the important variables that we plan to analyze.
first_quantile <- c(summary(spotify_songs_5$track_popularity)[2], summary(spotify_songs_5$danceability)[2], summary(spotify_songs_5$energy)[2], summary(spotify_songs_5$key)[2], summary(spotify_songs_5$loudness)[2], summary(spotify_songs_5$mode)[2], summary(spotify_songs_5$speechiness)[2], summary(spotify_songs_5$acousticness)[2], summary(spotify_songs_5$instrumentalness)[2], summary(spotify_songs_5$liveness)[2], summary(spotify_songs_5$valence)[2], summary(spotify_songs_5$tempo)[2], summary(spotify_songs_5$duration_ms)[2])
# Using the below code, we create a vector called median which stores the median of each of the important variables that we plan to analyze.
median <- c(summary(spotify_songs_5$track_popularity)[3], summary(spotify_songs_5$danceability)[3], summary(spotify_songs_5$energy)[3], summary(spotify_songs_5$key)[3], summary(spotify_songs_5$loudness)[3], summary(spotify_songs_5$mode)[3], summary(spotify_songs_5$speechiness)[3], summary(spotify_songs_5$acousticness)[3], summary(spotify_songs_5$instrumentalness)[3], summary(spotify_songs_5$liveness)[3], summary(spotify_songs_5$valence)[3], summary(spotify_songs_5$tempo)[3], summary(spotify_songs_5$duration_ms)[3])
# Using the below code, we create a vector called mean which stores the mean of each of the important variables that we plan to analyze.
mean <- c(summary(spotify_songs_5$track_popularity)[4], summary(spotify_songs_5$danceability)[4], summary(spotify_songs_5$energy)[4], summary(spotify_songs_5$key)[4], summary(spotify_songs_5$loudness)[4], summary(spotify_songs_5$mode)[4], summary(spotify_songs_5$speechiness)[4], summary(spotify_songs_5$acousticness)[4], summary(spotify_songs_5$instrumentalness)[4], summary(spotify_songs_5$liveness)[4], summary(spotify_songs_5$valence)[4], summary(spotify_songs_5$tempo)[4], summary(spotify_songs_5$duration_ms)[4])
# Using the below code, we create a vector called third_quantile which stores the third quantile of each of the important variables that we plan to analyze.
third_quantile <- c(summary(spotify_songs_5$track_popularity)[5], summary(spotify_songs_5$danceability)[5], summary(spotify_songs_5$energy)[5], summary(spotify_songs_5$key)[5], summary(spotify_songs_5$loudness)[5], summary(spotify_songs_5$mode)[5], summary(spotify_songs_5$speechiness)[5], summary(spotify_songs_5$acousticness)[5], summary(spotify_songs_5$instrumentalness)[5], summary(spotify_songs_5$liveness)[5], summary(spotify_songs_5$valence)[5], summary(spotify_songs_5$tempo)[5], summary(spotify_songs_5$duration_ms)[5])
# Using the below code, we create a vector called standard_deviation which stores the standard deviation of each of the important variables that we plan to analyze.
standard_deviation <- c(sd(spotify_songs_5$track_popularity), sd(spotify_songs_5$danceability), sd(spotify_songs_5$energy), sd(spotify_songs_5$key), sd(spotify_songs_5$loudness), sd(spotify_songs_5$mode), sd(spotify_songs_5$speechiness), sd(spotify_songs_5$acousticness), sd(spotify_songs_5$instrumentalness), sd(spotify_songs_5$liveness), sd(spotify_songs_5$valence), sd(spotify_songs_5$tempo), sd(spotify_songs_5$duration_ms))
# Using the below code, we create a data frame called descriptive_statistics that contains the vectors minimum, maximum, first_quantile, median, mean, third_quantile, and standard_deviation.
descriptive_statistics <- data.frame(minimum, maximum, first_quantile, median, mean, third_quantile, standard_deviation)
# Using the below code, we name each row in descriptive_statistics according to the corresponding variable described by those statistics.
row.names(descriptive_statistics) <- c("track_popularity", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms")
# Using the below code, we output a table displaying the values in descriptive_statistics.
knitr::kable(
descriptive_statistics,
caption = "Descriptive Statistics and Numerical Summaries for the Clean Data"
)
| minimum | maximum | first_quantile | median | mean | third_quantile | standard_deviation | |
|---|---|---|---|---|---|---|---|
| track_popularity | 0.0000e+00 | 100.000 | 21.00000 | 4.20000e+01 | 3.933887e+01 | 5.800000e+01 | 2.369742e+01 |
| danceability | 7.7100e-02 | 0.983 | 0.56100 | 6.70000e-01 | 6.533830e-01 | 7.600000e-01 | 1.457295e-01 |
| energy | 1.7500e-04 | 1.000 | 0.57900 | 7.22000e-01 | 6.983495e-01 | 8.430000e-01 | 1.834771e-01 |
| key | 0.0000e+00 | 11.000 | 2.00000 | 6.00000e+00 | 5.367556e+00 | 9.000000e+00 | 3.613605e+00 |
| loudness | -4.6448e+01 | -0.046 | -8.31025 | -6.26200e+00 | -6.818456e+00 | -4.709750e+00 | 3.032471e+00 |
| mode | 0.0000e+00 | 1.000 | 0.00000 | 1.00000e+00 | 5.654459e-01 | 1.000000e+00 | 4.957071e-01 |
| speechiness | 2.2400e-02 | 0.918 | 0.04100 | 6.26000e-02 | 1.079337e-01 | 1.330000e-01 | 1.025514e-01 |
| acousticness | 1.4000e-06 | 0.994 | 0.01430 | 7.97000e-02 | 1.772138e-01 | 2.600000e-01 | 2.228397e-01 |
| instrumentalness | 0.0000e+00 | 0.994 | 0.00000 | 2.07000e-05 | 9.115160e-02 | 6.582500e-03 | 2.325908e-01 |
| liveness | 9.3600e-03 | 0.996 | 0.09260 | 1.27000e-01 | 1.909535e-01 | 2.490000e-01 | 1.558725e-01 |
| valence | 1.0000e-05 | 0.991 | 0.32900 | 5.12000e-01 | 5.104299e-01 | 6.950000e-01 | 2.343385e-01 |
| tempo | 3.5477e+01 | 220.252 | 99.97200 | 1.21993e+02 | 1.209551e+02 | 1.339957e+02 | 2.693762e+01 |
| duration_ms | 2.9493e+04 | 517810.000 | 187746.50000 | 2.16933e+05 | 2.265964e+05 | 2.549773e+05 | 6.106305e+04 |
We remind the reader that the objective of this study is to help record labels identify new artists who are likely to produce popular songs. Before developing contracts, record labels are generally provided with a sample of songs from potential artists. We hypothesize that record labels can use properties of these songs to determine their likelihood of becoming popular, and we aim to help record labels in this determination via the use of regression analysis. We also remind the reader that new artists may not necessarily already have songs and/or albums appearing in Spotify, and hence the Spotify track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, and playlist_id are not musical characteristics that can always be used by record labels to identify promising new artists. Hence we do not focus our analysis around these variables. Further as mentioned in previous sections, we remind the reader that playlist_genre and playlist_subgenre are not unique for each track_id. Indeed some track id’s are assigned multiple genres and sub-genres. Because each track_id cannot be uniquely assigned a specific sub-genre and genre to use as covariates in our regression model, we exclude these variables from our analysis. The below output displays scatter plots, histograms, and correlations among the remaining variables that we plan to analyze. The code used to produce this output can be seen by clicking on the code button directly below.
# Using the below code, we display scatter plots, histograms, and correlations for the variables that we plan to analyze.
pairs.panels(spotify_songs_5[,c(3, 6:17)])
In examining the above output, we remind the reader that observations from histograms were already discussed in Section 5. As a consequence, we instead focus our current efforts in examining the scatter plots and correlations shown in the above output. We note that there is not much correlation between most variables. In fact in the above output, we see only one occurence of a correlation that is greater than 0.5, and there is only one occurence of a correlation that is less than -0.5. In particular, there is a correlation of 0.68 between energy and loudness, and there is a correlation of -0.55 between energy and acousticness. This indicates that there is somewhat of a positive linear relationship between energy and loudness, and there is somewhat of a negative linear relationship between energy and acousticness. Because energy appears to be related to both loudness and acousticness, we may possibly consider excluding the variables loudness and/or acousticness from our regression model. We will keep this in mind as we move forward in our investigation.
The fact that we observe no strong correlations involving track_popularity may possibly indicate one of the following two situations:
There is no relationship between track_popularity and the other variables under consideration, or
The relationship between track_popularity and other variables is non-linear in nature.
In order to determine whether there is a non-linear relationship between track_popularity and the other variables under consideration, we examine scatter plots having track_popularity as the y-axis. Although small scatter plots are shown in the above output, we can gain new information more clearly by examining larger graphs. Hence we produce various scatter plots having track_popularity as the y-axis. These scatter plots are shown below. You can view the code that was used to create them by clicking on the code button displayed directly below.
# Using the below code, we create a variable called j and give it the value 6.
j = 6
# Using the below code, we create a variable called i and give it the value 1.
i = 1
# Using the below code, we create a variable containing all of the names for our x-axes that will be used in our scatter plots.
x_axis <- c("Danceability", "Energy", "Key", "Loudness", "Mode", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Duration (in milliseconds)")
# Using the below code, we create scatter plots to examine the relationships between track popularity and various musical characteristics.
for(k in 6:17){
print(ggplot(data = spotify_songs_5, aes(x = spotify_songs_5[,k], y = track_popularity, color = as.character(key))) + geom_point() + geom_smooth() + labs(x = x_axis[k - 5], y = "Popularity", color = "Key") + ggtitle(paste0("Track Popularity and ", x_axis[k - 5]), subtitle = paste0("Is There a Relationship Between Popularity and ", x_axis[k - 5], "?")) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)))
}
Each of the above scatter plots contains a smooth curve fitted to the data for each key. These curves indicate the general trends in the data. We notice that in the first scatter plot titled “Track Popularity and Danceability,” the smooth curves appear to be relatively similar, indicating that there is likely not much interaction between the variables danceability and key. We also notice that in this plot, the highest levels of popularity all tend to occur at high levels of danceability. This indicates that there is likely a relationship between danceability and track popularity, and hence we will plan to include the variable danceability in our initial regression model. This observation can also be seen in the below table. The below table displays descriptive statistics for popularity when the data is split into bins based on levels of danceability or loudness. In particular, the first row displays descriptive statistics of popularity for the tracks having a value for danceability larger than 0.2, and the second row displays descriptive statistics of popularity for tracks having a value for danceability that is no more than 0.2. We notice that the maximum value in the first row is 100, but the maximum value in the second row is only 77. This indicates that although popularity is sometimes high for tracks with high danceability, popularity never exceeds 77 when danceability is small (i.e., when danceability is no larger than 0.2). The code used to produce the below table can be seen by clicking on the code button directly below.
# Using the below code, we create a variable containing the minimum values for popularity when the data is split into different bins (where the groupings of bins are based on values for danceability or loudness).
minimum <- c(summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability > 0.2])[1], summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability <= 0.2])[1], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness > -25])[1], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness <= -25])[1])
# Using the below code, we create a variable containing the first quantile for popularity when the data is split into different bins (where the groupings of bins are based on values for danceability or loudness).
first_quantile <- c(summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability > 0.2])[2], summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability <= 0.2])[2], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness > -25])[2], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness <= -25])[2])
# Using the below code, we create a variable containing the median for popularity when the data is split into different bins (where the groupings of bins are based on values for danceability or loudness).
median <- c(summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability > 0.2])[3], summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability <= 0.2])[3], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness > -25])[3], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness <= -25])[3])
# Using the below code, we create a variable containing the third quantile for popularity when the data is split into different bins (where the groupings of bins are based on values for danceability or loudness).
third_quantile <- c(summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability > 0.2])[5], summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability <= 0.2])[5], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness > -25])[5], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness <= -25])[5])
# Using the below code, we create a variable containing the maximum for popularity when the data is split into different bins (where the groupings of bins are based on values for danceability or loudness).
maximum <- c(summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability > 0.2])[6], summary(spotify_songs_5$track_popularity[spotify_songs_5$danceability <= 0.2])[6], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness > -25])[6], summary(spotify_songs_5$track_popularity[spotify_songs_5$loudness <= -25])[6])
# Using the below code, we create a data frame containing the variables that we created in the above portion of this chunk of code.
descriptive_statistics_2 <- data.frame(minimum, first_quantile, median, third_quantile, maximum)
# Using the below code, we name the rows of our data frame based on the groupings related to the value of danceability or loudness.
row.names(descriptive_statistics_2) <- c("Danceability Greater Than 0.2", "Danceability No Larger Than 0.2", "Loudness Greater Than -25", "Loudness No More Than -25")
# Using the below code, we display our data frame in a table.
knitr::kable(
descriptive_statistics_2,
caption = "Descriptive Statistics for Popularity When Data is Grouped"
)
| minimum | first_quantile | median | third_quantile | maximum | |
|---|---|---|---|---|---|
| Danceability Greater Than 0.2 | 0 | 21.0 | 42 | 58.0 | 100 |
| Danceability No Larger Than 0.2 | 0 | 33.0 | 46 | 57.0 | 77 |
| Loudness Greater Than -25 | 0 | 21.0 | 42 | 58.0 | 100 |
| Loudness No More Than -25 | 0 | 44.5 | 50 | 51.5 | 71 |
We next examine the plot titled “Track Popularity and Energy” displayed earlier in this section. In this plot, the smooth curves appear to be relatively similar, indicating that there is likely not much interaction between the variables energy and key. We also notice that in this plot, the highest levels of popularity occur at neither extremely high nor extremely low values for energy. This indicates that there is likely a nonlinear relationship between energy and popularity, and hence we will plan to include the variable energy in our initial regression model.
In looking at the plot titled “Track Popularity and Key,” we do not notice any apparent relationship between key and popularity. Hence we may want to consider excluding this variable from our regression model.
We next consider the scatter plot titled “Track Popularity and Loudness.” In this plot, we notice that the smooth curves are not extremely similar in the far left of the graph. In particular, some curves tend to increase while others tend to decrease in the far left of the plot. This indicates that there may be interaction between the variables loudness and key, and we will explore this interaction further in developing our regression model. In looking at the plot, we also notice that when tracks are incredibly popular, they tend to have a high value for loudness. This observation can also be seen in the table displayed directly above. The third row of the preceding table displays descriptive statistics of popularity for tracks having a value for loudness larger than -25, and the fourth row displays descriptive statistics of popularity for tracks having a value for loudness that is no more than -25. We notice that the maximum value in the third row is 100, but the maximum value in the fourth row is only 71. This indicates that although popularity is sometimes high for tracks with high loudness, popularity never exceeds 71 when loudness is small (i.e., when loudness is no larger than -25). Because there appears to be a relationship between popularity and loudness, we will plan to include the variable loudness in our initial regression model.
In examining the scatter plot titled “Track Popularity and Mode,” we do not seem to observe any interaction between the variables key and mode, and we also do not see a relationship between popularity and mode. However before making these conclusions, we will perform further analysis regarding these variables later in the report.
We next examine the scatter plot titled “Track Popularity and Speechiness.” In this plot, we notice that the smooth curves all appear relatively similar, indicating that there may not be much interaction between key and speechiness. We also notice that the smooth curves appear to be relatively flat. Although this would seem to suggest that there is no relationship between speechiness and popularity, we notice that tracks with incredibly high values of popularity tend to have low values of speechiness. In addition to seeing this relationship between popularity and speechiness in a scatter plot, we can also observe this relationship by comparing the below two frequency plots. The first plot shown directly below displays the frequency of track popularity among tracks with low speechiness (i.e., among tracks whose speechiness is less than 0.5). In this plot, we notice that tracks quite often tend to have the low popularity rating of zero. This speaks to the numerous challenges that artists face in becoming popular. However, we do notice that in the first plot displayed directly below, some tracks do indeed achieve a popularity rating of 100. You can view the code used to create the below frequency plot by clicking on the code button displayed directly below.
# Using the below code, we create a frequency plot for popularity among tracks with low speechiness (i.e., those tracks having a value for speechiness that is less than 0.5).
ggplot(data = spotify_songs_5[spotify_songs_5$speechiness < 0.5,], aes(x = track_popularity)) + geom_freqpoly() + labs(x = "Popularity", y = "Count") + ggtitle("Frequency of Track Popularity for Tracks with Low Speechiness", subtitle = "Examining Tracks with a Value for Speechines That Is Less Than 0.5") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
The next plot shown directly below displays the frequency of track popularity among tracks with high speechiness (i.e., among tracks whose speechiness is greater than or equal to 0.5). In this plot, we notice that as in the preceding plot, tracks quite often tend to have the low popularity rating of zero. This again indicates how challenging it is to actually become popular. However, unlike in the preceding plot, we notice that there are no tracks which achieve a popularity rating of 100 in the below plot. This difference between the two plots indicates a relationship between speechiness and popularity, and so we will incorporate the variable speechines in our initial regression model. You can view the code used to create the below frequency plot by clicking on the code button displayed directly below.
# Using the below code, we create a frequency plot for popularity among tracks with high speechiness (i.e., those tracks having a value for speechiness that is at least 0.5 or more).
ggplot(data = spotify_songs_5[spotify_songs_5$speechiness >= 0.5,], aes(x = track_popularity)) + geom_freqpoly() + labs(x = "Popularity", y = "Count") + ggtitle("Frequency of Track Popularity for Tracks with High Speechiness", subtitle = "Examining Tracks with a Value of Speechiness That Is Greater Than or Equal to 0.5") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
We next examine the scatter plot displayed earlier in this section titled “Track Popularity and Acousticness.” Although the smooth curves in this graph appear fairly similar, we notice that the light blue smooth curve tends to rise higher than the other curves in the far right of the graph. This indicates that there may potentially be a small amount of interaction between acousticness and key, and we will further investigate this possible interaction later in our report. We also notice that both high and low values of popularity tend to occur throughout various values of acousticness. This leads us to believe that there does not seem to be a relationship between acousticness and popularity. However before arriving at this conclusion, we will perform further analysis of the variable acousticness later in our report. Most certainly, we found earlier that acousticness is somewhat correlated with energy, and there is in turn a relationship between energy and popularity that can be observed in the plot titled “Track Popularity and Energy.” These facts suggest that although we do not notice a strong relationship in the plot titled “Track Popularity and Acousticness,” acousticness may indeed have somewhat of a relationship with popularity.
We next consider the plot displayed earlier in this section titled “Track Popularity and Instrumentalness.” The smooth curves in this graph appear to have ever so slightly different dips in the far left of the plot. This indicates that there may potentially be a slight interaction between instrumentalness and key, and we will further investigate this possible interaction later in our report. We also notice that when tracks are incredibly popular, they often have an instrumentalness rating of zero. This observation can also be observed by comparing the below two histograms. The first histogram displayed directly below shows the frequency of popularity among tracks having a low value for instrumentalness (i.e., a value for instrumentalness that is less than 0.75). As the preceding frequency plots indicated, the below histogram also demonstrates that the popularity rating for tracks is often zero. However, by looking at the histogram displayed directly below, we notice that some tracks do indeed achieve a popularity greater than 80. You can view the code used to produce the below histogram by clicking on the code button displayed directly below.
# Using the below code, we create a histogram for popularity among tracks with low instrumentalness (i.e., those tracks having a value for instrumentalness that is less than 0.75).
ggplot(data = spotify_songs_5[spotify_songs_5$instrumentalness < 0.75,], aes(x = track_popularity)) + geom_histogram() + labs(x = "Popularity", y = "Count") + ggtitle("Frequency of Track Popularity for Tracks with Low Instrumentalness", subtitle = "Examining Tracks with a Value of Instrumentalness That Is Less Than 0.75") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
The next histogram displayed directly below shows the frequency of popularity among tracks having a high value for instrumentalness (i.e., a value for instrumentalness that is at least 0.75 or more). The below histogram also indicates that it is challenging to produce a track having a popularity rating greater than zero. Also unlike in the preceding histogram, there appear to be no tracks in the below histogram which achieve a popularity greater than 80. This difference in histograms indicates a relationship between popularity and instrumentalness, and hence we will include the variable instrumentalness in our initial regression model. You can view the code used to produce the below histogram by clicking on the code button displayed directly below.
# Using the below code, we create a histogram for popularity among tracks with high instrumentalness (i.e., those tracks having a value for instrumentalness that is at least 0.75 or more).
ggplot(data = spotify_songs_5[spotify_songs_5$instrumentalness >= 0.75,], aes(x = track_popularity)) + geom_histogram() + labs(x = "Popularity", y = "Count") + ggtitle("Frequency of Track Popularity for Tracks with High Instrumentalness", subtitle = "Examining Tracks with a Value of Instrumentalness That Is Greater Than or Equal to 0.75") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
We next examine the plot displayed earlier in this section titled “Track Popularity and Liveness.” In this plot, we see that the smooth curves appear relatively similar to each other, indicating that we likely do not need to consider potential interaction between liveness and key. We also notice that when popularity is incredibly high, the value of liveness is often low. This indicates that extremely popular tracks are often not performed live. Because of this relationship between popularity and liveness, we will plan to include the variable liveness in our initial regression model. We can also observe this relationship in the below two boxplots. The first boxplot displayed directly below provides information about the popularity of tracks having a high value for liveness (i.e., those tracks having a value for liveness that is at least 0.8 or more). In the boxplot directly below, we see that no track appears to achieve a popularity rating of 100. You can view the code used to create the below boxplot by clicking on the code button displayed directly below.
# Using the below code, we create a boxplot for popularity among tracks with high liveness (i.e., those tracks having a value for liveness that is at least 0.8 or more).
ggplot(data = spotify_songs_5[spotify_songs_5$liveness >= 0.8,], aes(x = track_popularity)) + geom_boxplot() + labs(x = "Popularity") + ggtitle("Boxplot of Track Popularity for Tracks with High Liveness", subtitle = "Examining Tracks with a Value of Liveness That Is Greater Than or Equal to 0.8") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), axis.text.y = element_blank())
The next boxplot displayed directly below provides information about the popularity of tracks having a low value for liveness (i.e., those tracks having a value for liveness that is less than 0.8). Unlike in the preceding boxplot, we see that there do appear to be some tracks in the below boxplot that achieve a popularity rating of 100. This difference between the two boxplots indicates a relationship between popularity and liveness. You can view the code used to produce the below boxplot by clicking on the code button displayed directly below.
# Using the below code, we create a boxplot for popularity among tracks with low liveness (i.e., those tracks having a value for liveness that is less than 0.8).
ggplot(data = spotify_songs_5[spotify_songs_5$liveness < 0.8,], aes(x = track_popularity)) + geom_boxplot() + labs(x = "Popularity") + ggtitle("Boxplot of Track Popularity for Tracks with Low Liveness", subtitle = "Examining Tracks with a Value of Liveness That Is Less Than 0.8") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), axis.text.y = element_blank())
We next consider the plot displayed earlier in this section titled “Track Popularity and Valence.” In this plot, we see that the smooth curves appear relatively similar to each other, indicating that we likely do not need to consider potential interaction between valence and key. We also notice that incredibly popular tracks tend to appear neither in the extreme far left nor extreme far right of the plot. This indicates that there is a nonlinear relationship between popularity and valence, and hence we will plan to include the variable valence in our initial regression model.
Now, we focus on the scatter plot displayed earlier in this section titled “Track Popularity and Tempo.” In this plot, we see that each of the smooth curves is quite distinct. Especially in the far left of the plot, we see that some curves decrease while others increase. This indicates that there is interaction between the variable tempo and the variable key, and we will explore this interaction further later in our report. We also note that extremely high values of popularity tend to occur neither in the far right nor the far left of the plot. This indicates that there is a nonlinear relationship between tempo and popularity, and hence we will include the variable tempo in our initial regression model.
Next, we examine the scatter plot displayed earlier in this section titled “Track Popularity and Duration (in milliseconds).” In this plot, we also see that each of the smooth curves is distinct. Particularly in the far left of the plot, we see that the steepness of the curvature is different for each key. This indicates that there is interaction between duration and key, and we will consider this as we develop our finalized regression model. By looking at the plot, we also notice that many extremely popular songs tend to have a duration of around 2e+05 milliseconds. This indicates that there is a nonlinear relationship between duration and popularity, and so we will include duration as a covariate in our initial regression model.
In developing regression models, it is often important to check for interaction with categorical variables. In the preceding section, we closely examined potential interaction involving the categorical variable key. Here, we note that although key contains numerical values, it is actually a categorical variable because it assumes only a finite set of values. Similarly, mode is a categorical variable because it assumes only finitely many values. Given that mode is a categorical variable, we should check for interaction involving mode, and we devote this section to that pursuit. As in the preceding section, we do this by creating scatter plots. However rather than coloring the data based on key, we will instead group the data in colors representing the mode. The below output displays these scatter plots. You can view the code that was used to produce the below plots by clicking on the code button displayed directly below.
# Using the below code, we create a variable called j and give it the value 6.
j = 6
# Using the below code, we create a variable called i and give it the value 1.
i = 1
# Using the below code, we create a variable containing all of the names for our x-axes that will be used in our scatter plots.
x_axis <- c("Danceability", "Energy", "Key", "Loudness", "Mode", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Duration (in milliseconds)")
# Using the below code, we create scatter plots comparing the relationship between track popularity and various musical characteristics.
for(k in 6:17){
print(ggplot(data = spotify_songs_5, aes(x = spotify_songs_5[,k], y = track_popularity, color = as.character(mode))) + geom_point() + geom_smooth() + labs(x = x_axis[k - 5], y = "Popularity", color = "Mode") + ggtitle(paste0("Track Popularity and ", x_axis[k - 5]), subtitle = paste0("Is There Interaction with Mode?")) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)))
}
We notice that in the above plot in this section titled “Track Popularity and Danceability,” the two smooth curves appear very similar. Additionally, similarity of smooth curves can also be seen in the above plots in this section titled “Track Popularity and Energy,” “Track Popularity and Loudness,” “Track Popularity and Acousticness,” “Track Popularity and Instrumentalness,” “Track Popularity and Liveness,” “Track Popularity and Valence,” and “Track Popularity and Duration (in milliseconds).” This indicates that there may not be much interaction between danceability and mode, between energy and mode, between loudness and mode, between acousticness and mode, between instrumentalness and mode, between liveness and mode, between valence and mode, and between duration and mode.
However in the above plot in this section titled “Track Popularity and Key,” we notice that the blue smooth curve appears very flat, whereas the red smooth curve appears to almost be a straight line with a slight negative slope. This difference suggests that there may be interaction between the variables mode and key, and we will explore this possible interaction further later in our report.
Similarly in the above plot in this section titled “Track Popularity and Speechiness,” we notice that the blue smooth curve appears very flat, whereas the red curve can be seen to decrease in the right-hand side of the plot. This difference indicates that there may be interaction between the variables mode and speechiness, and hence we will consider this interaction in finalizing our regression model.
In the above plot in this section titled “Track Popularity and Tempo,” we notice that the red curve appears to increase while the blue curve decreases in the far right-hand side of the plot. This difference in the smooth curves indicates potential interaction between the variables tempo and mode, and we will examine this possible interacation more later.
The above plot in this section titled “Track Popularity and Mode” does not provide us any further information about interaction with mode. However it does display a possible relationship between mode and popularity in a less distracting manner than in the preceding section. In the preceding section, it was hard to observe any relationship between mode and popularity due to the many different colors representing various keys. However once we display a scatter plot that is more simplistic and only contains two colors, we readily observe that the tracks with a popularity rating of 100 all appear to have the mode 0. This suggests that there may be a relationship between mode and popularity, and hence we choose to include the variable mode in our initial regression model.
After using graphical displays and tables in the preceding two sections to uncover relationships and new information in the data, we begin this section by developing a simple linear regression model to predict popularity using the covariates under consideration (i.e., danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration_ms). We refer the reader to the first paragraph of Section 7 for a discussion relaying our reasons for considering only these covariates. Since the variables key and mode are categorical variables, we use character versions of these variables in the development of our regression model. The below output provides a summary of our linear regression model. By clicking on the code button displayed directly below, you can view the code that was used to produce this summary and linear regression model.
# Using the below code, we create a variable containing the values for key as characters.
key_char <- as.character(spotify_songs_5$key)
# Using the below code, we create a variable containing the values for mode as characters.
mode_char <- as.character(spotify_songs_5$mode)
# Using the below code, we create a data frame containing the variables key_char, mode_char, and all of the variables in spotify_songs_5.
spotify_songs_6 <- data.frame(spotify_songs_5, key_char, mode_char)
# Using the below code, we create a linear regression model to predict popularity.
linear_model <- lm(track_popularity ~ danceability + energy + key_char + loudness + mode_char + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_songs_6)
# Using the below code, we display a summary of the linear regression model.
summary(linear_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + key_char +
## loudness + mode_char + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.684 -17.115 2.921 18.085 60.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.843e+01 1.743e+00 39.255 < 2e-16 ***
## danceability 3.593e+00 1.077e+00 3.337 0.000847 ***
## energy -2.311e+01 1.220e+00 -18.939 < 2e-16 ***
## key_char1 -5.642e-01 5.782e-01 -0.976 0.329176
## key_char10 -4.028e-01 6.768e-01 -0.595 0.551738
## key_char11 -4.838e-01 6.277e-01 -0.771 0.440907
## key_char2 -1.029e+00 6.253e-01 -1.645 0.100011
## key_char3 -1.878e+00 9.218e-01 -2.038 0.041606 *
## key_char4 -5.089e-01 6.806e-01 -0.748 0.454618
## key_char5 -4.215e-01 6.438e-01 -0.655 0.512717
## key_char6 -6.468e-01 6.468e-01 -1.000 0.317287
## key_char7 -1.391e+00 5.995e-01 -2.320 0.020354 *
## key_char8 6.806e-01 6.591e-01 1.033 0.301772
## key_char9 -6.171e-01 6.177e-01 -0.999 0.317713
## loudness 1.149e+00 6.541e-02 17.561 < 2e-16 ***
## mode_char1 8.764e-01 2.907e-01 3.015 0.002576 **
## speechiness -6.277e+00 1.385e+00 -4.532 5.85e-06 ***
## acousticness 4.287e+00 7.475e-01 5.736 9.82e-09 ***
## instrumentalness -9.317e+00 6.256e-01 -14.894 < 2e-16 ***
## liveness -4.280e+00 9.002e-01 -4.755 1.99e-06 ***
## valence 1.755e+00 6.573e-01 2.671 0.007575 **
## tempo 2.575e-02 5.246e-03 4.909 9.19e-07 ***
## duration_ms -4.341e-05 2.297e-06 -18.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23 on 28321 degrees of freedom
## Multiple R-squared: 0.05861, Adjusted R-squared: 0.05788
## F-statistic: 80.15 on 22 and 28321 DF, p-value: < 2.2e-16
Notice that there is a variable called mode_char1 in the above output. This variable is equivalent to 1 whenever mode is equal to 1, and mode_char1 is equal to zero whenever mode is not equal to 1. However because mode is a binary variable containing only the values 0 and 1, all entries in mode_char1 are actually equal to the entries in mode. Hence in the future regression models that we develop, we will simply use the variable mode instead of mode_char.
In the above output, we notice that the p-values corresponding to key_char1, key_char10, key_char11, key_char2, key_char4, key_char5, key_char6, key_char8, and key_char9 are all greater than 0.1. This indicates that there is not a statistically significant relationship between these variables and popularity, and so we will exclude these variables from our regression model. Because the p-values corresponding to key_char3 and key_char7 are statistically significant, we will create new variables that record the values of key_char3 and key_char7. We will also create a variable that is 1 when key is equal to 0, and is zero otherwise. We add these three new variables to the data frame spotify_songs_6 (see the preceding code for the creation of the data frame spotify_songs_6). We do this using the below code.
# Using the below code, we create a new variable called key_char_3.
key_char_3 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_7.
key_char_7 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_0.
key_char_0 <- c(1:nrow(spotify_songs_5))
# Using the below code, we add the variables key_char_0, key_char_3, and key_char_7 into the data frame spotify_songs_6.
spotify_songs_6 <- data.frame(spotify_songs_6, key_char_0, key_char_3, key_char_7)
# Using the below code, we create a new variable called i and set it initially equal to 1.
i = 1
# Using the below code, we place the proper values in the new variables that we created. In particular we set key_char_0 equal to 1 when the key is 0, and zero otherwise. We set key_char_3 equal to 1 when the key is 3, and zero otherwise. We set key_char_7 equal to 1 when the key is 7, and zero otherwise.
for(k in 20:22){
if (k == 20){
i = 0
}
if (k == 21){
i = 3
}
if (k == 22){
i = 7
}
for(j in 1:nrow(spotify_songs_5)){
if(spotify_songs_5$key[j] == i){
spotify_songs_6[j, k] = 1
}
if(spotify_songs_5$key[j] != i){
spotify_songs_6[j, k] = 0
}
}
}
We create a new regression model that does not include the covariates that were found to be statistically insignificant in the preceding summary of our regression model, and we display a new summary of the revised regression model in the output below. By clicking on the code button displayed directly below, you can view the code that was used to produce this output and revised regression model.
# Using the below code, we create a revised linear regression model that does not contain the covariates key_char1, key_char2, key_char4, key_char5, key_char6, key_char8, key_char9, key_char10, and key_char11.
revised_linear_model <- lm(track_popularity ~ danceability + energy + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + key_char_0 + key_char_3 + key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of the revised linear regression model.
summary(revised_linear_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## mode + speechiness + acousticness + instrumentalness + liveness +
## valence + tempo + duration_ms + key_char_0 + key_char_3 +
## key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.730 -17.140 2.946 18.070 60.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.804e+01 1.694e+00 40.159 < 2e-16 ***
## danceability 3.622e+00 1.074e+00 3.374 0.000742 ***
## energy -2.316e+01 1.220e+00 -18.988 < 2e-16 ***
## loudness 1.154e+00 6.535e-02 17.651 < 2e-16 ***
## mode 8.686e-01 2.808e-01 3.093 0.001982 **
## speechiness -6.226e+00 1.381e+00 -4.509 6.54e-06 ***
## acousticness 4.308e+00 7.468e-01 5.769 8.04e-09 ***
## instrumentalness -9.294e+00 6.254e-01 -14.861 < 2e-16 ***
## liveness -4.317e+00 8.994e-01 -4.799 1.60e-06 ***
## valence 1.753e+00 6.564e-01 2.671 0.007562 **
## tempo 2.575e-02 5.243e-03 4.912 9.08e-07 ***
## duration_ms -4.353e-05 2.296e-06 -18.963 < 2e-16 ***
## key_char_0 4.718e-01 4.527e-01 1.042 0.297382
## key_char_3 -1.411e+00 8.317e-01 -1.697 0.089749 .
## key_char_7 -9.184e-01 4.584e-01 -2.004 0.045122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23 on 28329 degrees of freedom
## Multiple R-squared: 0.05838, Adjusted R-squared: 0.05791
## F-statistic: 125.5 on 14 and 28329 DF, p-value: < 2.2e-16
In the output displayed directly above, we see that the p-values corresponding to key_char_0 and key_char_3 are each greater than 0.05. This indicates that there is not a statistically significant linear relationship between these covariates and popularity. Hence we will exclude these two variables from our regression model. We summarize our new revised regression model in the output below. By clicking on the code button displayed directly below, you can view the code that was used to create this output and revised regression model.
# Using the below code, we create a revised linear regression model that does not contain the covariates key_char_0 and key_char_3.
revised_linear_model <- lm(track_popularity ~ danceability + energy + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of the revised linear regression model.
summary(revised_linear_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## mode + speechiness + acousticness + instrumentalness + liveness +
## valence + tempo + duration_ms + key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.752 -17.180 2.932 18.083 60.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.794e+01 1.692e+00 40.166 < 2e-16 ***
## danceability 3.689e+00 1.073e+00 3.439 0.000585 ***
## energy -2.316e+01 1.219e+00 -18.993 < 2e-16 ***
## loudness 1.151e+00 6.534e-02 17.620 < 2e-16 ***
## mode 9.256e-01 2.783e-01 3.326 0.000882 ***
## speechiness -6.252e+00 1.379e+00 -4.533 5.85e-06 ***
## acousticness 4.286e+00 7.465e-01 5.741 9.53e-09 ***
## instrumentalness -9.291e+00 6.254e-01 -14.857 < 2e-16 ***
## liveness -4.271e+00 8.992e-01 -4.750 2.05e-06 ***
## valence 1.761e+00 6.563e-01 2.683 0.007294 **
## tempo 2.578e-02 5.243e-03 4.917 8.85e-07 ***
## duration_ms -4.350e-05 2.295e-06 -18.950 < 2e-16 ***
## key_char_7 -9.411e-01 4.536e-01 -2.075 0.038039 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23 on 28331 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05784
## F-statistic: 146 on 12 and 28331 DF, p-value: < 2.2e-16
In the output displayed directly above, we see that the p-values corresponding to the regressors are all less than 0.05. This indicates that there is a statistically significant relationship between the regressors and popularity. However, the multiple R-squared and adjusted R-squared are both only around 0.06. This indicates that record labels may have difficulty using this model to make highly accurate predictions of track popularity. This is not surprising. Indeed, we found earlier that there is not much correlation between each of the regressors and track_popularity. Further by examining scatter plots, we also earlier found that many of the relationships appear to be nonlinear rather than linear. As a consequence, we need to consider transforming our data, and we begin by examining transformations of the covariates used in the preceding regression model. To select transformations of our regressors, we utilize the Bayesian information criterion. The below graphs display the Bayesian information criterion for various transformations of the variables tempo and duration_ms. By clicking on the code button displayed directly below, you can view the code that was used to produce the below graphs.
# Using the below code, we create graphical displays of the BIC for various transformations of the variables tempo and duration_ms. We note that because these variables contain only positive numbers, we can consider a log transformation.
for (j in 16:17){
transformations = regsubsets(track_popularity ~ spotify_songs_5[,j] + I(spotify_songs_5[,j]^2) + I(spotify_songs_5[,j]^3) + I(spotify_songs_5[,j]^4) + I(spotify_songs_5[,j]^5) + I(spotify_songs_5[,j]^6) + I(spotify_songs_5[,j]^7) + I(spotify_songs_5[,j]^8) + I(spotify_songs_5[,j]^9) + I(spotify_songs_5[,j]^10) + I(log(spotify_songs_5[,j])), data = spotify_songs_5, nbest = 11)
plot(transformations, scale = "bic", main = "BIC for Transformations", labels = c("Intercept", names(spotify_songs_5)[j], paste0(names(spotify_songs_5)[j], "^2"), paste0(names(spotify_songs_5)[j], "^3"), paste0(names(spotify_songs_5)[j], "^4"), paste0(names(spotify_songs_5)[j], "^5"), paste0(names(spotify_songs_5)[j], "^6"), paste0(names(spotify_songs_5)[j], "^7"), paste0(names(spotify_songs_5)[j], "^8"), paste0(names(spotify_songs_5)[j], "^9"), paste0(names(spotify_songs_5)[j], "^10"), paste0("log(", names(spotify_songs_5)[j], ")")))
}
The below graphs display the Bayesian information criterion for various transformations of the variables danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, and valence. By clicking on the code button displayed directly below, you can view the code that was used to produce the below graphs. Since these variable may take a negative and/or zero value according to the data dictionary, we do not consider log transformations for these variables. As the variable key_char_7 is binary (having values of only 0 and 1), we also do not consider transformations of this variable. Indeed, a polynomial transformation of the variable key_char_7 would simply be equivalent to the original variable key_char_7.
# Using the below code, we create graphical displays of the BIC for various transformations of other potential regressors. Based on the variable dictionary, each of these regressors may possibly contain a non-positive value, and so we do not consider a log transformation for these variables.
for (j in c(6:7, 9:15)){
transformations = regsubsets(track_popularity ~ spotify_songs_5[,j] + I(spotify_songs_5[,j]^2) + I(spotify_songs_5[,j]^3) + I(spotify_songs_5[,j]^4) + I(spotify_songs_5[,j]^5) + I(spotify_songs_5[,j]^6) + I(spotify_songs_5[,j]^7) + I(spotify_songs_5[,j]^8) + I(spotify_songs_5[,j]^9) + I(spotify_songs_5[,j]^10), data = spotify_songs_5, nbest = 10)
plot(transformations, scale = "bic", main = "BIC for Transformations", labels = c("Intercept", names(spotify_songs_5)[j], paste0(names(spotify_songs_5)[j], "^2"), paste0(names(spotify_songs_5)[j], "^3"), paste0(names(spotify_songs_5)[j], "^4"), paste0(names(spotify_songs_5)[j], "^5"), paste0(names(spotify_songs_5)[j], "^6"), paste0(names(spotify_songs_5)[j], "^7"), paste0(names(spotify_songs_5)[j], "^8"), paste0(names(spotify_songs_5)[j], "^9"), paste0(names(spotify_songs_5)[j], "^10")))
}
In the first BIC graph displayed above, we see the BIC for various possible transformations of tempo. For variable selection, it is generally recommended to choose the lowest BIC. In the graph, we see that the lowest BIC involves (tempo)5, (tempo)6, (tempo)7, (tempo)8, (tempo)9, and (tempo)10. Consequently, we will include these transformations of tempo in our regression model. Similarly in the second BIC graph displayed above, we see that the lowest BIC involves duration_ms, (duration_ms)4, and log(duration_ms). As a result, we will include these transformations of duration_ms in our regression model. By looking at the other BIC graphs displayed above, we can choose transformations of other regressors in a similar manner. In particular by selecting transformations that result in the lowest BIC shown in the above graphs, we will also choose to include (danceability)7, (energy)7, loudness, (loudness)2, (loudness)3, (loudness)4, (loudness)5, mode, speechiness, acousticness, (acousticness)2, (acousticness)3, (acousticness)4, (acousticness)5, (acousticness)6, instrumentalness, (instrumentalness)2, (instrumentalness)3, (instrumentalness)4, (instrumentalness)5, (instrumentalness)6, (instrumentalness)7, (instrumentalness)8, liveness, (liveness)2, (liveness)3, (liveness)4, (liveness)5, (liveness)6, valence, and (valence)2 in our regression model. We create a regression model incorporating all of the transformations mentioned in this paragraph, and below, we display a summary of this regression model. By clicking on the code button displayed directly below, you can view the code that was used to create this regression model and summary.
# Using the below code, we create a regression model using transformations of the covariates that we selected based on BIC.
model_transformed_regressors <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + I(acousticness^2) + I(acousticness^3) + I(acousticness^4) + I(acousticness^5) + I(acousticness^6) + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_transformed_regressors)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + I(acousticness^2) +
## I(acousticness^3) + I(acousticness^4) + I(acousticness^5) +
## I(acousticness^6) + instrumentalness + I(instrumentalness^2) +
## I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) +
## I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) +
## liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) +
## I(liveness^5) + I(liveness^6) + valence + I(valence^2) +
## I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) +
## I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) +
## key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.128 -17.088 2.839 17.993 61.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.751e+02 4.199e+01 -6.551 5.81e-11 ***
## I(danceability^7) 1.028e+01 1.249e+00 8.225 < 2e-16 ***
## I(energy^7) -1.216e+01 8.605e-01 -14.129 < 2e-16 ***
## loudness -1.748e+00 8.463e-01 -2.065 0.038891 *
## I(loudness^2) -5.307e-01 1.426e-01 -3.721 0.000199 ***
## I(loudness^3) -4.075e-02 1.030e-02 -3.955 7.67e-05 ***
## I(loudness^4) -1.168e-03 3.163e-04 -3.693 0.000222 ***
## I(loudness^5) -1.125e-05 3.348e-06 -3.360 0.000782 ***
## mode 9.563e-01 2.769e-01 3.453 0.000555 ***
## speechiness -1.119e+01 1.469e+00 -7.617 2.68e-14 ***
## acousticness 3.849e+01 1.520e+01 2.532 0.011354 *
## I(acousticness^2) -2.355e+02 1.793e+02 -1.313 0.189054
## I(acousticness^3) 8.956e+02 8.306e+02 1.078 0.280928
## I(acousticness^4) -1.816e+03 1.770e+03 -1.026 0.304936
## I(acousticness^5) 1.770e+03 1.744e+03 1.015 0.310241
## I(acousticness^6) -6.445e+02 6.435e+02 -1.002 0.316550
## instrumentalness -2.222e+02 3.832e+01 -5.798 6.78e-09 ***
## I(instrumentalness^2) 3.996e+03 8.798e+02 4.542 5.60e-06 ***
## I(instrumentalness^3) -3.113e+04 7.559e+03 -4.118 3.84e-05 ***
## I(instrumentalness^4) 1.224e+05 3.168e+04 3.864 0.000112 ***
## I(instrumentalness^5) -2.638e+05 7.181e+04 -3.673 0.000240 ***
## I(instrumentalness^6) 3.162e+05 8.999e+04 3.514 0.000443 ***
## I(instrumentalness^7) -1.979e+05 5.865e+04 -3.375 0.000739 ***
## I(instrumentalness^8) 5.047e+04 1.552e+04 3.252 0.001147 **
## liveness 1.304e+02 4.112e+01 3.171 0.001521 **
## I(liveness^2) -1.047e+03 3.662e+02 -2.858 0.004261 **
## I(liveness^3) 3.753e+03 1.471e+03 2.551 0.010736 *
## I(liveness^4) -6.804e+03 2.886e+03 -2.358 0.018389 *
## I(liveness^5) 6.044e+03 2.696e+03 2.242 0.024991 *
## I(liveness^6) -2.083e+03 9.588e+02 -2.173 0.029820 *
## valence 5.750e+00 2.562e+00 2.245 0.024803 *
## I(valence^2) -6.899e+00 2.427e+00 -2.843 0.004477 **
## I(tempo^5) 5.303e-08 1.603e-08 3.307 0.000943 ***
## I(tempo^6) -1.537e-09 4.705e-10 -3.266 0.001093 **
## I(tempo^7) 1.778e-11 5.580e-12 3.187 0.001437 **
## I(tempo^8) -1.023e-13 3.319e-14 -3.081 0.002063 **
## I(tempo^9) 2.917e-16 9.867e-17 2.956 0.003116 **
## I(tempo^10) -3.299e-19 1.170e-19 -2.820 0.004805 **
## duration_ms -2.018e-04 2.098e-05 -9.617 < 2e-16 ***
## I(duration_ms^4) 4.607e-22 6.641e-23 6.937 4.09e-12 ***
## I(log(duration_ms)) 2.826e+01 3.771e+00 7.494 6.86e-14 ***
## key_char_7 -1.062e+00 4.515e-01 -2.352 0.018685 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.87 on 28302 degrees of freedom
## Multiple R-squared: 0.06959, Adjusted R-squared: 0.06824
## F-statistic: 51.63 on 41 and 28302 DF, p-value: < 2.2e-16
In the preceding output, we see that the p-values corresponding to (acousticness)2, (acousticness)3, (acousticness)4, (acousticness)5, and (acousticness)6 are each greater than 0.1. This indicates that there is not a statistically significant relationship between popularity and each of these terms. Consequently, we exclude these variables from our regression model. Using the below code, we create and summarize a revised regression model that does not contain these covariates. The output shown below is a summary of this revised regression model.
# Using the below code, we create a regression model using transformations of the covariates that we selected based on BIC, while excluding those terms that were found to be statistically insignificant.
model_transformed_regressors <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_transformed_regressors)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.152 -17.035 2.828 18.013 62.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.768e+02 4.186e+01 -6.612 3.85e-11 ***
## I(danceability^7) 1.051e+01 1.248e+00 8.423 < 2e-16 ***
## I(energy^7) -1.278e+01 8.433e-01 -15.156 < 2e-16 ***
## loudness -1.682e+00 8.438e-01 -1.994 0.046212 *
## I(loudness^2) -5.252e-01 1.420e-01 -3.699 0.000217 ***
## I(loudness^3) -4.053e-02 1.025e-02 -3.956 7.64e-05 ***
## I(loudness^4) -1.162e-03 3.147e-04 -3.694 0.000221 ***
## I(loudness^5) -1.118e-05 3.333e-06 -3.356 0.000793 ***
## mode 9.446e-01 2.769e-01 3.411 0.000647 ***
## speechiness -1.082e+01 1.465e+00 -7.383 1.59e-13 ***
## acousticness 6.851e+00 7.029e-01 9.746 < 2e-16 ***
## instrumentalness -2.340e+02 3.820e+01 -6.125 9.21e-10 ***
## I(instrumentalness^2) 4.193e+03 8.785e+02 4.773 1.82e-06 ***
## I(instrumentalness^3) -3.255e+04 7.552e+03 -4.310 1.64e-05 ***
## I(instrumentalness^4) 1.277e+05 3.165e+04 4.035 5.48e-05 ***
## I(instrumentalness^5) -2.748e+05 7.177e+04 -3.830 0.000129 ***
## I(instrumentalness^6) 3.293e+05 8.995e+04 3.661 0.000252 ***
## I(instrumentalness^7) -2.061e+05 5.863e+04 -3.515 0.000440 ***
## I(instrumentalness^8) 5.254e+04 1.551e+04 3.387 0.000707 ***
## liveness 1.364e+02 4.110e+01 3.318 0.000906 ***
## I(liveness^2) -1.090e+03 3.661e+02 -2.977 0.002915 **
## I(liveness^3) 3.894e+03 1.471e+03 2.647 0.008121 **
## I(liveness^4) -7.036e+03 2.886e+03 -2.438 0.014764 *
## I(liveness^5) 6.233e+03 2.697e+03 2.311 0.020814 *
## I(liveness^6) -2.143e+03 9.589e+02 -2.235 0.025441 *
## valence 5.750e+00 2.561e+00 2.245 0.024788 *
## I(valence^2) -6.576e+00 2.426e+00 -2.710 0.006731 **
## I(tempo^5) 5.459e-08 1.603e-08 3.406 0.000660 ***
## I(tempo^6) -1.581e-09 4.704e-10 -3.360 0.000779 ***
## I(tempo^7) 1.828e-11 5.578e-12 3.277 0.001049 **
## I(tempo^8) -1.051e-13 3.319e-14 -3.166 0.001547 **
## I(tempo^9) 2.995e-16 9.866e-17 3.036 0.002401 **
## I(tempo^10) -3.385e-19 1.170e-19 -2.894 0.003801 **
## duration_ms -2.032e-04 2.092e-05 -9.713 < 2e-16 ***
## I(duration_ms^4) 4.637e-22 6.631e-23 6.992 2.76e-12 ***
## I(log(duration_ms)) 2.847e+01 3.758e+00 7.577 3.64e-14 ***
## key_char_7 -1.062e+00 4.516e-01 -2.351 0.018726 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28307 degrees of freedom
## Multiple R-squared: 0.06889, Adjusted R-squared: 0.06771
## F-statistic: 58.18 on 36 and 28307 DF, p-value: < 2.2e-16
In the output displayed directly above, we see that the p-values corresponding to the regressors are all less than 0.05. This indicates that there is a statistically significant relationship between the regressors and popularity. However, the multiple R-squared and adjusted R-squared have both only improved to about 0.07. This indicates that record labels may have difficulty using this model to make highly accurate predictions of track popularity. As a consequence, we consider incorporating interaction with categorical variables. We remind the reader that based on scatter plots drawn in Section 7, we suspect that there may be interactions between loudness and key, between acousticness and key, between instrumentalness and key, between tempo and key, and between duration and key. As a consequence, we add in these possible interactions to our regression model, and we display a summary of the revised model in the output below. By clicking on the code button displayed directly below, you can view the code that is used to create and summarize our revised regression model.
# Using the below code, we create a regression model that includes potential interaction with the categorical variable key.
model_interaction <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7 + loudness:key_char + acousticness:key_char + instrumentalness:key_char + tempo:key_char + duration_ms:key_char, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7 + loudness:key_char + acousticness:key_char +
## instrumentalness:key_char + tempo:key_char + duration_ms:key_char,
## data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.71 -17.05 2.91 17.94 60.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.430e+02 5.020e+01 -4.840 1.30e-06 ***
## I(danceability^7) 1.047e+01 1.252e+00 8.361 < 2e-16 ***
## I(energy^7) -1.267e+01 8.452e-01 -14.989 < 2e-16 ***
## loudness -1.545e+00 8.544e-01 -1.808 0.070618 .
## I(loudness^2) -4.981e-01 1.426e-01 -3.493 0.000479 ***
## I(loudness^3) -3.855e-02 1.031e-02 -3.739 0.000185 ***
## I(loudness^4) -1.105e-03 3.172e-04 -3.483 0.000496 ***
## I(loudness^5) -1.061e-05 3.365e-06 -3.152 0.001623 **
## mode 8.473e-01 2.903e-01 2.919 0.003518 **
## speechiness -1.073e+01 1.474e+00 -7.279 3.45e-13 ***
## acousticness 6.558e+00 1.954e+00 3.356 0.000793 ***
## instrumentalness -2.330e+02 3.823e+01 -6.093 1.12e-09 ***
## I(instrumentalness^2) 4.156e+03 8.787e+02 4.730 2.26e-06 ***
## I(instrumentalness^3) -3.210e+04 7.555e+03 -4.249 2.16e-05 ***
## I(instrumentalness^4) 1.255e+05 3.167e+04 3.961 7.47e-05 ***
## I(instrumentalness^5) -2.692e+05 7.182e+04 -3.749 0.000178 ***
## I(instrumentalness^6) 3.218e+05 9.003e+04 3.574 0.000351 ***
## I(instrumentalness^7) -2.010e+05 5.869e+04 -3.426 0.000614 ***
## I(instrumentalness^8) 5.118e+04 1.553e+04 3.296 0.000983 ***
## liveness 1.378e+02 4.113e+01 3.350 0.000810 ***
## I(liveness^2) -1.106e+03 3.663e+02 -3.019 0.002543 **
## I(liveness^3) 3.966e+03 1.472e+03 2.695 0.007034 **
## I(liveness^4) -7.191e+03 2.887e+03 -2.491 0.012754 *
## I(liveness^5) 6.384e+03 2.698e+03 2.367 0.017961 *
## I(liveness^6) -2.199e+03 9.593e+02 -2.292 0.021919 *
## valence 5.422e+00 2.566e+00 2.113 0.034608 *
## I(valence^2) -6.312e+00 2.429e+00 -2.599 0.009364 **
## I(tempo^5) 1.192e-07 5.527e-08 2.157 0.031026 *
## I(tempo^6) -3.174e-09 1.369e-09 -2.318 0.020442 *
## I(tempo^7) 3.476e-11 1.425e-11 2.439 0.014730 *
## I(tempo^8) -1.927e-13 7.639e-14 -2.522 0.011666 *
## I(tempo^9) 5.364e-16 2.085e-16 2.573 0.010101 *
## I(tempo^10) -5.974e-19 2.302e-19 -2.595 0.009454 **
## duration_ms -1.832e-04 2.171e-05 -8.439 < 2e-16 ***
## I(duration_ms^4) 4.580e-22 6.658e-23 6.879 6.15e-12 ***
## I(log(duration_ms)) 2.834e+01 3.771e+00 7.513 5.95e-14 ***
## key_char_7 -7.190e+00 2.879e+00 -2.498 0.012498 *
## loudness:key_char1 4.022e-01 1.957e-01 2.055 0.039894 *
## loudness:key_char10 1.259e-01 2.307e-01 0.546 0.585184
## loudness:key_char11 1.393e-01 2.133e-01 0.653 0.513697
## loudness:key_char2 1.907e-01 2.175e-01 0.877 0.380722
## loudness:key_char3 -2.891e-01 3.089e-01 -0.936 0.349305
## loudness:key_char4 -4.065e-02 2.320e-01 -0.175 0.860901
## loudness:key_char5 -3.628e-01 2.218e-01 -1.636 0.101929
## loudness:key_char6 -3.078e-02 2.201e-01 -0.140 0.888798
## loudness:key_char7 -1.141e-01 2.097e-01 -0.544 0.586292
## loudness:key_char8 -1.848e-01 2.341e-01 -0.789 0.429897
## loudness:key_char9 -1.359e-01 2.125e-01 -0.640 0.522474
## acousticness:key_char1 4.259e+00 2.805e+00 1.518 0.128980
## acousticness:key_char10 -2.636e+00 3.094e+00 -0.852 0.394125
## acousticness:key_char11 2.207e+00 2.998e+00 0.736 0.461493
## acousticness:key_char2 -2.835e+00 2.965e+00 -0.956 0.339072
## acousticness:key_char3 -1.526e+00 4.162e+00 -0.367 0.713871
## acousticness:key_char4 1.876e+00 3.193e+00 0.588 0.556773
## acousticness:key_char5 -2.701e+00 3.033e+00 -0.890 0.373264
## acousticness:key_char6 4.379e+00 3.166e+00 1.383 0.166710
## acousticness:key_char7 1.382e+00 2.904e+00 0.476 0.634220
## acousticness:key_char8 -2.168e+00 3.122e+00 -0.695 0.487337
## acousticness:key_char9 -1.344e+00 2.841e+00 -0.473 0.636058
## instrumentalness:key_char1 -2.554e+00 2.528e+00 -1.010 0.312467
## instrumentalness:key_char10 4.172e+00 2.855e+00 1.461 0.143923
## instrumentalness:key_char11 3.080e+00 2.723e+00 1.131 0.258008
## instrumentalness:key_char2 -1.269e+00 2.727e+00 -0.465 0.641614
## instrumentalness:key_char3 -6.065e+00 4.184e+00 -1.450 0.147191
## instrumentalness:key_char4 -3.819e+00 2.962e+00 -1.289 0.197296
## instrumentalness:key_char5 -2.033e+00 2.726e+00 -0.746 0.455840
## instrumentalness:key_char6 2.233e+00 2.858e+00 0.781 0.434624
## instrumentalness:key_char7 -2.646e+00 2.522e+00 -1.049 0.293999
## instrumentalness:key_char8 -1.015e+00 2.877e+00 -0.353 0.724189
## instrumentalness:key_char9 -6.453e-01 2.694e+00 -0.240 0.810682
## key_char0:tempo -6.986e-01 5.809e-01 -1.203 0.229077
## key_char1:tempo -6.453e-01 5.808e-01 -1.111 0.266576
## key_char10:tempo -6.370e-01 5.809e-01 -1.097 0.272768
## key_char11:tempo -6.470e-01 5.808e-01 -1.114 0.265361
## key_char2:tempo -6.663e-01 5.808e-01 -1.147 0.251271
## key_char3:tempo -6.861e-01 5.810e-01 -1.181 0.237647
## key_char4:tempo -6.786e-01 5.811e-01 -1.168 0.242859
## key_char5:tempo -6.615e-01 5.808e-01 -1.139 0.254724
## key_char6:tempo -7.002e-01 5.810e-01 -1.205 0.228176
## key_char7:tempo -6.347e-01 5.806e-01 -1.093 0.274359
## key_char8:tempo -6.545e-01 5.810e-01 -1.126 0.259984
## key_char9:tempo -6.723e-01 5.807e-01 -1.158 0.246943
## duration_ms:key_char1 -2.243e-05 8.070e-06 -2.779 0.005454 **
## duration_ms:key_char10 -3.097e-05 8.997e-06 -3.442 0.000578 ***
## duration_ms:key_char11 -2.968e-05 8.662e-06 -3.427 0.000612 ***
## duration_ms:key_char2 -1.391e-05 8.646e-06 -1.608 0.107758
## duration_ms:key_char3 -1.953e-05 1.338e-05 -1.460 0.144377
## duration_ms:key_char4 -1.493e-05 9.445e-06 -1.580 0.114046
## duration_ms:key_char5 -3.034e-05 8.744e-06 -3.470 0.000521 ***
## duration_ms:key_char6 -8.112e-06 8.906e-06 -0.911 0.362350
## duration_ms:key_char7 -1.321e-05 9.339e-06 -1.415 0.157219
## duration_ms:key_char8 -2.480e-05 9.333e-06 -2.657 0.007888 **
## duration_ms:key_char9 -2.043e-05 8.564e-06 -2.385 0.017069 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.87 on 28251 degrees of freedom
## Multiple R-squared: 0.07203, Adjusted R-squared: 0.06901
## F-statistic: 23.84 on 92 and 28251 DF, p-value: < 2.2e-16
In the above output, we see that the p-values indicate that there is not a significant linear relationship between most of our interaction terms and popularity. However the p-values corresponding to loudness:key_char1, duration_ms:key_char1, duration_ms:key_char10, duration_ms:key_char11, duration_ms:key_char5, duration_ms:key_char8, and duration_ms:key_char9 are each less than 0.05. This indicates that the influence of these interaction terms is indeed statistically significant. Hence by using the below code, we create and summarize a new regression model containing these statistically significant interaction terms, but not the insignificant interaction terms. To do this, we also create new variables in spotify_songs_6 that are equivalent to key_char1, key_char10, key_char11, key_char5, key_char8, and key_char9 (see the below code). The below output displays a summary of our revised regression model.
# Using the below code, we create a new variable called key_char_1.
key_char_1 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_10.
key_char_10 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_11.
key_char_11 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_5.
key_char_5 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_8.
key_char_8 <- c(1:nrow(spotify_songs_5))
# Using the below code, we create a new variable called key_char_9.
key_char_9 <- c(1:nrow(spotify_songs_5))
# Using the below code, we add the variables key_char_1, key_char_10, key_char_11, key_char_5, key_char_8, and key_char_9 into the data frame spotify_songs_6.
spotify_songs_6 <- data.frame(spotify_songs_6, key_char_1, key_char_10, key_char_11, key_char_5, key_char_8, key_char_9)
# Using the below code, we initially set the counting variable i to 1.
i = 1
# Using the below code, we place the proper values in the new variables that we created. In particular we set key_char_1 equal to 1 when the key is 1, and zero otherwise. We set key_char_10 equal to 1 when the key is 10, and zero otherwise. We set key_char_11 equal to 1 when the key is 11, and zero otherwise. We set key_char_5 equal to 1 when the key is 5, and zero otherwise. We set key_char_8 equal to 1 when the key is 8, and zero otherwise. We set key_char_9 equal to 1 when the key is 9, and zero otherwise.
for(k in 23:28){
if (k == 23){
i = 1
}
if (k == 24){
i = 10
}
if (k == 25){
i = 11
}
if (k == 26){
i = 5
}
if (k == 27){
i = 8
}
if (k == 28){
i = 9
}
for(j in 1:nrow(spotify_songs_5)){
if(spotify_songs_5$key[j] == i){
spotify_songs_6[j, k] = 1
}
if(spotify_songs_5$key[j] != i){
spotify_songs_6[j, k] = 0
}
}
}
# Using the below code, we create a regression model that includes the statistically significant interaction terms, but not the insignificant ones.
model_interaction_2 <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7 + loudness:key_char_1 + duration_ms:key_char_1 + duration_ms:key_char_10 + duration_ms:key_char_11 + duration_ms:key_char_5 + duration_ms:key_char_8 + duration_ms:key_char_9, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction_2)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7 + loudness:key_char_1 +
## duration_ms:key_char_1 + duration_ms:key_char_10 + duration_ms:key_char_11 +
## duration_ms:key_char_5 + duration_ms:key_char_8 + duration_ms:key_char_9,
## data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.958 -17.076 2.841 18.002 62.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.738e+02 4.188e+01 -6.538 6.35e-11 ***
## I(danceability^7) 1.057e+01 1.251e+00 8.448 < 2e-16 ***
## I(energy^7) -1.276e+01 8.435e-01 -15.126 < 2e-16 ***
## loudness -1.717e+00 8.440e-01 -2.034 0.041948 *
## I(loudness^2) -5.239e-01 1.420e-01 -3.690 0.000225 ***
## I(loudness^3) -4.033e-02 1.025e-02 -3.936 8.32e-05 ***
## I(loudness^4) -1.153e-03 3.147e-04 -3.663 0.000250 ***
## I(loudness^5) -1.105e-05 3.334e-06 -3.316 0.000915 ***
## mode 8.934e-01 2.824e-01 3.164 0.001558 **
## speechiness -1.073e+01 1.469e+00 -7.307 2.80e-13 ***
## acousticness 6.786e+00 7.039e-01 9.641 < 2e-16 ***
## instrumentalness -2.333e+02 3.820e+01 -6.106 1.04e-09 ***
## I(instrumentalness^2) 4.172e+03 8.785e+02 4.749 2.05e-06 ***
## I(instrumentalness^3) -3.237e+04 7.552e+03 -4.286 1.82e-05 ***
## I(instrumentalness^4) 1.270e+05 3.165e+04 4.012 6.02e-05 ***
## I(instrumentalness^5) -2.734e+05 7.177e+04 -3.810 0.000139 ***
## I(instrumentalness^6) 3.277e+05 8.995e+04 3.643 0.000270 ***
## I(instrumentalness^7) -2.052e+05 5.863e+04 -3.499 0.000467 ***
## I(instrumentalness^8) 5.233e+04 1.551e+04 3.373 0.000744 ***
## liveness 1.354e+02 4.111e+01 3.293 0.000993 ***
## I(liveness^2) -1.080e+03 3.662e+02 -2.949 0.003190 **
## I(liveness^3) 3.854e+03 1.471e+03 2.620 0.008806 **
## I(liveness^4) -6.961e+03 2.886e+03 -2.412 0.015883 *
## I(liveness^5) 6.165e+03 2.697e+03 2.286 0.022262 *
## I(liveness^6) -2.120e+03 9.591e+02 -2.210 0.027095 *
## valence 5.743e+00 2.563e+00 2.241 0.025047 *
## I(valence^2) -6.578e+00 2.427e+00 -2.710 0.006732 **
## I(tempo^5) 5.483e-08 1.603e-08 3.420 0.000626 ***
## I(tempo^6) -1.588e-09 4.705e-10 -3.374 0.000741 ***
## I(tempo^7) 1.836e-11 5.580e-12 3.291 0.000999 ***
## I(tempo^8) -1.056e-13 3.319e-14 -3.180 0.001475 **
## I(tempo^9) 3.009e-16 9.868e-17 3.050 0.002293 **
## I(tempo^10) -3.403e-19 1.170e-19 -2.909 0.003634 **
## duration_ms -2.024e-04 2.094e-05 -9.666 < 2e-16 ***
## I(duration_ms^4) 4.618e-22 6.633e-23 6.961 3.45e-12 ***
## I(log(duration_ms)) 2.821e+01 3.760e+00 7.502 6.47e-14 ***
## key_char_7 -1.140e+00 4.806e-01 -2.373 0.017646 *
## loudness:key_char_1 2.568e-01 1.262e-01 2.034 0.041914 *
## duration_ms:key_char_1 5.644e-06 4.065e-06 1.388 0.165059
## duration_ms:key_char_10 -9.691e-07 2.415e-06 -0.401 0.688146
## duration_ms:key_char_11 -1.666e-06 2.150e-06 -0.775 0.438260
## duration_ms:key_char_5 -9.614e-07 2.252e-06 -0.427 0.669468
## duration_ms:key_char_8 4.135e-06 2.397e-06 1.725 0.084572 .
## duration_ms:key_char_9 -1.085e-06 2.096e-06 -0.518 0.604671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28300 degrees of freedom
## Multiple R-squared: 0.06921, Adjusted R-squared: 0.0678
## F-statistic: 48.94 on 43 and 28300 DF, p-value: < 2.2e-16
In the above output, we see that most of the interaction terms have corresponding p-values that are larger than 0.05. This means that although we previously thought these interaction terms were statistically significant, it turns out that they are not. Hence we will exclude all of the interaction terms from our regression model except for loudness:key_char_1, as this is the only interaction term in the above output with a corresponding p-value that is less than 0.05. Using the below code, we create and summarize a new regression model excluding the insignificant interaction terms. The below output displays a summary of this revised regression model.
# Using the below code, we create a regression model that includes the statistically significant interaction term, but not the insignificant ones.
model_interaction_3 <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7 + loudness:key_char_1, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction_3)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7 + loudness:key_char_1, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.188 -17.036 2.849 18.017 62.518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.760e+02 4.186e+01 -6.594 4.35e-11 ***
## I(danceability^7) 1.061e+01 1.250e+00 8.493 < 2e-16 ***
## I(energy^7) -1.278e+01 8.432e-01 -15.160 < 2e-16 ***
## loudness -1.699e+00 8.438e-01 -2.013 0.044124 *
## I(loudness^2) -5.260e-01 1.420e-01 -3.705 0.000212 ***
## I(loudness^3) -4.057e-02 1.025e-02 -3.959 7.54e-05 ***
## I(loudness^4) -1.162e-03 3.147e-04 -3.693 0.000222 ***
## I(loudness^5) -1.117e-05 3.333e-06 -3.351 0.000807 ***
## mode 9.837e-01 2.779e-01 3.540 0.000401 ***
## speechiness -1.068e+01 1.467e+00 -7.278 3.46e-13 ***
## acousticness 6.797e+00 7.037e-01 9.660 < 2e-16 ***
## instrumentalness -2.338e+02 3.820e+01 -6.121 9.43e-10 ***
## I(instrumentalness^2) 4.186e+03 8.785e+02 4.765 1.90e-06 ***
## I(instrumentalness^3) -3.248e+04 7.552e+03 -4.301 1.71e-05 ***
## I(instrumentalness^4) 1.274e+05 3.165e+04 4.026 5.69e-05 ***
## I(instrumentalness^5) -2.743e+05 7.177e+04 -3.822 0.000133 ***
## I(instrumentalness^6) 3.287e+05 8.994e+04 3.654 0.000259 ***
## I(instrumentalness^7) -2.057e+05 5.862e+04 -3.509 0.000450 ***
## I(instrumentalness^8) 5.246e+04 1.551e+04 3.382 0.000721 ***
## liveness 1.360e+02 4.110e+01 3.309 0.000936 ***
## I(liveness^2) -1.088e+03 3.661e+02 -2.971 0.002967 **
## I(liveness^3) 3.890e+03 1.471e+03 2.645 0.008180 **
## I(liveness^4) -7.036e+03 2.886e+03 -2.438 0.014757 *
## I(liveness^5) 6.239e+03 2.696e+03 2.314 0.020687 *
## I(liveness^6) -2.147e+03 9.588e+02 -2.239 0.025178 *
## valence 5.666e+00 2.562e+00 2.212 0.026993 *
## I(valence^2) -6.524e+00 2.427e+00 -2.689 0.007175 **
## I(tempo^5) 5.510e-08 1.603e-08 3.437 0.000588 ***
## I(tempo^6) -1.596e-09 4.705e-10 -3.392 0.000695 ***
## I(tempo^7) 1.846e-11 5.579e-12 3.309 0.000937 ***
## I(tempo^8) -1.061e-13 3.319e-14 -3.198 0.001386 **
## I(tempo^9) 3.027e-16 9.867e-17 3.068 0.002160 **
## I(tempo^10) -3.423e-19 1.170e-19 -2.926 0.003435 **
## duration_ms -2.029e-04 2.092e-05 -9.698 < 2e-16 ***
## I(duration_ms^4) 4.626e-22 6.631e-23 6.976 3.11e-12 ***
## I(log(duration_ms)) 2.840e+01 3.758e+00 7.558 4.22e-14 ***
## key_char_7 -1.155e+00 4.551e-01 -2.539 0.011138 *
## loudness:key_char_1 9.631e-02 5.796e-02 1.662 0.096620 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28306 degrees of freedom
## Multiple R-squared: 0.06898, Adjusted R-squared: 0.06777
## F-statistic: 56.68 on 37 and 28306 DF, p-value: < 2.2e-16
In the above output, we see that the p-values corresponding to the covariates are all less than 0.1. However the values for the multiple R-squared and adjusted R-squared are still only about 0.07, indicating that this model cannot be used to make extremely reliable predictions. As a consequence, we choose to additionally incorporate the interactions discussed in Section 8 (i.e., the interaction between key and mode, the interaction between mode and speechiness, and the interaction between mode and tempo). Using the below code, we create and summarize a revised regression model incorporating these interactions. The below output displays a summary of our revised regression model.
# Using the below code, we create a regression model that incorporates interaction with the categorical variable mode.
model_interaction_4 <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7 + mode:speechiness + mode:tempo + mode:key_char + loudness:key_char_1, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction_4)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7 + mode:speechiness + mode:tempo +
## mode:key_char + loudness:key_char_1, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.606 -17.055 2.894 17.953 62.787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.759e+02 4.186e+01 -6.590 4.46e-11 ***
## I(danceability^7) 1.074e+01 1.252e+00 8.579 < 2e-16 ***
## I(energy^7) -1.274e+01 8.437e-01 -15.101 < 2e-16 ***
## loudness -1.667e+00 8.441e-01 -1.975 0.048328 *
## I(loudness^2) -5.218e-01 1.420e-01 -3.674 0.000240 ***
## I(loudness^3) -4.028e-02 1.025e-02 -3.930 8.53e-05 ***
## I(loudness^4) -1.153e-03 3.148e-04 -3.663 0.000249 ***
## I(loudness^5) -1.107e-05 3.334e-06 -3.320 0.000903 ***
## mode 1.993e+00 1.368e+00 1.457 0.145179
## speechiness -8.055e+00 2.050e+00 -3.929 8.53e-05 ***
## acousticness 6.742e+00 7.053e-01 9.559 < 2e-16 ***
## instrumentalness -2.321e+02 3.821e+01 -6.074 1.27e-09 ***
## I(instrumentalness^2) 4.151e+03 8.787e+02 4.724 2.33e-06 ***
## I(instrumentalness^3) -3.224e+04 7.553e+03 -4.268 1.97e-05 ***
## I(instrumentalness^4) 1.267e+05 3.166e+04 4.002 6.30e-05 ***
## I(instrumentalness^5) -2.731e+05 7.177e+04 -3.805 0.000142 ***
## I(instrumentalness^6) 3.277e+05 8.995e+04 3.643 0.000270 ***
## I(instrumentalness^7) -2.054e+05 5.863e+04 -3.504 0.000460 ***
## I(instrumentalness^8) 5.245e+04 1.551e+04 3.381 0.000723 ***
## liveness 1.333e+02 4.112e+01 3.241 0.001193 **
## I(liveness^2) -1.068e+03 3.662e+02 -2.916 0.003551 **
## I(liveness^3) 3.830e+03 1.471e+03 2.603 0.009238 **
## I(liveness^4) -6.952e+03 2.886e+03 -2.409 0.016011 *
## I(liveness^5) 6.185e+03 2.697e+03 2.293 0.021840 *
## I(liveness^6) -2.134e+03 9.590e+02 -2.225 0.026077 *
## valence 5.659e+00 2.563e+00 2.208 0.027258 *
## I(valence^2) -6.525e+00 2.427e+00 -2.688 0.007183 **
## I(tempo^5) 5.506e-08 1.605e-08 3.431 0.000601 ***
## I(tempo^6) -1.595e-09 4.709e-10 -3.386 0.000709 ***
## I(tempo^7) 1.845e-11 5.583e-12 3.304 0.000955 ***
## I(tempo^8) -1.060e-13 3.321e-14 -3.193 0.001411 **
## I(tempo^9) 3.024e-16 9.873e-17 3.063 0.002195 **
## I(tempo^10) -3.420e-19 1.170e-19 -2.922 0.003481 **
## duration_ms -2.026e-04 2.093e-05 -9.682 < 2e-16 ***
## I(duration_ms^4) 4.626e-22 6.632e-23 6.975 3.12e-12 ***
## I(log(duration_ms)) 2.836e+01 3.758e+00 7.547 4.60e-14 ***
## key_char_7 5.323e-01 8.503e-01 0.626 0.531324
## mode:speechiness -4.731e+00 2.685e+00 -1.762 0.078118 .
## mode:tempo -9.956e-04 1.022e-02 -0.097 0.922399
## mode:key_char1 -9.007e-01 8.952e-01 -1.006 0.314346
## mode:key_char10 -2.813e-01 9.926e-01 -0.283 0.776870
## mode:key_char11 3.749e-01 8.847e-01 0.424 0.671718
## mode:key_char2 -6.745e-01 7.158e-01 -0.942 0.346006
## mode:key_char3 -2.462e+00 1.318e+00 -1.867 0.061866 .
## mode:key_char4 5.509e-01 9.834e-01 0.560 0.575318
## mode:key_char5 1.331e-01 8.775e-01 0.152 0.879416
## mode:key_char6 -2.509e-02 8.678e-01 -0.029 0.976935
## mode:key_char7 -2.453e+00 1.100e+00 -2.230 0.025774 *
## mode:key_char8 8.611e-01 7.910e-01 1.089 0.276341
## mode:key_char9 -4.112e-01 7.821e-01 -0.526 0.599068
## loudness:key_char_1 2.758e-02 8.504e-02 0.324 0.745672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28293 degrees of freedom
## Multiple R-squared: 0.06958, Adjusted R-squared: 0.06794
## F-statistic: 42.32 on 50 and 28293 DF, p-value: < 2.2e-16
In the above output, we notice that the only interaction terms whose corresponding p-value is less than 0.1 are mode:speechiness, mode:key_char3, and mode:key_char7. As a consequence, we choose to exclude all other interaction terms from our regression model, and we create a revised regression model using the below code. The below code is also used to create the below output containing a summary of our revised model.
# Using the below code, we create a revised regression model containing those interaction terms whose p-value is less than 0.1.
model_interaction_5 <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + key_char_7 + mode:speechiness + mode:key_char_3 + mode:key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction_5)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + key_char_7 + mode:speechiness + mode:key_char_3 +
## mode:key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.42 -17.03 2.90 18.02 62.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.766e+02 4.185e+01 -6.610 3.91e-11 ***
## I(danceability^7) 1.058e+01 1.249e+00 8.471 < 2e-16 ***
## I(energy^7) -1.276e+01 8.432e-01 -15.135 < 2e-16 ***
## loudness -1.661e+00 8.437e-01 -1.969 0.048995 *
## I(loudness^2) -5.220e-01 1.420e-01 -3.676 0.000237 ***
## I(loudness^3) -4.031e-02 1.025e-02 -3.934 8.38e-05 ***
## I(loudness^4) -1.154e-03 3.147e-04 -3.669 0.000244 ***
## I(loudness^5) -1.109e-05 3.333e-06 -3.326 0.000882 ***
## mode 1.751e+00 4.131e-01 4.237 2.27e-05 ***
## speechiness -8.031e+00 2.049e+00 -3.920 8.87e-05 ***
## acousticness 6.816e+00 7.035e-01 9.689 < 2e-16 ***
## instrumentalness -2.331e+02 3.820e+01 -6.102 1.06e-09 ***
## I(instrumentalness^2) 4.177e+03 8.785e+02 4.755 1.99e-06 ***
## I(instrumentalness^3) -3.246e+04 7.552e+03 -4.298 1.73e-05 ***
## I(instrumentalness^4) 1.275e+05 3.165e+04 4.029 5.62e-05 ***
## I(instrumentalness^5) -2.748e+05 7.176e+04 -3.829 0.000129 ***
## I(instrumentalness^6) 3.296e+05 8.994e+04 3.664 0.000248 ***
## I(instrumentalness^7) -2.065e+05 5.862e+04 -3.522 0.000429 ***
## I(instrumentalness^8) 5.269e+04 1.551e+04 3.397 0.000681 ***
## liveness 1.355e+02 4.110e+01 3.298 0.000975 ***
## I(liveness^2) -1.085e+03 3.661e+02 -2.964 0.003039 **
## I(liveness^3) 3.885e+03 1.471e+03 2.642 0.008248 **
## I(liveness^4) -7.036e+03 2.885e+03 -2.439 0.014745 *
## I(liveness^5) 6.244e+03 2.696e+03 2.316 0.020565 *
## I(liveness^6) -2.150e+03 9.587e+02 -2.242 0.024957 *
## valence 5.751e+00 2.562e+00 2.245 0.024762 *
## I(valence^2) -6.581e+00 2.426e+00 -2.713 0.006679 **
## I(tempo^5) 5.456e-08 1.603e-08 3.404 0.000664 ***
## I(tempo^6) -1.580e-09 4.704e-10 -3.359 0.000784 ***
## I(tempo^7) 1.827e-11 5.578e-12 3.276 0.001055 **
## I(tempo^8) -1.050e-13 3.319e-14 -3.164 0.001556 **
## I(tempo^9) 2.993e-16 9.866e-17 3.034 0.002415 **
## I(tempo^10) -3.383e-19 1.170e-19 -2.893 0.003822 **
## duration_ms -2.030e-04 2.092e-05 -9.703 < 2e-16 ***
## I(duration_ms^4) 4.632e-22 6.630e-23 6.986 2.90e-12 ***
## I(log(duration_ms)) 2.843e+01 3.757e+00 7.567 3.94e-14 ***
## key_char_7 5.450e-01 8.485e-01 0.642 0.520697
## mode:speechiness -5.171e+00 2.672e+00 -1.936 0.052932 .
## mode:key_char_3 -2.298e+00 1.241e+00 -1.852 0.064062 .
## mode:key_char_7 -2.275e+00 1.003e+00 -2.267 0.023374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28304 degrees of freedom
## Multiple R-squared: 0.06928, Adjusted R-squared: 0.068
## F-statistic: 54.02 on 39 and 28304 DF, p-value: < 2.2e-16
In the above output, we see that the p-value corresponding to key_char_7 is about 0.52. This indicates that key_char_7 is not actually statistically significant in our regression model, and hence we exclude this covariate. Using the below code, we create and summarize a revised regression model that does not contain the variable key_char_7. The below output displays a summary of this revised regression model.
# Using the below code, we create a revised regression model that does not contain the variable key_char_7.
model_interaction_6 <- lm(track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness + mode:key_char_3 + mode:key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(model_interaction_6)
##
## Call:
## lm(formula = track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + mode:speechiness + mode:key_char_3 +
## mode:key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.42 -17.04 2.89 18.01 62.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.767e+02 4.185e+01 -6.613 3.83e-11 ***
## I(danceability^7) 1.057e+01 1.249e+00 8.465 < 2e-16 ***
## I(energy^7) -1.276e+01 8.432e-01 -15.136 < 2e-16 ***
## loudness -1.666e+00 8.436e-01 -1.975 0.048294 *
## I(loudness^2) -5.230e-01 1.420e-01 -3.684 0.000230 ***
## I(loudness^3) -4.039e-02 1.025e-02 -3.942 8.09e-05 ***
## I(loudness^4) -1.157e-03 3.146e-04 -3.678 0.000235 ***
## I(loudness^5) -1.112e-05 3.333e-06 -3.337 0.000848 ***
## mode 1.711e+00 4.086e-01 4.188 2.82e-05 ***
## speechiness -8.067e+00 2.048e+00 -3.939 8.20e-05 ***
## acousticness 6.821e+00 7.034e-01 9.697 < 2e-16 ***
## instrumentalness -2.335e+02 3.820e+01 -6.113 9.94e-10 ***
## I(instrumentalness^2) 4.186e+03 8.784e+02 4.765 1.89e-06 ***
## I(instrumentalness^3) -3.252e+04 7.551e+03 -4.307 1.66e-05 ***
## I(instrumentalness^4) 1.277e+05 3.165e+04 4.037 5.44e-05 ***
## I(instrumentalness^5) -2.752e+05 7.176e+04 -3.836 0.000126 ***
## I(instrumentalness^6) 3.301e+05 8.993e+04 3.670 0.000243 ***
## I(instrumentalness^7) -2.068e+05 5.862e+04 -3.527 0.000420 ***
## I(instrumentalness^8) 5.277e+04 1.551e+04 3.402 0.000670 ***
## liveness 1.358e+02 4.109e+01 3.304 0.000953 ***
## I(liveness^2) -1.087e+03 3.661e+02 -2.968 0.002996 **
## I(liveness^3) 3.889e+03 1.471e+03 2.644 0.008188 **
## I(liveness^4) -7.039e+03 2.885e+03 -2.440 0.014708 *
## I(liveness^5) 6.243e+03 2.696e+03 2.316 0.020585 *
## I(liveness^6) -2.148e+03 9.587e+02 -2.241 0.025045 *
## valence 5.739e+00 2.561e+00 2.241 0.025062 *
## I(valence^2) -6.579e+00 2.426e+00 -2.711 0.006702 **
## I(tempo^5) 5.451e-08 1.603e-08 3.401 0.000672 ***
## I(tempo^6) -1.578e-09 4.704e-10 -3.355 0.000794 ***
## I(tempo^7) 1.825e-11 5.578e-12 3.272 0.001069 **
## I(tempo^8) -1.049e-13 3.319e-14 -3.160 0.001577 **
## I(tempo^9) 2.989e-16 9.865e-17 3.030 0.002447 **
## I(tempo^10) -3.379e-19 1.170e-19 -2.889 0.003872 **
## duration_ms -2.031e-04 2.092e-05 -9.707 < 2e-16 ***
## I(duration_ms^4) 4.632e-22 6.630e-23 6.987 2.88e-12 ***
## I(log(duration_ms)) 2.845e+01 3.757e+00 7.571 3.83e-14 ***
## mode:speechiness -5.130e+00 2.671e+00 -1.921 0.054759 .
## mode:key_char_3 -2.299e+00 1.241e+00 -1.853 0.063929 .
## mode:key_char_7 -1.730e+00 5.347e-01 -3.235 0.001217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.88 on 28305 degrees of freedom
## Multiple R-squared: 0.06927, Adjusted R-squared: 0.06802
## F-statistic: 55.43 on 38 and 28305 DF, p-value: < 2.2e-16
In the above output, we see that the p-values corresponding to most covariates are less than 0.05, indicating statistically significant relationships. However the variables mode:speechiness and mode:key_char_3 have corresponding p-values that are each slightly above 0.05. As these p-values are only slightly above 0.05, we will keep the variables mode:speechiness and mode:key_char_3 in our model for now. We notice that the values for multiple R-squared and adjusted R-squared are still only about 0.07, and so we cannot rely on this model to make highly accurate predictions. As a consequence, we consider possible Box-Cox transformations of track_popularity. In order to select the tuning parameter for the Box-Cox transformation, we maximize the log-likelihood of the transformed data. In order to do this, we need the response variable to be positive. Since track_popularity ranges from 0 to 1, we can achieve this by creating a new response variable that is equivalent to 1 + track_popularity. We call this new variable pos_track_popularity, and we create a new regression model whose response variable is pos_track_popularity, and whose covariates are the covariates used in the preceding regression model directly above. After doing this, we display a graph marking the value of the optimal tuning parameter with a dashed line. This is shown in the below output. By clicking on the code button directly below, you can view the code used to accomplish everything mentioned in this paragraph.
# Using the below code, we create the new variable called pos_track_popularity which is equivalent to 1 + track_popularity.
pos_track_popularity <- spotify_songs_6$track_popularity + 1
# Using the below code, we create a new regression model whose response variable is pos_track_popularity, and whose covariates are the covariates used in the preceding regression model directly above.
box_cox_selection_model <- lm(pos_track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness + mode:key_char_3 + mode:key_char_7, data = spotify_songs_6)
# Using the below code, we create a graph that displays the optimal tuning parameter for a Box-Cox transformation.
box_cox <- boxcox(box_cox_selection_model, xlab = "Tuning Parameter")
In the above graph, we notice that the optimal tuning parameter is very close to 1. In the output below, we display the value of this optimal tuning parameter. You can view the code used to create this output by clicking on the code button directly below.
# Using the below code, we display the value of the optimal tuning parameter for a Box-Cox transformation.
box_cox$x[which.max(box_cox$y)]
## [1] 0.7878788
We create a revised regression model using a Box-Cox transformation with the optimal tuning parameter, and we display a summary of this revised regression model in the output below. By clicking on the code button directly below, you can view the code used to create and summarize this revised regression model.
# Using the below code, we create a revised regression model using a Box-Cox transformation with the optimal tuning parameter.
box_cox_model <- lm(I(pos_track_popularity^0.7878788) ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness + mode:key_char_3 + mode:key_char_7, data = spotify_songs_6)
# Using the below code, we display a summary of our new regression model.
summary(box_cox_model)
##
## Call:
## lm(formula = I(pos_track_popularity^0.7878788) ~ I(danceability^7) +
## I(energy^7) + loudness + I(loudness^2) + I(loudness^3) +
## I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness +
## instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) +
## I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) +
## I(instrumentalness^7) + I(instrumentalness^8) + liveness +
## I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) +
## I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) +
## I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms +
## I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness +
## mode:key_char_3 + mode:key_char_7, data = spotify_songs_6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.764 -5.875 1.712 6.926 21.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.066e+02 1.623e+01 -6.571 5.10e-11 ***
## I(danceability^7) 3.990e+00 4.844e-01 8.236 < 2e-16 ***
## I(energy^7) -4.747e+00 3.270e-01 -14.518 < 2e-16 ***
## loudness -7.017e-01 3.272e-01 -2.145 0.031978 *
## I(loudness^2) -2.023e-01 5.506e-02 -3.675 0.000239 ***
## I(loudness^3) -1.537e-02 3.973e-03 -3.868 0.000110 ***
## I(loudness^4) -4.375e-04 1.220e-04 -3.586 0.000336 ***
## I(loudness^5) -4.193e-06 1.292e-06 -3.244 0.001180 **
## mode 6.294e-01 1.585e-01 3.972 7.14e-05 ***
## speechiness -3.110e+00 7.942e-01 -3.916 9.02e-05 ***
## acousticness 2.614e+00 2.728e-01 9.582 < 2e-16 ***
## instrumentalness -8.302e+01 1.481e+01 -5.604 2.11e-08 ***
## I(instrumentalness^2) 1.495e+03 3.406e+02 4.388 1.15e-05 ***
## I(instrumentalness^3) -1.170e+04 2.928e+03 -3.996 6.45e-05 ***
## I(instrumentalness^4) 4.620e+04 1.227e+04 3.765 0.000167 ***
## I(instrumentalness^5) -9.982e+04 2.783e+04 -3.587 0.000335 ***
## I(instrumentalness^6) 1.198e+05 3.488e+04 3.436 0.000592 ***
## I(instrumentalness^7) -7.505e+04 2.273e+04 -3.302 0.000963 ***
## I(instrumentalness^8) 1.913e+04 6.015e+03 3.181 0.001469 **
## liveness 5.044e+01 1.594e+01 3.165 0.001552 **
## I(liveness^2) -4.030e+02 1.420e+02 -2.839 0.004529 **
## I(liveness^3) 1.439e+03 5.703e+02 2.524 0.011614 *
## I(liveness^4) -2.603e+03 1.119e+03 -2.327 0.019986 *
## I(liveness^5) 2.311e+03 1.046e+03 2.211 0.027078 *
## I(liveness^6) -7.972e+02 3.718e+02 -2.144 0.032019 *
## valence 2.212e+00 9.934e-01 2.227 0.025936 *
## I(valence^2) -2.626e+00 9.409e-01 -2.791 0.005262 **
## I(tempo^5) 2.035e-08 6.216e-09 3.274 0.001063 **
## I(tempo^6) -5.892e-10 1.824e-10 -3.230 0.001239 **
## I(tempo^7) 6.814e-12 2.163e-12 3.150 0.001634 **
## I(tempo^8) -3.915e-14 1.287e-14 -3.042 0.002351 **
## I(tempo^9) 1.115e-16 3.826e-17 2.915 0.003556 **
## I(tempo^10) -1.260e-19 4.536e-20 -2.777 0.005484 **
## duration_ms -8.095e-05 8.113e-06 -9.978 < 2e-16 ***
## I(duration_ms^4) 1.863e-22 2.571e-23 7.247 4.38e-13 ***
## I(log(duration_ms)) 1.120e+01 1.457e+00 7.687 1.56e-14 ***
## mode:speechiness -1.873e+00 1.036e+00 -1.808 0.070631 .
## mode:key_char_3 -8.962e-01 4.813e-01 -1.862 0.062609 .
## mode:key_char_7 -6.329e-01 2.074e-01 -3.052 0.002275 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.872 on 28305 degrees of freedom
## Multiple R-squared: 0.06683, Adjusted R-squared: 0.06557
## F-statistic: 53.34 on 38 and 28305 DF, p-value: < 2.2e-16
In the preceding output, we notice that the multiple R-squared is about 0.0668, and the adjusted R-squared is about 0.0656. These values are actually lower than we achieved in regression models prior to considering a Box-Cox transformation. As a consequence, we choose to not use a Box-Cox transformation to alter the response variable. However given that our values for multiple R-squared and adjusted R-squared are relatively low in all of the preceding regression models, we note that record labels cannot use these models to make highly accurate predictions. In helping record labels, we are primarily interested in determining a model that would aid in the decision to offer new artists contracts. Such a decision can actually be described in a binary manner, where 1 indicates a recommendation to offer a contract, and 0 indicates a recommendation to not offer a contract. Using this binary representation, we can improve the accuracy of our model’s recommendations through logistic regression. To create this binary representation, we establish a new response variable called bin_track_popularity, and we add this new variable to the data frame spotify_songs_6. We set bin_track_popularity equal to 1 when track_popularity is greater than 50, and we set bin_track_popularity equal to 0 when track_popularity is no more than 50. We use logistic regression to create a model in which the response variable is bin_track_popularity, and in which the covariates are those used in the preceding regression model. We then summarize this new model, and we display the results of our summary in the output below. By clicking on the code button directly below, you can view the code used to accomplish everything discussed in this paragraph.
# Using the below code, we create a new variable called bin_track_popularity.
bin_track_popularity <- c(1:nrow(spotify_songs_6))
# Using the below code, we set the initial value of the counting variable i to be 1.
i = 1
# Using the below code, we set bin_track_popularity equal to 1 when track_popularity is greater than 50, and we set bin_track_popularity to be 0 otherwise.
while(i <= nrow(spotify_songs_6)){
if(spotify_songs_6$track_popularity[i] > 50) bin_track_popularity[i] = 1
if(spotify_songs_6$track_popularity[i] <= 50) bin_track_popularity[i] = 0
i = i + 1
}
# Using the below code, we add the variable bin_track_popularity to the data frame spotify_songs_6.
spotify_songs_6 <- data.frame(spotify_songs_6, bin_track_popularity)
# Using the below code, we use logistic regression with bin_track_popularity as the response variable.
bin_model = glm(bin_track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness + mode:key_char_3 + mode:key_char_7, data = spotify_songs_6, family = "binomial")
# Using the below code, we summarize our new model.
summary(bin_model)
##
## Call:
## glm(formula = bin_track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + mode:speechiness + mode:key_char_3 +
## mode:key_char_7, family = "binomial", data = spotify_songs_6)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5039 -0.9966 -0.7709 1.2590 2.3817
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.440e+01 4.561e+00 -5.349 8.87e-08 ***
## I(danceability^7) 7.651e-01 1.143e-01 6.696 2.14e-11 ***
## I(energy^7) -1.126e+00 8.278e-02 -13.597 < 2e-16 ***
## loudness -2.051e-01 1.123e-01 -1.827 0.067745 .
## I(loudness^2) -7.165e-02 2.218e-02 -3.230 0.001238 **
## I(loudness^3) -6.722e-03 1.922e-03 -3.497 0.000471 ***
## I(loudness^4) -2.474e-04 7.262e-05 -3.407 0.000656 ***
## I(loudness^5) -3.104e-06 9.512e-07 -3.263 0.001101 **
## mode 1.565e-01 3.812e-02 4.105 4.04e-05 ***
## speechiness -6.734e-01 1.899e-01 -3.546 0.000392 ***
## acousticness 4.700e-01 6.434e-02 7.305 2.76e-13 ***
## instrumentalness -2.331e+01 3.749e+00 -6.217 5.07e-10 ***
## I(instrumentalness^2) 4.003e+02 8.726e+01 4.588 4.48e-06 ***
## I(instrumentalness^3) -3.004e+03 7.581e+02 -3.963 7.42e-05 ***
## I(instrumentalness^4) 1.161e+04 3.202e+03 3.626 0.000288 ***
## I(instrumentalness^5) -2.495e+04 7.299e+03 -3.418 0.000631 ***
## I(instrumentalness^6) 3.007e+04 9.182e+03 3.275 0.001058 **
## I(instrumentalness^7) -1.901e+04 5.999e+03 -3.168 0.001532 **
## I(instrumentalness^8) 4.905e+03 1.590e+03 3.086 0.002031 **
## liveness 1.243e+01 3.947e+00 3.148 0.001643 **
## I(liveness^2) -1.096e+02 3.527e+01 -3.106 0.001896 **
## I(liveness^3) 4.306e+02 1.424e+02 3.023 0.002501 **
## I(liveness^4) -8.338e+02 2.813e+02 -2.964 0.003035 **
## I(liveness^5) 7.719e+02 2.648e+02 2.915 0.003552 **
## I(liveness^6) -2.723e+02 9.488e+01 -2.870 0.004111 **
## valence 5.220e-01 2.437e-01 2.142 0.032206 *
## I(valence^2) -4.682e-01 2.294e-01 -2.041 0.041257 *
## I(tempo^5) 5.566e-09 1.551e-09 3.590 0.000331 ***
## I(tempo^6) -1.656e-10 4.575e-11 -3.620 0.000295 ***
## I(tempo^7) 1.968e-12 5.456e-13 3.606 0.000310 ***
## I(tempo^8) -1.162e-14 3.266e-15 -3.559 0.000372 ***
## I(tempo^9) 3.408e-17 9.771e-18 3.487 0.000488 ***
## I(tempo^10) -3.964e-20 1.166e-20 -3.399 0.000677 ***
## duration_ms -1.277e-05 2.273e-06 -5.619 1.92e-08 ***
## I(duration_ms^4) 2.084e-23 7.300e-24 2.855 0.004304 **
## I(log(duration_ms)) 2.098e+00 4.096e-01 5.121 3.04e-07 ***
## mode:speechiness -5.084e-01 2.488e-01 -2.043 0.041045 *
## mode:key_char_3 -1.713e-01 1.146e-01 -1.495 0.134892
## mode:key_char_7 -1.707e-01 5.058e-02 -3.376 0.000736 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37374 on 28343 degrees of freedom
## Residual deviance: 35903 on 28305 degrees of freedom
## AIC: 35981
##
## Number of Fisher Scoring iterations: 5
In the preceding output, we see that the only p-value greater than 0.1 corresponds to mode:key_char_3. Because this interaction term does not appear to be statistically significant in our revised model, we exclude this term. We again use logistic regression to create a new model that excludes model:key_char_3, and we summarize this new model in the below output. By clicking on the code button directly below, you can view the code used to create and summarize this new model (called bin_model_2).
# Using the below code, we use logistic regression with mode:key_char_3 excluded.
bin_model_2 = glm(bin_track_popularity ~ I(danceability^7) + I(energy^7) + loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) + I(loudness^5) + mode + speechiness + acousticness + instrumentalness + I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) + I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) + I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) + I(liveness^4) + I(liveness^5) + I(liveness^6) + valence + I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) + I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) + I(log(duration_ms)) + mode:speechiness + mode:key_char_7, data = spotify_songs_6, family = "binomial")
# Using the below code, we summarize our new model.
summary(bin_model_2)
##
## Call:
## glm(formula = bin_track_popularity ~ I(danceability^7) + I(energy^7) +
## loudness + I(loudness^2) + I(loudness^3) + I(loudness^4) +
## I(loudness^5) + mode + speechiness + acousticness + instrumentalness +
## I(instrumentalness^2) + I(instrumentalness^3) + I(instrumentalness^4) +
## I(instrumentalness^5) + I(instrumentalness^6) + I(instrumentalness^7) +
## I(instrumentalness^8) + liveness + I(liveness^2) + I(liveness^3) +
## I(liveness^4) + I(liveness^5) + I(liveness^6) + valence +
## I(valence^2) + I(tempo^5) + I(tempo^6) + I(tempo^7) + I(tempo^8) +
## I(tempo^9) + I(tempo^10) + duration_ms + I(duration_ms^4) +
## I(log(duration_ms)) + mode:speechiness + mode:key_char_7,
## family = "binomial", data = spotify_songs_6)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5024 -0.9968 -0.7713 1.2591 2.3810
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.439e+01 4.561e+00 -5.349 8.87e-08 ***
## I(danceability^7) 7.680e-01 1.142e-01 6.723 1.78e-11 ***
## I(energy^7) -1.126e+00 8.278e-02 -13.601 < 2e-16 ***
## loudness -2.053e-01 1.123e-01 -1.829 0.067434 .
## I(loudness^2) -7.168e-02 2.219e-02 -3.230 0.001238 **
## I(loudness^3) -6.724e-03 1.923e-03 -3.496 0.000472 ***
## I(loudness^4) -2.476e-04 7.265e-05 -3.407 0.000656 ***
## I(loudness^5) -3.106e-06 9.518e-07 -3.264 0.001099 **
## mode 1.513e-01 3.797e-02 3.986 6.73e-05 ***
## speechiness -6.738e-01 1.899e-01 -3.548 0.000389 ***
## acousticness 4.674e-01 6.431e-02 7.269 3.63e-13 ***
## instrumentalness -2.323e+01 3.748e+00 -6.196 5.78e-10 ***
## I(instrumentalness^2) 3.990e+02 8.725e+01 4.573 4.80e-06 ***
## I(instrumentalness^3) -2.995e+03 7.580e+02 -3.951 7.79e-05 ***
## I(instrumentalness^4) 1.158e+04 3.202e+03 3.616 0.000300 ***
## I(instrumentalness^5) -2.487e+04 7.299e+03 -3.408 0.000655 ***
## I(instrumentalness^6) 2.997e+04 9.181e+03 3.265 0.001095 **
## I(instrumentalness^7) -1.895e+04 5.999e+03 -3.158 0.001586 **
## I(instrumentalness^8) 4.888e+03 1.589e+03 3.075 0.002102 **
## liveness 1.241e+01 3.947e+00 3.145 0.001660 **
## I(liveness^2) -1.094e+02 3.527e+01 -3.102 0.001922 **
## I(liveness^3) 4.299e+02 1.424e+02 3.018 0.002542 **
## I(liveness^4) -8.322e+02 2.813e+02 -2.959 0.003090 **
## I(liveness^5) 7.704e+02 2.648e+02 2.909 0.003621 **
## I(liveness^6) -2.716e+02 9.488e+01 -2.863 0.004195 **
## valence 5.214e-01 2.437e-01 2.139 0.032411 *
## I(valence^2) -4.670e-01 2.294e-01 -2.036 0.041766 *
## I(tempo^5) 5.599e-09 1.551e-09 3.610 0.000306 ***
## I(tempo^6) -1.666e-10 4.576e-11 -3.640 0.000272 ***
## I(tempo^7) 1.980e-12 5.458e-13 3.627 0.000286 ***
## I(tempo^8) -1.170e-14 3.267e-15 -3.580 0.000343 ***
## I(tempo^9) 3.429e-17 9.775e-18 3.508 0.000452 ***
## I(tempo^10) -3.989e-20 1.167e-20 -3.419 0.000629 ***
## duration_ms -1.277e-05 2.273e-06 -5.617 1.94e-08 ***
## I(duration_ms^4) 2.084e-23 7.301e-24 2.855 0.004308 **
## I(log(duration_ms)) 2.097e+00 4.096e-01 5.120 3.06e-07 ***
## mode:speechiness -5.001e-01 2.487e-01 -2.011 0.044366 *
## mode:key_char_7 -1.665e-01 5.050e-02 -3.298 0.000975 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37374 on 28343 degrees of freedom
## Residual deviance: 35905 on 28306 degrees of freedom
## AIC: 35981
##
## Number of Fisher Scoring iterations: 5
In the above output, we notice that the p-value corresponding to loudness is the only one that is greater than 0.05. However since this p-value is very close to 0.05, we choose to leave loudness in our model. Because all p-values in the above output are relatively small, we believe that all covariates used in the preceding model have statistically significant relationships with bin_track_popularity. We further examine the validity and prediction performance of the preceding model by calculating a confusion matrix and the misclassification error rate. We display the confusion matrix in the below output. The rows of this confusion matrix are the actual values appearing in bin_track_popularity, and the columns of the matrix are the values predicted by the preceding model. The numbers within the matrix indicate the number of tracks having certain actual and predicted values. You can view the code used to create this confusion matrix by clicking on the code button displayed directly below.
# Using the below code, we create a new variable called pred_prob.
pred_prob = predict(bin_model_2, spotify_songs_6, type = "response")
# Using the below code, we create a new variable called pred_value.
pred_value = 1*(pred_prob > 0.5)
# Using the below code, we create a new variable called actual_value.
actual_value = spotify_songs_6$bin_track_popularity
# Using the below code, we create a confusion matrix.
confusion_matrix = table(actual_value, pred_value)
# Using the below code, we display the confusion matrix.
confusion_matrix
## pred_value
## actual_value 0 1
## 0 16562 1277
## 1 8925 1580
The below output displays the misclassification error rate, which can be calculated using the values in the preceding confusion matrix. By clicking on the code button displayed directly below, you can view the code that was used to calculate and display the misclassification error rate.
# Using the below code, we calculate the misclassification error rate.
misclassification_error_rate = 1 - sum(diag(confusion_matrix))/sum(confusion_matrix)
# Using the below code, we display the misclassification error rate.
misclassification_error_rate
## [1] 0.3599351
Based on the above output, we see that bin_model_2 makes inaccurate predictions of our data set about 36% of the time. This indicates that although this model is not always accurate, it may indeed be accurate enough to be useful to record labels in making contract decisions with new artsists. When a value of 1 is predicted by bin_model_2, the record label may possibly consider signing a contract with a new artist. However when a value of 0 is predicted by bin_model_2, the record label should be hesistant about signing a contract with the artist.
In establishing contracts with new artists, there is risk involved for record labels. If the tracks produced by the new artists do not become popular, the record label may potentially make no or very little profit. While gaining no profit, the record label may also lose time and resources invested in the unsuccessful new artists. By using data analysis to draw conclusions, the objective of this study was to help mitigate the risk posed to record labels in signing new artists.
Because of the great risk involved in establishing contracts with new artists, we performed data analysis to find relationships that can aid record labels in such contract decisions. In particular, we analyzed a data set containing information about Spotify tracks, obtained from the following website:
https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1
The data set contains the track id, track name, track artist, track popularity, track album id, track album name, track album release date, playlist name, playlist id, playlist genre, playlist subgenre, and various musical characteristics for 32833 Spotify tracks. In analyzing this data set, we created scatter plots, tables, histograms, frequency graphs, and other graphical displays to find relationships between track popularity and various musical characteristics. We also uncovered new information from the data set by creating new variables, splitting the data, and grouping the data. Record labels can use these relationships and new insights to inform their contract decisions. To further help record labels develop strategic contracts with new artists, we used regression analysis (in particular logistic regression) to create a model that predicts whether a track is likely to become popular.
By examining tables and graphical displays, we found multiple relationships between track popularity and various musical characteristics. In particular, we determined that the most popular tracks tend to be very dance-able, tend to have a high value for loudness, tend to contain many vocals (i.e., are not purely instrumental), are not likely to have been performed live, tend to have a duration of around 2e+05 milliseconds, and tend to not have many spoken words. Further, we found that songs with a popularity of 100 tend to be written in mode 0. We also found that the highest levels of popularity occur at neither extremely high nor extremely low values for energy, indicating a non-linear relationship between energy and popularity. Similarly, incredibly popular tracks tend to have neither extremely large nor extremely small values for valence, indicating a non-linear relationship between valence and popularity. Through similar reasoning, we also found a non-linear relationship between tempo and popularity, wherein incredibly popular songs tend to have neither extremely fast nor extremely slow tempos. In addition, we found interaction between the variables mode and speechiness, and between the variables mode and key. In addition, we used logistic regression to develop a model to aid record labels in their contract decisions. The model predicts a value of 1 when a track is likely to become popular, and the model predicts a value of 0 when the track is not likely to become popular. We found that the model was inaccurate only about 36% of the time in predicting the popularity of tracks in our data set, making the model more reliable than non-expert human judgement.
Record labels can examine the musical characteristics of new artists’ tracks, and based on the insights and relationships determined in this report, record labels can identify tracks which have a good chance of becoming popular. By allowing record labels to make this identification, record labels can use our insights and findings to inform their contract decisions with new artists. In addition, we developed a predictive model using logistic regression, and this predictive model can also be used by record labels in their contract decisions. The model makes a binary prediction. In particular when the model predicts a value of 1, the record label can expect that a track has a good chance of becoming popular, and when the model predicts a value of zero, the record label should be hesitant about expecting a track to become popular. By using the model to make predictions regarding the popularity of tracks from new artists, record labels can get a sense of the number of tracks that have a good chance of becoming popular, and they can use this information in selecting new artists to receive contract offers.
A major limitation of our research is that our predictive model was found to be inaccurate about 36% of the time in predicting the popularity of tracks in our data set. As a consequence, it is important to note that although our model can provide recommendations, it should not be the only method used by record labels to make contract decisions. Despite this limitation, our model is accurate enough to be used in conjunction with expert opinion and other decision methods, as a support system to aid record labels in developing strategic contract decisions. The predictive accuracy of our model could be improved by considering additional variables, and further advice could be provided to record labels by determining the terms between the contractor and contractee which are profitable for both entities. Additionally machine learning methods other than regression analysis could be employed to analyze the data, with an eye towards developing a more accurate model. Indeed, there are multiple avenues for further research related to this topic, but our predictive model provides a strong foundational cornerstone to inform future development in this area. In so doing, we shed new light on insights and relationships which can spur further research and investigation.