Reviewers

After this mid-term is submitted, Sharanya Amaravadhi and Emily Thie have agreed to review it. In return, I have also agreed to review the mid-terms submitted by Sharanya Amaravadhi and Emily Thie.

Introduction

The introduction to this report contains four subsections, each of which can be accessed by clicking on the four tabs directly below.

Section 1.1: Explanation of Problem Statement

In order to sell as many albums as possible, it is important for record labels to sign contracts with new performers who will create popular songs and/or recordings. Hence to help record labels strategically select new artists with strong potential for developing popular material, we examine possible relationships between popularity of recordings and their musical characteristics. We then present our findings to provide record labels with an understanding of the musical characteristics inherent in recordings that tend to become popular.

Section 1.2: Data and Methodology Used to Address the Problem Statement

In order to address the problem statement, we analyze data from Spotify. This data was originally obtained by using the spotifyr package to collect data from Spotify. We plan to use regression, residual plots, and graphical displays (i.e., scatter plots, histograms, etc.) to investigate the influence of various musical characteristics on the popularity of a recording. In particular by analyzing the data set, we determine whether genre, sub-genre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and/or duration of tracks are statistically significant in predicting the popularity of the track. We hypothesize that these musical characteristics alone may not be enough to establish a highly accurate model for the prediction of track popularity. However, the analysis will still provide insights into potentially statistically significant relationships between musical characteristics and track popularity, and record labels can use these findings to inform their performer contract decisions. Because we are trying to provide record labels with information regarding the potential popularity of track recordings for new performers with whom they are considering initial contracts, the artist name, track id, track name, album name, album id, playlist name, playlist id, and album release date do not play a substantial role in our analysis, as new performers may not necessarily already have produced albums and/or songs appearing in Spotify playlists. However if the musical characteristics alone are insufficient for developing a highly predictive model, we will explore the potential benefit of incorporating such other possible variables in our regression model.

Section 1.3: Proposed Approach and Analytic Techniques

By creating graphical displays of the data set (such as scatter plots, histograms, etc.), we visualize any potential relationships between musical characteristics and track popularity. If relationships appear to be nonlinear, we transform the data appropriately prior to utilizing regression. By performing regression analysis and using residual plots, we determine whether certain musical characteristics are statistically significant for predicting track popularity. After performing regression analysis, we record the sign (i.e., positive or negative) of coefficients corresponding to statistically significant covariates, and we use this sign to inform our understanding of general relationships between the covariates (i.e., musical characteristics) and the dependent variable (i.e., track popularity). We also use unit normal scaling to compare the influence strength of various musical characteristics; here, we are again examining the influence on track popularity. We also use the Bayesian information criterion and adjusted R square to determine the musical characteristics which are best-suited for inclusion in our regression model. After developing our regression model, we examine the adequacy of the model using residual plots.

Section 1.4: Benefits and Significance of Proposed Work

This report is meant primarily to be consumed and read by record labels. Record labels can use the results of this analysis in the strategic planning of new performer contracts. By providing record labels with insights into the musical characteristics that are most significant in determining track popularity, record labels gain an understanding of the musical characteristics that are commonly inherent in popular tracks, and they can use this understanding to inform their contract decisions.

Packages Required

Click on each of the three tabs directly below to view the subsections of this portion of the report.

Section 2.1: Loading of Required Packages

To replicate the midterm proposal, the package tidyverse is required. The following two packages will be used in the final project report:

tidyverse
leaps

By clicking on the code button shown directly below, you can see the code for loading these two packages.

# Below, the package tidyverse is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("tidyverse")
library(tidyverse)

# Below, the package leaps is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("leaps")
library(leaps)

Section 2.2: Suppressing of Messages and Warnings for Loading Packages

Note that while loading the packages in Section 2.1, message and warning were both set to FALSE. This suppressed the messages and warnings resulting from loading the two packages. Also, echo was set to TRUE in order to ensure that the reader is able to view the R code for loading the required packages.

Section 2.3: Purpose of Each Package

Here, we explain the purpose of using each package in our data analysis. The package leaps will help us to understand the best subset of variables in our data set to choose in developing our regression model (i.e., subsets can be chosen using this package based on adjusted R squared for instance). In loading the package tidyverse, other packages are automatically loaded that will be helpful in our analysis. When we load the package tidyverse, the package ggplot2 is automatically loaded. The package ggplot2 will allow us to create nice visualizations of our data (i.e., graphs and plots). A couple other packages that are automatically loaded with tidyverse include dplyr and tidyr. We will leverage the power of dplyr to manipulate our data set, and we will use tidyr to tidy our data.

Data Preparation

The data preparation portion of the report contains six subsections, each of which can be viewed by clicking on the six tabs displayed directly below.

Section 3.1: Original Source of Data

In Section 1.2, we mentioned that we will analyze data from Spotify that was originally obtained using the spotifyr package. We downloaded this data at the following link: https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1

Section 3.2: Data Explanation

We will analyze data from Spotify that was originally obtained using the spotifyr package. This package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff in order to make it easier for individuals to obtain data from Spotify. The data set that we analyze appeared as the 01/21/2020 tidytuesday data set in the rfordatascience GitHub organization, where a data dictionary for the 23 variables in the original data set is provided (note that the data set contains 32833 observations, each of which is a Spotify track). We also make note of some peculiarities in the data set. For the variable key, a value of -1 is recorded whenever no musical key is detected. We note that this may occur when multiple keys are used throughout the piece, preventing a single standard key from being detected. We also note that the mode variable only allows for the entry of two possible modes (i.e., major or minor). Although these are the two most prevalent modes in today’s music, we note that other modes do exist, and hence this variable is not defined in a manner that allows for the entry of all possible modes. As a consequence, we will need to explicitly analyze and consider potential missing values for the variable mode, as they may be an indicator that a mode was used either than major or minor. Another possibility is that missing values for the variable mode may indicate that multiple modes were used throughout the musical piece.

The data dictionary for the original data set can be obtained at the following link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

For convenience, we provide a description of each of the 23 variables in the original data set here as well:

Data Dictionary

track_id - the unique ID for a song
track_name - the name of the song
track_artist - the name of the performing artist
track_popularity - the popularity of a song based on an integer scale from 0 to 100, with 100 being the most popular
track_album_id - the unique ID for the album
track_album_name - the name of the album
track_album_release_date - the date when the album was released
playlist_name - the name of the playlist
playlist_id - the unique ID for the playlist
playlist_genre - the genre of the playlist
playlist_subgenre - the sub-genre of the playlist
danceability - a measure of the suitability of a song for dancing based on a scale from 0 to 1, with 1 being the most dance-able
energy - a measure of the intensity of a song on a scale of 0 to 1, with 1 being the most energetic
key - the overall key of the song represented as an integer; here, 0 is the key of C (also called B#), 1 is C# (also referred to as D-flat), 2 is D, 3 is D# (also called E-flat), 4 is E, 5 is F (also called E#), 6 is F# (also called G-flat), 7 is G, 8 is G# (also called A-flat), 9 is A, 10 is A# (also called B-flat), 11 is B (also called C-flat); -1 is used whenever an overall key cannot be detected
loudness - a measure of the loudness of a song in decibels (loudness is generally between -60 and zero decibels)
mode - the value 1 is used to represent pieces that are written in a major key, and the value 0 is used to represent pieces that are written in a minor key; we note here that it is possible for songs to be written in modes that are neither major nor minor, but that is not prevalent in today’s popular music
speechiness - a measure of the use of spoken words on a scale of 0 to 1, with 1 being tracks with the most spoken words
acousticness - a measure of the confidence that a track is acoustic, with 1 used to indicate tracks that are most likely to be acoustic
instrumentalness - a measure of the prediction that instruments are used with no vocals, with 1 representing tracks that are predicted to use the least amount of vocals
liveness - a measure of the prediction that the track was recorded live in front of an audience, with 1 representing tracks that are most likely to have been recorded live
valence - a measure of the positiveness of a recording on a scale from 0 to 1, with 1 representing the tracks that are most positive
tempo - the speed of a track recorded in beats per minute
duration_ms - the length of the track in milliseconds

Based on the descriptions provided in the above data dictionary and also the values that tend to appear in the data set, the type of the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, and playlist_subgenre should all be character. The type of the variables track_popularity, key, mode, and duration_ms should be integer. The type of the variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo should be double. We will check for these variable types in Section 3.3 as we clean the data.

Also, the 01/21/2020 tidytuesday issue mentions that Kaylin Pavlik used the spotifyr package to collect data from Spotify in order to design a model for predicting the genre of specific pieces of music.

How to Next Read Section 3.3

Now that you’ve finished reading Section 3.2, you can continue to Section 3.3 by clicking the tab for Section 3.3 displayed earlier in the report.

Section 3.3: Data Importing and Cleaning

We begin this section by importing the data set described in Section 3.2. To do this, we first set our working directory to the folder location containing the data set. We then import the data set, and we name the data set spotify_songs. We also begin to understand the data by viewing the first rows of the data set, which are displayed below. The code for this can be seen by clicking on the code button directly below.

# Using the below code, we set our working directory to the folder location containing the Spotify data.
setwd("C:/Users/richa/Dropbox/My PC (DESKTOP-B9LT0L1)/Documents/Data Wrangling/Possible Data Sets/spotify")

# Using the below code, we import the Spotify data, and we name the data set spotify_songs.
spotify_songs <- read.csv("spotify_songs.csv")

# We view the first rows of the Spotify data using the following code.
head(spotify_songs)

Structure of the Data, Variable Names, and Variable Types

We also gain an understanding of the data by examining its structure. Output about the data structure obtained from R is shown below, along with the code required to obtain the output. Looking at this output, we see that the data set contains 32833 observations of Spotify tracks and 23 variables. We also see that all variables are properly named in a manner that follows the snake_case. Further as desired, we see that the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, and playlist_subgenre all are of type character. Because characters are stored in these variables, we did want to double-check and ensure that these variables were correctly assigned a character type. Because the variables track_popularity, key, mode, and duration_ms all contain integer values, we must check that they are appropriately assigned an integer type. Looking at the output regarding the data structure shown below, we see that they indeed are an integer type. The variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo all contain decimal numbers, and hence they should be assigned the numeric type double. Looking at the below output regarding the structure of the data set, we see that these variables are indeed appropriately assigned a numeric type. Because all variables are of the appropriate type, we need not change any of the variable types in our data cleaning.

# Using the below code, we examine the structure of the data set spotify_songs. This is helpful in gaining a better initial understanding of the data, and it also allows us to check if the variable names and type are appropriate. By examining the output, we find that they are indeed appropriate and do not need to be changed.
str(spotify_songs)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Missing Values

We next examine and handle the missing values in the data. We find that most of the variables do not have any missing values. In fact, the only variables with missing values are track_name, track_artist, and track_album_name. Each of these three variables have five missing values. The below code and output is used to display the number of missing values per variable.

# The below code is used to calculate the number of missing values in each variable.
colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

The output below indicates the five observations with missing values for the variable track_name. The required code to produce this output can also be seen by clicking on the code button directly below.

# The below code is used to determine the five observations containing missing values for the variable track_name.
which(is.na(spotify_songs$track_name))

## [1]  8152  9283  9284 19569 19812

The output below indicates the five observations with missing values for the variable track_artist. The required code to produce this output can also be seen by clicking on the code button directly below.

# The below code is used to determine the five observations containing missing values for the variable track_artist.
which(is.na(spotify_songs$track_artist))

## [1]  8152  9283  9284 19569 19812

The output below indicates the five observations with missing values for the variable track_album_name. The required code to produce this output can also be seen by clicking on the code button directly below.

# The below code is used to determine the five observations containing missing values for the variable track_album_name.
which(is.na(spotify_songs$track_album_name))

## [1]  8152  9283  9284 19569 19812

Note that the preceding three pieces of output are identical. That is, observations 8152, 9283, 9284, 19569, and 19812 contain the missing values for the variables track_name, track_artist, and track_album_name. Five observations is only a very small proportion of the total 32,833 observations, and so we choose to simply remove these five observations using the following code:

# Using the below code, we remove the five observations that contain missing values.
spotify_songs <- spotify_songs[-c(8152, 9283, 9284, 19569, 19812),]

The output below indicates the number of missing values that each variable now contains. We see that we have now successfully removed all missing values. You can view the code that was used to produce this output by clicking on the code button directly below.

# Using the below code, we check that there are now zero missing values in each variable.
colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        0                        0 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        0 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Removing Observations So That Each Track Occurs Only Once

We note that the purpose of this study is to understand the musical characteristics that tend to most often occur in tracks which become popular. In our attempt to develop a regression model to see if it is possible to predict track popularity, it is important that each track should occur no more than once in our data set. Having the same track occur multiple times could skew our results. Hence we need to examine the number of unique track id’s occurring in the data set, and we need to ensure that this number is equivalent to the total number of observations. The below output indicates that there are 28352 unique track id’s in the data set. You can view the code that was used to produce this output by clicking on the code button directly below.

# Using the below code, we determine the number of unique track id's occuring in the data set.
length(unique(spotify_songs$track_id))

## [1] 28352

We compare the number of unique track id’s to the total number of unique observations in the data set. The below output indicates that there are 32828 unique observations in the data set. By clicking on the code button directly below, you can view the code that was used to create this output.

# Using the below code, we determine the number of unique observations in the data set.
nrow(unique(spotify_songs))

## [1] 32828

We also note that the total number of unique observations in the data set is actually equivalent to the total number of observations. Using the below code and output, we see that there are a total of 32828 observations in the data set.

# Using the below code, we determine the total number of observations in the data set.
nrow(spotify_songs)

## [1] 32828

Because there are more unique observations than unique track id’s, we need to remove some of the rows in the data to ensure that each track occurs only once. After examining the data set, we find that there are multiple playlist sub-genres assigned to the same track_id. Because track_id cannot be uniquely assigned a specific sub-genre to use as a covariate in our regression model, we exclude this variable in the development of our regression model. We create a new data set called spotify_songs_2 which stores all of the variables in spotify_songs except for playlist_subgenre. The below output indicates that the number of unique observations contained in spotify_songs_2 is only 32505. By clicking on the code button directly below, you can view the code used to create the spotify_songs_2 data set and to find the number of unique observations in the data set.

# Using the below code, we create a new data set called spotify_songs_2 which stores all of the variables in spotify_songs except for playlist_subgenre.
spotify_songs_2 <- select(spotify_songs, -c(playlist_subgenre))

# Using the below code, we find the number of unique observations in the new data set spotify_songs_2.
nrow(unique(spotify_songs_2))

## [1] 32505

Because the number of unique observations in spotify_songs_2 is larger than 28352 (the number of unique track id’s that we found earlier), we need to further alter the data set to ensure that each track id occurs only once. Upon examining the data set further, we see that each track id can be assigned multiple playlist genres. As a result, we need to remove the variable playlist_genre for the same reason that we removed playlist_subgenre. After removing the variable playlist_genre from spotify_songs_2, we produce the following output displaying the number of unique observations in the revised data set. We see that the number of unique observations has now reduced to 32246. By clicking on the code button directly below, you can view the code that was used to modify spotify_songs_2 and count the number of unique observations in the data set.

# Using the below code, we remove the variable playlist_genre from the data set spotify_songs_2.
spotify_songs_2 <- select(spotify_songs_2, -c(playlist_genre))

# Using the below code, we find the number of unique observations in spotify_songs_2.
nrow(unique(spotify_songs_2))

## [1] 32246

Because the number of unique observations in the revised data set is still greater than 28352, we need to still further revise the data to ensure that each track id occurs only once. Upon further examination of the data, we find that each track can occur in multiple playlists. As a consequence, playlist_name and playlist_id are not uniquely-valued potential covariates. Hence we remove playlist_name and playlist_id from the data set spotify_songs_2, and after removing these variables, we produce the output below to see that there are now 28352 unique observations in the data set. The code and output for performing the actions described in this paragraph are shown directly below.

# Using the below code, we remove the variables playlist_name and playlist_id from the data set spotify_songs_2.
spotify_songs_2 <- select(spotify_songs_2, -c(playlist_name, playlist_id))

# Using the below code, we determine the number of unique observations in the revised spotify_songs_2.
nrow(unique(spotify_songs_2))

## [1] 28352

We then revise spotify_songs_2 so that it contains only unique observations, ensuring that each track id occurs only once (for a total of 28352 unique observations). We then produce the output displayed below to double-check that there are now a total of 28352 observations in the revised data set. The code and output for the actions described in this paragraph can be found directly below.

# Using the below code, we revise spotify_songs_2 so that it contains only unique observations, ensuring that each track id occurs only once in the data set.
spotify_songs_2 <- unique(spotify_songs_2)

# Using the below code, we determine the total number of observations in spotify_songs_2.
nrow(spotify_songs_2)

## [1] 28352

The below output indicates that there are 28352 unique track id’s in our revised data set, verifying that we have successfully created a data set containing each track only once. You can view the code that was used to create this output by clicking on the code button directly below.

# Using the below code, we find the number of unique track id's in spotify_songs_2.
length(unique(spotify_songs_2$track_id))

## [1] 28352

Numerical Summaries, Visual Summaries, and Outliers

As mentioned in the introduction, we are trying to help record labels make informed strategic decisions on developing contracts with new performers. Because new performers may not already have created albums or tracks appearing in playlists on Spotify, the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, and playlist_id will not play a large role in our analysis. These variables cannot necessarily be used by record labels to predict the popularity of recordings by new performers. Because we do not plan to utilize these variables in our recommendations for record labels, we do not worry about further cleaning of these variables. We also remind the reader that we have removed the variables playlist_genre and playlist_subgenre from consideration as tracks are not always assigned a unique genre and subgenre, and hence inclusion of these two variables would prevent us from having each track occur only once in our data set. The below output provides a numerical summary of the values for track_popularity. Using the below code and output, we see that the values for track_popularity range from 0 to 100 as indicated in the data dictionary. Although the maximum value of 100 seems to be much larger than the 3rd quantile of 58, this may just simply indicate that most tracks do not become extremely popular, and so we do not remove the observations having high popularity.

# Using the below code, we see that the values for track_popularity appropriately range from 0 to 100, as indicated in the data dictionary.
summary(spotify_songs_2$track_popularity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.00   42.00   39.34   58.00  100.00

The below output provides a numerical summary of the variable danceability. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for danceability all fall appropriately within this scale. We note that the minimum value of 0 is rather far away from the 1st quantile of 0.561. However this may occur simply because most tracks are at least somewhat dance-able. Because we want to determine if tracks with low danceability tend to be less popular, we do not remove tracks with low danceability from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we create a numerical summary for the values of the variable danceability.
summary(spotify_songs_2$danceability)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5610  0.6700  0.6534  0.7600  0.9830

The below output provides a numerical summary of the variable energy. According to the data dictionary, this variable should also be measured on a scale between 0 and 1, and we see that the values for energy all fall appropriately within this scale. We note that the minimum value of 0.000175 is rather far away from the 1st quantile of 0.579. However this may occur simply because most tracks are at least somewhat energetic. Because we want to determine if tracks which are less energetic tend to be less popular, we do not remove tracks with low energy from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we create a numerical summary for the values of the variable energy.
summary(spotify_songs_2$energy)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000175 0.579000 0.722000 0.698372 0.843000 1.000000

The below output displays the number of observations in our data having each value for key. We notice that there do not appear to be any outliers for the variable key. We also notice that because -1 does not occur anywhere in the table, an overall key was identified for each piece of music in our data set. We also notice that the values for key are all integers between 0 and 11, and these values align appropriately with our expectations based on the definition of key provided in the data dictionary. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we create a table displaying the number of observations for each value of key.
table(spotify_songs_2$key)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 3001 3436 2478  797 1925 2301 2261 2907 2066 2631 1972 2577

The below output displays the number of observations in our data having each value for mode. We notice that each piece of music is classified as either major or minor, and hence we do not need to worry about the fact that other modes do exist. We also notice that, as we would expect based on the definition provided in the data dictionary, the variable mode only has the values 0 and 1. Additionally, there do not appear to be any outliers for the variable mode. By clicking on the code button below, you can see the code that was used to produce the following output.

# Using the below code, we create a table displaying the number of observations for each value of mode.
table(spotify_songs_2$mode)

## 
##     0     1 
## 12318 16034

The below output provides a numerical summary of the variable speechiness. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for speechiness all fall appropriately within this scale. We note that the maximum value of 0.918 is rather far away from the 3rd quantile of 0.133. However this may occur simply because most tracks do not contain many spoken words. Because we want to determine if tracks with many spoken words tend to be less popular, we do not remove tracks with large speechiness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we produce a numerical summary of the variable speechiness.
summary(spotify_songs_2$speechiness)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0410  0.0626  0.1079  0.1330  0.9180

The below output provides a numerical summary of the variable acousticness. According to the data dictionary, this variable should be measured on a scale between 0 and 1, and we see that the values for acousticness all fall appropriately within this scale. We note that the maximum value of 0.994 is rather far away from the 3rd quantile of 0.26. However this may occur simply because most tracks do not tend to be extremely acoustic. Because we want to determine if tracks that are mostly acoustic tend to be less popular, we do not remove tracks with large values for acousticness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we produce a numerical summary of the variable acousticness.
summary(spotify_songs_2$acousticness)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0143  0.0797  0.1772  0.2600  0.9940

The below output provides a numerical summary of the variable instrumentalness. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for instrumentalness all fall appropriately within this scale. We note that the maximum value of 0.994 is rather far away from the 3rd quantile of 0.00657. However this may occur simply because most tracks tend to use a great deal of vocals. Because we want to determine if tracks that do not contain vocals tend to be less popular, we do not remove tracks with large instrumentalness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we produce a numerical summary of the variable instrumentalness.
summary(spotify_songs_2$instrumentalness)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000207 0.0911294 0.0065725 0.9940000

The below output provides a numerical summary of the variable liveness. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for liveness all fall appropriately within this scale. We note that the maximum value of 0.996 is rather far away from the 3rd quantile of 0.249. However this may occur simply because most performers do not record their tracks live. Because we want to determine if popularity is influenced by live recording, we do not remove tracks with large values for liveness from our data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we produce a numerical summary of the variable liveness.
summary(spotify_songs_2$liveness)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0926  0.1270  0.1910  0.2490  0.9960

The below output provides a numerical summary of the variable valence. According to the data dictionary, this variable should be measured on a scale from 0 to 1, and we see that the values for valence all fall appropriately within this scale. Based on this output, there do not appear to be extreme outlying values for valence in the data set. By clicking on the code button directly below, you can see the code that was used to produce the following output.

# Using the below code, we produce a numerical summary of the variable valence.
summary(spotify_songs_2$valence)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3290  0.5120  0.5104  0.6950  0.9910

The below histogram also provides further support of the idea that there are no outlying values for the variable valence, and it indicates that the variable valence appears to be normally distributed. You can view the code used to produce the below histogram by clicking on the code button directly below.

# Using the below code, we produce a histogram for the variable valence. 
hist(spotify_songs_2$valence, main = "Frequency of Values for Valence", xlab = "Valence (measured on scale from 0 to 1)")

In looking at the data dictionary, we notice that an ideal range for tempo is not provided. This is quite different from the other variables for which typical, expected ranges were given in their definitions. Because no standard range is provided for the expected values of tempo, we will remove any extreme outliers when examining this variable. The below output displays a numerical summary and box-plot for the variable tempo. Based on this output, we notice that there appears to be an outlier that is unusually small, and there is also an outlier that is abnormally large. The code used to produce this output can be seen by clicking on the code buttons directly below.

# Using the below code, we obtain a numerical summary of the variable tempo.
summary(spotify_songs_2$tempo)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   99.97  121.99  120.96  134.00  239.44

The below code and output is for the creation of the box-plot for the variable tempo.

# Using the below code, we obtain a boxplot for the variable tempo.
boxplot(spotify_songs_2$tempo)

The below output indicates that only observation 10639 has a tempo less than 30. You can view the code used to show this by clicking on the code button directly below.

# Using the below code, we determine that only observation 10639 has a tempo less than 30.
which(spotify_songs_2$tempo < 30)

## [1] 10639

The below output indicates that only observation 17960 has a tempo greater than 230. You can view the code used to show this by clicking on the code button directly below.

# Using the below code, we find that only observation 17960 has a tempo greater than 230.
which(spotify_songs_2$tempo > 230)

## [1] 17960

We create a new data set that contains all of the observations in spotify_songs_2 except for these two outliers (i.e., all observations except for observations 10639 and 17960), and we name this revised data set spotify_songs_3. The code that we use to do this is shown directly below.

# Using the below code, we create a new data set containing all of the observations in spotify_songs_2 which have a value for tempo that is neither less than 30 nor greater than 230. We name this revised data set spotify_songs_3.
spotify_songs_3 <- filter(spotify_songs_2, tempo >= 30, tempo <= 230)

After removing these two outliers, we create a new numerical summary and box-plot for the variable tempo. In the box-plot, we can especially see that there no longer seem to be any extreme outliers. The numerical summary and box-plot are shown directly below, and you can access the code required to produce them by clicking on the code buttons below.

# Using the below code, we produce a numerical summary of the variable tempo.
summary(spotify_songs_3$tempo)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.48   99.97  121.99  120.96  134.00  220.25

The below code and output is for the creation of a box-plot for the variable tempo.

# Using the below code, we produce a boxplot for the variable tempo.
boxplot(spotify_songs_3$tempo)

The below output provides a numerical summary and box-plot for the variable duration_ms in the data set spotify_songs_3. In the box-plot especially, we can see that there appear to be no extreme outliers. You can view the code required to produce this output by clicking on the code buttons below.

# Using the below code, we produce a numerical summary for the variable duration_ms.
summary(spotify_songs_3$duration_ms)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29493  187743  216933  226587  254976  517810

The below output and code is for the creation of a box-plot for the variable duration_ms.

# Using the below code, we produce a boxplot for the variable duration_ms.
boxplot(spotify_songs_3$duration_ms)

The below output provides a numerical summary of the variable loudness. In the variable dictionary, it is stated that the loudness can generally be expected to be between -60 and zero decibels. We note that the minimum observed value of loudness is -46.448, which is quite far from the 1st quantile of -8.309. Although there is a large difference between the minimum and 1st quantile, we do not remove observations with extremely small values for loudness because they are not perceived as abnormal based on the definition provided in the data dictionary. Keeping these observations will also allow us to determine if tracks with extremely small values for loudness tend to be less popular. You can view the code required to produce the below output by clicking on the code button directly below.

# Using the below code, we produce a numerical summary of the variable loudness.
summary(spotify_songs_3$loudness)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -46.448  -8.309  -6.261  -6.817  -4.708   1.275

Based on the above output, we note that the maximum observed value of 1.275 is outside of the expected normal range for loudness stated in the data dictionary. Using the below code and output, we find that there are only six observations that contain an abnormal value for loudness which is above zero. Because this is only a small proportion of the total number of observations, we will remove these six observations from the data set.

# Using the below code, we find that there are only six observations having a loudness that is greater than zero decibels.
length(which(spotify_songs_3$loudness > 0))

## [1] 6

Because the data dictionary states that tracks having a loudness above zero are considered abnormal, we create a new data set with these unusual observations removed. This new data set is called spotify_songs_4, and it contains all of the observations in spotify_songs_3 except for those having a value for loudness that is greater than zero. The code that we use to accomplish this can be viewed by clicking on the code button shown directly below.

# Using the below code, we create a new data set called spotify_songs_4 that contains all of the observations in spotify_songs_3 except for those having a value for loudness that is greater than zero.
spotify_songs_4 <- filter(spotify_songs_3, loudness <= 0)

Tidying the Data

Recall that we mentioned that the variables track_id, track_name, track_artist, track_album_id, track_album_name, and track_album_release_date are not extremely important for our analysis. New performers may not necessarily have already produced albums or tracks to appear in Spotify playlists, and consequently, record labels cannot always use these variables to strategically select new performers to engage in contracts. Because these variables are not heavily utilized in our analysis, we will unite columns from among these variables which contain similar information. For instance, we will unite track_id and track_name, as these two variables both serve similar purposes for identifying tracks. Similarly, we will unite track_album_id and track_album_name since these two variables serve the same purpose (i.e., identifying the album containing the track). By uniting these columns, we create a tidied data set that is more simplified in the sense that it has a smaller number of variables, and we name this new tidied data set spotify_songs_5. We use underscores to separate the entries of united columns, as indicated in the below code.

# Using the below code, we create a new data set called spotify_songs_5 in which the track_id and track_name columns have been united to form a column titled track_id_and_name
spotify_songs_5 <- unite(spotify_songs_4, track_id_and_name, track_id, track_name, sep = "_")

# Using the below code, we unite the track_album_id and track_album_name columns to form a new column titled track_album_id_and_name
spotify_songs_5 <- unite(spotify_songs_5, track_album_id_and_name, track_album_id, track_album_name, sep = "_")

How to Next Read Section 3.4

Now that you have finished reading Section 3.3, you can view Section 3.4 by clicking on the tab displayed earlier in this report.

Section 3.4: Further Visual Summaries

This section is devoted to displaying visual summaries of the data. In the final project, these visual summaries will be analyzed and studied, and conclusions will be drawn based on these graphics.

The below code and output is used to create a column chart depicting the frequency of each artist in the data set.

# Using the below code, we produce a column chart displaying the frequency of each artist in the data set.
barplot(table(spotify_songs_5$track_artist), main = "Frequency of Each Artist", xlab = "Artist Name", ylab = "Frequency")

The below code and output is used to create a histogram of track_popularity. We see that there are many tracks that are not very popular.

# Using the below code, a histogram is created for the variable track_popularity.
hist(spotify_songs_5$track_popularity, main = "Histogram of Track Popularity", xlab = "Track Popularity")

The below code and output is used to display a histogram for the variable danceability. Notice that the distribution appears to be left-skewed.

# Using the below code, we produce a histogram for the variable danceability.
hist(spotify_songs_5$danceability, main = "Histogram for Danceability", xlab = "Danceability")

Using the below code and histogram output, we also see that the distribution of the variable energy also appears to be left-skewed.

# Using the below code, we produce a histogram for the variable energy.
hist(spotify_songs_5$energy, main = "Histogram for Energy", xlab = "Energy")

The below code and output is used to display a column chart depicting the frequency of each track_album_id_and_name.

# Using the below code, I display a column chart depicting the frequency of each track_album_id_and_name.
barplot(table(spotify_songs_5$track_album_id_and_name), main = "Frequency of Each Album", xlab = "track_album_id_and_name", ylab = "Frequency")

Using the below code and output, we display a box-plot for the variable key. We notice that there do not seem to be any outliers.

# Using the below code, we produce a boxplot for the variable key.
boxplot(spotify_songs_5$key, main = "Box-plot for Key")

Using the below code and histogram output, we see that the distribution for the variable loudness is left-skewed.

# Using the below code, we produce a histogram of the variable loudness.
hist(spotify_songs_5$loudness, breaks = 100, main = "Histogram of Loudness", xlab = "Loudness in Decibels")

Using the below code and histogram output, we visually see that the variable mode contains only two values.

# Using the below code, we produce a histogram of the variable mode.
hist(spotify_songs_5$mode, main = "Histogram of Mode", xlab = "Mode")

Using the below code and histogram output, we see that the distribution for the variable speechiness is right-skewed.

# Using the below code, we produce a histogram of the variable speechiness.
hist(spotify_songs_5$speechiness, main = "Histogram of Speechiness", xlab = "Speechiness")

Using the below code and output, we display a column chart of the frequency of each value for track_id_and_name. Because each track occurs only once in the data set, we observe that the frequency for each track is 1 in the column chart.

# Using the below code, we create a column chart for the variable track_id_and_name
barplot(table(spotify_songs_5$track_id_and_name), main = "Frequency of track_id_and_name", xlab = "track_id_and_name", ylab = "Frequency")

Using the below code and output, we display a histogram of the variable acousticness, and we see that the distribution of the variable acousticness is right-skewed.

# Using the below code, we produce a histogram of the variable acousticness.
hist(spotify_songs_5$acousticness, main = "Histogram of Acousticness", xlab = "Acousticness")

Using the below code and output, we display a histogram for the variable instrumentalness, and we see that most tracks have very low values for the variable instrumentalness.

# Using the below code, we produce a histogram for the variable instrumentalness.
hist(spotify_songs_5$instrumentalness, main = "Histogram of Instrumentalness", xlab = "Instrumentalness")

Using the below code and output, we display a histogram for the variable liveness. We notice that the distribution for liveness appears to be somewhat bi-modal, though the left-most peak is much more significantly pronounced.

# Using the below code, we produce a histogram for the variable liveness.
hist(spotify_songs_5$liveness, breaks = 100, main = "Histogram of Liveness", xlab = "Liveness")

The below output displays a column chart which indicates the frequency of each release date for albums. You can view the code used to create this output by clicking on the code button directly below.

# Using the below code, we produce a column chart which indicates the frequency of each release date for albums.
barplot(table(spotify_songs_5$track_album_release_date), main = "Frequency of Each Release Date for Albums", xlab = "Album Release Date", ylab = "Frequency")

Using the below code and output, we display a box-plot for the variable valence, and we see that the variable valence does not appear to contain any outliers.

# Using the below code, we produce a boxplot for the variable valence.
boxplot(spotify_songs_5$valence, main = "Box-plot for Valence")

Using the below code and output, we display a histogram for the variable tempo. In the histogram, we see that the distribution of tempo appears to have multiple peaks.

# Using the below code, we produce a histogram for the variable tempo.
hist(spotify_songs_5$tempo, breaks = 100, main = "Histogram of Tempo", xlab = "Tempo")

Using the below code and output, we display a histogram of the variable duration_ms, and we see that the variable’s distribution is right-skewed.

# Using the below code, we produce a histogram for the variable duration_ms.
hist(spotify_songs_5$duration_ms, breaks = 100, main = "Histogram of the Duration of Tracks", xlab = "duration_ms")

How to Next Read Section 3.5

Now that you’ve finished reading Section 3.4, you can continue to Section 3.5 by clicking on the tab for Section 3.5 displayed earlier in the report.

Section 3.5: The Clean Data

The below table displays the first rows of the clean data set (i.e., spotify_songs_5). You can view the code that was used to produce this output by clicking on the code button directly below.

# Using the below code, we display a table containing the first rows of the clean data.
knitr::kable(
  head(spotify_songs_5),
  align = "ccccccccccccccccc"
)

track_id_and_name	track_artist	track_popularity	track_album_id_and_name	track_album_release_date	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN_I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx_I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31_Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6_Memories (Dillon Francis Remix)	2019-12-13	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l_All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4_All the Time (Don Diablo Remix)	2019-07-05	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7_Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6_Call You Mine - The Remixes	2019-07-19	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x_Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ_Someone You Loved (Future Humans Remix)	2019-03-05	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef_Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k_Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

How to Read Section 3.6

Now that you’ve finished reading Section 3.5, you can continue to Section 3.6 by clicking on the Section 3.6 tab found earlier in the report.

Section 3.6: Descriptive Statistics and Numerical Summaries

Visual Summaries

We remind the reader that visual summaries of the clean data were already produced in Section 3.4

Data Structure

Using the below code, we find that the clean data set contains 28344 observations and 17 variables. We also find that the variables track_id_and_name, track_artist, track_album_id_and_name, and track_album_release_date all have a character type. The variables track_popularity, key, mode, and duration_ms all have an integer type. Also, the variables danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo all have a numeric type (in particular, they are of type double). As instructed in the grading rubric, we have removed the output from display in favor of providing the succinct summary contained in this paragraph.

# Using the below code, we examine the structure of the clean data.
str(spotify_songs_5)

Missing Values

Using the below code, we also find that there are no longer any missing values in the data. As instructed in the grading rubric, we have not shown the R output in favor of the more succinct summary provided in the preceding sentence.

# Using the below code, we find the number of missing values for each variable.
colSums(is.na(spotify_songs_5))

Table of Descriptive Statistics and Numerical Summaries

We remind the reader that track_id_and_name, track_artist, track_album_id_and_name, and track_album_release_date are not extremely important in our analysis. Since new performers may not already have an album and tracks in Spotify playlists, record labels will not always be able to use these variables in decisions regarding initial contracts with new performers. Hence we focus our numerical summaries around the other variables which we will be studying in our analysis. The below output displays a table of descriptive statistics for these important variables used in our analysis. You can view the code used to produce this output by clicking the code button directly below.

# Using the below code, we create a vector called minimum which stores the minimum value of each of the important variables that we plan to analyze.
minimum <- c(min(spotify_songs_5$track_popularity), min(spotify_songs_5$danceability), min(spotify_songs_5$energy), min(spotify_songs_5$key), min(spotify_songs_5$loudness), min(spotify_songs_5$mode), min(spotify_songs_5$speechiness), min(spotify_songs_5$acousticness), min(spotify_songs_5$instrumentalness), min(spotify_songs_5$liveness), min(spotify_songs_5$valence), min(spotify_songs_5$tempo), min(spotify_songs_5$duration_ms))

# Using the below code, we create a vector called maximum which stores the maximum value of each of the important variables that we plan to analyze.
maximum <- c(max(spotify_songs_5$track_popularity), max(spotify_songs_5$danceability), max(spotify_songs_5$energy), max(spotify_songs_5$key), max(spotify_songs_5$loudness), max(spotify_songs_5$mode), max(spotify_songs_5$speechiness), max(spotify_songs_5$acousticness), max(spotify_songs_5$instrumentalness), max(spotify_songs_5$liveness), max(spotify_songs_5$valence), max(spotify_songs_5$tempo), max(spotify_songs_5$duration_ms))

# Using the below code, we create a vector called first_quantile which stores the first quantile of each of the important variables that we plan to analyze.
first_quantile <- c(summary(spotify_songs_5$track_popularity)[2], summary(spotify_songs_5$danceability)[2], summary(spotify_songs_5$energy)[2], summary(spotify_songs_5$key)[2], summary(spotify_songs_5$loudness)[2], summary(spotify_songs_5$mode)[2], summary(spotify_songs_5$speechiness)[2], summary(spotify_songs_5$acousticness)[2], summary(spotify_songs_5$instrumentalness)[2], summary(spotify_songs_5$liveness)[2], summary(spotify_songs_5$valence)[2], summary(spotify_songs_5$tempo)[2], summary(spotify_songs_5$duration_ms)[2])

# Using the below code, we create a vector called median which stores the median of each of the important variables that we plan to analyze.
median <- c(summary(spotify_songs_5$track_popularity)[3], summary(spotify_songs_5$danceability)[3], summary(spotify_songs_5$energy)[3], summary(spotify_songs_5$key)[3], summary(spotify_songs_5$loudness)[3], summary(spotify_songs_5$mode)[3], summary(spotify_songs_5$speechiness)[3], summary(spotify_songs_5$acousticness)[3], summary(spotify_songs_5$instrumentalness)[3], summary(spotify_songs_5$liveness)[3], summary(spotify_songs_5$valence)[3], summary(spotify_songs_5$tempo)[3], summary(spotify_songs_5$duration_ms)[3])

# Using the below code, we create a vector called mean which stores the mean of each of the important variables that we plan to analyze.
mean <- c(summary(spotify_songs_5$track_popularity)[4], summary(spotify_songs_5$danceability)[4], summary(spotify_songs_5$energy)[4], summary(spotify_songs_5$key)[4], summary(spotify_songs_5$loudness)[4], summary(spotify_songs_5$mode)[4], summary(spotify_songs_5$speechiness)[4], summary(spotify_songs_5$acousticness)[4], summary(spotify_songs_5$instrumentalness)[4], summary(spotify_songs_5$liveness)[4], summary(spotify_songs_5$valence)[4], summary(spotify_songs_5$tempo)[4], summary(spotify_songs_5$duration_ms)[4])

# Using the below code, we create a vector called third_quantile which stores the third quantile of each of the important variables that we plan to analyze.
third_quantile <- c(summary(spotify_songs_5$track_popularity)[5], summary(spotify_songs_5$danceability)[5], summary(spotify_songs_5$energy)[5], summary(spotify_songs_5$key)[5], summary(spotify_songs_5$loudness)[5], summary(spotify_songs_5$mode)[5], summary(spotify_songs_5$speechiness)[5], summary(spotify_songs_5$acousticness)[5], summary(spotify_songs_5$instrumentalness)[5], summary(spotify_songs_5$liveness)[5], summary(spotify_songs_5$valence)[5], summary(spotify_songs_5$tempo)[5], summary(spotify_songs_5$duration_ms)[5])

# Using the below code, we create a vector called standard_deviation which stores the standard deviation of each of the important variables that we plan to analyze.
standard_deviation <- c(sd(spotify_songs_5$track_popularity), sd(spotify_songs_5$danceability), sd(spotify_songs_5$energy), sd(spotify_songs_5$key), sd(spotify_songs_5$loudness), sd(spotify_songs_5$mode), sd(spotify_songs_5$speechiness), sd(spotify_songs_5$acousticness), sd(spotify_songs_5$instrumentalness), sd(spotify_songs_5$liveness), sd(spotify_songs_5$valence), sd(spotify_songs_5$tempo), sd(spotify_songs_5$duration_ms))

# Using the below code, we create a dataframe called descriptive_statistics that contains the vectors minimum, maximum, first_quantile, median, mean, third_quantile, and standard_deviation.
descriptive_statistics <- data.frame(minimum, maximum, first_quantile, median, mean, third_quantile, standard_deviation)

# Using the below code, we name each row in descriptive_statistics according to the corresponding variable described by those statistics.
row.names(descriptive_statistics) <- c("track_popularity", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms")

# Using the below code, we output a table displaying the values in descriptive_statistics.
knitr::kable(
  descriptive_statistics,
  caption = "Descriptive Statistics and Numerical Summaries for the Clean Data"
)

Descriptive Statistics and Numerical Summaries for the Clean Data
	minimum	maximum	first_quantile	median	mean	third_quantile	standard_deviation
track_popularity	0.0000e+00	100.000	21.00000	4.20000e+01	3.933887e+01	5.800000e+01	2.369742e+01
danceability	7.7100e-02	0.983	0.56100	6.70000e-01	6.533830e-01	7.600000e-01	1.457295e-01
energy	1.7500e-04	1.000	0.57900	7.22000e-01	6.983495e-01	8.430000e-01	1.834771e-01
key	0.0000e+00	11.000	2.00000	6.00000e+00	5.367556e+00	9.000000e+00	3.613605e+00
loudness	-4.6448e+01	-0.046	-8.31025	-6.26200e+00	-6.818456e+00	-4.709750e+00	3.032471e+00
mode	0.0000e+00	1.000	0.00000	1.00000e+00	5.654459e-01	1.000000e+00	4.957071e-01
speechiness	2.2400e-02	0.918	0.04100	6.26000e-02	1.079337e-01	1.330000e-01	1.025514e-01
acousticness	1.4000e-06	0.994	0.01430	7.97000e-02	1.772138e-01	2.600000e-01	2.228397e-01
instrumentalness	0.0000e+00	0.994	0.00000	2.07000e-05	9.115160e-02	6.582500e-03	2.325908e-01
liveness	9.3600e-03	0.996	0.09260	1.27000e-01	1.909535e-01	2.490000e-01	1.558725e-01
valence	1.0000e-05	0.991	0.32900	5.12000e-01	5.104299e-01	6.950000e-01	2.343385e-01
tempo	3.5477e+01	220.252	99.97200	1.21993e+02	1.209551e+02	1.339957e+02	2.693762e+01
duration_ms	2.9493e+04	517810.000	187746.50000	2.16933e+05	2.265964e+05	2.549773e+05	6.106305e+04

Proposed Exploratory Data Analysis

The proposed exploratory data analysis contains four subsections, each of which can be accessed by clicking the four tabs below.

Section 4.1: Uncovering New Information in the Data

There are several different ways that I could look at this data in order to determine the musical characteristics that are often present in popular tracks. One method would be to examine scatter plots which compare track_popularity to other musical characteristics. Another way would be to develop a regression model to relate track_popularity to various musical characteristics. Still another way would be to calculate summary statistics for track_popularity among musical characteristics that are filtered to contain only certain values. For instance summary statistics could be calculated for track_popularity among observations with only high liveness values. By slicing the data and summarizing it when only certain musical characteristics are present (such as only those tracks that are likely to have been recorded live), we can begin to see patterns in variables that may influence track_popularity. We also plan to examine potential interaction between the variables. If interaction between two variables is discovered, the two variables will be multiplied together to create a new variable that is used in the development of a regression model. This will account for the interaction of the two variables. Further potential transformations of the data will be examined to see if the transformations would improve the regression results.

Section 4.2: Plots and Tables

We will create scatter plots that compare track_popularity to various musical characteristics. We will also create tables providing summary statistics for track_popularity among filtered slices of the data set. In particular we will focus on filtering the data to examine track_popularity among observations that have specific musical characteristics (such as those observations that are likely to have been recorded live, those observations with high valence, those observations with high instrumentalness, etc.). I also plan to use residual plots to examine the adequacy of potential regression models.

Section 4.3: Topics to Learn

I already have the skills and knowledge necessary to perform the tasks proposed in this work. In class, we have not yet learned how to develop regression models, investigate interaction between variables, use variable selection criteria such as the Bayesian information criterion, use the package ggplot2 to create visualizations, and use the package leaps to aid in variable selection. However, I have already gained these skills in other classes. In developing this midterm report, I learned new skills related to the creation of tabbed sections within R markdown, and I also learned how to use the knitr::kable() function. In class, I am also learning new information and functions in tidyverse. To complete the final project, I plan to use shiny. Because I do not have experience using shiny, I will need to learn how to use it.

Section 4.4: Machine Learning Techniques

In determining whether each numeric variable is statistically significant in influencing track_popularity, we plan to examine p-values resulting from regression analysis. In selecting tuning parameters for potential transformations of the data, we plan to maximize the log-likelihood of the transformed data and utilize residual plots. We will also examine the existence of possible multicollinearity among the variables, and if multicollinearity is found to be present, we will consider the potential benefits of utilizing ridge regression. We will also use cross validation to understand the predictive accuracy of potential regression models.

The Influence of Musical Characteristics on Popularity

Abigail Richard

10/30/2020

Reviewers

Introduction

Section 1.1: Explanation of Problem Statement

Section 1.2: Data and Methodology Used to Address the Problem Statement

Section 1.3: Proposed Approach and Analytic Techniques

Section 1.4: Benefits and Significance of Proposed Work

Packages Required

Section 2.1: Loading of Required Packages

Section 2.2: Suppressing of Messages and Warnings for Loading Packages

Section 2.3: Purpose of Each Package

Data Preparation

Section 3.1: Original Source of Data

Section 3.2: Data Explanation

Data Dictionary

How to Next Read Section 3.3

Section 3.3: Data Importing and Cleaning

Structure of the Data, Variable Names, and Variable Types

Missing Values

Removing Observations So That Each Track Occurs Only Once

Numerical Summaries, Visual Summaries, and Outliers

Tidying the Data

How to Next Read Section 3.4

Section 3.4: Further Visual Summaries

How to Next Read Section 3.5

Section 3.5: The Clean Data

How to Read Section 3.6

Section 3.6: Descriptive Statistics and Numerical Summaries

Visual Summaries

Data Structure

Missing Values

Table of Descriptive Statistics and Numerical Summaries

Proposed Exploratory Data Analysis

Section 4.1: Uncovering New Information in the Data

Section 4.2: Plots and Tables

Section 4.3: Topics to Learn

Section 4.4: Machine Learning Techniques