Final Project: Popularity of Spotify Audio

For my final project, I choose to explore the Spotify data set, and figure out how features, genres, and release date of the music influence the popularity of the music. Below are the different elements of the project broken out on each tab.

Final Project Components

Synopsis

Everyone wants to be famous nowadays. With social medias like Youtube, Instagram, and Tik Tok, people can easily be a content creator and become popular, as they can monetize from popularity. The content creators would constantly be thinking how to make their accounts, videos, and stream channel more attractive to the viewers. Music is one of the first things they consider when they make videos or stream, since music set the mood of their work. For example, when we are on Youtube watching a vlog, the background music of the vlog sounds light-hearted and cheerful for most viewers, we would enjoy the vedio more and have higher probability of hitting that like botton. In other words, popularity of the music is positively correlated with the popularity of the content. This is one of the many examples in life that proves the importance of music popularity.

In this project, I am using the Spotify dataset to explore the factors that contribute to the popularity of the music. This data sets contains 32833 songs with six broad genres: pop,rap, rock, latin, EDM, and R&B. It also contains 12 variables for 12 music features: acousticness,liveness,speechiness,energy,loudness, danceability, instrumentalness, valence, duration, tempo, key ,and mode. Certainly, it has a variable called track_popularity, rating the popularity of the music from 0 to 100. The methodology I am using is to conduct univariate analysis on useful variables: use summarization (Pivot tables) and graphs such as barplots and boxplots to show the correlation between popularity of music and factors such as release date, features, and genres of the music. From the methods above, we can conclude the factors that contribute positively and negatively to the popularity of music, and the factors that are not correlated with popularity of music.

This project will guide the customers on the choice of music, in order to maximize the popularity of their content. Youtubers and streamers could use the result of this project to choose background music, to increase the popularity of their videos. Restaurants and shops can also benefit from the result of this project to choose the popular tracks to play in their place, in order to maximize the satisfaction of their customers. In a nutshell, they can monetize from the popularity of the music they choose.

Packages Required

These packages are required to manipulate and visualize the data.

library(dplyr) #manipulate data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(knitr) ## Show original data in good format
library(DT) ## Output data in nice format
library(magrittr) ## Pipe operators
library(ggcorrplot) ## Visualizing data
library(plotly) ## Visualizing data

Data Preparation

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. The Spotify’s API provides artist, album, track data, audio features, genres for each song.

The Spotify Genre Data was downloaded beforehand, and I downloaded the cvs file from a dropbox on 3/31/2020.

song <- read.csv("spotify_songs.csv", stringsAsFactors = FALSE)

There are 32833 observations, each contains a song, and 23 variables.

# Find the dimensions of the dataset.
dim(song)
## [1] 32833    23

Then I check how many missing values exists in each variable. It turns out 3 out of 23 variables have 5 missing values.

# See how many NA is in each variable
colSums(is.na(song))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

I removed some columns due to their ineffectiveness, including the 3 variables with missing values mentioned above: track_name, track_artist, and track_album_name. Since the purpose of this project is to find the factors that influence popularity, variables such as the name and id of the song, album, or artist does not provide much useful analytic information. So I removed them.

dropcol <- c("track_id","track_name","track_artist","track_album_id","track_album_name","playlist_name","playlist_id")
song1<- song %>% select(-all_of(dropcol))

Because I want to explore the correlation between popularity and year, month, date of the song release, I break down the release date to those 3 variables. After checking, however, I found out that the variable track_album_release_date contains 1855 values without month and 1886 values without date.

# Break down the date to Year, Month, and Date
song1 %>% separate(track_album_release_date,c("Year","Month","Date"),sep = "-") -> song2

# see how many missing values in each variable
colSums(is.na(song2))
##  track_popularity              Year             Month              Date 
##                 0                 0              1855              1886 
##    playlist_genre playlist_subgenre      danceability            energy 
##                 0                 0                 0                 0 
##               key          loudness              mode       speechiness 
##                 0                 0                 0                 0 
##      acousticness  instrumentalness          liveness           valence 
##                 0                 0                 0                 0 
##             tempo       duration_ms 
##                 0                 0

I decide to use boxplot to see if release month and date of the songs are correlated with the track popularity. If correlation does not seem to exist, I would drop the 2 variables which do not prodvide useful information regarding popularity.

song2 %>% 
  ggplot(aes(Month,track_popularity))+
    geom_boxplot()+
    scale_y_continuous(name = "Track Popularity")+
    ggtitle("Relationship between Track Popularity and Month of Track Release")

song2 %>% 
  ggplot(aes(Date,track_popularity))+
  geom_boxplot()+
  scale_y_continuous(name = "Track Popularity")+
  ggtitle("Relationship between Track Popularity and Date of Track Release")

From the boxplots above, all the songs have similar popularity median and boxplot in general across both Month and Date, even the boxplot for missing value (NA) are similiar to the rest. Therefore, I can conclude that variables Month and Date do not have influence on response variable track_popularity. I drop those 2 variables and converted playlist_genre to factors. Then I also transformed duration_ms to duration in minutes so that people can comprehend the longevity of the songs better.

# drop month and date and change duration to minute
song_use <- song2 %>%
  transmute(track_popularity,
            Year,
            playlist_genre = as_factor(playlist_genre),
            danceability,
            energy,
            key,
            loudness,
            mode,
            speechiness,
            acousticness,
            instrumentalness,
            liveness,
            valence,
            tempo,
            duration_min = duration_ms/1000/60
         )

Here is a glimpse of all the variables now that they are cleaned with correct data types.

summary(song_use)
##  track_popularity     Year           playlist_genre  danceability   
##  Min.   :  0.00   Length:32833       pop  :5507     Min.   :0.0000  
##  1st Qu.: 24.00   Class :character   rap  :5746     1st Qu.:0.5630  
##  Median : 45.00   Mode  :character   rock :4951     Median :0.6720  
##  Mean   : 42.48                      latin:5155     Mean   :0.6548  
##  3rd Qu.: 62.00                      r&b  :5431     3rd Qu.:0.7610  
##  Max.   :100.00                      edm  :6043     Max.   :0.9830  
##      energy              key            loudness            mode       
##  Min.   :0.000175   Min.   : 0.000   Min.   :-46.448   Min.   :0.0000  
##  1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0000  
##  Median :0.721000   Median : 6.000   Median : -6.166   Median :1.0000  
##  Mean   :0.698619   Mean   : 5.374   Mean   : -6.720   Mean   :0.5657  
##  3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :11.000   Max.   :  1.275   Max.   :1.0000  
##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927  
##  Median :0.0625   Median :0.0804   Median :0.0000161   Median :0.1270  
##  Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472   Mean   :0.1902  
##  3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960  
##     valence           tempo         duration_min    
##  Min.   :0.0000   Min.   :  0.00   Min.   :0.06667  
##  1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:3.13032  
##  Median :0.5120   Median :121.98   Median :3.60000  
##  Mean   :0.5106   Mean   :120.88   Mean   :3.76333  
##  3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:4.22642  
##  Max.   :0.9910   Max.   :239.44   Max.   :8.63017
str(song_use)
## 'data.frame':    32833 obs. of  15 variables:
##  $ track_popularity: int  66 67 70 60 69 67 62 69 68 67 ...
##  $ Year            : chr  "2019" "2019" "2019" "2019" ...
##  $ playlist_genre  : Factor w/ 6 levels "pop","rap","rock",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_min    : num  3.25 2.71 2.94 2.82 3.15 ...

In the summary of the variables, loudness has maximum of 1.275, while in the description of the data, it states that loudness typically falls into range between -60 and 0 db.

# select records with loudness larger than 1
song %>% 
  filter(loudness > 1) %>% 
  select(track_name,playlist_genre,playlist_subgenre,loudness) %>% 
  arrange(desc(loudness)) %>% 
  as_tibble()
## # A tibble: 2 x 4
##   track_name                     playlist_genre playlist_subgenre loudness
##   <chr>                          <chr>          <chr>                <dbl>
## 1 Raw Power - Iggy Pop Mix       rock           album rock            1.27
## 2 Escape From Love - Curbi Remix edm            electro house         1.14
song_use %>% 
  ggplot(aes(y = loudness))+
  geom_boxplot(color = 'black',fill = 'green', coef = 4) + 
  coord_flip()+
  labs(title = "Loudness")

As shown above, however, the outliers are mostly on the left side with negative values. Since the variable description says it ranges from -60 to 0, it makes sense to have loudness as low as -46.45. Therefore, I decide not to remove any outliers on the left side. For loudness larger than 1, besides they dont seems like outliers in the boxplot, the 2 tracks shown above are remix of rock and edm, which are expected to be extra loud. In conclusion, I decide not to remove those records too.

Below is a preview of cleaned data.

song_use %>% head(100) %>% datatable()

Below is a table of the variable name, data types, and a description for each variable.

Variable Name Data Type Variable Description
track_popularity integer Song Popularity (0-100) where higher is better
Year character Year when song is released
playlist_genre factor Playlist genre
danceability numeric Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy numeric Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key integer The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness numeric The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode integer Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness numeric Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness numeric A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness numeric Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness numeric Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence numeric A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo numeric The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_min numeric Duration of song in minutes

Exploratory Data Analysis

To further dive in to the data, I break down the EDA into the following 3 parts:

Correlation between Music features

After I review all the music features, I found some of the features might be highly correlated with one another. For example, dance music must have strong energy, and music with strong energy usually are loud. In this section, I visualize the correlation between music features to see if some of the features are redundant.

song_use %>% 
  select(danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,
         valence,tempo,mode) %>% 
  cor() %>% 
  round(2) %>% 
  ggcorrplot(hc.order = TRUE, type = "lower", lab = TRUE, title =  "Correlation between music features", 
             ggtheme = theme_gray, colors = c("#6D9EC1", "white", "#E46726"))

As shown above, most of the music feature are not correlated, and have correlation below 0.2. Given the data, I would set an arbitrary standard of 0.3 to tell if the correlation exist. Four points has the absolute value greater than 0.3. Valence and danceability have positive 0.33 correlation, which makes sense, since dance music tends to be happy music. Acousticness is negatively correlated with energy and loudness, and it is true that acoustic music are usually quiet and soothing. Energy and loudness have positive correlation of 0.68, which can be considered highly correlated. I decide to drop Loudness, since the feature energy conveys “fast, loud, and noisy”, which is more informative and meaningful than Loudness.

Popularity and Time

In this section, I explore the correlation between Popularity of music and time. Below I display the mean of track_populariy in the timeline of Year 1957 to 2020.

song_use %>% 
  mutate(Year = as.factor(Year)) %>%
  group_by(Year) %>% 
  summarise(Popularity = mean(track_popularity,na.rm = TRUE)) %>%
  ggplot() + 
  geom_line(aes(x = Year, y = Popularity, group = 1),
            color = "#09557f", size = 1) + 
  scale_x_discrete(breaks = seq(1957,2020,10),labels = seq(1955,2020,10))+
  ggtitle("Popularity of Songs from 1957 to 2020")

We can see huge variations in popularity during 1957-1965, and the popularity of music gradually decrease as the year increases until music from 2005. Music from 2005 to 2018 started to trend up in popularity, and music from 2020 seems to have lower popularity than music from 2018 and 2019. Since this is current (2020) popularity, it makes sense that latest released music enjoys more popularity, while the music from 2020 are too new and might still not got enough exposure, resulting in lower popularity compared to music from 2018 and 2019.

I also want to explore if the number of songs are evenly distributed across the timeline.

song_use %>% 
  transmute(Year = as.numeric(Year),
            playlist_genre) %>% 
  ggplot(aes(x = Year, fill = playlist_genre))+
  geom_density(alpha = 0.5, position = "stack")+
  ggtitle("Density of Songs from 1957 to 2020") +
  scale_x_continuous(breaks = seq(1957,2020,10),labels = seq(1955,2020,10)) +
  labs(fill = "Genre") -> a 

ggplotly(a)

As shown above, the density plot is very skewed to the left, which means in the dataset, there are very few number of songs before approximately 2005. This explains the huge variation early on during 1955-1965, and since there are too small quantity of data from 1955 to 2005, the popularity trend from 1955 to 2005 could be not representative. In a nutshell, the take-away from this section of analysis is that, for music released from 2005 to present, except for a drop in around 2010, popularity of music increases as time goes by. The music with the highest popularity are songs from approximately 2018 and 2019.

Features of each Genre

Since in next section I will run a regression model of popularity against all the music features, I want to discover how to describe each genre with these features. In this way, this report can provide guidance on what type and genre of the music is most popular.

First I make categorical variables of the features based on variable description. For example, in the variable description, it stated that “A value above 0.8 provides strong likelihood that the track is live”. Based on this, I create a new variable named live_cat, which categorizes music with liveness higher than 0.8 as “Live”, and lower than 0.8 as “Not Live”.

For those features that do not have clear classification mentioned in variable description, I use the median of the variable to divide into 2 categories. For example, I created valence_cat, which categorizes songs with valence lower than the median as “Sad”, and those higher than the median as “Happy”.

# make categorical variables based on variable description, or the median of the variable.
song_use %>% 
  transmute(playlist_genre,
            mode_cat = as.factor(mode),
            speech_cat = as.factor(case_when(speechiness >= 0 & speechiness <= 0.33 ~ "No speech music",
                                speechiness > 0.33 & speechiness <= 0.66 ~ "Speech like music",
                                TRUE ~ "Speech")),
            instrumental_cat = as.factor(case_when(instrumentalness >= 0 & instrumentalness <= 0.5 ~ "Yes Vocal",
                                      instrumentalness > 0.5 & instrumentalness <= 1 ~ "No Vocal")),
            live_cat = as.factor(case_when(liveness >= 0 & liveness <= 0.8 ~ "Not Live",
                              liveness > 0.8 & liveness <= 1 ~ "Live")), 
            dance_cat = as.factor(case_when(danceability >= 0 & danceability <= median(song_use$danceability) ~ "Not dancable",
                               TRUE ~ "Dancable")),
            energy_cat = as.factor(case_when(energy >= 0 & energy <= median(song_use$energy) ~ "Not energetic",
                                            TRUE ~ "Energetic")), 
            key_cat = as.factor(case_when(key >= 0 & key <= median(song_use$key) ~ "lower key",
                                             TRUE ~ "higher key")), 
            acoustic_cat = as.factor(case_when(acousticness >= 0 & acousticness <= median(song_use$acousticness) ~ "Not Acoustic",
                                          TRUE ~ "Acoustic")), 
            valence_cat = as.factor(case_when(valence >= 0 & valence <= median(song_use$valence) ~ "Sad",
                                               TRUE ~ "Happy")), 
            tempo_cat = as.factor(case_when(tempo >= 0 & tempo <= median(song_use$tempo) ~ "Slow",
                                              TRUE ~ "Fast")),
            duration_cat = as.factor(case_when(duration_min >= 0 & duration_min <= median(song_use$duration_min) ~ "Short",
                                            TRUE ~ "Long")),            
            ) -> song_graph

Then I display one chart for each categorical feature variables, and see what percent of each genre falls into each feature categories. In this way, I can see what features does each genre have.

# Create a function to graph
Genre_graph <- function(feature,title){
  song_graph %>% 
    ggplot(aes(x = playlist_genre, fill = feature)) + 
    geom_bar(position = "fill") + 
    scale_y_continuous(name = "Percent", labels = scales::percent)+
    coord_flip()+
    ggtitle(paste0(title," Distribution for each Genre"))
}

Genre_graph(song_graph$speech_cat,"Speechiness")

Genre_graph(song_graph$instrumental_cat,"Instrumentalness")

Genre_graph(song_graph$live_cat,"Liveness")

Genre_graph(song_graph$dance_cat,"Danceablility")

Genre_graph(song_graph$energy_cat,"Energy")

Genre_graph(song_graph$key_cat,"Key")

Genre_graph(song_graph$acoustic_cat,"Acousticness")

Genre_graph(song_graph$valence_cat,"Valence")

Genre_graph(song_graph$tempo_cat,"Tempo")

Genre_graph(song_graph$duration_cat,"Duration")

From above graphs, I can conclude the following features for each genre,

  • EDM: No speech, No vocal, a few live, Energetic, Not acoustic, Sad, Fast, medium danceability and medium duration.
  • R&B: A little speech like, Vocal, No live, not energetic, Acoustic, Slow, Long, medium danceability, and medium valence
  • Latin: No speech, Vocal, No live, Danceable, Acoustic, Happy, Slow, medium energy, medium duration
  • Rock: No speech, Vocal, a few live, Not danceable, Energetic, Not acoustic, Long, medium valence, medium tempo
  • Rap: Speech like, Vocal, No live, Danceable, not energetic, a bit acoustic, medium valence, medium tempo, medium duration
  • Pop: No speech, Vocal, No live, medium danceability and medium energy, medium acoustic, medium valence, medium tempo, and medium duration

All genres have almost the same percentage for Key, so I don’t include Key as part of the features.

Supervised Learning

In this section, I will construct a linear regression model to see if we can predict popularity score using the given music features. First, I convert the mode variable to factor, and sample 90% of the original data and use it as training set. The remaining 10% is used as test set. The regression model will be built on the training data set, I test the performance of the model using the testing data set.

# Convert mode into factor
song_use %>% 
  mutate(mode = as.factor(mode)) -> song_mod

# Split data to training and testing samples
set.seed(1234)
sample_index <- sample(nrow(song_mod),nrow(song_mod)*0.90)
song_train <- song_mod[sample_index,]
song_test <- song_mod[-sample_index,]

I build the linear model using training data set.

# Model building
m1 <- lm(track_popularity ~ danceability+energy+mode+speechiness+
       acousticness+instrumentalness+liveness+valence+tempo+duration_min+key+loudness, data = song_train)
summary(m1)
## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + mode + 
##     speechiness + acousticness + instrumentalness + liveness + 
##     valence + tempo + duration_min + key + loudness, data = song_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.462 -17.643   2.834  18.909  66.432 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       78.030622   1.764709  44.217  < 2e-16 ***
## danceability       5.020581   1.105794   4.540 5.64e-06 ***
## energy           -29.215194   1.261290 -23.163  < 2e-16 ***
## mode1              0.624701   0.287952   2.169 0.030056 *  
## speechiness       -7.867691   1.434033  -5.486 4.14e-08 ***
## acousticness       3.054412   0.769847   3.968 7.28e-05 ***
## instrumentalness -12.017507   0.665854 -18.048  < 2e-16 ***
## liveness          -4.374367   0.928340  -4.712 2.46e-06 ***
## valence            2.907273   0.677409   4.292 1.78e-05 ***
## tempo              0.018568   0.005385   3.448 0.000565 ***
## duration_min      -2.818576   0.144418 -19.517  < 2e-16 ***
## key                0.057938   0.039423   1.470 0.141664    
## loudness           1.508370   0.067640  22.300  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.09 on 29536 degrees of freedom
## Multiple R-squared:  0.0717, Adjusted R-squared:  0.07133 
## F-statistic: 190.1 on 12 and 29536 DF,  p-value: < 2.2e-16

From, the summary we can see that the F statistics p-value is less than 2.2e-16, which means the model is significant. But the Adjusted R-squared is 0.074, which means only 7.4% of the variance of popularity score is explained by the music features.

Then I use forward selection,backward selection, and stepwise selection to choose model with lowest AIC.

# forward selection, backward selection, stepwise selection
nullmodel <- lm(track_popularity~1,data = song_train)
fullmodel <- lm(track_popularity~.-Year-playlist_genre, data = song_train)
model_b <- step(fullmodel,direction='backward')
model_f <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
model_s <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')

It turns out the 3 methods gives me the same model, with same AIC, which is the full model including all the music feature variables. Therefore, the coefficients should be the same as my original m1, and this is the model I will choose.

# compare AIC for all models
AIC(m1)
## [1] 271909.5
AIC(model_b)
## [1] 271909.5
AIC(model_f)
## [1] 271909.5
AIC(model_s)
## [1] 271909.5
song_summary <- summary(model_b)

The in sample MSE is 580.33, and out of sample MSE is 570.73. Although out of sample MSE is slightly lower than in sample MSE, they are simliar and both too high, which is reflected in the low R squared.

# In sample MSE
song_summary$sigma^2
## [1] 580.3332
# Out of sample MSE
pred <- predict(object = model_b, newdata = song_test)
mean((pred-song_test$track_popularity)^2)
## [1] 570.7285

This could mean that this model lacks predictive power. The reason is that to model the popularity score, we are missing many important variables other than those music features. However, we can still look at the coefficients of the model, and those coefficients are still useful as guidence on which features increase popularity and which decrease popularity.

song_summary$coefficients
##                      Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)       78.03062181 1.764708928  44.217276  0.000000e+00
## danceability       5.02058081 1.105794336   4.540248  5.640973e-06
## energy           -29.21519424 1.261289766 -23.162952 1.206173e-117
## key                0.05793791 0.039422620   1.469662  1.416641e-01
## loudness           1.50837031 0.067639685  22.300079 2.947581e-109
## mode1              0.62470137 0.287952114   2.169463  3.005552e-02
## speechiness       -7.86769079 1.434032978  -5.486409  4.135511e-08
## acousticness       3.05441244 0.769846956   3.967558  7.278424e-05
## instrumentalness -12.01750654 0.665854215 -18.048255  1.996758e-72
## liveness          -4.37436748 0.928340252  -4.712030  2.463769e-06
## valence            2.90727346 0.677409278   4.291753  1.778318e-05
## tempo              0.01856769 0.005384939   3.448078  5.653781e-04
## duration_min      -2.81857632 0.144418017 -19.516791  2.688298e-84

To conclude, music that is more danceale, loud, in major mode, acoustic, happy, and fast can be more popular, while music that has features like higher energy, more speech like, instrumental, live, and long can be less popular.

Summary

  • Problem statement This study uses the Spotify dataset to analyze factors and music genres that can affect the popularity of the music. Specifically, this study answers the question that how the Year of release, genres of music, and music features affect the music popularity. In the EDA part, I used graph to demonstrate the correlations between music features, show how population varies across the years, and describe each genre based on music features. In the Supervised learning part, I used linear regression to demonstrate how each music feature affects music popularity.

  • Insights and Implications
  1. From 2005 to 2020, music popularity increases as the music release year becomes more recent. The highest popularity songs are released in approximately 2018 and 2019.
  2. Music that is more danceale, loud, in major mode, acoustic, happy, and fast tends to be more popular. Customers can pick music with these features. Music that is higher energy, more speech like, instrumental, live, and long tends to be less popular. Customers should avoid music with these features.
  3. Using the second point, and music features of each genre, we can conclude that Lain and Pop music tends to be popular, R&B and Rap enjoys medium popularity, and EDM and Rock are not popular. Therefore, customers who craves popularity should use music with genres like Lain and Pop, and avoid music with genres like EDM and Rock.
  • Limitation of the Analysis To model the popularity score with music features, we are still lacking important variables, which results in low R Square and high MSE. To let the linear regression model have more explanatory and predictive power, we need to figure out those missing variables.