Description of Data

I am using a new data set because the biopsy data set was quite boring with regards to looking at anything other than the prediction of cancer. This Spotify data set that I am using was pulled using the Spotify API and the spotifyR package. It contains 256,789 tracks that are specific to 62 countries. This spans about two years from 2017-Sept. 2018 and each country’s track-list is comprised of the top 50 songs from each week.

I am interested in looking at the relationship between a country and its valence. Valence is defined by the emotion of a music. It ranges from 0-10, with 10 being the happiest that a song can be. Here I am using only two countries, United States and Argentina.

Importing Dataset

Here we are importing the spotify data set.

pacman::p_load(Zelig,pander,texreg,lmtest,visreg,tidyverse,shiny,readr,knitr)
data=read_csv("/Users/gregmaghakian/Documents/Econ 392W/Code/SpotifyDataFiles/tracklist_with_audio_features.csv")
data=filter(data,Country=="United States"|Country=="Argentina")%>%
  mutate(danceability = sjmisc::rec(danceability, rec = "0:.5=0; else=1"))
head(data)
## # A tibble: 6 x 21
##   track_uri   Country  Position `Track Name`   Artist URL       Date      
##   <chr>       <chr>       <int> <chr>          <chr>  <chr>     <date>    
## 1 000xQL6tZN… United …       30 Still Got Time ZAYN   https://… 2017-03-30
## 2 000xQL6tZN… Argenti…       38 Still Got Time ZAYN   https://… 2017-03-30
## 3 003eoIwxET… United …       28 Growing Pains  Aless… https://… 2018-06-21
## 4 00B7TZ0Xaw… United …        8 Best Life (fe… Cardi… https://… 2018-04-12
## 5 00c9VpjXk7… United …       33 Lay It On Me … Kasbo  https://… 2017-07-06
## 6 00EPIEnX1J… Argenti…       43 No Me Acuerdo  Thalía https://… 2018-06-14
## # ... with 14 more variables: danceability <dbl>, energy <dbl>, key <chr>,
## #   loudness <dbl>, mode <chr>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>, time_signature <int>, key_mode <chr>
dim(data)
## [1] 9000   21
length(unique(data$Country))
## [1] 2
range(data$danceability)
## [1] 0 1

Note: We re-code danceability such that 0:.5 is low danceability, and everything above that is high danceability. #Regression 1:

attach(data)
model1=lm(valence~Country)
model1
## 
## Call:
## lm(formula = valence ~ Country)
## 
## Coefficients:
##          (Intercept)  CountryUnited States  
##              0.53721              -0.09086

Interpretation Model 1:

Ceteris paribus, compared to Argentina, the United States has a valence on average that is .09 lower. Meaning, Argentina’s music is happier than the United States.

Looking at the other features in this data set, let’s include danceability and energy into our model as these features would seem to make sense in explaining valence.

Regression 2:

model2=lm(valence~Country+danceability+energy)
model2
## 
## Call:
## lm(formula = valence ~ Country + danceability + energy)
## 
## Coefficients:
##          (Intercept)  CountryUnited States          danceability  
##              0.11910              -0.07043               0.16609  
##               energy  
##              0.41025

Interpretation Model 2:

Ceteris paribus, compared to Argentina, the United States has a valence on average that is .07 lower. Here we can see that, while United States’ music is still sadder than Argentina’s music, the coefficient for United States has become lesser, meaning that danceability and energy are helping to explain valence as well.

Regression 3:

Let’s add an interaction between country and danceability to try and capture valence between countries better.

model3=lm(valence~Country+danceability+energy+Country*danceability)
model3
## 
## Call:
## lm(formula = valence ~ Country + danceability + energy + Country * 
##     danceability)
## 
## Coefficients:
##                       (Intercept)               CountryUnited States  
##                           0.09244                           -0.01913  
##                      danceability                             energy  
##                           0.19784                            0.40951  
## CountryUnited States:danceability  
##                          -0.06060
data %>%
group_by(danceability,Country) %>%
summarize(Mean_Valence= mean(valence)) %>%
  spread(danceability, Mean_Valence)%>%
kable()
Country 0 1
Argentina 0.3452137 0.5696802
United States 0.3130535 0.4721134

Interpretation Model 3 and Tabling:

By this tabling, we can see that our inclusion of an interaction term between country and danceability makes sense with regards to valence. Looking at the table, we see that have a danceable song leads to having a higher average valence and therefore a happier song for each country. This is visible in the regression results of model 3 as well as the United States interacted with danceability will have a lower valence than Argentina.

#htmlreg(list(model1,model2,model3))
Statistical models
Model 1 Model 2 Model 3
(Intercept) 0.54*** 0.12*** 0.09***
(0.00) (0.01) (0.01)
CountryUnited States -0.09*** -0.07*** -0.02
(0.00) (0.00) (0.01)
danceability 0.17*** 0.20***
(0.01) (0.01)
energy 0.41*** 0.41***
(0.01) (0.01)
CountryUnited States:danceability -0.06***
(0.01)
R2 0.04 0.22 0.23
Adj. R2 0.04 0.22 0.23
Num. obs. 9000 9000 9000
RMSE 0.23 0.20 0.20
p < 0.001, p < 0.01, p < 0.05

Looking at these results, we can see that our R^2 increases and RMSE decreases with the addition of the new features and the interaction term. Both of these indicate that our model is better capturing valence.
In addition to this, our interaction term in model 3 is statistically significant at all levels.

Subgroup modeling and tabling

Let’s run a model using two different groups.

#Dataset where danceability is 0 aka not as danceable 
undanceable=data%>%
  filter(danceability==0)
#Dataset where danceablity is 1 aka danceable
danceable=data%>%
  filter(danceability==1)
model4=lm(valence~Country*danceability,data=data)
modelUndance=lm(valence~Country,data=undanceable)
modelDance=lm(valence~Country,data=danceable)
#htmlreg(list(modelUndance,modelDance,model4),custom.model.names = c("dance=0","dance=1","dance"))
Statistical models
dance=0 dance=1 dance
(Intercept) 0.35*** 0.57*** 0.35***
(0.01) (0.00) (0.01)
CountryUnited States -0.03** -0.10*** -0.03**
(0.01) (0.01) (0.01)
danceability 0.22***
(0.01)
CountryUnited States:danceability -0.07***
(0.01)
R2 0.01 0.05 0.13
Adj. R2 0.01 0.05 0.13
Num. obs. 1380 7620 9000
RMSE 0.19 0.22 0.22
p < 0.001, p < 0.01, p < 0.05

Looking at this output, we can see that the coefficient for dance is the same as dance=0 and the coefficient for the interaction term is the difference between the two subgroup’s coefficients for Country US.Note: the dance model is a model with only the interaction term, not energy–for ease of interpretation.

Visualization of Model 3

visreg(model3, "Country",by="danceability",scale="response")

Looking at this graphic, we can view that Valence, or the emotion of music, is dependent upon not only country, but danceability. Argentina has both higher danceability in music and higher valence in music. Also, when a song is danceable, the valence is considerably higher than when the music is not as danceable.

Food for thought

Looking at valence in Argentina and the US with regards to danceability, we can start to ask questions and make some predictions for why we are getting the results mentioned above. I believe that there could be a cultural aspect to Argentina in which its songs tend to be more danceable and upbeat because of how Latin American culture is structured. Furthermore, we could ask why people in the US want to listen to sadder music, even though we are an economic powerhouse.