Project Overview

Main Question: Do the most popular songs of 2017 reflect the economic sentiment of the United States in that year?”

1. Importing the CSVs

spotifydata2017_raw = read_csv("../louisetnguyen/Desktop/STAT 240/Personal Project/Data/featuresdf.csv")
## Rows: 100 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): id, name, artists
## dbl (13): danceability, energy, key, loudness, mode, speechiness, acousticne...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(spotifydata2017_raw)
## Rows: 100
## Columns: 16
## $ id               <chr> "7qiZfU4dY1lWllzX7mPBI", "5CtI0qwDJkDQGwXD1H1cL", "4a…
## $ name             <chr> "Shape of You", "Despacito - Remix", "Despacito (Feat…
## $ artists          <chr> "Ed Sheeran", "Luis Fonsi", "Luis Fonsi", "The Chains…
## $ danceability     <dbl> 0.825, 0.694, 0.660, 0.617, 0.609, 0.904, 0.640, 0.72…
## $ energy           <dbl> 0.652, 0.815, 0.786, 0.635, 0.668, 0.611, 0.533, 0.76…
## $ key              <dbl> 1, 2, 2, 11, 7, 1, 0, 6, 1, 0, 11, 2, 5, 3, 2, 6, 1, …
## $ loudness         <dbl> -3.183, -4.328, -4.757, -6.769, -4.284, -6.842, -6.59…
## $ mode             <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ speechiness      <dbl> 0.0802, 0.1200, 0.1700, 0.0317, 0.0367, 0.0888, 0.070…
## $ acousticness     <dbl> 0.581000, 0.229000, 0.209000, 0.049800, 0.055200, 0.0…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 1.44e-05, 0.00e+00, 2.0…
## $ liveness         <dbl> 0.0931, 0.0924, 0.1120, 0.1640, 0.1670, 0.0976, 0.086…
## $ valence          <dbl> 0.9310, 0.8130, 0.8460, 0.4460, 0.8110, 0.4000, 0.515…
## $ tempo            <dbl> 95.977, 88.931, 177.833, 103.019, 80.924, 150.020, 99…
## $ duration_ms      <dbl> 233713, 228827, 228200, 247160, 288600, 177000, 22078…
## $ time_signature   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
fred_raw = read_csv("../louisetnguyen/Desktop/STAT 240/Personal Project/Data/UMCSENT.csv")
## Rows: 874 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): UMCSENT
## date (1): observation_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fred_raw)
## Rows: 874
## Columns: 2
## $ observation_date <date> 1952-11-01, 1952-12-01, 1953-01-01, 1953-02-01, 1953…
## $ UMCSENT          <dbl> 86.2, NA, NA, 90.7, NA, NA, NA, NA, NA, 80.8, NA, NA,…

2. Cleaning up the Data

spotifydata = spotifydata2017_raw %>% select(-duration_ms,-time_signature,-speechiness,-id, -instrumentalness,-mode,-key)
glimpse(spotifydata)
## Rows: 100
## Columns: 9
## $ name         <chr> "Shape of You", "Despacito - Remix", "Despacito (Featurin…
## $ artists      <chr> "Ed Sheeran", "Luis Fonsi", "Luis Fonsi", "The Chainsmoke…
## $ danceability <dbl> 0.825, 0.694, 0.660, 0.617, 0.609, 0.904, 0.640, 0.726, 0…
## $ energy       <dbl> 0.652, 0.815, 0.786, 0.635, 0.668, 0.611, 0.533, 0.769, 0…
## $ loudness     <dbl> -3.183, -4.328, -4.757, -6.769, -4.284, -6.842, -6.596, -…
## $ acousticness <dbl> 0.581000, 0.229000, 0.209000, 0.049800, 0.055200, 0.00025…
## $ liveness     <dbl> 0.0931, 0.0924, 0.1120, 0.1640, 0.1670, 0.0976, 0.0864, 0…
## $ valence      <dbl> 0.9310, 0.8130, 0.8460, 0.4460, 0.8110, 0.4000, 0.5150, 0…
## $ tempo        <dbl> 95.977, 88.931, 177.833, 103.019, 80.924, 150.020, 99.968…
release_dates = c("2017-01-06","2017-04-14","2017-01-12","2017-02-22","2017-02-22","2017-03-30","2017-02-16","2017-04-21", "2017-01-30","2017-12-09","2017-2-26","2017-01-13","2017-02-23","2017-04-21","2017-04-18","2017-01-31","2017-02-24","2017-01-06","2016-10-26","2017-02-01","2017-06-30","2017-04-27","2016-09-16","2017-04-20","2016-09-16","2017-01-13","2017-03-17","2017-01-27","2016-07-29","2017-03-17","2016-11-24","2016-11-25","2017-06-16","2017-02-23","2017-07-07","2017-04-27","2017-03-28","2017-09-15","2017-05-19","2017-06-09","2017-09-26","2016-10-13","2017-06-15","2017-05-17","2017-04-21","2016-12-02","2016-05-23","2016-10-28","2016-10-28","2017-09-08","2016-11-04","2016-10-24","2016-08-05","2017-05-27","2016-04-05","2017-02-24","2017-03-03","2017-07-11","2016-11-17","2016-11-18","2017-04-14","2017-01-16","2016-12-09","2017-02-10","2017-03-31","2016-10-11","2016-11-18","2017-05-04","2017-03-31","2016-10-31","2017-04-07","2016-09-02","2017-06-13","2017-05-26","2017-02-14","2017-08-11","2017-09-08","2016-12-10","2017-08-24","2017-06-15","2016-07-22","2016-08-26","2016-09-16","2016-08-30","2016-08-05","2017-05-19","2016-07-29","2016-07-22","2017-05-11","2016-10-14","2016-10-21"
,"2017-05-26","2017-03-17","2017-08-17","2017-07-07","2016-10-14","2016-02-05","2017-02-02","2017-04-04","2017-09-08"            )

release_dates = ymd(release_dates)
spotifydata$release_dates = release_dates

fred = fred_raw %>% mutate(year=year(observation_date)) %>% filter(year==2017)
glimpse(fred)
## Rows: 12
## Columns: 3
## $ observation_date <date> 2017-01-01, 2017-02-01, 2017-03-01, 2017-04-01, 2017…
## $ UMCSENT          <dbl> 98.5, 96.3, 96.9, 97.0, 97.1, 95.0, 93.4, 96.8, 95.1,…
## $ year             <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…

3. Calculating Average “Happiness”

  • Based on the features of danceability, energy, loudness, liveness, valence, and tempo, by calculating the sum of each song’s score and then finding the average by dividing by the number of categories. This way, we can achieve the song’s composite happiness score. After computation, we are then able to plot it against the economic sentiment given by FRED.
spotifydata = spotifydata %>% mutate(chs =((danceability + energy + liveness + valence)/4), avg_tempo = mean(tempo))

4. Plotting

ggplot(spotifydata, aes(x=chs, fill="red")) + geom_histogram(position = "identity", alpha = 0.5) + labs(x="Composite Happiness Score", y="Count", title = "Distribution of Composite Happiness Scores within Top 100 2017 Songs")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

  • Most songs in 2017 fall around 0.5 for their composite happiness score. The histogram shows slight signs of bimodality, but has a stronger, singular peak around 0.5 There is also a slight left skew, demonstrating that the data has a higher mode than it does median.
ggplot(fred, aes(x=month(observation_date), y= UMCSENT)) + geom_line(color="blue") + labs(x= "Month (in 2017)", y="Consumer Sentiment Index", title = "Consumer Sentiment Index in 2017")

  • The CSI for 2017 remains in the 90s for majority of the year. There is a slight drop within the middle of the year, but as months progress, the CSI is shown to increase. However, in December, the CSI drastically drops.
fredmonthly = fred %>% mutate(month = month(observation_date))
spottymonth = spotifydata %>% mutate(month=month(release_dates))

combined2017data = inner_join(spottymonth,fredmonthly, by = "month")

ggplot(combined2017data, aes(x=month)) + geom_point(aes(y=chs, color="CHS")) + geom_point(aes(y=rescale(UMCSENT, to = c(0,1)), color ="UMCSENT")) + labs(x="Month in 2017",y="Consumer Sentiment Index & Composite Happiness Score", title = "Consumer Sentiment Index against Composite Happiness Score of Top 100 Songs in 2017") 

5. Conclusion

Based on the data presented in the plot, it is clear that the Composite Happiness Score of Top 100 Songs in 2017 does not have any pattern or correlation with the 2017 Monthly Consumer Sentiment Index.

monthlyspottymean = spottymonth %>% group_by(month) %>% summarize(chs=mean(chs, na.rm=TRUE))

cor.test(monthlyspottymean$chs,fredmonthly$UMCSENT)
## 
##  Pearson's product-moment correlation
## 
## data:  monthlyspottymean$chs and fredmonthly$UMCSENT
## t = 0.93528, df = 10, p-value = 0.3717
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.34672  0.73748
## sample estimates:
##       cor 
## 0.2836165
  • The correlate coefficient, r, = 0.28. This means based on data, however, there is a weak, positive correlation. This means that when the consumer sentiment rises, CHS tends to rise slightly, but there is a weakness. The p-value of 0.3717 indicates that these results are not statistically significant.

6. Future Improvements

Some improvements to make on this analysis is to extend the data from 2016-2018 to have more data and if that makes any indication of a relationship. - Some internal criticism: The chosen Spotify data did not have the release date year/date variables needed. Finding other datasets, (even larger ones) would prove to be more useful/efficient. - Overall: Extending the time frame, improving data quality, and exploring additional features(working more with tempo, etc) would strengthen this analysis and have the potential to show a stronger correlation.

Although there is no statistically significant correlation between these two variables, it is present that there could exist a weak relationship between CHS x CSI in 2017. While there is no proof of strong correlation, this could open up pathways for more data analysis in the future.