Main Question: Do the most popular songs of 2017 reflect the economic sentiment of the United States in that year?”
Plan: View Spotify’s top 100 songs for the year 2017 plot them against economic mood index in 2017. After, determine if there are possible conclusions based on popularity of a song vs economic sentiment.
Hypothesis: In 2017, the most popular songs had higher values of danceability, tempo, and other factors that would make it a “happy” and “upbeat” song.
spotifydata2017_raw = read_csv("../louisetnguyen/Desktop/STAT 240/Personal Project/Data/featuresdf.csv")
## Rows: 100 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): id, name, artists
## dbl (13): danceability, energy, key, loudness, mode, speechiness, acousticne...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(spotifydata2017_raw)
## Rows: 100
## Columns: 16
## $ id <chr> "7qiZfU4dY1lWllzX7mPBI", "5CtI0qwDJkDQGwXD1H1cL", "4a…
## $ name <chr> "Shape of You", "Despacito - Remix", "Despacito (Feat…
## $ artists <chr> "Ed Sheeran", "Luis Fonsi", "Luis Fonsi", "The Chains…
## $ danceability <dbl> 0.825, 0.694, 0.660, 0.617, 0.609, 0.904, 0.640, 0.72…
## $ energy <dbl> 0.652, 0.815, 0.786, 0.635, 0.668, 0.611, 0.533, 0.76…
## $ key <dbl> 1, 2, 2, 11, 7, 1, 0, 6, 1, 0, 11, 2, 5, 3, 2, 6, 1, …
## $ loudness <dbl> -3.183, -4.328, -4.757, -6.769, -4.284, -6.842, -6.59…
## $ mode <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ speechiness <dbl> 0.0802, 0.1200, 0.1700, 0.0317, 0.0367, 0.0888, 0.070…
## $ acousticness <dbl> 0.581000, 0.229000, 0.209000, 0.049800, 0.055200, 0.0…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 1.44e-05, 0.00e+00, 2.0…
## $ liveness <dbl> 0.0931, 0.0924, 0.1120, 0.1640, 0.1670, 0.0976, 0.086…
## $ valence <dbl> 0.9310, 0.8130, 0.8460, 0.4460, 0.8110, 0.4000, 0.515…
## $ tempo <dbl> 95.977, 88.931, 177.833, 103.019, 80.924, 150.020, 99…
## $ duration_ms <dbl> 233713, 228827, 228200, 247160, 288600, 177000, 22078…
## $ time_signature <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
fred_raw = read_csv("../louisetnguyen/Desktop/STAT 240/Personal Project/Data/UMCSENT.csv")
## Rows: 874 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): UMCSENT
## date (1): observation_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fred_raw)
## Rows: 874
## Columns: 2
## $ observation_date <date> 1952-11-01, 1952-12-01, 1953-01-01, 1953-02-01, 1953…
## $ UMCSENT <dbl> 86.2, NA, NA, 90.7, NA, NA, NA, NA, NA, 80.8, NA, NA,…
spotifydata = spotifydata2017_raw %>% select(-duration_ms,-time_signature,-speechiness,-id, -instrumentalness,-mode,-key)
glimpse(spotifydata)
## Rows: 100
## Columns: 9
## $ name <chr> "Shape of You", "Despacito - Remix", "Despacito (Featurin…
## $ artists <chr> "Ed Sheeran", "Luis Fonsi", "Luis Fonsi", "The Chainsmoke…
## $ danceability <dbl> 0.825, 0.694, 0.660, 0.617, 0.609, 0.904, 0.640, 0.726, 0…
## $ energy <dbl> 0.652, 0.815, 0.786, 0.635, 0.668, 0.611, 0.533, 0.769, 0…
## $ loudness <dbl> -3.183, -4.328, -4.757, -6.769, -4.284, -6.842, -6.596, -…
## $ acousticness <dbl> 0.581000, 0.229000, 0.209000, 0.049800, 0.055200, 0.00025…
## $ liveness <dbl> 0.0931, 0.0924, 0.1120, 0.1640, 0.1670, 0.0976, 0.0864, 0…
## $ valence <dbl> 0.9310, 0.8130, 0.8460, 0.4460, 0.8110, 0.4000, 0.5150, 0…
## $ tempo <dbl> 95.977, 88.931, 177.833, 103.019, 80.924, 150.020, 99.968…
release_dates = c("2017-01-06","2017-04-14","2017-01-12","2017-02-22","2017-02-22","2017-03-30","2017-02-16","2017-04-21", "2017-01-30","2017-12-09","2017-2-26","2017-01-13","2017-02-23","2017-04-21","2017-04-18","2017-01-31","2017-02-24","2017-01-06","2016-10-26","2017-02-01","2017-06-30","2017-04-27","2016-09-16","2017-04-20","2016-09-16","2017-01-13","2017-03-17","2017-01-27","2016-07-29","2017-03-17","2016-11-24","2016-11-25","2017-06-16","2017-02-23","2017-07-07","2017-04-27","2017-03-28","2017-09-15","2017-05-19","2017-06-09","2017-09-26","2016-10-13","2017-06-15","2017-05-17","2017-04-21","2016-12-02","2016-05-23","2016-10-28","2016-10-28","2017-09-08","2016-11-04","2016-10-24","2016-08-05","2017-05-27","2016-04-05","2017-02-24","2017-03-03","2017-07-11","2016-11-17","2016-11-18","2017-04-14","2017-01-16","2016-12-09","2017-02-10","2017-03-31","2016-10-11","2016-11-18","2017-05-04","2017-03-31","2016-10-31","2017-04-07","2016-09-02","2017-06-13","2017-05-26","2017-02-14","2017-08-11","2017-09-08","2016-12-10","2017-08-24","2017-06-15","2016-07-22","2016-08-26","2016-09-16","2016-08-30","2016-08-05","2017-05-19","2016-07-29","2016-07-22","2017-05-11","2016-10-14","2016-10-21"
,"2017-05-26","2017-03-17","2017-08-17","2017-07-07","2016-10-14","2016-02-05","2017-02-02","2017-04-04","2017-09-08" )
release_dates = ymd(release_dates)
spotifydata$release_dates = release_dates
fred = fred_raw %>% mutate(year=year(observation_date)) %>% filter(year==2017)
glimpse(fred)
## Rows: 12
## Columns: 3
## $ observation_date <date> 2017-01-01, 2017-02-01, 2017-03-01, 2017-04-01, 2017…
## $ UMCSENT <dbl> 98.5, 96.3, 96.9, 97.0, 97.1, 95.0, 93.4, 96.8, 95.1,…
## $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
spotifydata = spotifydata %>% mutate(chs =((danceability + energy + liveness + valence)/4), avg_tempo = mean(tempo))
ggplot(spotifydata, aes(x=chs, fill="red")) + geom_histogram(position = "identity", alpha = 0.5) + labs(x="Composite Happiness Score", y="Count", title = "Distribution of Composite Happiness Scores within Top 100 2017 Songs")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
ggplot(fred, aes(x=month(observation_date), y= UMCSENT)) + geom_line(color="blue") + labs(x= "Month (in 2017)", y="Consumer Sentiment Index", title = "Consumer Sentiment Index in 2017")
fredmonthly = fred %>% mutate(month = month(observation_date))
spottymonth = spotifydata %>% mutate(month=month(release_dates))
combined2017data = inner_join(spottymonth,fredmonthly, by = "month")
ggplot(combined2017data, aes(x=month)) + geom_point(aes(y=chs, color="CHS")) + geom_point(aes(y=rescale(UMCSENT, to = c(0,1)), color ="UMCSENT")) + labs(x="Month in 2017",y="Consumer Sentiment Index & Composite Happiness Score", title = "Consumer Sentiment Index against Composite Happiness Score of Top 100 Songs in 2017")
Based on the data presented in the plot, it is clear that the Composite Happiness Score of Top 100 Songs in 2017 does not have any pattern or correlation with the 2017 Monthly Consumer Sentiment Index.
monthlyspottymean = spottymonth %>% group_by(month) %>% summarize(chs=mean(chs, na.rm=TRUE))
cor.test(monthlyspottymean$chs,fredmonthly$UMCSENT)
##
## Pearson's product-moment correlation
##
## data: monthlyspottymean$chs and fredmonthly$UMCSENT
## t = 0.93528, df = 10, p-value = 0.3717
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.34672 0.73748
## sample estimates:
## cor
## 0.2836165
Some improvements to make on this analysis is to extend the data from 2016-2018 to have more data and if that makes any indication of a relationship. - Some internal criticism: The chosen Spotify data did not have the release date year/date variables needed. Finding other datasets, (even larger ones) would prove to be more useful/efficient. - Overall: Extending the time frame, improving data quality, and exploring additional features(working more with tempo, etc) would strengthen this analysis and have the potential to show a stronger correlation.
Although there is no statistically significant correlation between these two variables, it is present that there could exist a weak relationship between CHS x CSI in 2017. While there is no proof of strong correlation, this could open up pathways for more data analysis in the future.