I selected this dataset because of my interest in Spotify. Spotify is audio streaming and media services provider. It is one of the largest music streaming service providers, with over 527 million monthly active users including 210 million paying subscribers, as of March 2023. Working with this dataset peaked my interest in learning more about their processes and algorithms for suggesting songs to subscribers. This is a topic that we have discussed in the context of data and ethics and it is very important that subscribers know how they are being influenced. Below, I provide a list of the variables in the is dataset which consists of 2000 observations and 18 variables.
Throughout this document, I share the steps to wrangling the data and highlighting some of the more notable data.
Variables in the Spotify Songs Dataset
According to the Spotify website, all of their songs are given a score in each of the following categories (taken from the Spotify API documentation,https://developer.spotify.com/documentation/web-api/reference/):
Mood: Danceability, Valence, Energy, Tempo
Properties: Loudness, Speechiness, Instrumentalness
Context: Liveness, Acousticness
The dataset contains the following variables:
Artist: Name of the Artist.
Song: Name of the Track.
Genre: Genre of the track.
Duration_ms: Duration of the track in milliseconds.
Explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.
Year: Release Year of the track.
Popularity: The higher the value the more popular the song is.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
Key: The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1.
Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Getting the working directory
getwd()
## [1] "/Users/grayce/Desktop"
Setting the working directory
setwd("/Users/grayce/Desktop")
Reading the dataset, Spotify Songs.
spotifysongs <- read_csv("spotifysongs2.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
## lgl (1): explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Loading the library
library(dplyr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(tidyverse)
library(ggcorrplot)
library(treemap)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(devtools)
## Loading required package: usethis
library(dslabs)
##
## Attaching package: 'dslabs'
## The following object is masked from 'package:highcharter':
##
## stars
library(readr)
I am curious about the scope of the dataset (years and number of releases). 1998 through 2020. 2000 observations and 18 variables.
table(spotifysongs$year)
##
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## 1 38 74 108 90 97 96 104 95 94 97 84 107 99 115 89
## 2014 2015 2016 2017 2018 2019 2020
## 104 99 99 111 107 89 3
I am removing all data prior to 2000 due to the limited number of releases in 1998 through 2000.
sample1df <- spotifysongs %>%
filter(year>2000)
sample1df
## # A tibble: 1,887 × 18
## artist song duration_ms explicit year popularity danceability energy key
## <chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Modjo Lady… 307153 FALSE 2001 77 0.72 0.808 6
## 2 Gigi D… L'Am… 238759 FALSE 2011 1 0.617 0.728 7
## 3 Aaliyah Try … 284000 FALSE 2002 53 0.797 0.622 6
## 4 Darude Sand… 225493 FALSE 2001 69 0.528 0.965 11
## 5 Chicane Don'… 210786 FALSE 2016 47 0.644 0.72 10
## 6 LeAnn … I Ne… 229826 FALSE 2001 61 0.478 0.736 7
## 7 Samant… Gott… 201946 FALSE 2018 43 0.729 0.632 0
## 8 Next Wifey 243666 FALSE 2004 52 0.829 0.652 7
## 9 Janet … Does… 265026 FALSE 2001 47 0.771 0.796 5
## 10 LeAnn … Can'… 215506 FALSE 2001 65 0.628 0.834 6
## # ℹ 1,877 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, genre <chr>
I am removing the years that are fewer than 75 releases and filtering them out by their respective years. There were only three (3) releases reported.
sample1df %>%
filter(year<2020)
## # A tibble: 1,884 × 18
## artist song duration_ms explicit year popularity danceability energy key
## <chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Modjo Lady… 307153 FALSE 2001 77 0.72 0.808 6
## 2 Gigi D… L'Am… 238759 FALSE 2011 1 0.617 0.728 7
## 3 Aaliyah Try … 284000 FALSE 2002 53 0.797 0.622 6
## 4 Darude Sand… 225493 FALSE 2001 69 0.528 0.965 11
## 5 Chicane Don'… 210786 FALSE 2016 47 0.644 0.72 10
## 6 LeAnn … I Ne… 229826 FALSE 2001 61 0.478 0.736 7
## 7 Samant… Gott… 201946 FALSE 2018 43 0.729 0.632 0
## 8 Next Wifey 243666 FALSE 2004 52 0.829 0.652 7
## 9 Janet … Does… 265026 FALSE 2001 47 0.771 0.796 5
## 10 LeAnn … Can'… 215506 FALSE 2001 65 0.628 0.834 6
## # ℹ 1,874 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, genre <chr>
sample1df<- sample1df[!duplicated(sample1df), ]
I am removing any duplicates and then filtering.
sample1df <- sample1df %>%
filter(year<2020)
The dataframe is now 1,828 obesrvations and 18 variables.
dim(sample1df)
## [1] 1828 18
The dimensions of the new data frame is 1828 observations and 18 variables. This is down from 2000 observations and 18 variables. I am now looking at the structure of the dataframe.
str(sample1df)
## tibble [1,828 × 18] (S3: tbl_df/tbl/data.frame)
## $ artist : chr [1:1828] "Modjo" "Gigi D'Agostino" "Aaliyah" "Darude" ...
## $ song : chr [1:1828] "Lady - Hear Me Tonight" "L'Amour Toujours" "Try Again" "Sandstorm" ...
## $ duration_ms : num [1:1828] 307153 238759 284000 225493 210786 ...
## $ explicit : logi [1:1828] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ year : num [1:1828] 2001 2011 2002 2001 2016 ...
## $ popularity : num [1:1828] 77 1 53 69 47 61 43 52 47 65 ...
## $ danceability : num [1:1828] 0.72 0.617 0.797 0.528 0.644 0.478 0.729 0.829 0.771 0.628 ...
## $ energy : num [1:1828] 0.808 0.728 0.622 0.965 0.72 0.736 0.632 0.652 0.796 0.834 ...
## $ key : num [1:1828] 6 7 6 11 10 7 0 7 5 6 ...
## $ loudness : num [1:1828] -5.63 -7.93 -5.64 -7.98 -9.63 ...
## $ mode : num [1:1828] 1 1 0 0 0 1 0 0 0 0 ...
## $ speechiness : num [1:1828] 0.0379 0.0292 0.29 0.0465 0.0419 0.0367 0.0279 0.108 0.076 0.0497 ...
## $ acousticness : num [1:1828] 0.00793 0.0328 0.0807 0.141 0.00145 0.02 0.191 0.067 0.0993 0.403 ...
## $ instrumentalness: num [1:1828] 2.93e-02 4.82e-02 0.00 9.85e-01 5.04e-01 9.58e-05 0.00 0.00 2.78e-03 0.00 ...
## $ liveness : num [1:1828] 0.0634 0.36 0.0841 0.0797 0.0839 0.118 0.166 0.0812 0.0981 0.051 ...
## $ valence : num [1:1828] 0.869 0.808 0.731 0.587 0.53 0.564 0.774 0.726 0.801 0.626 ...
## $ tempo : num [1:1828] 126 139 93 136 132 ...
## $ genre : chr [1:1828] "Dance/Electronic" "pop" "hip hop, pop, R&B" "pop, Dance/Electronic" ...
I am looking at the head of the data.
head(sample1df)
## # A tibble: 6 × 18
## artist song duration_ms explicit year popularity danceability energy key
## <chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Modjo Lady… 307153 FALSE 2001 77 0.72 0.808 6
## 2 Gigi D'… L'Am… 238759 FALSE 2011 1 0.617 0.728 7
## 3 Aaliyah Try … 284000 FALSE 2002 53 0.797 0.622 6
## 4 Darude Sand… 225493 FALSE 2001 69 0.528 0.965 11
## 5 Chicane Don'… 210786 FALSE 2016 47 0.644 0.72 10
## 6 LeAnn R… I Ne… 229826 FALSE 2001 61 0.478 0.736 7
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, genre <chr>
I am removing any NA’s. There are none in this dataset. There are also no NA’s in the columns (variables)
sum(is.na(sample1df))
## [1] 0
colSums(is.na(sample1df))
## artist song duration_ms explicit
## 0 0 0 0
## year popularity danceability energy
## 0 0 0 0
## key loudness mode speechiness
## 0 0 0 0
## acousticness instrumentalness liveness valence
## 0 0 0 0
## tempo genre
## 0 0
I am renaming some of the longer variables describing columns.
names(sample1df)[names(sample1df) == "popularity"] <- "popular"
names(sample1df)[names(sample1df) == "speechiness"] <- "speech"
names(sample1df)[names(sample1df) == "acousticness"] <- "acoustics"
names(sample1df)[names(sample1df) == "instrumentalness"] <- "instrumental"
I am looking at the new names.
names(sample1df)
## [1] "artist" "song" "duration_ms" "explicit" "year"
## [6] "popular" "danceability" "energy" "key" "loudness"
## [11] "mode" "speech" "acoustics" "instrumental" "liveness"
## [16] "valence" "tempo" "genre"
I am converting the duration of songs from milliseconds to minutes.
sample1df<- sample1df %>% mutate(duration_min = duration_ms/60000)
I am removing columns of data that I will not use in analyzing the data.
summary(sample1df)
## artist song duration_ms explicit
## Length:1828 Length:1828 Min. :113000 Mode :logical
## Class :character Class :character 1st Qu.:202524 FALSE:1314
## Mode :character Mode :character Median :222418 TRUE :514
## Mean :227237
## 3rd Qu.:245890
## Max. :484146
## year popular danceability energy
## Min. :2001 Min. : 0.00 Min. :0.1290 Min. :0.0549
## 1st Qu.:2005 1st Qu.:56.00 1st Qu.:0.5810 1st Qu.:0.6228
## Median :2010 Median :66.00 Median :0.6755 Median :0.7360
## Mean :2010 Mean :59.61 Mean :0.6667 Mean :0.7202
## 3rd Qu.:2015 3rd Qu.:73.00 3rd Qu.:0.7640 3rd Qu.:0.8380
## Max. :2019 Max. :89.00 Max. :0.9750 Max. :0.9990
## key loudness mode speech
## Min. : 0.000 Min. :-20.514 Min. :0.0000 Min. :0.02320
## 1st Qu.: 2.000 1st Qu.: -6.467 1st Qu.:0.0000 1st Qu.:0.04020
## Median : 6.000 Median : -5.240 Median :1.0000 Median :0.06195
## Mean : 5.389 Mean : -5.477 Mean :0.5542 Mean :0.10541
## 3rd Qu.: 8.000 3rd Qu.: -4.152 3rd Qu.:1.0000 3rd Qu.:0.13200
## Max. :11.000 Max. : -0.276 Max. :1.0000 Max. :0.57600
## acoustics instrumental liveness valence
## Min. :0.0000192 Min. :0.00e+00 Min. :0.02150 Min. :0.0381
## 1st Qu.:0.0132000 1st Qu.:0.00e+00 1st Qu.:0.08897 1st Qu.:0.3850
## Median :0.0553500 Median :0.00e+00 Median :0.12500 Median :0.5540
## Mean :0.1282624 Mean :1.49e-02 Mean :0.18189 Mean :0.5488
## 3rd Qu.:0.1752500 3rd Qu.:6.08e-05 3rd Qu.:0.24000 3rd Qu.:0.7262
## Max. :0.9760000 Max. :9.85e-01 Max. :0.85300 Max. :0.9730
## tempo genre duration_min
## Min. : 60.02 Length:1828 Min. :1.883
## 1st Qu.: 99.01 Class :character 1st Qu.:3.375
## Median :120.05 Mode :character Median :3.707
## Mean :120.40 Mean :3.787
## 3rd Qu.:134.97 3rd Qu.:4.098
## Max. :210.85 Max. :8.069
sample1df <- sample1df[,-c(3,9,10,11 )]
dim(sample1df)
## [1] 1828 15
head(sample1df)
## # A tibble: 6 × 15
## artist song explicit year popular danceability energy speech acoustics
## <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Modjo Lady… FALSE 2001 77 0.72 0.808 0.0379 0.00793
## 2 Gigi D'Agos… L'Am… FALSE 2011 1 0.617 0.728 0.0292 0.0328
## 3 Aaliyah Try … FALSE 2002 53 0.797 0.622 0.29 0.0807
## 4 Darude Sand… FALSE 2001 69 0.528 0.965 0.0465 0.141
## 5 Chicane Don'… FALSE 2016 47 0.644 0.72 0.0419 0.00145
## 6 LeAnn Rimes I Ne… FALSE 2001 61 0.478 0.736 0.0367 0.02
## # ℹ 6 more variables: instrumental <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, genre <chr>, duration_min <dbl>
I am going to take another glimpse at the dataframe (sample1df). There are now 1,828 observations and 14 variables.
glimpse(sample1df)
## Rows: 1,828
## Columns: 15
## $ artist <chr> "Modjo", "Gigi D'Agostino", "Aaliyah", "Darude", "Chicane…
## $ song <chr> "Lady - Hear Me Tonight", "L'Amour Toujours", "Try Again"…
## $ explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ year <dbl> 2001, 2011, 2002, 2001, 2016, 2001, 2018, 2004, 2001, 200…
## $ popular <dbl> 77, 1, 53, 69, 47, 61, 43, 52, 47, 65, 58, 0, 65, 60, 70,…
## $ danceability <dbl> 0.720, 0.617, 0.797, 0.528, 0.644, 0.478, 0.729, 0.829, 0…
## $ energy <dbl> 0.808, 0.728, 0.622, 0.965, 0.720, 0.736, 0.632, 0.652, 0…
## $ speech <dbl> 0.0379, 0.0292, 0.2900, 0.0465, 0.0419, 0.0367, 0.0279, 0…
## $ acoustics <dbl> 0.00793, 0.03280, 0.08070, 0.14100, 0.00145, 0.02000, 0.1…
## $ instrumental <dbl> 2.93e-02, 4.82e-02, 0.00e+00, 9.85e-01, 5.04e-01, 9.58e-0…
## $ liveness <dbl> 0.0634, 0.3600, 0.0841, 0.0797, 0.0839, 0.1180, 0.1660, 0…
## $ valence <dbl> 0.869, 0.808, 0.731, 0.587, 0.530, 0.564, 0.774, 0.726, 0…
## $ tempo <dbl> 126.041, 139.066, 93.020, 136.065, 132.017, 144.705, 109.…
## $ genre <chr> "Dance/Electronic", "pop", "hip hop, pop, R&B", "pop, Dan…
## $ duration_min <dbl> 5.119217, 3.979317, 4.733333, 3.758217, 3.513100, 3.83043…
The new dataframe includes the years of 2001 through 2019.
table(sample1df$year)
##
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 106 86 91 95 101 93 90 91 82 103 96 113 87 100 93 98
## 2017 2018 2019
## 110 104 89
I am looking at Spotify Songs released by year and their level of popularity using ggplot, in the first chart, and then in high charter, I looked at the artist and their high level of popularity.
ggplot(sample1df,
aes(factor(year))) +
geom_bar(fill = "coral",
alpha = 0.5) +
theme_classic() +
ggtitle("Spotify Songs Released by Year and Popularity") +
theme(axis.text.x = element_text(angle=25, family = "Georgia"))
sample1df%>%
count(artist) %>%
arrange(artist) %>%
top_n(10) %>%
hchart(type = "column",color = "green", hcaes(x = artist, y=n)) %>%
hc_title(text = "Spotify Songs by Artist and Popularity (2001 - 2019)") %>%
hc_xAxis(title = list(text = "Artist")) %>%
hc_yAxis(title = list(text = "Popularity"))
## Selecting by n
For this dataframe, I used a treemap to look at the artist with the highest number of releases and popularity using high charter.
sample1df %>%
count(artist)%>%
arrange(artist, desc(n))%>%
top_n(20) %>%
hchart(type = "treemap", hcaes(x = artist, value = n, color = n)) %>%
hc_title(text = "Top 20 Popular Artists by Number of Popular Songs") %>%
hc_subtitle(text= "Spotify Dataset 2001 - 2019") %>%
hc_caption(text= "Souce: Dataset Provided by Montgomery College, Data 110")%>%
hc_xAxis(title = list(text = "Artist")) %>%
hc_yAxis(title = list(text = "Popularity"))
## Selecting by n
The artists with the most popular songs, 2001 - 2019. The artists with the most songs are Drake and Rihanna (tied at 23).
The Most Popular Artist - Rihanna and Drake
Both Rihanna and Drake tied for the highest number of popular songs from 2001 -2019 (23). Let’s look at the songs and the years they were released on Spotify for each artist. I will also look at potential correlations given the variables provided in the dataset.
Rihanna’s Popularity
I started by filtering by the artist to see the number of songs and popularity ratings. I then created a column chart of Rihanna’s top songs. This chart is interactive.
rihanna <- sample1df %>%
filter(artist =='Rihanna' )
rihanna%>%
hchart('column', hcaes(x = song, y = popular, group = year)) %>%
hc_colorAxis() %>%
hc_chart(style = list(fontFamily = "NewCenturySchoolbook",
fontWeight = "bold")) %>%
hc_xAxis(title = list(text="Song")) %>%
hc_yAxis(title = list(text="Popular"))%>%
hc_title( text = "Rihanna's Top Songs") %>%
hc_subtitle(text = "2005 - 2016") %>%
hc_add_theme(hc_theme_sandsignika()) %>%
hc_tooltip(shared = TRUE)
I then decided to embed the top song, Umbrella via You Tube. Umbrella was released in 2008 and has a popularity rating of 81 (the highest of all song by Rihanna).
library("vembedr")
##
## Attaching package: 'vembedr'
## The following object is masked from 'package:lubridate':
##
## hms
suggest_embed("https://www.youtube.com/watch?v=xXD5tltX9Pg")
## embed_youtube("xXD5tltX9Pg")
Feel free to click on the link “watch on YouTube” to watch Rihanna sing Umbrella.
embed_url("https://www.youtube.com/watch?v=xXD5tltX9Pg")
There are strong correlations in the following categories:
Danceability and Popularity (0.6)
Energy and Valence (0.5)
Energy and Tempo (0.4)
Energy and Duration of Minutes (0.4)
Danceability and Valence (0.4)
Tempo and Liveliness (0.3)
Energy is a big variable in Rihanna’s songs. I wanted to look the energy variable across her top 23 songs.
rihannacorr1 <- rihanna %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)
rihannacorr2 <- round(cor(rihannacorr1),1)
head(rihannacorr2[, 1:9])
## danceability speech popular acoustics tempo energy instrumental
## danceability 1.0 -0.2 0.6 -0.1 -0.5 0.0 0.2
## speech -0.2 1.0 0.2 -0.3 0.2 0.0 -0.2
## popular 0.6 0.2 1.0 0.0 -0.2 -0.2 -0.1
## acoustics -0.1 -0.3 0.0 1.0 0.1 -0.6 -0.1
## tempo -0.5 0.2 -0.2 0.1 1.0 0.4 0.1
## energy 0.0 0.0 -0.2 -0.6 0.4 1.0 0.2
## valence liveness
## danceability 0.4 -0.4
## speech -0.1 -0.1
## popular 0.2 -0.6
## acoustics -0.3 -0.1
## tempo 0.2 0.3
## energy 0.5 0.1
ggcorrplot(rihannacorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
labs(title = "Correlation Heat Map - Rihanna Spotify Songs")
rihanna%>%
arrange(desc(energy))%>%
ggplot(aes(song, energy, fill= song)) + geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),legend.position = "none") +
coord_flip() +
ggtitle("Energy of Rihanna Songs")
There are strong correlations in the following categories include:
Energy and Valence (0.5)
Speech and Tempo (0.4)
Energy and Duration of minutes (0.4)
Energy and Speech (0.3)
Acoustics and Instrumental (0.3)
Based on this information, I wanted to look at Drake songs because Energy is a big correlation factor.
Drake’s dataframe
Similarly, I filtered for songs by Drake, looking for the magical 23 record that he shares with Rihanna. I also created an interactive chart to show the songs, popularity, and year.
drake <- sample1df %>%
filter(artist =='Drake')
drake %>%
hchart('column', hcaes(x = song, y = popular, group = year)) %>%
hc_colorAxis() %>%
hc_chart(style = list(fontFamily = "NewCenturySchoolbook",
fontWeight = "bold")) %>%
hc_xAxis(title = list(text="Song")) %>%
hc_yAxis(title = list(text="Popular"))%>%
hc_title( text = "Drake's Top Songs") %>%
hc_subtitle(text = "2009 - 2019") %>%
hc_add_theme(hc_theme_sandsignika()) %>%
hc_tooltip(shared = TRUE)
I also embedded a YouTube video to share the some with the highest popularity. The top song from Drake was “One Dance” which had a popularity rating of 84. Now, let’s look at possible correlations.
suggest_embed("https://www.youtube.com/watch?v=iAbnEUA0wpA")
## embed_youtube("iAbnEUA0wpA")
Feel free to click on the link “watch on YouTube” to watch Drake sing One Dance.
embed_url("https://www.youtube.com/watch?v=iAbnEUA0wpA")
drakecorr1 <- drake %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)
drakecorr2 <- round(cor(drakecorr1),1)
head(drakecorr2[, 1:9])
## danceability speech popular acoustics tempo energy instrumental
## danceability 1.0 -0.5 -0.1 -0.1 0.2 -0.7 0.1
## speech -0.5 1.0 0.1 0.1 0.4 0.3 -0.3
## popular -0.1 0.1 1.0 -0.1 -0.1 0.0 -0.5
## acoustics -0.1 0.1 -0.1 1.0 0.1 0.1 0.3
## tempo 0.2 0.4 -0.1 0.1 1.0 -0.2 0.0
## energy -0.7 0.3 0.0 0.1 -0.2 1.0 -0.2
## valence liveness
## danceability -0.2 0.1
## speech 0.2 0.0
## popular -0.2 0.1
## acoustics -0.1 -0.2
## tempo 0.1 -0.3
## energy 0.5 0.0
ggcorrplot(drakecorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
labs(title = "Correlation Heat Map - Drake Spotify Songs")
drake%>%
arrange(desc(energy))%>%
ggplot(aes(song, energy, fill= song, color = "coral")) + geom_point() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),legend.position = "none") +
coord_flip() +
ggtitle("Energy of Drake's Songs")
There are strong correlations in the following categories include:
Energy and Valence (0.5)
Speech and Tempo (0.4)
Energy and Duration of minutes (0.4)
Energy and Speech (0.3)
Acoustics and Instrumental (0.3)
Based on this information, I wanted to look at Drake songs because Energy is a big correlation factor.
Top Genres
For this analysis, I looked at the different genre (total 56) and mapped it using a tree map. The largest single category was pop, followed by a combination of other genres. They included: hip, pop; hip hop, pop, R&B; pop, Dance/Electronic; and pop and R&B.
sample1df%>% group_by(genre = genre) %>%
summarise(popular = n()) %>%
arrange(desc(popular))
## # A tibble: 56 × 2
## genre popular
## <chr> <int>
## 1 pop 388
## 2 hip hop, pop 261
## 3 hip hop, pop, R&B 223
## 4 pop, Dance/Electronic 211
## 5 pop, R&B 156
## 6 hip hop 114
## 7 hip hop, pop, Dance/Electronic 75
## 8 rock 55
## 9 Dance/Electronic 39
## 10 rock, pop 36
## # ℹ 46 more rows
top <- sample1df%>% select(genre, popular) %>% group_by(genre) %>% summarise(n = n()) %>% top_n(56, n)
tm <- treemap(top, index = c("genre"), vSize = "n", vColor = 'genre', palette="RdYlBu")
Not surprisingly, Pop and a combination of hip hop and pop represent the most frequent genres.
Correlation
Loading additional packages to run histograms for numerical variables.
library(funModeling)
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(tidyverse)
library(Hmisc)
Here is a quick way to retrieve one plot containing all the histograms for numerical variables.
histograms <- sample1df[,-c(1,2,3,4,5,14)]
plot_num(sample1df, bins = 10)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the funModeling package.
## Please report the issue at <https://github.com/pablo14/funModeling/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Correlation Heat Map
I used a correlation heat map as a tool to display the correlation between multiple variables as a color coded matrix.
library(ggcorrplot)
samplecorr1 <- sample1df %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)
samplecorr2 <- round(cor(samplecorr1),1)
head(samplecorr2[, 1:9])
## danceability speech popular acoustics tempo energy instrumental
## danceability 1.0 0.1 0 -0.1 -0.2 -0.1 0.0
## speech 0.1 1.0 0 0.0 0.1 -0.1 -0.1
## popular 0.0 0.0 1 0.0 0.0 0.0 -0.1
## acoustics -0.1 0.0 0 1.0 -0.1 -0.4 0.0
## tempo -0.2 0.1 0 -0.1 1.0 0.2 0.0
## energy -0.1 -0.1 0 -0.4 0.2 1.0 0.0
## valence liveness
## danceability 0.4 -0.1
## speech 0.1 0.1
## popular 0.0 0.0
## acoustics -0.1 -0.1
## tempo 0.0 0.0
## energy 0.3 0.1
ggcorrplot(samplecorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
labs(title = "Correlation Heat Map - Spotify Songs")
The strongest correlations are between valence and energy(.032) and valence and danceability (0.39)
Energy & danceability by genre.
Let’s check the correlation
cor(sample1df$valence, sample1df$danceability)
## [1] 0.3988757
Valence and Energy
Let’s check the correlation quickly to see if something stands out.
cor(sample1df$valence, sample1df$energy)
## [1] 0.3270572
The most popular artist and songs in the 2001 - 2019 Spotify data set were explored in this session of the project.
most_popular <- sample1df%>%
group_by(song, artist) %>%
summarise(popular) %>%
arrange(desc(popular))%>%
head(22)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'song', 'artist'. You can override using
## the `.groups` argument.
most_popular
## # A tibble: 22 × 3
## # Groups: song, artist [22]
## song artist popular
## <chr> <chr> <dbl>
## 1 Sweater Weather The Neighbourhood 89
## 2 Another Love Tom Odell 88
## 3 Without Me Eminem 87
## 4 Wait a Minute! WILLOW 86
## 5 lovely (with Khalid) Billie Eilish 86
## 6 'Till I Collapse Eminem 85
## 7 Circles Post Malone 85
## 8 Daddy Issues The Neighbourhood 85
## 9 Locked out of Heaven Bruno Mars 85
## 10 Perfect Ed Sheeran 85
## # ℹ 12 more rows
Most Popular Individual Songs Released by Spotify
I wanted to explore the popular song encompassed in the data set (2001 - 2019). This interactive chart (HighCharter) identifies the song, artist, and popularity score for the most popular individual songs released by Spotify. Further, for artist with multiple popular songs, each individual song is listed on the tooltip with the corresponding popularity score.
most_popular%>%
hchart(type = "column",color = "#FF7900", hcaes(x = artist, y= popular, group = song)) %>%
hc_title(text = "Individual Songs by Popularity (2001 - 2019)") %>%
hc_subtitle(text = "Source: Data Supplied by Montgomery College, Data 110") %>%
hc_xAxis(title = list(text = "Artist")) %>%
hc_yAxis(title = list(text = "Popularity")) %>%
hc_add_theme(hc_theme_darkunica()) %>%
hc_tooltip(shared = TRUE)
The winner of most popular song (89) is Sweater Weather by the Neighbourhood.
What the visualization represents, any interesting patterns or surprises that arise within the visualization:
There were several interesting patterns that were made clear through the correlation plots, including the connection between energy and valence in the songs that were the most popular and streamed the most often. It was not surprising that Pop was a popular genre but I was surprised to learn about all of the subgenres that exist as Spotify variables, including liveness, energy, and danceability. I was also surprised that most songs have a runtime between 3-5 minutes.
Anything that could have been shown that you could not get to work or that you wished you could have included.
I attempted to embed videos of the popular songs by Rihanna and Drake. I attempted to also include images to bring the project to life. I anticipated working on more correlations between songs with explict lyrics and popularity and songs with low popularity scores and how some of the variables (danceability, acoustics, and tempo) compare to the more popular songs.
Bibliograpy
For my project, I researched the topic by looking at articles that discussed Spotify and popularity of certain type of music. The most influential article was “Can big data really predict what makes a song popular?” Published: October 10, 2022 by the online periodical, The Conversation. I am sharing a link to the article. https://theconversation.com/can-big-data-really-predict-what-makes-a-song-popular-189052