SPOTIFY SONGS - FINAL PROJECT

I selected this dataset because of my interest in Spotify. Spotify is audio streaming and media services provider. It is one of the largest music streaming service providers, with over 527 million monthly active users including 210 million paying subscribers, as of March 2023. Working with this dataset peaked my interest in learning more about their processes and algorithms for suggesting songs to subscribers. This is a topic that we have discussed in the context of data and ethics and it is very important that subscribers know how they are being influenced. Below, I provide a list of the variables in the is dataset which consists of 2000 observations and 18 variables.

Throughout this document, I share the steps to wrangling the data and highlighting some of the more notable data.

Variables in the Spotify Songs Dataset

According to the Spotify website, all of their songs are given a score in each of the following categories (taken from the Spotify API documentation,https://developer.spotify.com/documentation/web-api/reference/):

Mood: Danceability, Valence, Energy, Tempo

Properties: Loudness, Speechiness, Instrumentalness

Context: Liveness, Acousticness

The dataset contains the following variables:

Artist: Name of the Artist.

Song: Name of the Track.

Genre: Genre of the track.

Duration_ms: Duration of the track in milliseconds.

Explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.

Year: Release Year of the track.

Popularity: The higher the value the more popular the song is.

Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

Key: The key the track is in. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1.

Mood

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Properties

Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

Context

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Getting the working directory

getwd()
## [1] "/Users/grayce/Desktop"

Setting the working directory

setwd("/Users/grayce/Desktop")

Reading the dataset, Spotify Songs.

spotifysongs <- read_csv("spotifysongs2.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
## lgl  (1): explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Loading the library

library(dplyr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
library(tidyverse)
library(ggcorrplot)
library(treemap)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(devtools)
## Loading required package: usethis
library(dslabs)
## 
## Attaching package: 'dslabs'
## The following object is masked from 'package:highcharter':
## 
##     stars
library(readr)

I am curious about the scope of the dataset (years and number of releases). 1998 through 2020. 2000 observations and 18 variables.

table(spotifysongs$year)
## 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
##    1   38   74  108   90   97   96  104   95   94   97   84  107   99  115   89 
## 2014 2015 2016 2017 2018 2019 2020 
##  104   99   99  111  107   89    3

I am removing all data prior to 2000 due to the limited number of releases in 1998 through 2000.

sample1df <- spotifysongs %>%
  filter(year>2000)
sample1df
## # A tibble: 1,887 × 18
##    artist  song  duration_ms explicit  year popularity danceability energy   key
##    <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
##  1 Modjo   Lady…      307153 FALSE     2001         77        0.72   0.808     6
##  2 Gigi D… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
##  3 Aaliyah Try …      284000 FALSE     2002         53        0.797  0.622     6
##  4 Darude  Sand…      225493 FALSE     2001         69        0.528  0.965    11
##  5 Chicane Don'…      210786 FALSE     2016         47        0.644  0.72     10
##  6 LeAnn … I Ne…      229826 FALSE     2001         61        0.478  0.736     7
##  7 Samant… Gott…      201946 FALSE     2018         43        0.729  0.632     0
##  8 Next    Wifey      243666 FALSE     2004         52        0.829  0.652     7
##  9 Janet … Does…      265026 FALSE     2001         47        0.771  0.796     5
## 10 LeAnn … Can'…      215506 FALSE     2001         65        0.628  0.834     6
## # ℹ 1,877 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, genre <chr>

I am removing the years that are fewer than 75 releases and filtering them out by their respective years. There were only three (3) releases reported.

sample1df %>%
  filter(year<2020)
## # A tibble: 1,884 × 18
##    artist  song  duration_ms explicit  year popularity danceability energy   key
##    <chr>   <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
##  1 Modjo   Lady…      307153 FALSE     2001         77        0.72   0.808     6
##  2 Gigi D… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
##  3 Aaliyah Try …      284000 FALSE     2002         53        0.797  0.622     6
##  4 Darude  Sand…      225493 FALSE     2001         69        0.528  0.965    11
##  5 Chicane Don'…      210786 FALSE     2016         47        0.644  0.72     10
##  6 LeAnn … I Ne…      229826 FALSE     2001         61        0.478  0.736     7
##  7 Samant… Gott…      201946 FALSE     2018         43        0.729  0.632     0
##  8 Next    Wifey      243666 FALSE     2004         52        0.829  0.652     7
##  9 Janet … Does…      265026 FALSE     2001         47        0.771  0.796     5
## 10 LeAnn … Can'…      215506 FALSE     2001         65        0.628  0.834     6
## # ℹ 1,874 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, genre <chr>
sample1df<- sample1df[!duplicated(sample1df), ]

I am removing any duplicates and then filtering.

sample1df <- sample1df %>%
  filter(year<2020)

The dataframe is now 1,828 obesrvations and 18 variables.

dim(sample1df)
## [1] 1828   18

The dimensions of the new data frame is 1828 observations and 18 variables. This is down from 2000 observations and 18 variables. I am now looking at the structure of the dataframe.

str(sample1df)
## tibble [1,828 × 18] (S3: tbl_df/tbl/data.frame)
##  $ artist          : chr [1:1828] "Modjo" "Gigi D'Agostino" "Aaliyah" "Darude" ...
##  $ song            : chr [1:1828] "Lady - Hear Me Tonight" "L'Amour Toujours" "Try Again" "Sandstorm" ...
##  $ duration_ms     : num [1:1828] 307153 238759 284000 225493 210786 ...
##  $ explicit        : logi [1:1828] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ year            : num [1:1828] 2001 2011 2002 2001 2016 ...
##  $ popularity      : num [1:1828] 77 1 53 69 47 61 43 52 47 65 ...
##  $ danceability    : num [1:1828] 0.72 0.617 0.797 0.528 0.644 0.478 0.729 0.829 0.771 0.628 ...
##  $ energy          : num [1:1828] 0.808 0.728 0.622 0.965 0.72 0.736 0.632 0.652 0.796 0.834 ...
##  $ key             : num [1:1828] 6 7 6 11 10 7 0 7 5 6 ...
##  $ loudness        : num [1:1828] -5.63 -7.93 -5.64 -7.98 -9.63 ...
##  $ mode            : num [1:1828] 1 1 0 0 0 1 0 0 0 0 ...
##  $ speechiness     : num [1:1828] 0.0379 0.0292 0.29 0.0465 0.0419 0.0367 0.0279 0.108 0.076 0.0497 ...
##  $ acousticness    : num [1:1828] 0.00793 0.0328 0.0807 0.141 0.00145 0.02 0.191 0.067 0.0993 0.403 ...
##  $ instrumentalness: num [1:1828] 2.93e-02 4.82e-02 0.00 9.85e-01 5.04e-01 9.58e-05 0.00 0.00 2.78e-03 0.00 ...
##  $ liveness        : num [1:1828] 0.0634 0.36 0.0841 0.0797 0.0839 0.118 0.166 0.0812 0.0981 0.051 ...
##  $ valence         : num [1:1828] 0.869 0.808 0.731 0.587 0.53 0.564 0.774 0.726 0.801 0.626 ...
##  $ tempo           : num [1:1828] 126 139 93 136 132 ...
##  $ genre           : chr [1:1828] "Dance/Electronic" "pop" "hip hop, pop, R&B" "pop, Dance/Electronic" ...

I am looking at the head of the data.

head(sample1df)
## # A tibble: 6 × 18
##   artist   song  duration_ms explicit  year popularity danceability energy   key
##   <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
## 1 Modjo    Lady…      307153 FALSE     2001         77        0.72   0.808     6
## 2 Gigi D'… L'Am…      238759 FALSE     2011          1        0.617  0.728     7
## 3 Aaliyah  Try …      284000 FALSE     2002         53        0.797  0.622     6
## 4 Darude   Sand…      225493 FALSE     2001         69        0.528  0.965    11
## 5 Chicane  Don'…      210786 FALSE     2016         47        0.644  0.72     10
## 6 LeAnn R… I Ne…      229826 FALSE     2001         61        0.478  0.736     7
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, genre <chr>

I am removing any NA’s. There are none in this dataset. There are also no NA’s in the columns (variables)

sum(is.na(sample1df))
## [1] 0
colSums(is.na(sample1df))
##           artist             song      duration_ms         explicit 
##                0                0                0                0 
##             year       popularity     danceability           energy 
##                0                0                0                0 
##              key         loudness             mode      speechiness 
##                0                0                0                0 
##     acousticness instrumentalness         liveness          valence 
##                0                0                0                0 
##            tempo            genre 
##                0                0

I am renaming some of the longer variables describing columns.

names(sample1df)[names(sample1df) == "popularity"] <- "popular"
names(sample1df)[names(sample1df) == "speechiness"] <- "speech"
names(sample1df)[names(sample1df) == "acousticness"] <- "acoustics"
names(sample1df)[names(sample1df) == "instrumentalness"] <- "instrumental"

I am looking at the new names.

names(sample1df)
##  [1] "artist"       "song"         "duration_ms"  "explicit"     "year"        
##  [6] "popular"      "danceability" "energy"       "key"          "loudness"    
## [11] "mode"         "speech"       "acoustics"    "instrumental" "liveness"    
## [16] "valence"      "tempo"        "genre"

I am converting the duration of songs from milliseconds to minutes.

sample1df<- sample1df %>% mutate(duration_min = duration_ms/60000)

I am removing columns of data that I will not use in analyzing the data.

summary(sample1df)
##     artist              song            duration_ms      explicit      
##  Length:1828        Length:1828        Min.   :113000   Mode :logical  
##  Class :character   Class :character   1st Qu.:202524   FALSE:1314     
##  Mode  :character   Mode  :character   Median :222418   TRUE :514      
##                                        Mean   :227237                  
##                                        3rd Qu.:245890                  
##                                        Max.   :484146                  
##       year         popular       danceability        energy      
##  Min.   :2001   Min.   : 0.00   Min.   :0.1290   Min.   :0.0549  
##  1st Qu.:2005   1st Qu.:56.00   1st Qu.:0.5810   1st Qu.:0.6228  
##  Median :2010   Median :66.00   Median :0.6755   Median :0.7360  
##  Mean   :2010   Mean   :59.61   Mean   :0.6667   Mean   :0.7202  
##  3rd Qu.:2015   3rd Qu.:73.00   3rd Qu.:0.7640   3rd Qu.:0.8380  
##  Max.   :2019   Max.   :89.00   Max.   :0.9750   Max.   :0.9990  
##       key            loudness            mode            speech       
##  Min.   : 0.000   Min.   :-20.514   Min.   :0.0000   Min.   :0.02320  
##  1st Qu.: 2.000   1st Qu.: -6.467   1st Qu.:0.0000   1st Qu.:0.04020  
##  Median : 6.000   Median : -5.240   Median :1.0000   Median :0.06195  
##  Mean   : 5.389   Mean   : -5.477   Mean   :0.5542   Mean   :0.10541  
##  3rd Qu.: 8.000   3rd Qu.: -4.152   3rd Qu.:1.0000   3rd Qu.:0.13200  
##  Max.   :11.000   Max.   : -0.276   Max.   :1.0000   Max.   :0.57600  
##    acoustics          instrumental         liveness          valence      
##  Min.   :0.0000192   Min.   :0.00e+00   Min.   :0.02150   Min.   :0.0381  
##  1st Qu.:0.0132000   1st Qu.:0.00e+00   1st Qu.:0.08897   1st Qu.:0.3850  
##  Median :0.0553500   Median :0.00e+00   Median :0.12500   Median :0.5540  
##  Mean   :0.1282624   Mean   :1.49e-02   Mean   :0.18189   Mean   :0.5488  
##  3rd Qu.:0.1752500   3rd Qu.:6.08e-05   3rd Qu.:0.24000   3rd Qu.:0.7262  
##  Max.   :0.9760000   Max.   :9.85e-01   Max.   :0.85300   Max.   :0.9730  
##      tempo           genre            duration_min  
##  Min.   : 60.02   Length:1828        Min.   :1.883  
##  1st Qu.: 99.01   Class :character   1st Qu.:3.375  
##  Median :120.05   Mode  :character   Median :3.707  
##  Mean   :120.40                      Mean   :3.787  
##  3rd Qu.:134.97                      3rd Qu.:4.098  
##  Max.   :210.85                      Max.   :8.069
sample1df <- sample1df[,-c(3,9,10,11 )]
dim(sample1df)
## [1] 1828   15
head(sample1df)
## # A tibble: 6 × 15
##   artist       song  explicit  year popular danceability energy speech acoustics
##   <chr>        <chr> <lgl>    <dbl>   <dbl>        <dbl>  <dbl>  <dbl>     <dbl>
## 1 Modjo        Lady… FALSE     2001      77        0.72   0.808 0.0379   0.00793
## 2 Gigi D'Agos… L'Am… FALSE     2011       1        0.617  0.728 0.0292   0.0328 
## 3 Aaliyah      Try … FALSE     2002      53        0.797  0.622 0.29     0.0807 
## 4 Darude       Sand… FALSE     2001      69        0.528  0.965 0.0465   0.141  
## 5 Chicane      Don'… FALSE     2016      47        0.644  0.72  0.0419   0.00145
## 6 LeAnn Rimes  I Ne… FALSE     2001      61        0.478  0.736 0.0367   0.02   
## # ℹ 6 more variables: instrumental <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, genre <chr>, duration_min <dbl>

I am going to take another glimpse at the dataframe (sample1df). There are now 1,828 observations and 14 variables.

glimpse(sample1df)
## Rows: 1,828
## Columns: 15
## $ artist       <chr> "Modjo", "Gigi D'Agostino", "Aaliyah", "Darude", "Chicane…
## $ song         <chr> "Lady - Hear Me Tonight", "L'Amour Toujours", "Try Again"…
## $ explicit     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ year         <dbl> 2001, 2011, 2002, 2001, 2016, 2001, 2018, 2004, 2001, 200…
## $ popular      <dbl> 77, 1, 53, 69, 47, 61, 43, 52, 47, 65, 58, 0, 65, 60, 70,…
## $ danceability <dbl> 0.720, 0.617, 0.797, 0.528, 0.644, 0.478, 0.729, 0.829, 0…
## $ energy       <dbl> 0.808, 0.728, 0.622, 0.965, 0.720, 0.736, 0.632, 0.652, 0…
## $ speech       <dbl> 0.0379, 0.0292, 0.2900, 0.0465, 0.0419, 0.0367, 0.0279, 0…
## $ acoustics    <dbl> 0.00793, 0.03280, 0.08070, 0.14100, 0.00145, 0.02000, 0.1…
## $ instrumental <dbl> 2.93e-02, 4.82e-02, 0.00e+00, 9.85e-01, 5.04e-01, 9.58e-0…
## $ liveness     <dbl> 0.0634, 0.3600, 0.0841, 0.0797, 0.0839, 0.1180, 0.1660, 0…
## $ valence      <dbl> 0.869, 0.808, 0.731, 0.587, 0.530, 0.564, 0.774, 0.726, 0…
## $ tempo        <dbl> 126.041, 139.066, 93.020, 136.065, 132.017, 144.705, 109.…
## $ genre        <chr> "Dance/Electronic", "pop", "hip hop, pop, R&B", "pop, Dan…
## $ duration_min <dbl> 5.119217, 3.979317, 4.733333, 3.758217, 3.513100, 3.83043…

The new dataframe includes the years of 2001 through 2019.

table(sample1df$year)
## 
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 
##  106   86   91   95  101   93   90   91   82  103   96  113   87  100   93   98 
## 2017 2018 2019 
##  110  104   89

I am looking at Spotify Songs released by year and their level of popularity using ggplot, in the first chart, and then in high charter, I looked at the artist and their high level of popularity.

ggplot(sample1df, 
       aes(factor(year))) +
    geom_bar(fill = "coral",
        alpha = 0.5) +
  theme_classic() +
ggtitle("Spotify Songs Released by Year and Popularity") +
  theme(axis.text.x = element_text(angle=25, family = "Georgia"))

sample1df%>%
  count(artist) %>%
  arrange(artist) %>%
  top_n(10) %>%
  hchart(type = "column",color = "green", hcaes(x = artist, y=n)) %>%
hc_title(text = "Spotify Songs by Artist and Popularity (2001 - 2019)") %>%
 hc_xAxis(title = list(text = "Artist")) %>%
 hc_yAxis(title = list(text = "Popularity"))
## Selecting by n

For this dataframe, I used a treemap to look at the artist with the highest number of releases and popularity using high charter.

sample1df %>%
  count(artist)%>%
   arrange(artist, desc(n))%>%
  top_n(20) %>%
  hchart(type = "treemap", hcaes(x = artist, value = n, color = n)) %>%
hc_title(text = "Top 20 Popular Artists by Number of Popular Songs") %>%
  hc_subtitle(text= "Spotify Dataset 2001 - 2019") %>% 
  hc_caption(text= "Souce: Dataset Provided by Montgomery College, Data 110")%>% 
 hc_xAxis(title = list(text = "Artist")) %>%
 hc_yAxis(title = list(text = "Popularity"))
## Selecting by n

The artists with the most popular songs, 2001 - 2019. The artists with the most songs are Drake and Rihanna (tied at 23).

The Most Popular Artist - Rihanna and Drake

Both Rihanna and Drake tied for the highest number of popular songs from 2001 -2019 (23). Let’s look at the songs and the years they were released on Spotify for each artist. I will also look at potential correlations given the variables provided in the dataset.

Rihanna’s Popularity

I started by filtering by the artist to see the number of songs and popularity ratings. I then created a column chart of Rihanna’s top songs. This chart is interactive.

rihanna <- sample1df %>%
filter(artist =='Rihanna' )
rihanna%>% 
  hchart('column', hcaes(x = song, y = popular, group = year)) %>%
    hc_colorAxis() %>%
hc_chart(style = list(fontFamily = "NewCenturySchoolbook",
                        fontWeight = "bold")) %>%
      hc_xAxis(title = list(text="Song")) %>%
      hc_yAxis(title = list(text="Popular"))%>%
   hc_title( text = "Rihanna's Top Songs") %>% 
      hc_subtitle(text = "2005 - 2016") %>%
   hc_add_theme(hc_theme_sandsignika()) %>%
   hc_tooltip(shared = TRUE)

I then decided to embed the top song, Umbrella via You Tube. Umbrella was released in 2008 and has a popularity rating of 81 (the highest of all song by Rihanna).

library("vembedr")
## 
## Attaching package: 'vembedr'
## The following object is masked from 'package:lubridate':
## 
##     hms
suggest_embed("https://www.youtube.com/watch?v=xXD5tltX9Pg") 
## embed_youtube("xXD5tltX9Pg")

Feel free to click on the link “watch on YouTube” to watch Rihanna sing Umbrella.

embed_url("https://www.youtube.com/watch?v=xXD5tltX9Pg")

There are strong correlations in the following categories:

Danceability and Popularity (0.6)

Energy and Valence (0.5)

Energy and Tempo (0.4)

Energy and Duration of Minutes (0.4)

Danceability and Valence (0.4)

Tempo and Liveliness (0.3)

Energy is a big variable in Rihanna’s songs. I wanted to look the energy variable across her top 23 songs.

rihannacorr1 <- rihanna %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)

rihannacorr2 <- round(cor(rihannacorr1),1)
head(rihannacorr2[, 1:9])
##              danceability speech popular acoustics tempo energy instrumental
## danceability          1.0   -0.2     0.6      -0.1  -0.5    0.0          0.2
## speech               -0.2    1.0     0.2      -0.3   0.2    0.0         -0.2
## popular               0.6    0.2     1.0       0.0  -0.2   -0.2         -0.1
## acoustics            -0.1   -0.3     0.0       1.0   0.1   -0.6         -0.1
## tempo                -0.5    0.2    -0.2       0.1   1.0    0.4          0.1
## energy                0.0    0.0    -0.2      -0.6   0.4    1.0          0.2
##              valence liveness
## danceability     0.4     -0.4
## speech          -0.1     -0.1
## popular          0.2     -0.6
## acoustics       -0.3     -0.1
## tempo            0.2      0.3
## energy           0.5      0.1
ggcorrplot(rihannacorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
  labs(title = "Correlation Heat Map - Rihanna Spotify Songs")

rihanna%>% 
arrange(desc(energy))%>%
  ggplot(aes(song, energy, fill= song)) + geom_col() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),legend.position = "none") + 
  coord_flip() +
   ggtitle("Energy of Rihanna Songs")

There are strong correlations in the following categories include:

Energy and Valence (0.5)

Speech and Tempo (0.4)

Energy and Duration of minutes (0.4)

Energy and Speech (0.3)

Acoustics and Instrumental (0.3)

Based on this information, I wanted to look at Drake songs because Energy is a big correlation factor.

Drake’s dataframe

Similarly, I filtered for songs by Drake, looking for the magical 23 record that he shares with Rihanna. I also created an interactive chart to show the songs, popularity, and year.

drake <- sample1df %>%
filter(artist =='Drake')
drake %>% 
  hchart('column', hcaes(x = song, y = popular, group = year)) %>%
    hc_colorAxis() %>%
hc_chart(style = list(fontFamily = "NewCenturySchoolbook",
                        fontWeight = "bold")) %>%
      hc_xAxis(title = list(text="Song")) %>%
      hc_yAxis(title = list(text="Popular"))%>%
   hc_title( text = "Drake's Top Songs") %>% 
      hc_subtitle(text = "2009 - 2019") %>%
   hc_add_theme(hc_theme_sandsignika()) %>%
   hc_tooltip(shared = TRUE)

I also embedded a YouTube video to share the some with the highest popularity. The top song from Drake was “One Dance” which had a popularity rating of 84. Now, let’s look at possible correlations.

suggest_embed("https://www.youtube.com/watch?v=iAbnEUA0wpA") 
## embed_youtube("iAbnEUA0wpA")

Feel free to click on the link “watch on YouTube” to watch Drake sing One Dance.

embed_url("https://www.youtube.com/watch?v=iAbnEUA0wpA")
drakecorr1 <- drake %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)

drakecorr2 <- round(cor(drakecorr1),1)
head(drakecorr2[, 1:9])
##              danceability speech popular acoustics tempo energy instrumental
## danceability          1.0   -0.5    -0.1      -0.1   0.2   -0.7          0.1
## speech               -0.5    1.0     0.1       0.1   0.4    0.3         -0.3
## popular              -0.1    0.1     1.0      -0.1  -0.1    0.0         -0.5
## acoustics            -0.1    0.1    -0.1       1.0   0.1    0.1          0.3
## tempo                 0.2    0.4    -0.1       0.1   1.0   -0.2          0.0
## energy               -0.7    0.3     0.0       0.1  -0.2    1.0         -0.2
##              valence liveness
## danceability    -0.2      0.1
## speech           0.2      0.0
## popular         -0.2      0.1
## acoustics       -0.1     -0.2
## tempo            0.1     -0.3
## energy           0.5      0.0
ggcorrplot(drakecorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
  labs(title = "Correlation Heat Map - Drake Spotify Songs")

drake%>% 
arrange(desc(energy))%>%
  ggplot(aes(song, energy, fill= song, color = "coral")) + geom_point() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),legend.position = "none") + 
  coord_flip() +
   ggtitle("Energy of Drake's Songs")

There are strong correlations in the following categories include:

Energy and Valence (0.5)

Speech and Tempo (0.4)

Energy and Duration of minutes (0.4)

Energy and Speech (0.3)

Acoustics and Instrumental (0.3)

Based on this information, I wanted to look at Drake songs because Energy is a big correlation factor.

Top Genres

For this analysis, I looked at the different genre (total 56) and mapped it using a tree map. The largest single category was pop, followed by a combination of other genres. They included: hip, pop; hip hop, pop, R&B; pop, Dance/Electronic; and pop and R&B.

sample1df%>% group_by(genre = genre) %>%
  summarise(popular = n()) %>% 
  arrange(desc(popular))
## # A tibble: 56 × 2
##    genre                          popular
##    <chr>                            <int>
##  1 pop                                388
##  2 hip hop, pop                       261
##  3 hip hop, pop, R&B                  223
##  4 pop, Dance/Electronic              211
##  5 pop, R&B                           156
##  6 hip hop                            114
##  7 hip hop, pop, Dance/Electronic      75
##  8 rock                                55
##  9 Dance/Electronic                    39
## 10 rock, pop                           36
## # ℹ 46 more rows
top <- sample1df%>% select(genre, popular) %>% group_by(genre) %>% summarise(n = n()) %>% top_n(56, n)
tm <- treemap(top, index = c("genre"), vSize = "n", vColor = 'genre', palette="RdYlBu")

Not surprisingly, Pop and a combination of hip hop and pop represent the most frequent genres.

Correlation

Loading additional packages to run histograms for numerical variables.

library(funModeling)
## Loading required package: Hmisc
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
library(tidyverse) 
library(Hmisc)

Here is a quick way to retrieve one plot containing all the histograms for numerical variables.

histograms <- sample1df[,-c(1,2,3,4,5,14)]
plot_num(sample1df, bins = 10)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the funModeling package.
##   Please report the issue at <https://github.com/pablo14/funModeling/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Correlation Heat Map

I used a correlation heat map as a tool to display the correlation between multiple variables as a color coded matrix.

library(ggcorrplot)
samplecorr1 <- sample1df %>% select(danceability, speech, popular, acoustics, tempo, energy, instrumental,valence,liveness, duration_min)

samplecorr2 <- round(cor(samplecorr1),1)
head(samplecorr2[, 1:9])
##              danceability speech popular acoustics tempo energy instrumental
## danceability          1.0    0.1       0      -0.1  -0.2   -0.1          0.0
## speech                0.1    1.0       0       0.0   0.1   -0.1         -0.1
## popular               0.0    0.0       1       0.0   0.0    0.0         -0.1
## acoustics            -0.1    0.0       0       1.0  -0.1   -0.4          0.0
## tempo                -0.2    0.1       0      -0.1   1.0    0.2          0.0
## energy               -0.1   -0.1       0      -0.4   0.2    1.0          0.0
##              valence liveness
## danceability     0.4     -0.1
## speech           0.1      0.1
## popular          0.0      0.0
## acoustics       -0.1     -0.1
## tempo            0.0      0.0
## energy           0.3      0.1
ggcorrplot(samplecorr2, hc.order = TRUE, type = "lower",lab = TRUE,) +
  labs(title = "Correlation Heat Map - Spotify Songs")

The strongest correlations are between valence and energy(.032) and valence and danceability (0.39)

Energy & danceability by genre.

Let’s check the correlation

cor(sample1df$valence, sample1df$danceability)
## [1] 0.3988757

Valence and Energy

Let’s check the correlation quickly to see if something stands out.

cor(sample1df$valence, sample1df$energy)
## [1] 0.3270572

The most popular artist and songs in the 2001 - 2019 Spotify data set were explored in this session of the project.

most_popular <- sample1df%>%
  group_by(song, artist) %>%
  summarise(popular) %>%
  arrange(desc(popular))%>%
head(22)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'song', 'artist'. You can override using
## the `.groups` argument.
most_popular
## # A tibble: 22 × 3
## # Groups:   song, artist [22]
##    song                 artist            popular
##    <chr>                <chr>               <dbl>
##  1 Sweater Weather      The Neighbourhood      89
##  2 Another Love         Tom Odell              88
##  3 Without Me           Eminem                 87
##  4 Wait a Minute!       WILLOW                 86
##  5 lovely (with Khalid) Billie Eilish          86
##  6 'Till I Collapse     Eminem                 85
##  7 Circles              Post Malone            85
##  8 Daddy Issues         The Neighbourhood      85
##  9 Locked out of Heaven Bruno Mars             85
## 10 Perfect              Ed Sheeran             85
## # ℹ 12 more rows

Most Popular Individual Songs Released by Spotify

I wanted to explore the popular song encompassed in the data set (2001 - 2019). This interactive chart (HighCharter) identifies the song, artist, and popularity score for the most popular individual songs released by Spotify. Further, for artist with multiple popular songs, each individual song is listed on the tooltip with the corresponding popularity score.

most_popular%>%
  hchart(type = "column",color = "#FF7900", hcaes(x = artist, y= popular, group = song)) %>%
hc_title(text = "Individual Songs by Popularity (2001 - 2019)") %>%
   hc_subtitle(text = "Source: Data Supplied by Montgomery College, Data 110") %>%
 hc_xAxis(title = list(text = "Artist")) %>%
 hc_yAxis(title = list(text = "Popularity")) %>%
  hc_add_theme(hc_theme_darkunica()) %>%
 hc_tooltip(shared = TRUE)

The winner of most popular song (89) is Sweater Weather by the Neighbourhood.

What the visualization represents, any interesting patterns or surprises that arise within the visualization:

There were several interesting patterns that were made clear through the correlation plots, including the connection between energy and valence in the songs that were the most popular and streamed the most often. It was not surprising that Pop was a popular genre but I was surprised to learn about all of the subgenres that exist as Spotify variables, including liveness, energy, and danceability. I was also surprised that most songs have a runtime between 3-5 minutes.

Anything that could have been shown that you could not get to work or that you wished you could have included.

I attempted to embed videos of the popular songs by Rihanna and Drake. I attempted to also include images to bring the project to life. I anticipated working on more correlations between songs with explict lyrics and popularity and songs with low popularity scores and how some of the variables (danceability, acoustics, and tempo) compare to the more popular songs.

Bibliograpy

For my project, I researched the topic by looking at articles that discussed Spotify and popularity of certain type of music. The most influential article was “Can big data really predict what makes a song popular?” Published: October 10, 2022 by the online periodical, The Conversation. I am sharing a link to the article. https://theconversation.com/can-big-data-really-predict-what-makes-a-song-popular-189052