Analysis of Spotify Data 2010-2019

This data consists of ~600 songs that were in The Billboard top songs of the year from 2010 to 2019.

Install the packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(cluster)

Reading the data

spotify_data<-read_csv("top10s.csv")
## New names:
## Rows: 603 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): title, artist, top genre dbl (12): ...1, year, bpm, nrgy, dnce, dB, live,
## val, dur, acous, spch, pop
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Summary of Spotify Data

This is the summary of Spotify Data

##       ...1          title              artist           top genre        
##  Min.   :  1.0   Length:603         Length:603         Length:603        
##  1st Qu.:151.5   Class :character   Class :character   Class :character  
##  Median :302.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :302.0                                                           
##  3rd Qu.:452.5                                                           
##  Max.   :603.0                                                           
##       year           bpm             nrgy           dnce      
##  Min.   :2010   Min.   :  0.0   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:2013   1st Qu.:100.0   1st Qu.:61.0   1st Qu.:57.00  
##  Median :2015   Median :120.0   Median :74.0   Median :66.00  
##  Mean   :2015   Mean   :118.5   Mean   :70.5   Mean   :64.38  
##  3rd Qu.:2017   3rd Qu.:129.0   3rd Qu.:82.0   3rd Qu.:73.00  
##  Max.   :2019   Max.   :206.0   Max.   :98.0   Max.   :97.00  
##        dB               live            val             dur       
##  Min.   :-60.000   Min.   : 0.00   Min.   : 0.00   Min.   :134.0  
##  1st Qu.: -6.000   1st Qu.: 9.00   1st Qu.:35.00   1st Qu.:202.0  
##  Median : -5.000   Median :12.00   Median :52.00   Median :221.0  
##  Mean   : -5.579   Mean   :17.77   Mean   :52.23   Mean   :224.7  
##  3rd Qu.: -4.000   3rd Qu.:24.00   3rd Qu.:69.00   3rd Qu.:239.5  
##  Max.   : -2.000   Max.   :74.00   Max.   :98.00   Max.   :424.0  
##      acous            spch             pop       
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 2.00   1st Qu.: 4.000   1st Qu.:60.00  
##  Median : 6.00   Median : 5.000   Median :69.00  
##  Mean   :14.33   Mean   : 8.358   Mean   :66.52  
##  3rd Qu.:17.00   3rd Qu.: 9.000   3rd Qu.:76.00  
##  Max.   :99.00   Max.   :48.000   Max.   :99.00

Let’s check the genre,

table(spotify_data$`top genre`)
## 
##              acoustic pop              alaska indie           alternative r&b 
##                         2                         1                         1 
##                   art pop               atl hip hop          australian dance 
##                         8                         5                         6 
##        australian hip hop            australian pop             barbadian pop 
##                         1                         5                        15 
##               baroque pop               belgian edm                  big room 
##                         2                         2                        10 
##                  boy band              british soul                   brostep 
##                        15                        11                         2 
## canadian contemporary r&b          canadian hip hop            canadian latin 
##                         9                         2                         1 
##              canadian pop                 candy pop               celtic rock 
##                        34                         2                         1 
##               chicago rap             colombian pop                complextro 
##                         1                         3                         6 
##      contemporary country                 dance pop                danish pop 
##                         1                       327                         1 
##           detroit hip hop                 downtempo                       edm 
##                         2                         2                         5 
##                   electro             electro house           electronic trap 
##                         2                         1                         2 
##                electropop               escape room                  folk-pop 
##                        13                         2                         2 
##          french indie pop                   hip hop                   hip pop 
##                         1                         4                         6 
##                 hollywood                     house                 indie pop 
##                         1                         1                         2 
##   irish singer-songwriter                     latin              metropopolis 
##                         1                         4                         1 
##              moroccan pop                neo mellow            permanent wave 
##                         1                         9                         4 
##                       pop            tropical house 
##                        60                         3

Since this data has many genres, I decided to organize it to make it easier to understand

spotify_data <-spotify_data%>%separate((`top genre`),c("variable","genre"),extra='merge')
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 97 rows [5, 12, 24, 55,
## 58, 60, 65, 106, 108, 110, 114, 118, 119, 138, 148, 150, 153, 155, 163, 168,
## ...].
other <-c("downtempo","escape","hollywood","house","permanent","metropolis","Null")
spotify_data$genre[spotify_data$variable %in% other]  <- 'Other'
spotify_data$genre[spotify_data$variable =='hip']  <- 'hip hop'
spotify_data$genre[spotify_data$variable =='irish']  <- 'folk'
spotify_data$genre[spotify_data$variable =='folk']  <- 'folk'
spotify_data$genre[spotify_data$variable =='tropical']  <- 'edm'
spotify_data$genre[spotify_data$variable =='complextro']  <- 'edm'
spotify_data$genre[spotify_data$variable =='electro']  <- 'edm'
spotify_data$genre[spotify_data$variable =='electronic']  <- 'edm'
spotify_data$genre[spotify_data$variable =='brostep']  <- 'edm'
spotify_data$genre[spotify_data$variable =='latin']  <- 'latin'
spotify_data$genre[spotify_data$variable =='boy']  <- 'pop'
spotify_data$genre[spotify_data$variable =='french']  <- 'pop'
spotify_data$genre[spotify_data$variable =='electropop']  <- 'pop'
spotify_data$genre[spotify_data$variable =='pop']  <- 'pop'
spotify_data$genre[spotify_data$variable =='neo']  <- 'alt rock'
spotify_data$genre[spotify_data$genre=='contemporary r&b'] <-'r&b'
spotify_data$genre[spotify_data$genre=='room'] <-'Other'
table(spotify_data$genre)
## 
## alt rock  country    dance      edm     folk  hip hop    indie    latin 
##        9        1        6       18        3       20        1        5 
##    Other      pop      r&b      rap     rock     soul 
##       20      491       10        1        1       11

Cleaning the data

before analyze the data, make sure to find if theres a null or missing data

miss<-colSums(is.na(spotify_data))
print(miss)
##     ...1    title   artist variable    genre     year      bpm     nrgy 
##        0        0        0        0        6        0        0        0 
##     dnce       dB     live      val      dur    acous     spch      pop 
##        0        0        0        0        0        0        0        0
spotify_data <-na.omit(spotify_data)

Data Exploratory and Visualization

I want to analyst the popular songs in 2010 and 2019, determine the artists with the most popular songs in those years, and analyze the popular genres in 2010 and 2019. and also analyze who the artists had the most popular songs from 2010 to 2019

After running the codes, I discovered that the most popular song in 2010 was “Hey, Soul Sister” by Train with the alternative rock genre, and in 2019 it was “Memories” by Maroon 5 in the pop genre. In 2010, the artists with the most popular songs were The Black Eyed Peas, Kesha, and Christina Aguilera, while in 2019, only Ed Sheeran had the most popular songs. Katy Perry had the most popular songs from 2010 to 2019, and the pop genre also the most popular genre music from 2010 to 2019.

Measurement

Here I aim to analyze songs with longest and shortest durations, the highest and lowest BPM, assess the energy levels of each song, identify energetic tracks, categorize genres that evoke energy, the loudness and the quietness song, highlight dance-friendly songs, and showcase songs with a positive sound (Happy,Cheerful).

Analyze Genre and Song that has the Longest and the shortest duration

duration <-spotify_data%>%ggplot(aes(x=reorder(genre,dur),y=dur))+geom_boxplot(fill="magenta")+ggtitle("Song Duration")+theme_light()
duration

Song with the longest duration

max_pop <-spotify_data%>%filter(dur>=400)%>%select(artist,dur,title)
max_pop
## # A tibble: 2 × 3
##   artist                   dur title                                            
##   <chr>                  <dbl> <chr>                                            
## 1 Justin Timberlake        424 "TKO"                                            
## 2 Florence + The Machine   403 "Wish That You Were Here - From \x93Miss Peregri…

Song with the shortest duration

min_pop <-spotify_data%>%filter(dur<=150)%>%select(artist,dur,title)
min_pop
## # A tibble: 2 × 3
##   artist          dur title                          
##   <chr>         <dbl> <chr>                          
## 1 Justin Bieber   134 Mark My Words                  
## 2 R3HAB           148 All Around The World (La La La)

Analyze Genre and Song that has fastest and slowest bpm

bpm <- spotify_data%>%ggplot(aes(x=reorder(genre,bpm),y=bpm))+geom_boxplot(fill="magenta")+ggtitle("bpm")+theme_light()
bpm

bpm_high <-spotify_data%>%filter(bpm>=200)%>%group_by(artist)%>%summarise(title,bpm,genre)
bpm_high
## # A tibble: 3 × 4
##   artist     title                                 bpm genre
##   <chr>      <chr>                               <dbl> <chr>
## 1 Fergie     L.A.LOVE (la la)                      202 pop  
## 2 Little Mix How Ya Doin'? (feat. Missy Elliott)   201 pop  
## 3 Rihanna    FourFiveSeconds                       206 pop
bpm_low <-spotify_data%>%filter(bpm<=0)%>%group_by(artist)%>%summarise(title,bpm,genre)
bpm_low
## # A tibble: 1 × 4
##   artist title               bpm genre
##   <chr>  <chr>             <dbl> <chr>
## 1 Adele  Million Years Ago     0 soul

Analyze which genre and song has highest and the lowest dB

dB <- spotify_data%>%ggplot(aes(x=reorder(genre,dB),dB))+ geom_boxplot(fill="magenta")+ggtitle("dB")+theme_light()
dB

Song with the lowest dB

min_dB <-spotify_data%>%filter(dB<=-50.000)%>%select(artist,dB,title,dur, genre)
min_dB
## # A tibble: 1 × 5
##   artist    dB title               dur genre
##   <chr>  <dbl> <chr>             <dbl> <chr>
## 1 Adele    -60 Million Years Ago   227 soul

Song with the highest dB

max_dB <-spotify_data%>%filter(dB>=-2.000)%>%select(artist,dB,title,dur,genre)
max_dB
## # A tibble: 5 × 5
##   artist            dB title                      dur genre
##   <chr>          <dbl> <chr>                    <dbl> <chr>
## 1 Britney Spears    -2 3                          213 pop  
## 2 One Direction     -2 What Makes You Beautiful   200 pop  
## 3 Nicki Minaj       -2 Starships                  211 pop  
## 4 Jonas Brothers    -2 Pom Poms                   198 pop  
## 5 Galantis          -2 Rich Boy                   184 Other

Analyze The Energetic Genre and Song

energy <- spotify_data%>%ggplot(aes(x=reorder(genre,nrgy),nrgy))+ geom_boxplot(fill="magenta")+ggtitle("The energy of each genre")+theme_light()
energy

Song has the higher energy

High_energy <-spotify_data%>%arrange(desc(nrgy))%>%select(artist,nrgy,title,year,genre)
High_energy
## # A tibble: 597 × 5
##    artist               nrgy title                                    year genre
##    <chr>               <dbl> <chr>                                   <dbl> <chr>
##  1 Martin Solveig         98 Hello                                    2010 Other
##  2 Jonas Brothers         98 Pom Poms                                 2013 pop  
##  3 Pitbull                96 Don't Stop the Party (feat. TJR)         2012 pop  
##  4 Avril Lavigne          96 Rock N Roll                              2013 pop  
##  5 OneRepublic            95 All The Right Moves                      2010 pop  
##  6 Tinie Tempah           95 Written in the Stars (feat. Eric Turne…  2010 pop  
##  7 Tinie Tempah           95 Written in the Stars (feat. Eric Turne…  2011 pop  
##  8 Little Mix             95 How Ya Doin'? (feat. Missy Elliott)      2013 pop  
##  9 5 Seconds of Summer    95 She Looks So Perfect                     2014 pop  
## 10 Jennifer Lopez         95 Booty                                    2015 pop  
## # ℹ 587 more rows

Analyze Danceable Genre and Song

dance_genre <- spotify_data%>%ggplot(aes(x=reorder(genre,dnce),y=dnce))+geom_boxplot(fill="magenta")+ggtitle("Genre to dance ")+theme_light()
dance_genre

danceable_song <-spotify_data%>%arrange(desc(dnce))%>%select(artist,dnce,title,genre)
danceable_song
## # A tibble: 597 × 4
##    artist             dnce title                 genre  
##    <chr>             <dbl> <chr>                 <chr>  
##  1 Selena Gomez         97 Bad Liar              pop    
##  2 Cardi B              97 Drip (feat. Migos)    pop    
##  3 Nicki Minaj          96 Anaconda              pop    
##  4 Pharrell Williams    93 Come Get It Bae       pop    
##  5 Meghan Trainor       93 Me Too                pop    
##  6 Missy Elliott        93 WTF (Where They From) pop    
##  7 Cardi B              93 Bodak Yellow          pop    
##  8 N.E.R.D              92 Lemon                 hip hop
##  9 Iggy Azalea          91 Fancy                 hip hop
## 10 Jennifer Hudson      90 Dangerous             pop    
## # ℹ 587 more rows

Analyze Genre and Song has the positive sound (Happy, Cheerful)

Pos_sound <- spotify_data%>%ggplot(aes(x=reorder(genre,val),y=val))+geom_boxplot(fill="magenta")+ggtitle("The positive sound of each genre ")+theme_light()
Pos_sound

Positive_Sound <-spotify_data%>%arrange(desc(val))%>%select(artist,val,title,genre)
Positive_Sound
## # A tibble: 597 × 4
##    artist              val title                              genre
##    <chr>             <dbl> <chr>                              <chr>
##  1 Austin Mahone        98 "Mmm Yeah (feat. Pitbull)"         pop  
##  2 Shawn Mendes         97 "There's Nothing Holdin' Me Back"  pop  
##  3 Pharrell Williams    96 "Happy - From \"Despicable Me 2\"" pop  
##  4 Meghan Trainor       96 "All About That Bass"              pop  
##  5 Pitbull              95 "Don't Stop the Party (feat. TJR)" pop  
##  6 Meghan Trainor       95 "Lips Are Movin"                   pop  
##  7 Jonas Brothers       95 "Sucker"                           pop  
##  8 Taylor Swift         94 "Shake It Off"                     pop  
##  9 Bruno Mars           94 "Treasure"                         pop  
## 10 Ed Sheeran           94 "Sing"                             pop  
## # ℹ 587 more rows

Variable Correlated

Energy and Dancebility

energy_dance <-spotify_data%>%ggplot(aes(x=nrgy,y=dnce))+geom_point()+ggtitle("Correlated Energy & Danceability")+theme_light()+geom_smooth(se=FALSE)
energy_dance
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

BPM and Dancebility

bpm_dance <-spotify_data%>%ggplot(aes(x=bpm,y=dnce))+geom_point()+ggtitle("Correlated BPM & Danceability")+theme_light()+geom_smooth(se=FALSE)
bpm_dance
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#### Energy and dB

Energy_dB <-spotify_data%>%ggplot(aes(x=nrgy,y=dB))+geom_point()+ggtitle("Correlated Energy & dB")+theme_light()+geom_smooth(se=FALSE)
Energy_dB
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#### Energy and Positive Sound

energy_Pos <-spotify_data%>%ggplot(aes(x=nrgy,y=val))+geom_point()+ggtitle("Correlated Energy and Positivity")+theme_light()+geom_smooth(se=FALSE)
energy_Pos
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Pos_Energy <-spotify_data%>%ggplot(aes(x=val,y=nrgy))+geom_point()+ggtitle("Positivity/Energy based on Genre")+theme_light()+geom_smooth(se=FALSE)+facet_wrap(~genre)
Pos_Energy
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 73.26
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 23.26
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 529
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 13.85
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 10.15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 406.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 46.825
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 22.175
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 173.58

Summary

Based on the analysis above, I obtained results that in 2010, the most popular song was ‘Hey, Soul Sister’ by Train in the alternative rock genre, while in 2019 ‘Memories’ by Maroon 5 in the pop genre took the lead. The top artists in 2010 were The Black Eyed Peas, Kesha, and Christina Aguilera, whereas in 2019, Ed Sheeran stood out. Throughout the period from 2010 to 2019, the pop genre remained the most popular. Analyzing song durations, Justin Timberlake’s track was the longest at 424 seconds, while Justin Bieber’s ‘Mark My Word’ was the shortest. In terms of tempo, Rihanna’s ‘FourFiveSeconds’ in the pop genre had the fastest BPM, while Adele’s ‘Million Years Ago’ in the soul genre had the slowest. Songs with loud voices included Britney Spears, One Direction, Nicki Minaj, and Jonas Brothers in the pop genre, with Galantis representing other genres. Adele’s ‘Million Years Ago’ had the lowest loudness level in the soul genre. Energy levels were analyzed, with Martin Solveig’s ‘Hello’ and Jonas Brothers’ ‘Pom Poms’ in the pop genre standing out for their high energy. Danceability was measured, with Selena Gomez’s ‘Bad Liar’ and Cardi B’s ‘Drip (feat. Migos)’ in the pop genre being highlighted as easy to dance to. Austin Mahone’s ‘Mmm Yeah (feat. Pitbull)’ in the pop genre had a high positive value. The correlation between energy and danceability was explored, showing that more energetic songs are easier to dance to. Additionally, higher BPM or tempo can make a song more challenging to dance to. The correlation between decibel levels and energy release was also noted, indicating that louder songs tend to be more energetic. Higher positivity values in songs also contribute to their energetic feel.