What’s On Spotify

Spotify said listening is everything. It also said Spotify is all the music you will ever need. Spotify has huge catalog of songs and podcasts. I wonder what people listen in 2020-2021, which it’s pandemic era. What about the correlation between Artist, their Spotify account’s followers, and listeners. Let’s find out.

knitr::include_graphics('Spotify_img')

Music Around The World

The world listen to music all the time. Many of them use Spotify as app they use to listen music daily. Spotify is no. 1 music streaming platform by number of subscribers. Based on company data, Spotify has 299 million unique in Q2 2020. The number grow from time to time. Spotify still on top of people’s mind. Other music streaming service are Apple Music, Amazon Music, Youtube, and many more.

knitr::include_graphics("Spotify's Annual Users.PNG")

This Spotify dataset is from kaggle. The period of dataset is 2020-2021. We are going analyize further about Top 200 charts in Spotify.

# Read Data
spotify <- read.csv('spotify_dataset.csv')

# Read 3 rows of data to check what's inside
head(spotify,3)
# Check the data structure

str(spotify)
#> 'data.frame':    1556 obs. of  23 variables:
#>  $ Index                    : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Highest.Charting.Position: int  1 2 1 3 5 1 3 2 3 8 ...
#>  $ Number.of.Times.Charted  : int  8 3 11 5 1 18 16 10 8 10 ...
#>  $ Week.of.Highest.Charting : chr  "2021-07-23--2021-07-30" "2021-07-23--2021-07-30" "2021-06-25--2021-07-02" "2021-07-02--2021-07-09" ...
#>  $ Song.Name                : chr  "Beggin'" "STAY (with Justin Bieber)" "good 4 u" "Bad Habits" ...
#>  $ Streams                  : chr  "48,633,449" "47,248,719" "40,162,559" "37,799,456" ...
#>  $ Artist                   : chr  "Måneskin" "The Kid LAROI" "Olivia Rodrigo" "Ed Sheeran" ...
#>  $ Artist.Followers         : int  3377762 2230022 6266514 83293380 5473565 5473565 8640063 6080597 36142273 3377762 ...
#>  $ Song.ID                  : chr  "3Wrjm47oTz2sjIgck11l5e" "5HCyWlXZPP0y6Gqq8TgA20" "4ZtFanR9U6ndgddUvNcjcG" "6PQ88X9TkUIAUIZJHW2upE" ...
#>  $ Genre                    : chr  "['indie rock italiano', 'italian pop']" "['australian hip hop']" "['pop']" "['pop', 'uk pop']" ...
#>  $ Release.Date             : chr  "2017-12-08" "2021-07-09" "2021-05-21" "2021-06-25" ...
#>  $ Weeks.Charted            : chr  "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16" "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--2021-07-02" ...
#>  $ Popularity               : int  100 99 99 98 96 97 94 95 96 95 ...
#>  $ Danceability             : num  0.714 0.591 0.563 0.808 0.736 0.61 0.762 0.78 0.644 0.75 ...
#>  $ Energy                   : num  0.8 0.764 0.664 0.897 0.704 0.508 0.701 0.718 0.648 0.608 ...
#>  $ Loudness                 : num  -4.81 -5.48 -5.04 -3.71 -7.41 ...
#>  $ Speechiness              : num  0.0504 0.0483 0.154 0.0348 0.0615 0.152 0.0286 0.0506 0.118 0.0387 ...
#>  $ Acousticness             : num  0.127 0.0383 0.335 0.0469 0.0203 0.297 0.235 0.31 0.276 0.00165 ...
#>  $ Liveness                 : num  0.359 0.103 0.0849 0.364 0.0501 0.384 0.123 0.0932 0.135 0.178 ...
#>  $ Tempo                    : num  134 170 167 126 150 ...
#>  $ Duration..ms.            : int  211560 141806 178147 231041 212000 137876 208867 199604 206710 173347 ...
#>  $ Valence                  : num  0.589 0.478 0.688 0.591 0.894 0.758 0.742 0.342 0.44 0.958 ...
#>  $ Chord                    : chr  "B" "C#/Db" "A" "B" ...

Get Only What needed

The dataset still looks too much to use. I simplify by only take columns I need. I only take :
1. Artist
2. Song Name
3. Streams
4. Artist Followers
5. Highest Charting Position
6. Number of Times Charted
7. Release Date

# Get Column that we needed

spotify_ds <- spotify[,c('Artist','Song.Name','Streams', 'Artist.Followers', 'Highest.Charting.Position','Number.of.Times.Charted','Release.Date')]
library(lubridate)
spotify_ds$Release.Date <- ymd(spotify_ds$Release.Date)
head(spotify_ds,3)
# Get how many data in dataset
nrow(spotify_ds)
#> [1] 1556
# Get Summary of dataset
summary(spotify_ds)
#>     Artist           Song.Name           Streams          Artist.Followers  
#>  Length:1556        Length:1556        Length:1556        Min.   :    4883  
#>  Class :character   Class :character   Class :character   1st Qu.: 2123734  
#>  Mode  :character   Mode  :character   Mode  :character   Median : 6852509  
#>                                                           Mean   :14716903  
#>                                                           3rd Qu.:22698747  
#>                                                           Max.   :83337783  
#>                                                           NA's   :11        
#>  Highest.Charting.Position Number.of.Times.Charted  Release.Date       
#>  Min.   :  1.00            Min.   :  1.00          Min.   :1942-01-01  
#>  1st Qu.: 37.00            1st Qu.:  1.00          1st Qu.:2020-01-17  
#>  Median : 80.00            Median :  4.00          Median :2020-06-19  
#>  Mean   : 87.74            Mean   : 10.67          Mean   :2019-03-08  
#>  3rd Qu.:137.00            3rd Qu.: 12.00          3rd Qu.:2021-01-14  
#>  Max.   :200.00            Max.   :142.00          Max.   :2021-08-13  
#>                                                    NA's   :28

We are going to analyze 1556 variant of songs in Spotify. We are trying to find out more about about Spotify’s dataset.

Top Artist and Their Followers

Let’s see the correlation between song’s chart position and their Spotify-account’s followers. We know that the more followers you have, the more people goint to listen. Is that true in this case ?
Here an Histogram of Artist’s Followers

# Make Histogram
Artist_Followers <- spotify_ds$Artist.Followers
hist(Artist_Followers, breaks = 200, main = 'Histogram of Spotify Artist Followers',
     xlab = 'Followers',
     cex.main=2, cex.lab=1.5, cex.sub=1.2)

# Change Artist.Followers to numeric
spotify_ds$Artist.Followers <- as.numeric(spotify_ds$Artist.Followers)

# Aggregate to get Artist followers ranking
agg_df <- aggregate(x=Artist.Followers ~ Artist, data= spotify_ds, FUN=max)
agg_df[order(agg_df$Artist.Followers, decreasing = T),]


We can see how diverse their followers numbers from histogram. Then, from the table, we get Ed Sheeran, Ariana Grande, and Drake on Top 3. Ed Sheeran has 83 millions followers from his account. He is Artist with the most followers in 2021.

# Remove comma from Streams then change Streams datatype to numeric
spotify_ds$Streams <- as.numeric(gsub(",","",spotify_ds$Streams))

# Aggregate to get Artist' song streams ranking
agg_df <- aggregate(x=Streams ~ Artist, data= spotify_ds, FUN=sum)
agg_df[order(agg_df$Streams, decreasing = T),]


in 2020-2021, The top 3 stream artist are Taylor Swift, BTS, and Justin Bieber. There many reason why not the most followers get the most streams in this period year. One of few is they didn’t drop any new song or album. Is it true that no correlation between artist’s followers and their stream number ? Then how about highest chart postion and number of time charted. Let’s find out more about that.

The Correlation

Artist Followers and Streams

# Check missing value
anyNA(spotify_ds)
#> [1] TRUE
# Drop row with missing value 
spotify_ds <- na.omit(spotify_ds)

# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Streams)
#> [1] 0.1043629

The correlation is 0.104. What is that mean ? Take a look at the table below.

knitr::include_graphics('Strength of Association.png')


> The conclucion is the correlation between Artist followers and streams is negligible. Two of them almost has no correlation. In period 2020-2021, People who use spotify listen the song they like whether many of them not follow the artist.

# Plot Artist Followers and Streams
plot(spotify_ds$Artist.Followers, spotify_ds$Stream,
     main='Followers vs Stream',
     xlab = 'Followers',
     ylab = 'Stream',
     cex.main=2, cex.lab=1.5, cex.sub=1.2)

abline(lm(spotify_ds$Streams ~ spotify_ds$Artist.Followers), 
      col = 'red')


Find out more about correlation between of sets of data scatter plot below.

knitr::include_graphics("Correlation.png")

Base on the scatter plot, the correlation between Artist followers and stream have weak positive correlation. But, it’s very weak. It has negligible association.

Spotify’s Account Followers and Highest Charting Position

# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Highest.Charting.Position)
#> [1] -0.2332416
# Change Data type to numeric

spotify_ds$Highest.Charting.Position <- as.numeric(spotify_ds$Highest.Charting.Position)

# Make Scatter Plot

plot(spotify_ds$Artist.Followers, spotify_ds$Highest.Charting.Position,
     main='Followers vs Highest Charting Position',
     xlab = 'Followers',
     ylab = 'Highest Charting Position',
     cex.main=2, cex.lab=1.5, cex.sub=1.2)

abline(lm(spotify_ds$Highest.Charting.Position ~ spotify_ds$Artist.Followers), 
      col = 'red')

In 2020-2021, Spotify Followers and Highest Charting Position has really weak negative correlation. It has negligible association.

Spotify’s Account Followers and Number of Times Charted

# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Number.of.Times.Charted)
#> [1] 0.02529148
# Make Scatter Plot

plot(spotify_ds$Artist.Followers, spotify_ds$Number.of.Times.Charted,
     main='Followers vs Number of Times Charted',
     xlab = 'Followers',
     ylab = 'Number of Times Charted',
     cex.main=2, cex.lab=1.5, cex.sub=1.2)

abline(lm(spotify_ds$Number.of.Times.Charted ~ spotify_ds$Artist.Followers), 
      col = 'red')

In 2020-2021, Spotify Followers and Number of Times Charted has really weak Positive correlation. It has negligible association.

Song Streams and Highest Charting Position

# Calculate Correlation
cor(spotify_ds$Streams, spotify_ds$Highest.Charting.Position)
#> [1] -0.295666
# Make Scatter Plot
plot(spotify_ds$Streams, spotify_ds$Highest.Charting.Position,
     main='Streams vs Highest Charting Position',
     xlab = 'Streams',
     ylab = 'Highest Charting Position',
     cex.main=2, cex.lab=1.5, cex.sub=1.2)

abline(lm(spotify_ds$Highest.Charting.Position ~ spotify_ds$Streams), 
      col = 'red')

In 2020-2021, Streams number and Highest Charting Position has weak negative correlation. That means with a lot of streams doesn’t make the song always get high chart position.

Conclusion

Spotify’s stream number and chart position is not much effected by Artist’s followers. This open oppurtinty to new or not-so-popular Artist to keep produce new music and make a great album. Their fans or general public will listen songs that suit their taste. Despite they not follow Artist’s account.To Spotify itself, they need to reach new artist more and give good platform to get many new listener around the world. In this globalization era, music can reach everyone and everywhere. After all, that’s the point of music streaming platform.