Spotify said listening is everything. It also said Spotify is all the music you will ever need. Spotify has huge catalog of songs and podcasts. I wonder what people listen in 2020-2021, which it’s pandemic era. What about the correlation between Artist, their Spotify account’s followers, and listeners. Let’s find out.
knitr::include_graphics('Spotify_img')The world listen to music all the time. Many of them use Spotify as app they use to listen music daily. Spotify is no. 1 music streaming platform by number of subscribers. Based on company data, Spotify has 299 million unique in Q2 2020. The number grow from time to time. Spotify still on top of people’s mind. Other music streaming service are Apple Music, Amazon Music, Youtube, and many more.
knitr::include_graphics("Spotify's Annual Users.PNG")This Spotify dataset is from kaggle. The period of dataset is 2020-2021. We are going analyize further about Top 200 charts in Spotify.
# Read Data
spotify <- read.csv('spotify_dataset.csv')
# Read 3 rows of data to check what's inside
head(spotify,3)# Check the data structure
str(spotify)#> 'data.frame': 1556 obs. of 23 variables:
#> $ Index : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Highest.Charting.Position: int 1 2 1 3 5 1 3 2 3 8 ...
#> $ Number.of.Times.Charted : int 8 3 11 5 1 18 16 10 8 10 ...
#> $ Week.of.Highest.Charting : chr "2021-07-23--2021-07-30" "2021-07-23--2021-07-30" "2021-06-25--2021-07-02" "2021-07-02--2021-07-09" ...
#> $ Song.Name : chr "Beggin'" "STAY (with Justin Bieber)" "good 4 u" "Bad Habits" ...
#> $ Streams : chr "48,633,449" "47,248,719" "40,162,559" "37,799,456" ...
#> $ Artist : chr "Måneskin" "The Kid LAROI" "Olivia Rodrigo" "Ed Sheeran" ...
#> $ Artist.Followers : int 3377762 2230022 6266514 83293380 5473565 5473565 8640063 6080597 36142273 3377762 ...
#> $ Song.ID : chr "3Wrjm47oTz2sjIgck11l5e" "5HCyWlXZPP0y6Gqq8TgA20" "4ZtFanR9U6ndgddUvNcjcG" "6PQ88X9TkUIAUIZJHW2upE" ...
#> $ Genre : chr "['indie rock italiano', 'italian pop']" "['australian hip hop']" "['pop']" "['pop', 'uk pop']" ...
#> $ Release.Date : chr "2017-12-08" "2021-07-09" "2021-05-21" "2021-06-25" ...
#> $ Weeks.Charted : chr "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16" "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--2021-07-02" ...
#> $ Popularity : int 100 99 99 98 96 97 94 95 96 95 ...
#> $ Danceability : num 0.714 0.591 0.563 0.808 0.736 0.61 0.762 0.78 0.644 0.75 ...
#> $ Energy : num 0.8 0.764 0.664 0.897 0.704 0.508 0.701 0.718 0.648 0.608 ...
#> $ Loudness : num -4.81 -5.48 -5.04 -3.71 -7.41 ...
#> $ Speechiness : num 0.0504 0.0483 0.154 0.0348 0.0615 0.152 0.0286 0.0506 0.118 0.0387 ...
#> $ Acousticness : num 0.127 0.0383 0.335 0.0469 0.0203 0.297 0.235 0.31 0.276 0.00165 ...
#> $ Liveness : num 0.359 0.103 0.0849 0.364 0.0501 0.384 0.123 0.0932 0.135 0.178 ...
#> $ Tempo : num 134 170 167 126 150 ...
#> $ Duration..ms. : int 211560 141806 178147 231041 212000 137876 208867 199604 206710 173347 ...
#> $ Valence : num 0.589 0.478 0.688 0.591 0.894 0.758 0.742 0.342 0.44 0.958 ...
#> $ Chord : chr "B" "C#/Db" "A" "B" ...
The dataset still looks too much to use. I simplify by only take
columns I need. I only take :
1. Artist
2. Song Name
3. Streams
4. Artist Followers
5. Highest Charting Position
6. Number of Times Charted
7. Release Date
# Get Column that we needed
spotify_ds <- spotify[,c('Artist','Song.Name','Streams', 'Artist.Followers', 'Highest.Charting.Position','Number.of.Times.Charted','Release.Date')]
library(lubridate)
spotify_ds$Release.Date <- ymd(spotify_ds$Release.Date)
head(spotify_ds,3)# Get how many data in dataset
nrow(spotify_ds)#> [1] 1556
# Get Summary of dataset
summary(spotify_ds)#> Artist Song.Name Streams Artist.Followers
#> Length:1556 Length:1556 Length:1556 Min. : 4883
#> Class :character Class :character Class :character 1st Qu.: 2123734
#> Mode :character Mode :character Mode :character Median : 6852509
#> Mean :14716903
#> 3rd Qu.:22698747
#> Max. :83337783
#> NA's :11
#> Highest.Charting.Position Number.of.Times.Charted Release.Date
#> Min. : 1.00 Min. : 1.00 Min. :1942-01-01
#> 1st Qu.: 37.00 1st Qu.: 1.00 1st Qu.:2020-01-17
#> Median : 80.00 Median : 4.00 Median :2020-06-19
#> Mean : 87.74 Mean : 10.67 Mean :2019-03-08
#> 3rd Qu.:137.00 3rd Qu.: 12.00 3rd Qu.:2021-01-14
#> Max. :200.00 Max. :142.00 Max. :2021-08-13
#> NA's :28
We are going to analyze 1556 variant of songs in Spotify. We are trying to find out more about about Spotify’s dataset.
Let’s see the correlation between song’s chart position and their
Spotify-account’s followers. We know that the more followers you have,
the more people goint to listen. Is that true in this case ?
Here an Histogram of Artist’s Followers
# Make Histogram
Artist_Followers <- spotify_ds$Artist.Followers
hist(Artist_Followers, breaks = 200, main = 'Histogram of Spotify Artist Followers',
xlab = 'Followers',
cex.main=2, cex.lab=1.5, cex.sub=1.2)# Change Artist.Followers to numeric
spotify_ds$Artist.Followers <- as.numeric(spotify_ds$Artist.Followers)
# Aggregate to get Artist followers ranking
agg_df <- aggregate(x=Artist.Followers ~ Artist, data= spotify_ds, FUN=max)
agg_df[order(agg_df$Artist.Followers, decreasing = T),]
We can see how diverse their followers numbers from histogram. Then,
from the table, we get Ed Sheeran, Ariana Grande, and Drake on Top 3. Ed
Sheeran has 83 millions followers from his account. He is Artist with
the most followers in 2021.
# Remove comma from Streams then change Streams datatype to numeric
spotify_ds$Streams <- as.numeric(gsub(",","",spotify_ds$Streams))
# Aggregate to get Artist' song streams ranking
agg_df <- aggregate(x=Streams ~ Artist, data= spotify_ds, FUN=sum)
agg_df[order(agg_df$Streams, decreasing = T),]
in 2020-2021, The top 3 stream artist are Taylor Swift, BTS, and Justin
Bieber. There many reason why not the most followers get the most
streams in this period year. One of few is they didn’t drop any new song
or album. Is it true that no correlation between artist’s followers and
their stream number ? Then how about highest chart postion and number of
time charted. Let’s find out more about that.
# Check missing value
anyNA(spotify_ds)#> [1] TRUE
# Drop row with missing value
spotify_ds <- na.omit(spotify_ds)
# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Streams)#> [1] 0.1043629
The correlation is 0.104. What is that mean ? Take a look at the table below.
knitr::include_graphics('Strength of Association.png')
> The conclucion is the correlation between Artist followers and
streams is negligible. Two of them almost has no correlation. In period
2020-2021, People who use spotify listen the song they like whether many
of them not follow the artist.
# Plot Artist Followers and Streams
plot(spotify_ds$Artist.Followers, spotify_ds$Stream,
main='Followers vs Stream',
xlab = 'Followers',
ylab = 'Stream',
cex.main=2, cex.lab=1.5, cex.sub=1.2)
abline(lm(spotify_ds$Streams ~ spotify_ds$Artist.Followers),
col = 'red')
Find out more about correlation between of sets of data scatter plot
below.
knitr::include_graphics("Correlation.png")Base on the scatter plot, the correlation between Artist followers and stream have weak positive correlation. But, it’s very weak. It has negligible association.
# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Highest.Charting.Position)#> [1] -0.2332416
# Change Data type to numeric
spotify_ds$Highest.Charting.Position <- as.numeric(spotify_ds$Highest.Charting.Position)
# Make Scatter Plot
plot(spotify_ds$Artist.Followers, spotify_ds$Highest.Charting.Position,
main='Followers vs Highest Charting Position',
xlab = 'Followers',
ylab = 'Highest Charting Position',
cex.main=2, cex.lab=1.5, cex.sub=1.2)
abline(lm(spotify_ds$Highest.Charting.Position ~ spotify_ds$Artist.Followers),
col = 'red')In 2020-2021, Spotify Followers and Highest Charting Position has really weak negative correlation. It has negligible association.
# Calculate correlation
cor(spotify_ds$Artist.Followers, spotify_ds$Number.of.Times.Charted)#> [1] 0.02529148
# Make Scatter Plot
plot(spotify_ds$Artist.Followers, spotify_ds$Number.of.Times.Charted,
main='Followers vs Number of Times Charted',
xlab = 'Followers',
ylab = 'Number of Times Charted',
cex.main=2, cex.lab=1.5, cex.sub=1.2)
abline(lm(spotify_ds$Number.of.Times.Charted ~ spotify_ds$Artist.Followers),
col = 'red')In 2020-2021, Spotify Followers and Number of Times Charted has really weak Positive correlation. It has negligible association.
# Calculate Correlation
cor(spotify_ds$Streams, spotify_ds$Highest.Charting.Position)#> [1] -0.295666
# Make Scatter Plot
plot(spotify_ds$Streams, spotify_ds$Highest.Charting.Position,
main='Streams vs Highest Charting Position',
xlab = 'Streams',
ylab = 'Highest Charting Position',
cex.main=2, cex.lab=1.5, cex.sub=1.2)
abline(lm(spotify_ds$Highest.Charting.Position ~ spotify_ds$Streams),
col = 'red')In 2020-2021, Streams number and Highest Charting Position has weak negative correlation. That means with a lot of streams doesn’t make the song always get high chart position.
Spotify’s stream number and chart position is not much effected by Artist’s followers. This open oppurtinty to new or not-so-popular Artist to keep produce new music and make a great album. Their fans or general public will listen songs that suit their taste. Despite they not follow Artist’s account.To Spotify itself, they need to reach new artist more and give good platform to get many new listener around the world. In this globalization era, music can reach everyone and everywhere. After all, that’s the point of music streaming platform.