Spotify Data 2020-2021

In this notebook, I will be exploring the spotify top 200 charts from 2020-2021 data. The dataset include all the songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021.

Data Preparation and Cleaning

#Libraries
library(lubridate)
library(readr) 
library(ggplot2)
library(dplyr)
library(viridis)

Load Dataset

spotify <- read_csv("data/spotify_dataset.csv")
head(spotify)
## # A tibble: 6 x 23
##   Index `Highest Chartin~` `Number of Tim~` `Week of Highe~` `Song Name` Streams
##   <dbl>              <dbl>            <dbl> <chr>            <chr>         <dbl>
## 1     1                  1                8 2021-07-23--202~ Beggin'      4.86e7
## 2     2                  2                3 2021-07-23--202~ STAY (with~  4.72e7
## 3     3                  1               11 2021-06-25--202~ good 4 u     4.02e7
## 4     4                  3                5 2021-07-02--202~ Bad Habits   3.78e7
## 5     5                  5                1 2021-07-23--202~ INDUSTRY B~  3.39e7
## 6     6                  1               18 2021-05-07--202~ MONTERO (C~  3.01e7
## # ... with 17 more variables: Artist <chr>, `Artist Followers` <dbl>,
## #   `Song ID` <chr>, Genre <chr>, `Release Date` <chr>, `Weeks Charted` <chr>,
## #   Popularity <dbl>, Danceability <dbl>, Energy <dbl>, Loudness <dbl>,
## #   Speechiness <dbl>, Acousticness <dbl>, Liveness <dbl>, Tempo <dbl>,
## #   `Duration (ms)` <dbl>, Valence <dbl>, Chord <chr>

Datasets Information

Here are variables information from the datasets:

  • Highest Charting Position : The highest position that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021.
  • Number of Times Charted : The number of times that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021.
  • Week of Highest Charting : The week when the song had the Highest Position in the Spotify Top 200 Weekly Global Charts in 2020 & 2021.
  • Song Name : Name of the song that has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021.
  • Song iD : The song ID provided by Spotify (unique to each song).
  • Streams : Approximate number of streams the song has.
  • Artist : The main artist/ artists involved in making the song.
  • Artist Followers: The number of followers the main artist has on Spotify.
  • Genre : The genres the song belongs to.
  • Release Date : The initial date that the song was released.
  • Weeks Charted : The weeks that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021.
  • Popularity :The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
  • Danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • Acousticness : A measure from 0.0 to 1.0 of whether the track is acoustic.
  • Energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
  • Instrumentalness : Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
  • Liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
  • Loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db.
  • Speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
  • Tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
  • Chord : The main chord of the song instrumental.

Check Missing Values

ncol(spotify)
## [1] 23
nrow(spotify)
## [1] 1556
n_distinct(spotify)
## [1] 1556
colSums(is.na(spotify))
##                     Index Highest Charting Position   Number of Times Charted 
##                         0                         0                         0 
##  Week of Highest Charting                 Song Name                   Streams 
##                         0                         0                         0 
##                    Artist          Artist Followers                   Song ID 
##                         0                        11                        11 
##                     Genre              Release Date             Weeks Charted 
##                        11                        11                         0 
##                Popularity              Danceability                    Energy 
##                        11                        11                        11 
##                  Loudness               Speechiness              Acousticness 
##                        11                        11                        11 
##                  Liveness                     Tempo             Duration (ms) 
##                        11                        11                        11 
##                   Valence                     Chord 
##                        11                        11
spotify %>% is.na() %>% sum()
## [1] 165
spotify %>% is.null() %>% sum()
## [1] 0

Check Data Type

str(spotify)
## spec_tbl_df [1,556 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Index                    : num [1:1556] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Highest Charting Position: num [1:1556] 1 2 1 3 5 1 3 2 3 8 ...
##  $ Number of Times Charted  : num [1:1556] 8 3 11 5 1 18 16 10 8 10 ...
##  $ Week of Highest Charting : chr [1:1556] "2021-07-23--2021-07-30" "2021-07-23--2021-07-30" "2021-06-25--2021-07-02" "2021-07-02--2021-07-09" ...
##  $ Song Name                : chr [1:1556] "Beggin'" "STAY (with Justin Bieber)" "good 4 u" "Bad Habits" ...
##  $ Streams                  : num [1:1556] 48633449 47248719 40162559 37799456 33948454 ...
##  $ Artist                   : chr [1:1556] "Måneskin" "The Kid LAROI" "Olivia Rodrigo" "Ed Sheeran" ...
##  $ Artist Followers         : num [1:1556] 3377762 2230022 6266514 83293380 5473565 ...
##  $ Song ID                  : chr [1:1556] "3Wrjm47oTz2sjIgck11l5e" "5HCyWlXZPP0y6Gqq8TgA20" "4ZtFanR9U6ndgddUvNcjcG" "6PQ88X9TkUIAUIZJHW2upE" ...
##  $ Genre                    : chr [1:1556] "['indie rock italiano', 'italian pop']" "['australian hip hop']" "['pop']" "['pop', 'uk pop']" ...
##  $ Release Date             : chr [1:1556] "2017-12-08" "2021-07-09" "2021-05-21" "2021-06-25" ...
##  $ Weeks Charted            : chr [1:1556] "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16" "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--202"| __truncated__ "2021-07-23--2021-07-30\n2021-07-16--2021-07-23\n2021-07-09--2021-07-16\n2021-07-02--2021-07-09\n2021-06-25--2021-07-02" ...
##  $ Popularity               : num [1:1556] 100 99 99 98 96 97 94 95 96 95 ...
##  $ Danceability             : num [1:1556] 0.714 0.591 0.563 0.808 0.736 0.61 0.762 0.78 0.644 0.75 ...
##  $ Energy                   : num [1:1556] 0.8 0.764 0.664 0.897 0.704 0.508 0.701 0.718 0.648 0.608 ...
##  $ Loudness                 : num [1:1556] -4.81 -5.48 -5.04 -3.71 -7.41 ...
##  $ Speechiness              : num [1:1556] 0.0504 0.0483 0.154 0.0348 0.0615 0.152 0.0286 0.0506 0.118 0.0387 ...
##  $ Acousticness             : num [1:1556] 0.127 0.0383 0.335 0.0469 0.0203 0.297 0.235 0.31 0.276 0.00165 ...
##  $ Liveness                 : num [1:1556] 0.359 0.103 0.0849 0.364 0.0501 0.384 0.123 0.0932 0.135 0.178 ...
##  $ Tempo                    : num [1:1556] 134 170 167 126 150 ...
##  $ Duration (ms)            : num [1:1556] 211560 141806 178147 231041 212000 ...
##  $ Valence                  : num [1:1556] 0.589 0.478 0.688 0.591 0.894 0.758 0.742 0.342 0.44 0.958 ...
##  $ Chord                    : chr [1:1556] "B" "C#/Db" "A" "B" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Index = col_double(),
##   ..   `Highest Charting Position` = col_double(),
##   ..   `Number of Times Charted` = col_double(),
##   ..   `Week of Highest Charting` = col_character(),
##   ..   `Song Name` = col_character(),
##   ..   Streams = col_number(),
##   ..   Artist = col_character(),
##   ..   `Artist Followers` = col_double(),
##   ..   `Song ID` = col_character(),
##   ..   Genre = col_character(),
##   ..   `Release Date` = col_character(),
##   ..   `Weeks Charted` = col_character(),
##   ..   Popularity = col_double(),
##   ..   Danceability = col_double(),
##   ..   Energy = col_double(),
##   ..   Loudness = col_double(),
##   ..   Speechiness = col_double(),
##   ..   Acousticness = col_double(),
##   ..   Liveness = col_double(),
##   ..   Tempo = col_double(),
##   ..   `Duration (ms)` = col_double(),
##   ..   Valence = col_double(),
##   ..   Chord = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
spotify$`Release Date` <- ymd(spotify$`Release Date`)
## Warning: 17 failed to parse.

From the given data, there are 3 things that I want to explore:

  • Who’s singer has the most stream?
  • What song that get the most stream?
  • What genre get the most stream?

Exploratory Analysis

Singer with the Most Stream

head(spotify)
## # A tibble: 6 x 23
##   Index `Highest Chartin~` `Number of Tim~` `Week of Highe~` `Song Name` Streams
##   <dbl>              <dbl>            <dbl> <chr>            <chr>         <dbl>
## 1     1                  1                8 2021-07-23--202~ Beggin'      4.86e7
## 2     2                  2                3 2021-07-23--202~ STAY (with~  4.72e7
## 3     3                  1               11 2021-06-25--202~ good 4 u     4.02e7
## 4     4                  3                5 2021-07-02--202~ Bad Habits   3.78e7
## 5     5                  5                1 2021-07-23--202~ INDUSTRY B~  3.39e7
## 6     6                  1               18 2021-05-07--202~ MONTERO (C~  3.01e7
## # ... with 17 more variables: Artist <chr>, `Artist Followers` <dbl>,
## #   `Song ID` <chr>, Genre <chr>, `Release Date` <date>, `Weeks Charted` <chr>,
## #   Popularity <dbl>, Danceability <dbl>, Energy <dbl>, Loudness <dbl>,
## #   Speechiness <dbl>, Acousticness <dbl>, Liveness <dbl>, Tempo <dbl>,
## #   `Duration (ms)` <dbl>, Valence <dbl>, Chord <chr>
stream <- spotify%>%
  arrange(desc(Streams))%>%
  head(10)

stream
## # A tibble: 10 x 23
##    Index `Highest Charti~` `Number of Tim~` `Week of Highe~` `Song Name` Streams
##    <dbl>             <dbl>            <dbl> <chr>            <chr>         <dbl>
##  1     1                 1                8 2021-07-23--202~ Beggin'      4.86e7
##  2     2                 2                3 2021-07-23--202~ STAY (with~  4.72e7
##  3     3                 1               11 2021-06-25--202~ good 4 u     4.02e7
##  4     4                 3                5 2021-07-02--202~ Bad Habits   3.78e7
##  5     5                 5                1 2021-07-23--202~ INDUSTRY B~  3.39e7
##  6     6                 1               18 2021-05-07--202~ MONTERO (C~  3.01e7
##  7     7                 3               16 2021-05-14--202~ Kiss Me Mo~  2.94e7
##  8  1431                 7                1 2020-02-07--202~ Intentions   2.85e7
##  9     8                 2               10 2021-06-18--202~ Todo De Ti   2.70e7
## 10     9                 3                8 2021-06-18--202~ Yonaguni     2.50e7
## # ... with 17 more variables: Artist <chr>, `Artist Followers` <dbl>,
## #   `Song ID` <chr>, Genre <chr>, `Release Date` <date>, `Weeks Charted` <chr>,
## #   Popularity <dbl>, Danceability <dbl>, Energy <dbl>, Loudness <dbl>,
## #   Speechiness <dbl>, Acousticness <dbl>, Liveness <dbl>, Tempo <dbl>,
## #   `Duration (ms)` <dbl>, Valence <dbl>, Chord <chr>
stream%>%
  ggplot(aes(x = Streams,
             y = reorder(Artist, Streams))) +
  geom_col() +
  labs(title = "Most Stream Artist",
       x = "Streams",
       y = "Artist")

Over 2020-2021, the most stream artist globally on spotify was Lil Nas X and Maneskin comes second.

The Most Stream Song

stream%>%
  ggplot(aes(x = Streams,
             y = reorder(`Song Name`, Streams))) +
  geom_col() +
  labs(title = "Most Stream Song",
       x = "Streams",
       y = "Song")

‘Beggin’ by Maneskin has .5% of all streams on Spotify. With 48,633,449 streams total.

The Most Stream Genre

stream %>%
  ggplot(aes(x = Streams,
             y = reorder(Genre, Streams))) +
  geom_col() +
  labs(title = "Most Stream Genre",
       x = "Streams",
       y = "Genre")

The most stream genre on spotify global in 2020-2021 is “LGBTQ + Hip Hop” and “Pop Rap”. This genres surpassing more than 60 millions streams.

Conclusion

There are three things that we can conclude from the exploratory data analysis above: 1. Over 2020-2021, the most stream artist globally on spotify was Lil Nas X and Maneskin comes second. 2. ‘Beggin’ by Maneskin has .5% of all streams on Spotify. With 48,633,449 streams total. 3. The most stream genre on spotify global in 2020-2021 is “LGBTQ + Hip Hop” and “Pop Rap”. This genres surpassing more than 60 millions streams.