Introduction

For this project I am taking a look into popularity and audio features of tracks in the Spotify Top 200 Charts (2020-2021) from Kaggle to try to highlight the common patterns between popularity and the audio features of these songs

Required Packages

Packages required for this project.

#Provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables. 
library("corrplot")
## corrplot 0.90 loaded
#Grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#This package is used to get basic statistic of numerical variables using basicStats function.
library( "fBasics")
## Loading required package: timeDate
## Loading required package: timeSeries
#a system for declarative creating graphics, based on The Grammar of Graphics.
library("ggplot2")

#A set of tools that solves a common set of problems
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
# It is a collection of multiple packages used to clean, visualize, model, and to communicate the data.
library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.5     v purrr   0.3.4
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plyr::arrange()      masks dplyr::arrange()
## x purrr::compact()     masks plyr::compact()
## x plyr::count()        masks dplyr::count()
## x plyr::failwith()     masks dplyr::failwith()
## x timeSeries::filter() masks dplyr::filter(), stats::filter()
## x plyr::id()           masks dplyr::id()
## x timeSeries::lag()    masks dplyr::lag(), stats::lag()
## x plyr::mutate()       masks dplyr::mutate()
## x plyr::rename()       masks dplyr::rename()
## x plyr::summarise()    masks dplyr::summarise()
## x plyr::summarize()    masks dplyr::summarize()

Data Preparation

This dataset originally comes from the Kaggle: https://www.kaggle.com/sashankpillai/spotify-top-200-charts-20202021. The dataset include all the songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021.

#Provide a fast and friendly way to read rectangular data
library(readr)
setwd("C:/Users/Jerem/OneDrive/Documents")
#rerading dataset into spotify
spotify <- read_csv("Montgomery College/Fall 2021/DATA 110/Datasets/spotify_dataset.csv")
## Rows: 1556 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (8): Week of Highest Charting, Song Name, Artist, Song ID, Genre, Relea...
## dbl (14): Index, Highest Charting Position, Number of Times Charted, Artist ...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(spotify)
##  [1] "Index"                     "Highest Charting Position"
##  [3] "Number of Times Charted"   "Week of Highest Charting" 
##  [5] "Song Name"                 "Streams"                  
##  [7] "Artist"                    "Artist Followers"         
##  [9] "Song ID"                   "Genre"                    
## [11] "Release Date"              "Weeks Charted"            
## [13] "Popularity"                "Danceability"             
## [15] "Energy"                    "Loudness"                 
## [17] "Speechiness"               "Acousticness"             
## [19] "Liveness"                  "Tempo"                    
## [21] "Duration (ms)"             "Valence"                  
## [23] "Chord"

Data Cleaning

Removing Null Values

colSums(is.na(spotify))
## Warning: One or more parsing issues, see `problems()` for details
##                     Index Highest Charting Position   Number of Times Charted 
##                         0                         0                         0 
##  Week of Highest Charting                 Song Name                   Streams 
##                         0                         0                         0 
##                    Artist          Artist Followers                   Song ID 
##                         0                        11                        11 
##                     Genre              Release Date             Weeks Charted 
##                        11                        11                         0 
##                Popularity              Danceability                    Energy 
##                        11                        11                        11 
##                  Loudness               Speechiness              Acousticness 
##                        11                        11                        11 
##                  Liveness                     Tempo             Duration (ms) 
##                        11                        11                        11 
##                   Valence                     Chord 
##                        11                        11
spotify <- na.omit(spotify)

Removing Variables

I have remove index, Week of Higest Charting, Song ID, Release Data, and Week Charted from the dataset since it’s not needed for my analysis.

spotify$`Index` = NULL
spotify$`Week of Highest Charting` = NULL
spotify$`Song ID` = NULL
spotify$`Release Date` = NULL
spotify$`Weeks Charted` = NULL

I have change the header Duration (ms) to Duration so it would be easier to read.

spotify$`Duration (ms)` <- round(spotify$`Duration (ms)` / 1000)
colnames(spotify)[15] <- "Duration"

Data Preparation Summary

spotify %>% head() %>% knitr::kable()
Highest Charting Position Number of Times Charted Song Name Streams Artist Artist Followers Genre Popularity Danceability Energy Loudness Speechiness Acousticness Liveness Duration Duration (ms) Valence Chord
1 8 Beggin’ 48633449 Måneskin 3377762 [‘indie rock italiano’, ‘italian pop’] 100 0.714 0.800 -4.808 0.0504 0.1270 0.3590 134.002 212 0.589 B
2 3 STAY (with Justin Bieber) 47248719 The Kid LAROI 2230022 [‘australian hip hop’] 99 0.591 0.764 -5.484 0.0483 0.0383 0.1030 169.928 142 0.478 C#/Db
1 11 good 4 u 40162559 Olivia Rodrigo 6266514 [‘pop’] 99 0.563 0.664 -5.044 0.1540 0.3350 0.0849 166.928 178 0.688 A
3 5 Bad Habits 37799456 Ed Sheeran 83293380 [‘pop’, ‘uk pop’] 98 0.808 0.897 -3.712 0.0348 0.0469 0.3640 126.026 231 0.591 B
5 1 INDUSTRY BABY (feat. Jack Harlow) 33948454 Lil Nas X 5473565 [‘lgbtq+ hip hop’, ‘pop rap’] 96 0.736 0.704 -7.409 0.0615 0.0203 0.0501 149.995 212 0.894 D#/Eb
1 18 MONTERO (Call Me By Your Name) 30071134 Lil Nas X 5473565 [‘lgbtq+ hip hop’, ‘pop rap’] 97 0.610 0.508 -6.682 0.1520 0.2970 0.3840 178.818 138 0.758 G#/Ab
str(spotify)
## tibble [1,545 x 18] (S3: tbl_df/tbl/data.frame)
##  $ Highest Charting Position: num [1:1545] 1 2 1 3 5 1 3 2 3 8 ...
##  $ Number of Times Charted  : num [1:1545] 8 3 11 5 1 18 16 10 8 10 ...
##  $ Song Name                : chr [1:1545] "Beggin'" "STAY (with Justin Bieber)" "good 4 u" "Bad Habits" ...
##  $ Streams                  : num [1:1545] 48633449 47248719 40162559 37799456 33948454 ...
##  $ Artist                   : chr [1:1545] "Måneskin" "The Kid LAROI" "Olivia Rodrigo" "Ed Sheeran" ...
##  $ Artist Followers         : num [1:1545] 3377762 2230022 6266514 83293380 5473565 ...
##  $ Genre                    : chr [1:1545] "['indie rock italiano', 'italian pop']" "['australian hip hop']" "['pop']" "['pop', 'uk pop']" ...
##  $ Popularity               : num [1:1545] 100 99 99 98 96 97 94 95 96 95 ...
##  $ Danceability             : num [1:1545] 0.714 0.591 0.563 0.808 0.736 0.61 0.762 0.78 0.644 0.75 ...
##  $ Energy                   : num [1:1545] 0.8 0.764 0.664 0.897 0.704 0.508 0.701 0.718 0.648 0.608 ...
##  $ Loudness                 : num [1:1545] -4.81 -5.48 -5.04 -3.71 -7.41 ...
##  $ Speechiness              : num [1:1545] 0.0504 0.0483 0.154 0.0348 0.0615 0.152 0.0286 0.0506 0.118 0.0387 ...
##  $ Acousticness             : num [1:1545] 0.127 0.0383 0.335 0.0469 0.0203 0.297 0.235 0.31 0.276 0.00165 ...
##  $ Liveness                 : num [1:1545] 0.359 0.103 0.0849 0.364 0.0501 0.384 0.123 0.0932 0.135 0.178 ...
##  $ Duration                 : num [1:1545] 134 170 167 126 150 ...
##  $ Duration (ms)            : num [1:1545] 212 142 178 231 212 138 209 200 207 173 ...
##  $ Valence                  : num [1:1545] 0.589 0.478 0.688 0.591 0.894 0.758 0.742 0.342 0.44 0.958 ...
##  $ Chord                    : chr [1:1545] "B" "C#/Db" "A" "B" ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 36 164 465 531 637 655 751 785 877 1141 ...
##   ..- attr(*, "names")= chr [1:11] "36" "164" "465" "531" ...

Data Visualization

Visualization inclucdes - Top Genres Bargraph - Top Artists Bargraph - correlation Heatmap - Popular Chords Bargraph

Top Genre

Top_genre<- spotify %>% 
  group_by(Genre) %>% 
  dplyr::summarise(song = n()) %>% 
  ungroup() %>% 
  mutate(song = song/50) %>% 
  arrange(desc(song)) %>% 
  head(10)

plot_top_genre <- ggplot(data = Top_genre, aes(x = reorder(Genre, song),
                                                 y = song,
                                                 label = song))+
  geom_col(aes(fill = song), show.legend = FALSE)+
  theme_bw()+
  coord_flip()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, colour = "black"),
        title = element_text(size = 14, colour = "black"))+
  geom_text(aes(label = scales::percent(song)), color = "white", size = 12, fontface = "bold", position = position_stack(0.7))+
  labs(title = "Top 10 Most Popular Genre",
       x = "Genre of Music",
       y = "Rate of Genre",
       caption = "Source : Kaggle Dataset")

plot_top_genre

Pop is the most popular genre.

Top Artists

Top_artist <- spotify %>% 
  group_by(Artist) %>% 
  dplyr::summarise(streams = n()) %>% 
  ungroup() %>% 
  mutate(streams = streams/50) %>% 
  arrange(desc(streams)) %>% 
  head(10)

plot_top_artist <- ggplot(data = Top_artist, aes(x = reorder(Artist, streams),
                                                 y = streams,
                                                 label = streams))+
  geom_col(aes(fill = streams), show.legend = FALSE)+
  coord_flip() +  
  geom_text(aes(label = scales::percent(streams)), color = "white", size = 12, fontface = "bold", position = position_stack(0.7))+
  labs(title = "Top 10 Most Popular Artist",
       x = "Artist",
       y = "Rate of Streams",
       caption = "Source : Kaggle Dataset")


plot_top_artist

Taylor Swift is the most popular Artist.

Feature Pattern

correlated_density <- ggplot(spotify) +
    geom_density(aes(Energy, fill ="Energy", alpha = 0.1)) + 
    geom_density(aes(Danceability, fill ="Danceability", alpha = 0.1)) + 
    geom_density(aes(Valence, fill ="Valence", alpha = 0.1)) + 
    geom_density(aes(Acousticness, fill ="Acousticness", alpha = 0.1)) + 
    geom_density(aes(Speechiness, fill ="Speechiness", alpha = 0.1)) + 
    geom_density(aes(Liveness, fill ="Liveness", alpha = 0.1)) + 
    scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
    theme_bw() +
    theme(plot.title = element_text(size = 10, face = "bold"),
          text = element_text(size = 10)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Accent")

correlated_density

It can be seen on the graph the distributions of variables speechlessness, acoustics and liveness are left-skewed with valued tending to be closer to 0. The density plots of danceability and energy are right-skewed while, valence has somewhat normal distribution like curve.

Correlation audio features

To understand the correlation between variables, I used corrplot function, which is one of the base data visualizations.

library(corrplot)
corr_spotify <- select(spotify, "Danceability", "Energy", "Loudness", "Speechiness", "Acousticness", "Liveness", "Duration")
corrplot(cor(corr_spotify), type="lower")

We can see that there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high possitive correlation between energy and loudness.

Data Description

In Project 1, I used a Spotify dataset that originally comes from the Kaggle: https://www.kaggle.com/sashankpillai/spotify-top-200-charts-20202021. The dataset includes all the songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021. For this project I am looking into popularity and audio features of tracks in the Spotify Top 200 Charts (2020-2021). The goal of this project is to highlight the common patterns between popularity and the audio features of these songs.

Cleaning data was necessary after viewing the dataset. In the dataset there were 11 rows that contain missing values based on the observation of each column by using colSums(is.na(). To remove these rows with missing values I used na.omit() to remove Null values. Then I removed index, Week of Highest Charting, Song ID, Release Data, and Week Charted from the dataset since it was not needed for my analysis. Lastly, I change Duration (ms) to Duration so it would be easier to read and type in the code.

The visualization for this project were created in Rstudios. For the first visualization, I created a bar graph that show the top Genres from the dataset. The genre’s variables had more than one values within one cell. With my limited knowledge on R, I was not able to separate the genres individually. Therefore, the rate of genre streams doesn’t not represent the true representation of most popular genre. From my data pop seems to be the most popular genre follow by latin music. The second visualization is a bar graph of the most popular artists. The most popular artist is Taylor Swift with a rate of streams of 104%. Taylor Swift produces mostly pop music backing up the proof that pop is the most popular genre. The third visualization is a density plot for audio features. It can be seen on the graph the distributions of variables speckiness, acousticness, and liveness are left-skewed with valued tending to be closer to 0. The density plots of danceability and energy are right-skewed while, valence has somewhat normal distribution like curve. The fourth visualization is a heatmap to observe the correlation between audio features. To understand the correlation between variables, I used corrplot function, which is one of the base data visualizations. We can see that there exists a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high positive correlation between energy and loudness. The last visualization is a bar graph of the most popular chords. it seems like the most common key among top tracks is C♯,D♭; while D♯,E♭ is the least preferred in the Top Songs list.

This project gave us insight in audio feature pattern on why they are popular, however due to the limitation of the data, model is far from perfect. I would love to explore whether which audio feature makes a song popular or least popular. I would also like to explore whether having a feature would increase the popularity of the song or not.