Introduction
Spotify is one of the largest global music streaming service, and a market leader. In this project we will analyze Spotify’s music library across characteristics such as popularity, generes, and releases to develop an understanding of Spotify’s strategic standing.
The questions we aim to answer in this report are:
The following R packages have been used for the data analysis in this project:
library('tidyverse')
library('dplyr')
library('ggplot2')
library('hrbrthemes')
library('DT')
library('corrplot')
library('funModeling')
| Library | Description |
|---|---|
| ‘tidyverse’ | Used for data manipulation. |
| ‘dplyr’ | Used for data wrangling & manipulation. |
| ‘ggplot2’ | Used for creating data visualizations. |
| ‘hrbrthemes’ | Used to add themes for plots(theme-ipsum). |
| ‘DT’ | Used for creating data tables. |
| ‘corrplot’ | Used to create correlation plots. |
| ‘funModeling’ | Used for data pre-processing and exploratory data analysis. |
The Spotify songs data for analysis has been sourced from this GitHub repository.The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
The Spotify data that has been imported contains 32833 tracks and 23 attributes detailing track_popularity, danceability, loudness, tempo and other such characteristics of the songs dating from 2019 to the late 1950s.
Here we read the data from a CSV file and load it to the spotify_data variable and view the first 6 rows of the data to check the content.
spotify_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
print(head(spotify_data))
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # … with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
Here we check the number of rows and columns in the spotify_data which gives us 32833 rows and 23 attributes.
print(paste('The data has',dim(spotify_data)[1],'rows and',dim(spotify_data)[2],'attributes'))
## [1] "The data has 32833 rows and 23 attributes"
The data description for the spotify_data variable is described below.
spotify_data_dictionary <- read_csv("spotify_data_dictionary.csv")
datatable(spotify_data_dictionary, options = list(
autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 3)),
pageLength = 25,
lengthMenu = c(5, 10, 15, 20, 25)
))
The structure of the spotify_data dataset with the datatypes and column names is displayed below. Majority of the data columns are of numeric type and character type.
str(spotify_data)
## spec_tbl_df [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The summary of the character datatypes with the lengths and the numeric datatypes is investigated below with mean, mean and the quartile ranges for the data. It can been seen that the following variables are skewed as they have a significant difference between the mean and the max values:
Upon initial review, it seems like further investigation need to be done in terms of outlier analysis using boxplots and histograms on this variables to check if the outliers need to be retained for analysis or treated/removed.
summary(spotify_data)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
We are taking a glimpse at the type of data in the spotify_data dataset.
glimpse(spotify_data)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
Here we are checking for the count of missing values per column to be able to analyse the if the values need to be dropped, retained or imputed with mean/median.
colSums(is.na(spotify_data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify_data %>%
filter_all(any_vars(is.na(.)))
## # A tibble: 5 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0 717UG2du6utFe…
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA> <NA> 0 3luHJEPw434tv…
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA> <NA> 0 3luHJEPw434tv…
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA> <NA> 0 717UG2du6utFe…
## 5 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0 717UG2du6utFe…
## # … with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
As there are only 5 missing values in this data set in 3 columns namely - * track_name * track_album_name * track_artist
which is less than 0.1% of the data, we decided to drop the rows with na’s as it will not impact our analysis.
spotify_data <- spotify_data %>% drop_na()
Here we are checking for the count of duplicate values to be able to analyse the if the values need to be dropped, and we see that there are no rows that are complete duplicates as the dimensions are the same for the below query.
spotify_data %>% distinct() %>% dim()
## [1] 32828 23
As the data dictionary describes “track_id” to be a unique identifier for the songs in the data set, we verified if the “track_id” column has any duplicates and it contained 4472 duplicates which were dropped with the new dimensions of the cleaned data being 28356 rows and 23 attributes.
spotify_data %>% distinct(track_id, .keep_all=TRUE) %>% dim()
## [1] 28352 23
spotify_data <- spotify_data %>% distinct(track_id,.keep_all=TRUE)
As we analyse the “duration_ms” column, and see that it is provided in milliseconds which is not a standard measure for the duration of songs, which is why we created a new variable “duartion_m” that stores the duration of the songs in minutes. This data was mutated with the conversion factor and then a subset of the data was selected without the “duration_ms” column as it is no longer required for further analysis.
spotify_data <- spotify_data %>% mutate(duration_m = duration_ms/60000)
spotify_data <- select(spotify_data, -duration_ms)
colnames(spotify_data)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_m"
On analyzing the data, we see that the track popularity varies on the basis of time and genres, which why we would like to analyse this relation further in the exploratory data analysis section for which we will be extracting the year of the “track_album_release_date” column and creating a new variable “track_album_release_year” to be able to use it for a yearly trend analysis instead of a minute date level analysis.
spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
spotify_data$track_album_release_year <- as.numeric(format(spotify_data$track_album_release_date, "%Y"))
The data in the “track_popularity” column is ranging from 1-100 which makes an overall analysis of the trend of popularity with attributes like genres, sub_genres and release_year inconvenient in terms of fitting models while predicting popularity of new tracks in the future.
Therefore, we have binned the “track_popularity” data into the following 6 genres and stored it in a new column called “track_popularity_tag”:
track_popularity_uniques <- spotify_data %>% distinct(track_popularity) %>% select(track_popularity)
tags <- c("[0-20]","(20-40]", "(40-60]", "(60-80]", "(80-100]", "(100+]")
spotify_data_binned <- spotify_data %>%
mutate(track_popularity_tag = case_when(
track_popularity <= 20 ~ tags[1],
track_popularity > 20 & track_popularity <= 40 ~ tags[2],
track_popularity > 40 & track_popularity <= 60 ~ tags[3],
track_popularity > 60 & track_popularity <= 80 ~ tags[4],
track_popularity > 80 & track_popularity <= 100 ~ tags[5],
track_popularity > 100 ~ tags[6]
))
spotify_data_binned %>% distinct(track_popularity_tag)
## # A tibble: 5 × 1
## track_popularity_tag
## <chr>
## 1 (60-80]
## 2 (40-60]
## 3 (20-40]
## 4 [0-20]
## 5 (80-100]
Next, to analyse if the outliers in the dataset needs to be removed, retained or imputed we plot the below boxplots for each of the numeric attributes of the song characteristics sub group.
spotify_pivot <- spotify_data_binned %>% select(12:22) %>% pivot_longer(cols = danceability:tempo, names_to =
"Var", values_to = "val")
ggplot(spotify_pivot, aes(y = val, fill = Var))+
geom_boxplot(show.legend = FALSE, width = .6, position = "dodge")+
coord_flip() +
facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()
We notice on analyzing these boxplots that apart from “key”, “mode”, and “valence” characteristics, every other columns has several outlier data points, but without domain expertise regarding the contribution of information from these outliers on our final analysis, we will not be able to remove these outliers as they may provide some insights on the trend of track popularity with audience which can be worked on to increase popularity.
To study the skewness of the data set, we plot histograms.
ggplot(spotify_pivot, aes(x = val, fill = Var))+
geom_histogram(show.legend = FALSE, position = "dodge") +
facet_wrap(vars(Var), ncol=3, scales = "free") + scale_fill_grey()
On analysis, we that only the attribute “valence” is normally distributed and whereas, * Loudness, Danceability and Energy are left skewed. * Liveness, Speechiness, Acousticness and Instrumentalness are right skewed.
This helps us take the below insights:
The final preview of the cleaned data is displayed below after removing missing values and duplicates, adding new variables to gain insights in exploratory data analysis section, transforming the variables, verifying outliers and binning data for model predictions.
spotify_data_cleaned <- spotify_data_binned
datatable(head(spotify_data_cleaned, 25), options = list(
scrollCollapse = TRUE,scrollX = TRUE,
autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20, 25)
))
To start off, we look at the correlation between the song attributes to see if there are any statistically dependent variables. This insight can help us either reduce to the features in the data by either implementing Principal Component Analysis before fitting it in the any model fit to the data in the future or do any sort of feature selection.
corr_data <-select(spotify_data_cleaned,track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_data), tl.col = 'black')
Insights:
audio_characteristics <- select(spotify_data_cleaned,c(12:22))
plot_num(audio_characteristics)
Insights:
This gives us an idea that features like instrulmentalness are not preferred, whereas danceability and energy are very significant factors in majority of the tracks.
Here we visualize the distribution of genres among all the tracks in the data provided.
spotify_genre_pie_data <- spotify_data_cleaned %>%
group_by(playlist_genre) %>%
summarise(Total_number_of_tracks = length(playlist_genre))
ggplot(spotify_genre_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0) +
geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
position = position_stack(vjust = 0.35))
Insights:
plot_list <-
map(names(spotify_data_cleaned %>% select(where(is.numeric)) %>% select(-mode,-key)),
function(colName) {
spotify_data_cleaned %>%
ggplot(aes(x = playlist_genre,
y = !! sym(colName),
fill = playlist_genre)) +
geom_boxplot() +
theme(legend.position = "NONE") +
labs(title = colName, x = "", y = "")
})
gridExtra::grid.arrange(grobs = plot_list[c(1:6)])
Insights:
Here we plot the number of tracks released from 1957 to 2019 based on genres and derive insights.
song_years_genre_df <- spotify_data_cleaned %>%
filter(track_album_release_year> 2005 & track_album_release_year<=2019)%>%
select('track_album_release_year', 'playlist_genre') %>%
group_by(track_album_release_year, playlist_genre) %>%
summarise(songs_released = n()) %>%
ungroup()
ggplot(song_years_genre_df, aes(x = track_album_release_year, y = songs_released)) +
geom_line(aes(color = playlist_genre)) +
ggtitle("Number of songs released over the years for each genre") +
ylab("songs released") +xlab("Release Year")
Insights:
Here we analyse which of the artists are high in popularity and have their songs on the top of the charts more frequently.
top_10_artist_popularity <- spotify_data_cleaned %>% select(track_artist, track_popularity, track_album_release_year) %>% filter(track_popularity >0, track_album_release_year > 2010) %>% arrange(desc(track_popularity)) %>% slice_head(n = 10) %>% distinct(track_artist, .keep_all = TRUE)
ggplot(data = top_10_artist_popularity, mapping = aes(x = reorder(track_artist, track_popularity, fill = track_artist), weight = track_popularity)) + geom_bar() + coord_flip() + scale_fill_brewer(palette="Spectral") + ggtitle("Top 10 Artists(By Popularity) From 2010") +
xlab("Artist Name") + ylab("Popularity Index")
From the analysis we see that the following are the top 10 artist of the recent times and their tracks garner more popularity than the others.
From all the above analysis, we get a better picture of the features in the dataset that add value to an insights and predictions in our data set. Further EDA can be done before proceeding with fitting the data to models and predicting the dependent target variables.
By analyzing the data we have developed the following insights:
As a roadmap plan, we can proceed with a more detailed exploratory data analytics process and conclude on which model is suited to predict our target variable , which could be the popularity of the song. For this classification task, we may use SVM, or Linear Regression as the data fits.