Problem Statement:
Address the problem:
Approach:
Usefulness:
library(dplyr)
library(corrplot)
library(ggplot2)
library(ggthemes)
library(tidyverse)
library(kableExtra)
library(factoextra)
library(plotly)
library(skimr)
library(DT)
library(class)
library(tseries)
library(lmtest)
library(forecast)
library(TSA)
Source - https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
Data Information
The data this week comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
Data Dictionary
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
raw_data <- read.csv("spotify_songs.csv")
str(raw_data)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Step 1: Convert the features to an appropriate data type.
raw_data$playlist_genre <- as.factor(raw_data$playlist_genre)
raw_data$playlist_subgenre <- as.factor(raw_data$playlist_subgenre)
raw_data$mode <- as.factor(raw_data$mode)
Step 2: Separate and create new variables.
raw_data <- separate(raw_data, track_album_release_date, c("track_album_release_year","track_album_release_month","track_album_release_day"), fill="right")
Step 3: Check for acceptable range of values for numeric variable to identify outliers.
numeric_data <- raw_data %>% dplyr::select(where(is.numeric))
summary(numeric_data)
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6548 Mean :0.698619 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness speechiness acousticness instrumentalness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.: -8.171 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median : -6.166 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean : -6.720 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.: -4.645 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. : 1.275 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
boxplot(numeric_data$tempo)
Step 4: Check for the number of missing or NA values in the dataframe
colSums(is.na(raw_data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_year track_album_release_month track_album_release_day
## 0 1855 1886
## playlist_name playlist_id playlist_genre
## 0 0 0
## playlist_subgenre danceability energy
## 0 0 0
## key loudness mode
## 0 0 0
## speechiness acousticness instrumentalness
## 0 0 0
## liveness valence tempo
## 0 0 0
## duration_ms
## 0
Step 5: Remove duplicate records
duplicate_data <- na.omit(raw_data[raw_data$track_name == "Shape of You",])
duplicate_data
## track_id track_name track_artist track_popularity
## 1219 7qiZfU4dY1lWllzX7mPBI3 Shape of You Ed Sheeran 86
## 2318 7qiZfU4dY1lWllzX7mPBI3 Shape of You Ed Sheeran 86
## 2839 7qiZfU4dY1lWllzX7mPBI3 Shape of You Ed Sheeran 86
## 5308 5ZiGdWZvv9kaWw1UFcmDur Shape of You EZA 2
## 17489 0FE9t6xYkqWXU2ahLh6D8X Shape of You Ed Sheeran 75
## 23868 0FE9t6xYkqWXU2ahLh6D8X Shape of You Ed Sheeran 75
## 23900 7qiZfU4dY1lWllzX7mPBI3 Shape of You Ed Sheeran 86
## 30407 7qiZfU4dY1lWllzX7mPBI3 Shape of You Ed Sheeran 86
## track_album_id track_album_name track_album_release_year
## 1219 3T4tUhGYeRNVUGevb0wThu <c3>· (Deluxe) 2017
## 2318 3T4tUhGYeRNVUGevb0wThu <c3>· (Deluxe) 2017
## 2839 3T4tUhGYeRNVUGevb0wThu <c3>· (Deluxe) 2017
## 5308 5Zm0pvfZXgyVjo8uaCqcan Shape of You 2017
## 17489 7oJa8bPFKVbq4c7NswXHw8 Shape of You 2017
## 23868 7oJa8bPFKVbq4c7NswXHw8 Shape of You 2017
## 23900 3T4tUhGYeRNVUGevb0wThu <c3>· (Deluxe) 2017
## 30407 3T4tUhGYeRNVUGevb0wThu <c3>· (Deluxe) 2017
## track_album_release_month track_album_release_day
## 1219 03 03
## 2318 03 03
## 2839 03 03
## 5308 05 19
## 17489 01 06
## 23868 01 06
## 23900 03 03
## 30407 03 03
## playlist_name
## 1219 Pop Ingl<c3><a9>s (2020 - 2010s)e<9f>’\231 M<c3>osica En Ingl<c3><a9>s 2010s
## 2318 post teen pop
## 2839 ElectroPop 2020
## 5308 Music&Other Drugs
## 17489 Tropical House Run 190 BPM
## 23868 Today's Hits (Clean)
## 23900 Today's Hits (Clean)
## 30407 2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016 - 2017 - 2018 - 2019 - 2020 TOP HITS
## playlist_id playlist_genre playlist_subgenre danceability
## 1219 4QAzO1Z92yqqD6Mf3mOANu pop dance pop 0.825
## 2318 6rjxP7GQKoqqgoakzxl3PY pop post-teen pop 0.825
## 2839 4frhr6RQM2fMOm2mpvOVo6 pop electropop 0.825
## 5308 5jROYSZSL7cO0jGAqkPx7C pop indie poptimism 0.596
## 17489 37i9dQZF1DWSTc9FdySHtz latin tropical 0.825
## 23868 7ENISpOJhocpMJVcGb0qcT r&b hip pop 0.825
## 23900 7ENISpOJhocpMJVcGb0qcT r&b hip pop 0.825
## 30407 2DjIfVDXGYDgRxw7IJTKVb edm pop edm 0.825
## energy key loudness mode speechiness acousticness instrumentalness
## 1219 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 2318 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 2839 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 5308 0.390 11 -9.708 0 0.2250 0.596 0.000368
## 17489 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 23868 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 23900 0.652 1 -3.183 0 0.0802 0.581 0.000000
## 30407 0.652 1 -3.183 0 0.0802 0.581 0.000000
## liveness valence tempo duration_ms
## 1219 0.0931 0.931 95.977 233713
## 2318 0.0931 0.931 95.977 233713
## 2839 0.0931 0.931 95.977 233713
## 5308 0.4870 0.652 176.000 225682
## 17489 0.0931 0.931 95.977 233713
## 23868 0.0931 0.931 95.977 233713
## 23900 0.0931 0.931 95.977 233713
## 30407 0.0931 0.931 95.977 233713
raw_data <- raw_data[!duplicated(raw_data$track_id),]
raw_data <- raw_data[!duplicated(raw_data[,c("track_name","track_artist")]),]
raw_data <- subset (raw_data, select = -c(1,5,6,10,11))
clean_data <- raw_data[!is.na(raw_data$track_name),]
colSums(is.na(clean_data))
## track_name track_artist track_popularity
## 0 0 0
## track_album_release_year track_album_release_month track_album_release_day
## 0 1464 1485
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
head(clean_data, 20) %>%
datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:4))
))
skim(clean_data) %>%
datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:4))
))
Correlation
spotify_correlation <- cor(clean_data[,-c(1,2,4,5,6,7,8,13)])
corrplot(spotify_correlation,
type="upper",
method="number", order="alphabet",
addCoef.col="black",
diag=FALSE,
tl.srt=45,
tl.col="black",col = colorRampPalette(c("midnightblue", "white","black"))(100))
Genre and its sub-genre
theme_set(theme_bw())
genre_overview <- ggplot(clean_data, aes(x = factor(playlist_subgenre))) +
geom_bar(width = 1, aes(fill = playlist_genre), colour = "black") +
theme(axis.text.x = element_text(angle =45, vjust = 20,hjust = 0))
genre_overview + coord_polar()
ggplot(clean_data, aes(x=danceability, fill=playlist_genre))+ geom_density(alpha=0.4) +theme_foundation()+ labs(x="Danceability", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Danceability distribution per Genre")
ggplot(clean_data, aes(x=energy, fill=playlist_genre))+ geom_density(alpha=0.4) + scale_fill_brewer(palette="Set1") + theme_excel()+ labs(x="Energy", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Energy distribution per Genre")
ggplot(clean_data, aes(x=loudness, fill=playlist_genre))+ geom_density(alpha=0.4) + scale_fill_brewer(palette="Set2") + labs(x="Loudness", y="Density") + guides(fill=guide_legend(title="Genres"))+ theme_economist()+ ggtitle("Loudness distribution per Genre")
ggplot(raw_data, aes(x=track_popularity, fill=playlist_genre))+ geom_density(alpha=0.2)+ theme_clean() + scale_fill_brewer(palette="Accent")+ labs(x="Track popularity", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Track popularity distribution per Genre")
ggplot(data = clean_data) +
geom_point(mapping = aes(x = duration_ms, y = track_popularity, color = playlist_genre, alpha = 0.12))
theme_set(theme_bw())
clean_data$track_popavg <- round((clean_data$track_popularity - mean(clean_data$track_popularity))/sd(clean_data$track_popularity), 2)
clean_data$tp_type <- ifelse(clean_data$track_popavg > 0, "above", "below")
clean_data <- clean_data[order(clean_data$track_popavg), ]
ggplot(clean_data, aes(x= playlist_subgenre, y=track_popavg, label=track_popavg)) +
geom_bar(stat='identity', aes(fill=tp_type), width=.5) +
scale_fill_manual(name="Popularity",
labels = c("Above Average", "Below Average"),
values = c("above"="purple", "below"="black")) +
labs(title= "Track Subgenre with respect to its popularity") +
coord_flip()
top_10_artists <- clean_data %>% group_by(track_artist, track_popularity) %>% summarise(count = n()) %>% arrange(desc(track_popularity,count))
## `summarise()` has grouped output by 'track_artist'. You can override using the `.groups` argument.
top_10_artists <- top_10_artists %>% filter(track_popularity>=95) %>% select(track_artist, count)
df <- as.data.frame(top_10_artists)
colnames(df) <- c("track_artist", "count")
pie <- ggplot(df, aes(x = "", y=count, fill = track_artist)) + geom_bar(width = 1, stat = "identity") + theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) + labs(fill="track_artist", x=NULL, y=NULL, title="Pie Chart of track artist with track popularity greater than 98")
pie + coord_polar(theta = "y", start=0)
par(mfrow=c(2,4))
pop_data <- clean_data %>% select(track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence) %>% filter(track_popularity > 75) %>% arrange(track_popularity)
plot(data = pop_data, track_popularity ~ danceability, col = "blue")
plot(data = pop_data, track_popularity ~ energy, col = "blue")
plot(data = pop_data, track_popularity ~ loudness, col = "blue")
plot(data = pop_data, track_popularity ~ speechiness, col = "blue")
plot(data = pop_data, track_popularity ~ acousticness, col = "blue")
plot(data = pop_data, track_popularity ~ instrumentalness, col = "blue")
plot(data = pop_data, track_popularity ~ liveness, col = "blue")
plot(data = pop_data, track_popularity ~ valence, col = "blue")
With majority of the songs of popularity values greater than 75, following are observations:
clean_data$genre_num <- as.numeric(factor(clean_data$playlist_genre))
shruthi_fav_artists = c("Ariana Grande", "Dua Lipa", "Taylor Swift", "The Weekend", "Jonas Brothers", "Justin Bieber", "The Chainsmokers", "Imagine Dragons")
shruthi_fav <- clean_data %>% select(track_artist, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, genre_num, tempo, duration_ms, playlist_genre) %>% filter(trimws(track_artist) %in% shruthi_fav_artists)
shruthi_fav_columns <- names(shruthi_fav[c(-1,-14)])
shruthi_pivot <- shruthi_fav %>%
pivot_longer(cols = all_of(shruthi_fav_columns))
shruthi_pivot %>%
ggplot(aes(x = name, y = value, color= playlist_genre, size = 10)) +
#geom_density() +
geom_jitter(cex = .4, aes(size = 10)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
theme_classic() +
theme(axis.text.x = element_blank()) +
labs(title = 'Shruthi`s - likings are as follows', x = '', y = '')
moun_fav_artists = c("Gloria Estefan", "Major Lazer", "Afrojack", "DJ Snake", "Elvis Presley", "Queen", "J Balvin", "Pink Floyd", "Daniel Powter")
moun_fav <- clean_data %>% select(track_artist, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, genre_num, tempo, duration_ms, playlist_genre) %>% filter(trimws(track_artist) %in% moun_fav_artists)
moun_fav_columns <- names(moun_fav[c(-1,-14)])
moun_pivot <- shruthi_fav %>%
pivot_longer(cols = all_of(moun_fav_columns))
moun_pivot %>%
ggplot(aes(x = name, y = value, color= playlist_genre, size = 0.5)) +
#geom_density() +
geom_jitter(cex = .4, aes(size = 0.5)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
theme_classic() +
theme(axis.text.x = element_blank()) +
labs(title = 'Mounica`s - likings are as follows', x = '', y = '')
Conclusion:
KNN is a classification technique used to identify the similarities in the data(existing) among various groups which is then used to predict the classification of any new data.
quantile(clean_data$track_popularity, c(0.33,0.66))
## 33% 66%
## 31 52
model_data <- clean_data %>%
mutate(popularity_gp = case_when(
track_popularity >= 0 & track_popularity <= 31 ~ "Least_Popularity",
track_popularity >= 32 & track_popularity <= 52 ~ "Average_Popularity",
TRUE ~ "Highest_Popularity"
)) %>%
select(where(is.numeric), -c(track_popavg, genre_num, track_popularity,duration_ms), popularity_gp)
model_data$popularity_gp = as.factor(model_data$popularity_gp)
str(model_data)
## 'data.frame': 26229 obs. of 11 variables:
## $ danceability : num 0.605 0.716 0.705 0.559 0.498 0.684 0.704 0.555 0.294 0.43 ...
## $ energy : num 0.894 0.981 0.975 0.844 0.723 0.85 0.482 0.712 0.929 0.944 ...
## $ key : int 9 0 6 11 7 5 5 2 9 6 ...
## $ loudness : num -3.88 -3.61 -3.85 -6.4 -6.74 ...
## $ speechiness : num 0.0385 0.031 0.0312 0.0557 0.223 0.109 0.0475 0.0357 0.0425 0.16 ...
## $ acousticness : num 1.19e-02 3.17e-02 3.32e-03 1.86e-04 5.08e-01 1.92e-02 3.02e-02 5.78e-02 1.00e-04 2.66e-05 ...
## $ instrumentalness: num 0.00 7.40e-03 1.82e-04 9.94e-04 5.21e-04 1.37e-04 9.13e-05 4.50e-03 0.00 1.96e-01 ...
## $ liveness : num 0.0887 0.171 0.537 0.169 0.419 0.589 0.173 0.128 0.0773 0.235 ...
## $ valence : num 0.545 0.861 0.969 0.357 0.318 0.925 0.284 0.288 0.26 0.55 ...
## $ tempo : num 122 139 130 128 110 ...
## $ popularity_gp : Factor w/ 3 levels "Average_Popularity",..: 3 3 3 3 3 3 3 3 3 3 ...
table(model_data$popularity_gp)
##
## Average_Popularity Highest_Popularity Least_Popularity
## 8695 8717 8817
set.seed(3245)
gp <- runif(nrow(model_data))
model_data <- model_data[order(gp),]
head(model_data,5)
## danceability energy key loudness speechiness acousticness
## 16056 0.480 0.974 5 -5.531 0.0947 0.00754
## 10783 0.706 0.552 6 -10.447 0.1690 0.30500
## 23342 0.811 0.458 5 -9.688 0.0402 0.01930
## 7231 0.725 0.600 1 -7.852 0.4230 0.01070
## 23500 0.486 0.620 4 -7.115 0.0284 0.05170
## instrumentalness liveness valence tempo popularity_gp
## 16056 5.64e-03 0.8950 0.522 96.382 Least_Popularity
## 10783 6.06e-06 0.0968 0.361 165.030 Average_Popularity
## 23342 1.41e-06 0.0526 0.797 92.940 Highest_Popularity
## 7231 0.00e+00 0.3500 0.454 106.223 Average_Popularity
## 23500 8.17e-03 0.0888 0.417 141.147 Average_Popularity
summary(model_data[,-11])
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5620 1st Qu.:0.577000 1st Qu.: 2.000 1st Qu.: -8.337
## Median :0.6700 Median :0.720000 Median : 6.000 Median : -6.274
## Mean :0.6536 Mean :0.696510 Mean : 5.376 Mean : -6.838
## 3rd Qu.:0.7600 3rd Qu.:0.842000 3rd Qu.: 9.000 3rd Qu.: -4.722
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## speechiness acousticness instrumentalness liveness
## Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.0410 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0929
## Median :0.0630 Median :0.0815 Median :0.0000221 Median :0.1270
## Mean :0.1087 Mean :0.1808 Mean :0.0951242 Mean :0.1911
## 3rd Qu.:0.1350 3rd Qu.:0.2670 3rd Qu.:0.0075000 3rd Qu.:0.2480
## Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960
## valence tempo
## Min. :0.0000 Min. : 0.00
## 1st Qu.:0.3250 1st Qu.: 99.98
## Median :0.5070 Median :122.01
## Mean :0.5064 Mean :121.00
## 3rd Qu.:0.6900 3rd Qu.:134.03
## Max. :0.9910 Max. :239.44
normalize <- function(x) {
return((x - min(x))/(max(x) - min(x)))
}
model_norm <- model_data
model_norm$popularity_gp <- NULL
model_norm <- as.data.frame(lapply(model_norm,normalize))
summary(model_norm)
## danceability energy key loudness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5717 1st Qu.:0.5769 1st Qu.:0.1818 1st Qu.:0.7986
## Median :0.6816 Median :0.7200 Median :0.5455 Median :0.8418
## Mean :0.6650 Mean :0.6965 Mean :0.4888 Mean :0.8300
## 3rd Qu.:0.7731 3rd Qu.:0.8420 3rd Qu.:0.8182 3rd Qu.:0.8743
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## speechiness acousticness instrumentalness liveness
## Min. :0.00000 Min. :0.00000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.04466 1st Qu.:0.01439 1st Qu.:0.0000000 1st Qu.:0.09327
## Median :0.06863 Median :0.08199 Median :0.0000222 Median :0.12751
## Mean :0.11845 Mean :0.18194 Mean :0.0956984 Mean :0.19189
## 3rd Qu.:0.14706 3rd Qu.:0.26861 3rd Qu.:0.0075453 3rd Qu.:0.24900
## Max. :1.00000 Max. :1.00000 Max. :1.0000000 Max. :1.00000
## valence tempo
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3280 1st Qu.:0.4175
## Median :0.5116 Median :0.5096
## Mean :0.5110 Mean :0.5054
## 3rd Qu.:0.6963 3rd Qu.:0.5598
## Max. :1.0000 Max. :1.0000
set.seed(123)
train_idx <- sample(nrow(model_norm), .80*nrow(model_norm))
model_train <- model_norm[train_idx,]
model_test <- model_norm[-train_idx,]
model_train_target <- model_data[train_idx,11]
model_test_target <- model_data[-train_idx,11]
sqrt(nrow(model_data))
## [1] 161.9537
Here, the data is split into traning (80%) and testing (20%). The respective poppularity_gp column is taken as target.
Also, the k-values considered here is 161 which is as per the common rule of taking sqrt of number of rows in the data.
m1 <- knn(train = model_train, test = model_test, cl= model_train_target, k = 161)
levels(m1)
## [1] "Average_Popularity" "Highest_Popularity" "Least_Popularity"
cm = as.matrix(table(model_test_target, m1))
cm
## m1
## model_test_target Average_Popularity Highest_Popularity Least_Popularity
## Average_Popularity 425 870 447
## Highest_Popularity 367 1041 277
## Least_Popularity 415 810 594
sum(diag(cm))/length(model_test_target)
## [1] 0.3926801
Conclusion
Will Rock Music Dominate the Industry in the 2020s
Rock Music had its share of decline in popularity in 2010s. Spotify data shows almost similar trend. Here is a time series analysis especially on Rock music’s popularity over the years and a forecast of it for next two years with respect to the songs present in this dataset.
genre_year = clean_data %>%
filter(track_album_release_year > '1985') %>%
group_by(playlist_genre, track_album_release_year) %>%
summarise(avg_popularity = c(mean(track_popularity)), .groups = 'drop')
genre_year$track_album_release_year <- as.numeric(as.character(genre_year$track_album_release_year))
ggplot(genre_year, aes(x = track_album_release_year, y = avg_popularity)) +
geom_line(aes(color = playlist_genre), size = 1) +
theme_minimal()
rock_popularity <- genre_year %>%
filter(playlist_genre == 'rock' & track_album_release_year > '1975') %>%
select(track_album_release_year, avg_popularity)
ts_data <- rock_popularity$avg_popularity
plot.ts(ts_data)
bc <- BoxCox.ar(ts_data)
bc$lambda[which.max(bc$loglike)]
## [1] 0.7
par(mfrow=c(1,2))
plot.ts(sqrt(ts_data))
plot.ts(log(ts_data))
par(mfrow=c(1,2))
acf(log(ts_data))
pacf(log(ts_data))
eacf(log(ts_data), ar.max = 3)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x o o o o o o o o o o
## 1 x o o o o o o o o o o o o o
## 2 o o o o o o o o o o o o o o
## 3 x o o o o o o o o o o o o o
adf.test(log(ts_data))
##
## Augmented Dickey-Fuller Test
##
## data: log(ts_data)
## Dickey-Fuller = -1.8444, Lag order = 3, p-value = 0.6335
## alternative hypothesis: stationary
plot.ts(diff(log(ts_data)))
par(mfrow=c(1,2))
acf(diff(log(ts_data)))
pacf(diff(log(ts_data)))
eacf(diff(log(ts_data)), ar.max = 3)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o o o o
## 1 o o o o o o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x o o o o o o o o o o o o o
fit1= arima(ts(log(ts_data)), order = c(1,1,1))
coeftest(fit1)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.29502 0.27348 -1.0788 0.2807
## ma1 -0.42549 0.27312 -1.5579 0.1193
fit1
##
## Call:
## arima(x = ts(log(ts_data)), order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## -0.2950 -0.4255
## s.e. 0.2735 0.2731
##
## sigma^2 estimated as 0.02321: log likelihood = 15.47, aic = -26.94
fit2= arima(ts(log(ts_data)), order = c(0,1,1))
coeftest(fit2)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 -0.63340 0.12801 -4.948 7.498e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit2
##
## Call:
## arima(x = ts(log(ts_data)), order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -0.6334
## s.e. 0.1280
##
## sigma^2 estimated as 0.02395: log likelihood = 14.94, aic = -27.88
fit3= arima(ts(log(ts_data)), order = c(1,1,0))
coeftest(fit3)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.57779 0.13766 -4.1972 2.702e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit3
##
## Call:
## arima(x = ts(log(ts_data)), order = c(1, 1, 0))
##
## Coefficients:
## ar1
## -0.5778
## s.e. 0.1377
##
## sigma^2 estimated as 0.02436: log likelihood = 14.71, aic = -27.42
checkresiduals(fit2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 5.6504, df = 6, p-value = 0.4635
##
## Model df: 1. Total lags used: 7
It is seen that residual plot has constant mean and variance and acf and pacf plot resemble white noise.
Ljung-Box test
Null hypothesis: There is no auto-correlation between residuals.
Alternative hypothesis: There is auto-correlation between residuals.
Also, p value is greater than 0.05 and null hypothesis cannot be rejected here. Hence this is in sync with the acf and pacf which shows that the residuals behave like white noise and have no correlation
ARIMA(0,1,1) model is concluded for this time series
fit= Arima(ts(log(ts_data)), order = c(0,1,1))
plot(forecast(fit, h = 5))
K-means is an unsupervised clustering algorithm that takes in input as numeric values and is used to predict numeric or categorical values by grouping the data into similar clusters. Elbow point graph is used to determine the number clusters to use in the k-means function.
clean_data_kmeans <- clean_data %>% select_if(is.numeric)
clean_data_kmeans <- clean_data_kmeans %>% select(-c(track_popavg,genre_num))
clean_data_kmeans_scaled <- scale(clean_data_kmeans)
elbow_point <- function(data, maxCluster = 9) {
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
elbow_point(clean_data_kmeans_scaled)
model1 <- kmeans(clean_data_kmeans_scaled, 6)
model1$size
## [1] 3283 7056 2203 3430 5709 4548
model1$centers
## track_popularity danceability energy key loudness
## 1 0.06460638 0.43113893 -0.182839316 0.056245862 -0.06263231
## 2 0.74533319 0.51651937 0.184624152 -0.043263129 0.37248101
## 3 -0.46163585 0.07279951 0.481009448 -0.002018671 -0.06099165
## 4 0.18290604 -0.43989632 -1.573390857 -0.052654929 -1.28671078
## 5 -0.08956485 -0.92656470 0.639581342 -0.025514041 0.53360344
## 6 -1.00488830 0.34701741 -0.003682848 0.099235415 -0.20254257
## speechiness acousticness instrumentalness liveness valence tempo
## 1 2.1231579 0.05544728 -0.3473924 0.09815776 0.1636627 0.02550604
## 2 -0.2725070 -0.12675818 -0.3459895 -0.19867045 0.4918050 -0.15298820
## 3 -0.3630038 -0.49831433 2.7630316 -0.03172469 -0.5123834 0.15399665
## 4 -0.3544612 1.71303417 0.1175860 -0.23430906 -0.5564637 -0.29857333
## 5 -0.2966919 -0.52378768 -0.2788568 0.51181511 -0.5695988 0.48773471
## 6 -0.2942403 -0.23642141 -0.2894667 -0.21302026 0.5017172 -0.24271706
## duration_ms
## 1 -0.18451500
## 2 -0.32192833
## 3 0.42616101
## 4 -0.10319469
## 5 -0.01669151
## 6 0.52500131
clean_data$cluster <- model1$cluster
fviz_cluster(model1, data=clean_data_kmeans_scaled)
clean_data %>%
group_by(cluster) %>%
summarise_all(mean) %>%
select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence, liveness)
## # A tibble: 6 x 8
## cluster acousticness danceability energy instrumentalness speechiness valence
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.193 0.716 0.663 0.0126 0.328 0.545
## 2 2 0.152 0.729 0.731 0.0129 0.0806 0.622
## 3 3 0.0681 0.664 0.785 0.752 0.0712 0.386
## 4 4 0.568 0.590 0.406 0.123 0.0721 0.376
## 5 5 0.0623 0.519 0.815 0.0289 0.0781 0.373
## 6 6 0.127 0.704 0.696 0.0263 0.0783 0.624
## # ... with 1 more variable: liveness <dbl>
Song Recommendation
clean_data %>%
filter(track_name == "The Nights", track_artist == "Avicii") %>%
datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:4))
))
set.seed(123)
clean_data %>%
filter(cluster == 2, playlist_genre == "pop", playlist_subgenre=="dance pop", track_popularity >=75 & track_popularity <= 85) %>%
sample_n(5) %>%
datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:4))
))
Proposal :
We implemented KNN algorithm to group similar songs based on their popularity , however it is concluded that a songs popularity cannot be determined or derived from the numerical features like energy, danceability etc. , it also depends on the artists, genre and listener’s demographics.
We have also implemented k-means to cluster the songs based on their similarity of numeric features and predicted the songs that match the interest of favourite songs by filtering through the same cluster.
Also, Spotify recently acquired SoundBetter, a music production marketplace and is venturing into producing music on its own. Such clustering analysis can help understand on what parameters the future production songs can be concentrated on which will in turn help widening its customer base.
Limitations
There are 70 million tracks on Spotify till date. Our cleaned data has only 26229 tracks. Using such less sample data for prediction comes with its drawback, not to mention the lack of other variables that depend on demography, language, culture and history of customers.