Members

HUEY WEN CHEAM (S2176658)
MELANIE KOH (S2178918)
YANG HAO (S2148678)
ALFIAN ROSI SAPUTRA (S2148301)
LEE WAN JIA (S2135972)
PUSHPANJALI NAGARAJAN (S2129915)

INTRODUCTION

Music recommendation systems are computer programs that use data mining and machine learning techniques to suggest music to users. These systems take into account factors such as a user’s listening history, their preferences, and the listening habits of similar users to suggest new songs, albums, or artists.

The importance of music recommendation systems lies in their ability to help users discover new music that they may not have otherwise found. With the vast amount of music available on streaming platforms and online, it can be difficult for users to find new music that they will enjoy. Recommendation systems can help users navigate this vast amount of content and discover new music that is tailored to their tastes. Additionally, music recommendation systems can also be used to help artists and record labels promote new music to a wider audience.Besides, by predicting the popularity of songs based on their genres, this can help the targeted audience like music platform users to have a more precise preference which cater to their music taste which eventually increases the listeners to that particular platform thus making it a no.1 choice for their targeted audience.

For this assignment, a dataset was chosen from Kaggle to predict the popularity of songs based on their genres. The dataset contains various features of songs, such as their duration, track IDs, artists names, popularity which varies between 0 to 100 making 100 the most popular and other features as well. The dataset will be studied closely to build and evaluate machine learning models to predict the popularity of songs based on the said features.The data science methods including exploring the data, pre-processing it as necessary, and applying appropriate machine learning techniques to train and evaluate the models. Ultimately, the findings and conclusions will be presented in the form of a written report.

OBJECTIVES

  1. To identify the most important features for predicting song popularity.

  2. To explore the characteristics of different song genres and their relationship with popularity, as measured by Spotify metrics such as streams and playlists.

  3. To build and evaluate machine learning models for the prediction of songs’ popularity

  4. To propose recommendations for music artists and industry professionals based on the insights gained from the analysis.

DATASET DETAILS

Link to Dataset used: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/code?resource=download

The “Spotify Tracks Dataset” from Kaggle is a dataset containing a collection of audio features and metadata for a selection of songs from Spotify. The dataset includes information on the artist, track, and album for each song, as well as a variety of audio features such as tempo, key, and loudness. It also includes the number of streams and playlists that each song has received on Spotify.

Overall, the “Spotify Tracks Dataset” provides a rich and diverse dataset for exploring the relationship between audio features and popularity on Spotify. It offers a range of potential avenues for analysis, including examining the characteristics of different genres, identifying trends in audio features over time, and building machine learning models to predict song popularity.

Load required libraries

library(dplyr) #dataframe
library(tidyr) #tidy data
library(ggplot2) #data visualization
library(reshape) #data restructuring and aggregation
library(reshape2) #data transformation
library(rpart) #build classificaiton and regression trees
library(rpart.plot) #scale trees at best fit
library(caret) #train different algorithms
library(tidyverse) #data presentation
library(skimr) #provide summmary statistics
library(xgboost) #xg boost modelling
library(caTools) #split train and test sets
library(Metrics) #supervised machine learning metrics

Load Dataset

data<-read.csv("spotify.csv")

The dataset was stored and called from csv file.

Dataset info

dim(data) #dataset size
## [1] 114000     21

The dataset contains 114000 rows and 21 columns.

glimpse(data)  #view data types
## Rows: 114,000
## Columns: 21
## $ X                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
## $ track_id         <chr> "5SuOikwiRyPMVoIQDJUgSV", "4qPNDBW1i3p13qLCt0Ki3A", "~
## $ artists          <chr> "Gen Hoshino", "Ben Woodward", "Ingrid Michaelson;ZAY~
## $ album_name       <chr> "Comedy", "Ghost (Acoustic)", "To Begin Again", "Craz~
## $ track_name       <chr> "Comedy", "Ghost - Acoustic", "To Begin Again", "Can'~
## $ popularity       <int> 73, 55, 57, 71, 82, 58, 74, 80, 74, 56, 74, 69, 52, 6~
## $ duration_ms      <int> 230666, 149610, 210826, 201933, 198853, 214240, 22940~
## $ explicit         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
## $ danceability     <dbl> 0.676, 0.420, 0.438, 0.266, 0.618, 0.688, 0.407, 0.70~
## $ energy           <dbl> 0.4610, 0.1660, 0.3590, 0.0596, 0.4430, 0.4810, 0.147~
## $ key              <int> 1, 1, 0, 0, 2, 6, 2, 11, 0, 1, 8, 4, 7, 3, 2, 4, 2, 1~
## $ loudness         <dbl> -6.746, -17.235, -9.734, -18.515, -9.681, -8.807, -8.~
## $ mode             <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,~
## $ speechiness      <dbl> 0.1430, 0.0763, 0.0557, 0.0363, 0.0526, 0.1050, 0.035~
## $ acousticness     <dbl> 0.0322, 0.9240, 0.2100, 0.9050, 0.4690, 0.2890, 0.857~
## $ instrumentalness <dbl> 1.01e-06, 5.56e-06, 0.00e+00, 7.07e-05, 0.00e+00, 0.0~
## $ liveness         <dbl> 0.3580, 0.1010, 0.1170, 0.1320, 0.0829, 0.1890, 0.091~
## $ valence          <dbl> 0.7150, 0.2670, 0.1200, 0.1430, 0.1670, 0.6660, 0.076~
## $ tempo            <dbl> 87.917, 77.489, 76.332, 181.740, 119.949, 98.017, 141~
## $ time_signature   <int> 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4,~
## $ track_genre      <chr> "acoustic", "acoustic", "acoustic", "acoustic", "acou~

Checking out the data types of attributes in the dataset.

head(data)
#tail(data)

Viewing the head of the dataset.

View column names

ls(data)
##  [1] "acousticness"     "album_name"       "artists"          "danceability"    
##  [5] "duration_ms"      "energy"           "explicit"         "instrumentalness"
##  [9] "key"              "liveness"         "loudness"         "mode"            
## [13] "popularity"       "speechiness"      "tempo"            "time_signature"  
## [17] "track_genre"      "track_id"         "track_name"       "valence"         
## [21] "X"

Viewing the list of columns.

All the column names with descriptions are shown below.

DATA CLEANING

Missing value analysis - check which columns have missing values

is.null(data)
## [1] FALSE

No missing values present in the dataset.

Checking for Null values in dataset

sum(is.null(data))
## [1] 0

No null values present in the dataset.

Check which columns have NA values

sum(is.na(data))
## [1] 9
colSums(is.na(data))
##                X         track_id          artists       album_name 
##                0                0                0                0 
##       track_name       popularity      duration_ms         explicit 
##                0                3                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##             mode      speechiness     acousticness instrumentalness 
##                2                0                0                4 
##         liveness          valence            tempo   time_signature 
##                0                0                0                0 
##      track_genre 
##                0
names(which(colSums(is.na(data))>0))
## [1] "popularity"       "mode"             "instrumentalness"

No NA values present in the dataset.

Check for duplicated entry

data[duplicated(data), ] #check for duplicated rows
sum(data[duplicated(data), ])
## [1] 0

No duplicated values present in the dataset.

Missing value correlation analysis

data1<- data %>% select(popularity, mode, instrumentalness)
cor(data1, data1,  method = "pearson", use = "complete.obs")
##                   popularity        mode instrumentalness
## popularity        1.00000000 -0.01401124      -0.09511166
## mode             -0.01401124  1.00000000      -0.04990568
## instrumentalness -0.09511166 -0.04990568       1.00000000

Visualization of Missing value correlation analysis

heatmap(cor(data1, data1,  method = "pearson", use = "complete.obs"))

It can be seen that these missing values do not show up significantly in the overall data set and have no strong correlation.

Hence, missing value rows can be deleted directly.

Delete those rows with missing values

data<- data %>% drop_na()

View dataset info after cleaning

dim(data)
## [1] 113991     21
tail(data)

The dataset contains 113991 rows and 21 columns after removing NA values rows.

colSums(is.na(data))
##                X         track_id          artists       album_name 
##                0                0                0                0 
##       track_name       popularity      duration_ms         explicit 
##                0                0                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##             mode      speechiness     acousticness instrumentalness 
##                0                0                0                0 
##         liveness          valence            tempo   time_signature 
##                0                0                0                0 
##      track_genre 
##                0
sum(is.na(data))
## [1] 0

The dataset is cleaned where no columns present with NA values. The dataset shows complete cases.

Transform explicit variable from categorical data into binary data

data$explicitUpdated<-ifelse(data$explicit =="FALSE", "0", "1")
data$explicitUpdated <- as.factor(data$explicitUpdated)
class(data$explicitUpdated)
## [1] "factor"
levels(data$explicitUpdated)
## [1] "0" "1"

The explicit attribute will be changed to 0 when it is FALSE, while 1 is TRUE

Transform track_genre to 5 parent_genre

data<-data %>%
  mutate(
    parent_genre= case_when(
      track_genre =="anime" | track_genre =="brazil" | track_genre =="british" | track_genre =="cantopop" | track_genre =="j-pop" | track_genre =="k-pop" | track_genre =="indie-pop" | track_genre =="malay" | track_genre =="mandopop" | track_genre =="power-pop" | track_genre =="pop-film" | track_genre =="pop" | track_genre =="synth-pop" | track_genre =="trip-hop" | track_genre =="j-idol" | track_genre =="dance" | track_genre =="salsa" | track_genre =="samba" | track_genre =="dancehall" | track_genre =="forro" | track_genre =="garage" | track_genre =="j-dance" | track_genre =="party" | track_genre =="tango" | track_genre =="happy" | track_genre =="disco" | track_genre =="afrobeat" | track_genre =="comedy" | track_genre =="hip-hop" | track_genre =="latin" | track_genre =="latino" | track_genre =="world-music" | track_genre =="mpb" ~ "Pop",
      track_genre =="blues" | track_genre =="dub" | track_genre =="dubstep" | track_genre =="r-n-b" | track_genre =="reggae" | track_genre =="reggaeton" | track_genre =="soul" | track_genre =="ska" ~ "R & B",
      track_genre =="alt-rock" | track_genre =="alternative" | track_genre =="emo" | track_genre =="funk" | track_genre =="hard-rock"| track_genre =="psych-rock"| track_genre =="punk-rock"| track_genre =="rock-n-roll" | track_genre =="rock" | track_genre =="rockabilly" | track_genre =="j-rock" | track_genre =="goth" | track_genre =="groove" | track_genre =="grunge" | track_genre =="indie" | track_genre =="iranian" | track_genre =="new-age" | track_genre =="black-metal" | track_genre =="death-metal" | track_genre =="heavy-metal" | track_genre =="metal" | track_genre =="metalcore" | track_genre =="grindcore" | track_genre =="punk" | track_genre =="hardcore" | track_genre =="hardstyle" | track_genre =="industrial" ~ "Rock",
      track_genre =="edm" | track_genre =="electro" | track_genre =="electronic" | track_genre =="chicago-house" | track_genre =="breakbeat" | track_genre =="club" | track_genre =="deep-house" | track_genre =="detroit-techno" | track_genre =="minimal-techno" | track_genre =="techno" | track_genre =="progressive-house" | track_genre =="trance" | track_genre =="drum-and-bass" | track_genre =="house" | track_genre =="idm" ~ "EDM",
      track_genre =="acoustic" | track_genre =="folk" | track_genre =="french" | track_genre =="german" | track_genre =="guitar" | track_genre =="sertanejo" | track_genre =="study" | track_genre =="sad" | track_genre =="singer-songwriter" | track_genre =="songwriter" | track_genre =="swedish" | track_genre =="turkish" | track_genre =="gospel" | track_genre =="children" | track_genre =="ambient" | track_genre =="chill" | track_genre =="classical" | track_genre =="disney" | track_genre =="kids" | track_genre =="opera" | track_genre =="jazz" | track_genre =="piano" | track_genre =="romance" | track_genre =="indian" | track_genre =="show-tunes" | track_genre =="sleep" | track_genre =="spanish" | track_genre =="bluegrass" | track_genre == "country" | track_genre =="honky-tonk" | track_genre =="pagode" ~ "Classical"
            )
    )
    
colSums(is.na(data))
##                X         track_id          artists       album_name 
##                0                0                0                0 
##       track_name       popularity      duration_ms         explicit 
##                0                0                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##             mode      speechiness     acousticness instrumentalness 
##                0                0                0                0 
##         liveness          valence            tempo   time_signature 
##                0                0                0                0 
##      track_genre  explicitUpdated     parent_genre 
##                0                0                0
data$parent_genre<- as.factor(data$parent_genre)
data$parent_genre_class<- as.numeric(data$parent_genre)

The track_genre attribute will be changed to parent_genre of 5 categories (pop, r&b, rock, edm, classical) and factor data into numerical data of 1 to 5.

Convert all attributes into numerical data

data$popularity<- as.numeric(data$popularity)
data$duration_ms<- as.numeric(data$duration_ms)
data$explicitUpdated<- as.numeric(data$explicitUpdated)
data$danceability<- as.numeric(data$danceability)
data$energy<- as.numeric(data$energy)
data$key<- as.numeric(data$key)
data$loudness<- as.numeric(data$loudness)
data$mode<- as.numeric(data$mode)
data$speechiness<- as.numeric(data$speechiness)
data$acousticness<- as.numeric(data$acousticness)
data$instrumentalness<- as.numeric(data$instrumentalness)
data$liveness<- as.numeric(data$liveness)
data$valence<- as.numeric(data$valence)
data$tempo<- as.numeric(data$tempo)

Transform ’‘’popularity’’’ attribute to 4 categories

data$popularity_class<-ifelse(data$popularity < 25, "Not Popular",
                              ifelse(data$popularity >= 25 & data$popularity <= 50, "Less Popular",
                                     ifelse(data$popularity >= 50 & data$popularity <= 75, "Popular", "Most Popular")))

# data$popularity_class<-ifelse(data$popularity <25, "Not Popular", 
#                               ifelse(data$popularity >= 25 & data$popularity <= 50, "Popular", "Most Popular"))

class(data$popularity_class)
## [1] "character"
data$popularity_class <- as.factor(data$popularity_class)
class(data$popularity_class)
## [1] "factor"
levels(data$popularity_class)
## [1] "Less Popular" "Most Popular" "Not Popular"  "Popular"
data$popularity_class<- factor((data$popularity_class), levels = c("Not Popular", "Less Popular", "Popular", "Most Popular"))
levels(data$popularity_class)
## [1] "Not Popular"  "Less Popular" "Popular"      "Most Popular"
data$popularity_class<- as.numeric(data$popularity_class)

The popularity attribute will be changed to ’‘’popularity_class’’’ of 4 categories (Not Popular, Less Popular, Popular, Most Popular) and factor data into numerical data of 1 to 4.

Display summary of dataset

str(data)
## 'data.frame':    113991 obs. of  25 variables:
##  $ X                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ track_id          : chr  "5SuOikwiRyPMVoIQDJUgSV" "4qPNDBW1i3p13qLCt0Ki3A" "1iJBSr7s7jYXzM8EGcbK5b" "6lfxq3CG4xtTiEg7opyCyx" ...
##  $ artists           : chr  "Gen Hoshino" "Ben Woodward" "Ingrid Michaelson;ZAYN" "Kina Grannis" ...
##  $ album_name        : chr  "Comedy" "Ghost (Acoustic)" "To Begin Again" "Crazy Rich Asians (Original Motion Picture Soundtrack)" ...
##  $ track_name        : chr  "Comedy" "Ghost - Acoustic" "To Begin Again" "Can't Help Falling In Love" ...
##  $ popularity        : num  73 55 57 71 82 58 74 80 74 56 ...
##  $ duration_ms       : num  230666 149610 210826 201933 198853 ...
##  $ explicit          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ danceability      : num  0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
##  $ energy            : num  0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
##  $ key               : num  1 1 0 0 2 6 2 11 0 1 ...
##  $ loudness          : num  -6.75 -17.23 -9.73 -18.52 -9.68 ...
##  $ mode              : num  0 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness       : num  0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
##  $ acousticness      : num  0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
##  $ instrumentalness  : num  1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
##  $ liveness          : num  0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
##  $ valence           : num  0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
##  $ tempo             : num  87.9 77.5 76.3 181.7 119.9 ...
##  $ time_signature    : int  4 4 4 3 4 4 3 4 4 4 ...
##  $ track_genre       : chr  "acoustic" "acoustic" "acoustic" "acoustic" ...
##  $ explicitUpdated   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ parent_genre      : Factor w/ 5 levels "Classical","EDM",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ parent_genre_class: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ popularity_class  : num  3 3 3 3 4 3 3 4 3 3 ...
summary(data)
##        X            track_id           artists           album_name       
##  Min.   :     0   Length:113991      Length:113991      Length:113991     
##  1st Qu.: 28503   Class :character   Class :character   Class :character  
##  Median : 57002   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 57001                                                           
##  3rd Qu.: 85500                                                           
##  Max.   :113999                                                           
##   track_name          popularity      duration_ms       explicit      
##  Length:113991      Min.   :  0.00   Min.   :      0   Mode :logical  
##  Class :character   1st Qu.: 17.00   1st Qu.: 174066   FALSE:104245   
##  Mode  :character   Median : 35.00   Median : 212906   TRUE :9746     
##                     Mean   : 33.24   Mean   : 228030                  
##                     3rd Qu.: 50.00   3rd Qu.: 261506                  
##                     Max.   :100.00   Max.   :5237295                  
##   danceability        energy            key            loudness      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :-49.531  
##  1st Qu.:0.4560   1st Qu.:0.4720   1st Qu.: 2.000   1st Qu.:-10.013  
##  Median :0.5800   Median :0.6850   Median : 5.000   Median : -7.004  
##  Mean   :0.5668   Mean   :0.6414   Mean   : 5.309   Mean   : -8.259  
##  3rd Qu.:0.6950   3rd Qu.:0.8540   3rd Qu.: 8.000   3rd Qu.: -5.003  
##  Max.   :0.9850   Max.   :1.0000   Max.   :11.000   Max.   :  4.532  
##       mode         speechiness       acousticness    instrumentalness  
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00e+00  
##  1st Qu.:0.0000   1st Qu.:0.03590   1st Qu.:0.0169   1st Qu.:0.00e+00  
##  Median :1.0000   Median :0.04890   Median :0.1690   Median :4.16e-05  
##  Mean   :0.6376   Mean   :0.08466   Mean   :0.3149   Mean   :1.56e-01  
##  3rd Qu.:1.0000   3rd Qu.:0.08450   3rd Qu.:0.5980   3rd Qu.:4.90e-02  
##  Max.   :1.0000   Max.   :0.96500   Max.   :0.9960   Max.   :1.00e+00  
##     liveness         valence           tempo        time_signature 
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :0.000  
##  1st Qu.:0.0980   1st Qu.:0.2600   1st Qu.: 99.22   1st Qu.:4.000  
##  Median :0.1320   Median :0.4640   Median :122.02   Median :4.000  
##  Mean   :0.2136   Mean   :0.4741   Mean   :122.15   Mean   :3.904  
##  3rd Qu.:0.2730   3rd Qu.:0.6830   3rd Qu.:140.07   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :0.9950   Max.   :243.37   Max.   :5.000  
##  track_genre        explicitUpdated    parent_genre   parent_genre_class
##  Length:113991      Min.   :1.000   Classical:30995   Min.   :1.000     
##  Class :character   1st Qu.:1.000   EDM      :15000   1st Qu.:1.000     
##  Mode  :character   Median :1.000   Pop      :32998   Median :3.000     
##                     Mean   :1.085   R & B    : 8000   Mean   :2.868     
##                     3rd Qu.:1.000   Rock     :26998   3rd Qu.:4.000     
##                     Max.   :2.000                     Max.   :5.000     
##  popularity_class
##  Min.   :1.000   
##  1st Qu.:1.000   
##  Median :2.000   
##  Mean   :1.889   
##  3rd Qu.:2.000   
##  Max.   :4.000

Here it consists of statistical summary of the dataset.

DATA NORMALIZATION

Combine selected attributes needed for normalization and stored into data_trans

data_trans <- subset(data, select = c(key, loudness, speechiness, instrumentalness, liveness, tempo, duration_ms))
head(data_trans)

Few attributes were selected and put into subset data_trans as it requires data normalization.

Normalize the data types using Min-Max scaler, combine and form a new dataframe for cleaned dataset

data_norm <- as.data.frame(apply(data_trans, 2, function(x) (x-min(x))/(max(x)-min(x))))

data_norm$danceability <- data$danceability
data_norm$energy <- data$energy
data_norm$acousticness <- data$acousticness
data_norm$valence <- data$valence
data_norm$time_signature <- data$time_signature
data_norm$parent_genre_class <- data$parent_genre_class
data_norm$popularity_class <- data$popularity_class
head(data_norm)

The key, loudness, speechiness, instrumentalness, liveness, tempo, duration_ms attributes undergoes data normalization to better organize and simplify the data and were stored into a new dataframe data_norm. Other attributes that are not required to undergo data normalization were selected and copied into the new dataframe as well.

EXPLORATORY DATA ANALYSIS (EDA)

EDA is a critical process of investigating and exploring further insights on the dataset selected

ggplot(data, aes(x = popularity_class)) + geom_histogram()

The above histogram visualized the spread of popularity values across the dataset that was separated into 4 categories.

Generate plots for all attributes in data_norm

for (col in 2:ncol(data_norm)) {
    hist(data_norm[,col], main=colnames(data_norm)[col],  xlab=colnames(data_norm)[col])
}

Generate correlation analysis and plot heatmap using data_norm

ht_cor <- cor(data_norm)
heatmap(ht_cor, scale="column", col = terrain.colors(256))

Here it shows the correlation among the attributes in data_norm.

Relation between danceability and popularity

ggplot(data, aes(x = danceability, y = popularity)) +
   geom_point(alpha = 0.1) +
   geom_density_2d()

From the scatter plot diagram, it can be deduced that when there is higher score in danceability, it resulted in higher popularity scores.

Relation between energy and popularity

par(mfrow = c(3, 2))

# Draw the first subgraph
plot(data$energy[1:1000], data$popularity[1:1000])

# Draw the second subgraph
plot(data$energy[301:600], data$popularity[301:600])

ggplot(data, aes(x = energy, y = popularity)) +
   geom_point(alpha = 0.1) +
   geom_density_2d()

From the result, there is negative correlation between energy and popularity.

DATA MODELLING - EXTREME GRADIENT BOOSTING (XG BOOST)

XG Boost is a supervised machine learning technique that can be used to solve both classification and regression problems. In this project, XG Boost is used to implement a prediction model which it will be split into 70% training and 30% testing set, maximum depth for decision tree = 6, with 500 maximum iterations.

This is another method to display the dataframe details but of one row per variable analysis.

skim_to_wide(data_norm)
Data summary
Name Piped data
Number of rows 113991
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
key 0 1 0.48 0.32 0 0.18 0.45 0.73 1.00 ▇▃▃▅▆
loudness 0 1 0.76 0.09 0 0.73 0.79 0.82 1.00 ▁▁▁▇▆
speechiness 0 1 0.09 0.11 0 0.04 0.05 0.09 1.00 ▇▁▁▁▁
instrumentalness 0 1 0.16 0.31 0 0.00 0.00 0.05 1.00 ▇▁▁▁▁
liveness 0 1 0.21 0.19 0 0.10 0.13 0.27 1.00 ▇▃▁▁▁
tempo 0 1 0.50 0.12 0 0.41 0.50 0.58 1.00 ▁▃▇▃▁
duration_ms 0 1 0.04 0.02 0 0.03 0.04 0.05 1.00 ▇▁▁▁▁
danceability 0 1 0.57 0.17 0 0.46 0.58 0.70 0.98 ▁▃▇▇▂
energy 0 1 0.64 0.25 0 0.47 0.69 0.85 1.00 ▂▃▅▆▇
acousticness 0 1 0.31 0.33 0 0.02 0.17 0.60 1.00 ▇▂▂▂▂
valence 0 1 0.47 0.26 0 0.26 0.46 0.68 1.00 ▆▇▇▇▅
time_signature 0 1 3.90 0.43 0 4.00 4.00 4.00 5.00 ▁▁▁▇▁
parent_genre_class 0 1 2.87 1.49 1 1.00 3.00 4.00 5.00 ▇▃▇▂▆
popularity_class 0 1 1.89 0.82 1 1.00 2.00 2.00 4.00 ▇▇▁▅▁

index <- createDataPartition(y = data_norm$popularity_class, 
                             p = 0.7, 
                             list = FALSE)
train <- data_norm[index, ]
test <- data_norm[-index, ]
rm(index)

#training predictor and target variables
train_x <- data.matrix(train[, -14])
train_y <- train[, 14]

#testing predictor and target variables
test_x <- data.matrix(test[, -14])
test_y <- test[, 14]

#finalized training and testing sets
xgb_train <- xgb.DMatrix(data = train_x, label = train_y)
xgb_test <- xgb.DMatrix(data = test_x, label = test_y)

Store and visualize the 500 iterations into a list using the watchlist parameter

#define watchlist
list_created <- list(train = xgb_train, test = xgb_test)

#show training and testing data for xg boost
model <- xgb.train(data = xgb_train, max.depth = 6, watchlist = list_created, nrounds = 500)
## [1]  train-rmse:1.262060 test-rmse:1.262898 
## [2]  train-rmse:1.045740 test-rmse:1.047999 
## [3]  train-rmse:0.919032 test-rmse:0.922583 
## [4]  train-rmse:0.849011 test-rmse:0.853962 
## [5]  train-rmse:0.810936 test-rmse:0.816887 
## [6]  train-rmse:0.790795 test-rmse:0.797892 
## [7]  train-rmse:0.779530 test-rmse:0.787353 
## [8]  train-rmse:0.771924 test-rmse:0.780657 
## [9]  train-rmse:0.768408 test-rmse:0.777589 
## [10] train-rmse:0.764070 test-rmse:0.774280 
## [11] train-rmse:0.761431 test-rmse:0.772373 
## [12] train-rmse:0.758396 test-rmse:0.770211 
## [13] train-rmse:0.757233 test-rmse:0.769337 
## [14] train-rmse:0.754126 test-rmse:0.767353 
## [15] train-rmse:0.753081 test-rmse:0.766869 
## [16] train-rmse:0.750863 test-rmse:0.765633 
## [17] train-rmse:0.747719 test-rmse:0.763334 
## [18] train-rmse:0.744925 test-rmse:0.761716 
## [19] train-rmse:0.742564 test-rmse:0.760139 
## [20] train-rmse:0.741703 test-rmse:0.759896 
## [21] train-rmse:0.741138 test-rmse:0.759909 
## [22] train-rmse:0.740453 test-rmse:0.759514 
## [23] train-rmse:0.740068 test-rmse:0.759452 
## [24] train-rmse:0.738553 test-rmse:0.758864 
## [25] train-rmse:0.736980 test-rmse:0.758269 
## [26] train-rmse:0.734760 test-rmse:0.756929 
## [27] train-rmse:0.733314 test-rmse:0.755964 
## [28] train-rmse:0.731110 test-rmse:0.754760 
## [29] train-rmse:0.730318 test-rmse:0.754367 
## [30] train-rmse:0.729589 test-rmse:0.754112 
## [31] train-rmse:0.727748 test-rmse:0.753035 
## [32] train-rmse:0.725804 test-rmse:0.751984 
## [33] train-rmse:0.724271 test-rmse:0.751302 
## [34] train-rmse:0.722631 test-rmse:0.750465 
## [35] train-rmse:0.722070 test-rmse:0.750194 
## [36] train-rmse:0.721262 test-rmse:0.749810 
## [37] train-rmse:0.719479 test-rmse:0.748808 
## [38] train-rmse:0.717734 test-rmse:0.747887 
## [39] train-rmse:0.716687 test-rmse:0.747389 
## [40] train-rmse:0.716036 test-rmse:0.747142 
## [41] train-rmse:0.713945 test-rmse:0.745939 
## [42] train-rmse:0.713165 test-rmse:0.745780 
## [43] train-rmse:0.712076 test-rmse:0.745403 
## [44] train-rmse:0.711685 test-rmse:0.745293 
## [45] train-rmse:0.709966 test-rmse:0.744514 
## [46] train-rmse:0.707800 test-rmse:0.743378 
## [47] train-rmse:0.706437 test-rmse:0.742704 
## [48] train-rmse:0.705346 test-rmse:0.742225 
## [49] train-rmse:0.704203 test-rmse:0.741628 
## [50] train-rmse:0.703532 test-rmse:0.741541 
## [51] train-rmse:0.702892 test-rmse:0.741376 
## [52] train-rmse:0.701074 test-rmse:0.740563 
## [53] train-rmse:0.700143 test-rmse:0.740093 
## [54] train-rmse:0.698677 test-rmse:0.739238 
## [55] train-rmse:0.697434 test-rmse:0.739067 
## [56] train-rmse:0.696342 test-rmse:0.738803 
## [57] train-rmse:0.695108 test-rmse:0.738319 
## [58] train-rmse:0.694700 test-rmse:0.738334 
## [59] train-rmse:0.693319 test-rmse:0.737860 
## [60] train-rmse:0.692563 test-rmse:0.737617 
## [61] train-rmse:0.691958 test-rmse:0.737347 
## [62] train-rmse:0.690763 test-rmse:0.736790 
## [63] train-rmse:0.689391 test-rmse:0.736386 
## [64] train-rmse:0.688159 test-rmse:0.735912 
## [65] train-rmse:0.687341 test-rmse:0.735780 
## [66] train-rmse:0.686254 test-rmse:0.735497 
## [67] train-rmse:0.685527 test-rmse:0.735268 
## [68] train-rmse:0.684687 test-rmse:0.734859 
## [69] train-rmse:0.683962 test-rmse:0.734612 
## [70] train-rmse:0.683130 test-rmse:0.734525 
## [71] train-rmse:0.682451 test-rmse:0.734315 
## [72] train-rmse:0.681367 test-rmse:0.733750 
## [73] train-rmse:0.680340 test-rmse:0.733307 
## [74] train-rmse:0.679839 test-rmse:0.733139 
## [75] train-rmse:0.679628 test-rmse:0.733039 
## [76] train-rmse:0.678973 test-rmse:0.732881 
## [77] train-rmse:0.677518 test-rmse:0.732323 
## [78] train-rmse:0.676360 test-rmse:0.731703 
## [79] train-rmse:0.675992 test-rmse:0.731657 
## [80] train-rmse:0.675558 test-rmse:0.731574 
## [81] train-rmse:0.675079 test-rmse:0.731446 
## [82] train-rmse:0.674145 test-rmse:0.731022 
## [83] train-rmse:0.673513 test-rmse:0.730738 
## [84] train-rmse:0.672351 test-rmse:0.730108 
## [85] train-rmse:0.671133 test-rmse:0.729478 
## [86] train-rmse:0.669849 test-rmse:0.729057 
## [87] train-rmse:0.668937 test-rmse:0.728714 
## [88] train-rmse:0.667906 test-rmse:0.728432 
## [89] train-rmse:0.667307 test-rmse:0.728348 
## [90] train-rmse:0.667228 test-rmse:0.728366 
## [91] train-rmse:0.666694 test-rmse:0.728247 
## [92] train-rmse:0.665964 test-rmse:0.728005 
## [93] train-rmse:0.665082 test-rmse:0.727777 
## [94] train-rmse:0.663932 test-rmse:0.727340 
## [95] train-rmse:0.663197 test-rmse:0.727039 
## [96] train-rmse:0.661876 test-rmse:0.726530 
## [97] train-rmse:0.661655 test-rmse:0.726485 
## [98] train-rmse:0.661048 test-rmse:0.726353 
## [99] train-rmse:0.660093 test-rmse:0.725971 
## [100]    train-rmse:0.660022 test-rmse:0.725979 
## [101]    train-rmse:0.659864 test-rmse:0.725954 
## [102]    train-rmse:0.659500 test-rmse:0.725798 
## [103]    train-rmse:0.659351 test-rmse:0.725769 
## [104]    train-rmse:0.659031 test-rmse:0.725744 
## [105]    train-rmse:0.657965 test-rmse:0.725329 
## [106]    train-rmse:0.657515 test-rmse:0.725222 
## [107]    train-rmse:0.657208 test-rmse:0.725166 
## [108]    train-rmse:0.656209 test-rmse:0.724621 
## [109]    train-rmse:0.655392 test-rmse:0.724298 
## [110]    train-rmse:0.654370 test-rmse:0.723858 
## [111]    train-rmse:0.653585 test-rmse:0.723701 
## [112]    train-rmse:0.653234 test-rmse:0.723598 
## [113]    train-rmse:0.652827 test-rmse:0.723566 
## [114]    train-rmse:0.652016 test-rmse:0.723355 
## [115]    train-rmse:0.650913 test-rmse:0.723059 
## [116]    train-rmse:0.650092 test-rmse:0.722689 
## [117]    train-rmse:0.649389 test-rmse:0.722357 
## [118]    train-rmse:0.648145 test-rmse:0.721895 
## [119]    train-rmse:0.647157 test-rmse:0.721635 
## [120]    train-rmse:0.646392 test-rmse:0.721472 
## [121]    train-rmse:0.645202 test-rmse:0.721090 
## [122]    train-rmse:0.644625 test-rmse:0.720816 
## [123]    train-rmse:0.644027 test-rmse:0.720682 
## [124]    train-rmse:0.643377 test-rmse:0.720417 
## [125]    train-rmse:0.642996 test-rmse:0.720315 
## [126]    train-rmse:0.641846 test-rmse:0.719904 
## [127]    train-rmse:0.641467 test-rmse:0.719815 
## [128]    train-rmse:0.640739 test-rmse:0.719518 
## [129]    train-rmse:0.639765 test-rmse:0.719198 
## [130]    train-rmse:0.639560 test-rmse:0.719173 
## [131]    train-rmse:0.639423 test-rmse:0.719165 
## [132]    train-rmse:0.638515 test-rmse:0.718855 
## [133]    train-rmse:0.637967 test-rmse:0.718668 
## [134]    train-rmse:0.637412 test-rmse:0.718484 
## [135]    train-rmse:0.636903 test-rmse:0.718386 
## [136]    train-rmse:0.636344 test-rmse:0.718155 
## [137]    train-rmse:0.635500 test-rmse:0.717783 
## [138]    train-rmse:0.635078 test-rmse:0.717651 
## [139]    train-rmse:0.634944 test-rmse:0.717613 
## [140]    train-rmse:0.633734 test-rmse:0.717177 
## [141]    train-rmse:0.632525 test-rmse:0.717026 
## [142]    train-rmse:0.631459 test-rmse:0.716522 
## [143]    train-rmse:0.630638 test-rmse:0.716251 
## [144]    train-rmse:0.630188 test-rmse:0.716128 
## [145]    train-rmse:0.629866 test-rmse:0.715998 
## [146]    train-rmse:0.628877 test-rmse:0.715905 
## [147]    train-rmse:0.627965 test-rmse:0.715688 
## [148]    train-rmse:0.626747 test-rmse:0.715132 
## [149]    train-rmse:0.626018 test-rmse:0.714787 
## [150]    train-rmse:0.625319 test-rmse:0.714664 
## [151]    train-rmse:0.624586 test-rmse:0.714371 
## [152]    train-rmse:0.624108 test-rmse:0.714249 
## [153]    train-rmse:0.623801 test-rmse:0.714238 
## [154]    train-rmse:0.622761 test-rmse:0.713745 
## [155]    train-rmse:0.622454 test-rmse:0.713608 
## [156]    train-rmse:0.621499 test-rmse:0.713237 
## [157]    train-rmse:0.620997 test-rmse:0.713095 
## [158]    train-rmse:0.620503 test-rmse:0.712943 
## [159]    train-rmse:0.620261 test-rmse:0.712872 
## [160]    train-rmse:0.620196 test-rmse:0.712872 
## [161]    train-rmse:0.619938 test-rmse:0.712751 
## [162]    train-rmse:0.619079 test-rmse:0.712591 
## [163]    train-rmse:0.618154 test-rmse:0.712098 
## [164]    train-rmse:0.617607 test-rmse:0.711953 
## [165]    train-rmse:0.617256 test-rmse:0.711858 
## [166]    train-rmse:0.616071 test-rmse:0.711474 
## [167]    train-rmse:0.615545 test-rmse:0.711263 
## [168]    train-rmse:0.614866 test-rmse:0.710974 
## [169]    train-rmse:0.614304 test-rmse:0.710886 
## [170]    train-rmse:0.613223 test-rmse:0.710325 
## [171]    train-rmse:0.612776 test-rmse:0.710288 
## [172]    train-rmse:0.612197 test-rmse:0.710245 
## [173]    train-rmse:0.611273 test-rmse:0.710046 
## [174]    train-rmse:0.610492 test-rmse:0.709799 
## [175]    train-rmse:0.610330 test-rmse:0.709752 
## [176]    train-rmse:0.609665 test-rmse:0.709710 
## [177]    train-rmse:0.609281 test-rmse:0.709558 
## [178]    train-rmse:0.609256 test-rmse:0.709554 
## [179]    train-rmse:0.608920 test-rmse:0.709432 
## [180]    train-rmse:0.608500 test-rmse:0.709329 
## [181]    train-rmse:0.608089 test-rmse:0.709241 
## [182]    train-rmse:0.607730 test-rmse:0.709155 
## [183]    train-rmse:0.607489 test-rmse:0.709117 
## [184]    train-rmse:0.607024 test-rmse:0.709061 
## [185]    train-rmse:0.606227 test-rmse:0.708792 
## [186]    train-rmse:0.605609 test-rmse:0.708570 
## [187]    train-rmse:0.605420 test-rmse:0.708537 
## [188]    train-rmse:0.604509 test-rmse:0.708371 
## [189]    train-rmse:0.603902 test-rmse:0.708206 
## [190]    train-rmse:0.603319 test-rmse:0.708139 
## [191]    train-rmse:0.602616 test-rmse:0.707881 
## [192]    train-rmse:0.601613 test-rmse:0.707313 
## [193]    train-rmse:0.600968 test-rmse:0.707186 
## [194]    train-rmse:0.599940 test-rmse:0.706765 
## [195]    train-rmse:0.599019 test-rmse:0.706514 
## [196]    train-rmse:0.598361 test-rmse:0.706040 
## [197]    train-rmse:0.597528 test-rmse:0.705879 
## [198]    train-rmse:0.596897 test-rmse:0.705750 
## [199]    train-rmse:0.596785 test-rmse:0.705746 
## [200]    train-rmse:0.596329 test-rmse:0.705680 
## [201]    train-rmse:0.596164 test-rmse:0.705655 
## [202]    train-rmse:0.595270 test-rmse:0.705457 
## [203]    train-rmse:0.594678 test-rmse:0.705205 
## [204]    train-rmse:0.594318 test-rmse:0.705117 
## [205]    train-rmse:0.593944 test-rmse:0.705035 
## [206]    train-rmse:0.593796 test-rmse:0.705006 
## [207]    train-rmse:0.593119 test-rmse:0.704918 
## [208]    train-rmse:0.592293 test-rmse:0.704722 
## [209]    train-rmse:0.591368 test-rmse:0.704495 
## [210]    train-rmse:0.590756 test-rmse:0.704317 
## [211]    train-rmse:0.590145 test-rmse:0.704083 
## [212]    train-rmse:0.589690 test-rmse:0.703982 
## [213]    train-rmse:0.588676 test-rmse:0.703673 
## [214]    train-rmse:0.588003 test-rmse:0.703500 
## [215]    train-rmse:0.587484 test-rmse:0.703327 
## [216]    train-rmse:0.586750 test-rmse:0.703121 
## [217]    train-rmse:0.586485 test-rmse:0.703028 
## [218]    train-rmse:0.585683 test-rmse:0.702674 
## [219]    train-rmse:0.585134 test-rmse:0.702603 
## [220]    train-rmse:0.584482 test-rmse:0.702484 
## [221]    train-rmse:0.583650 test-rmse:0.702274 
## [222]    train-rmse:0.583014 test-rmse:0.702118 
## [223]    train-rmse:0.582767 test-rmse:0.702035 
## [224]    train-rmse:0.582463 test-rmse:0.701917 
## [225]    train-rmse:0.581962 test-rmse:0.701772 
## [226]    train-rmse:0.581839 test-rmse:0.701748 
## [227]    train-rmse:0.581349 test-rmse:0.701642 
## [228]    train-rmse:0.580579 test-rmse:0.701485 
## [229]    train-rmse:0.579992 test-rmse:0.701318 
## [230]    train-rmse:0.579633 test-rmse:0.701252 
## [231]    train-rmse:0.579127 test-rmse:0.701238 
## [232]    train-rmse:0.578695 test-rmse:0.701038 
## [233]    train-rmse:0.578222 test-rmse:0.700879 
## [234]    train-rmse:0.577753 test-rmse:0.700811 
## [235]    train-rmse:0.577299 test-rmse:0.700622 
## [236]    train-rmse:0.576854 test-rmse:0.700570 
## [237]    train-rmse:0.576614 test-rmse:0.700435 
## [238]    train-rmse:0.576204 test-rmse:0.700285 
## [239]    train-rmse:0.575642 test-rmse:0.700003 
## [240]    train-rmse:0.575148 test-rmse:0.699886 
## [241]    train-rmse:0.574380 test-rmse:0.699686 
## [242]    train-rmse:0.573608 test-rmse:0.699477 
## [243]    train-rmse:0.573309 test-rmse:0.699364 
## [244]    train-rmse:0.572626 test-rmse:0.699163 
## [245]    train-rmse:0.571872 test-rmse:0.698976 
## [246]    train-rmse:0.571258 test-rmse:0.698812 
## [247]    train-rmse:0.570699 test-rmse:0.698721 
## [248]    train-rmse:0.569926 test-rmse:0.698567 
## [249]    train-rmse:0.569078 test-rmse:0.698307 
## [250]    train-rmse:0.568283 test-rmse:0.698035 
## [251]    train-rmse:0.567700 test-rmse:0.697959 
## [252]    train-rmse:0.567093 test-rmse:0.697936 
## [253]    train-rmse:0.566954 test-rmse:0.697942 
## [254]    train-rmse:0.566565 test-rmse:0.697821 
## [255]    train-rmse:0.566308 test-rmse:0.697781 
## [256]    train-rmse:0.566140 test-rmse:0.697781 
## [257]    train-rmse:0.565422 test-rmse:0.697596 
## [258]    train-rmse:0.564658 test-rmse:0.697326 
## [259]    train-rmse:0.564028 test-rmse:0.697222 
## [260]    train-rmse:0.563629 test-rmse:0.697203 
## [261]    train-rmse:0.562955 test-rmse:0.697034 
## [262]    train-rmse:0.562433 test-rmse:0.696791 
## [263]    train-rmse:0.561515 test-rmse:0.696428 
## [264]    train-rmse:0.560815 test-rmse:0.696181 
## [265]    train-rmse:0.560408 test-rmse:0.696123 
## [266]    train-rmse:0.559829 test-rmse:0.696060 
## [267]    train-rmse:0.559361 test-rmse:0.696015 
## [268]    train-rmse:0.558748 test-rmse:0.695627 
## [269]    train-rmse:0.558510 test-rmse:0.695579 
## [270]    train-rmse:0.558401 test-rmse:0.695547 
## [271]    train-rmse:0.558273 test-rmse:0.695583 
## [272]    train-rmse:0.557578 test-rmse:0.695327 
## [273]    train-rmse:0.557180 test-rmse:0.695318 
## [274]    train-rmse:0.557056 test-rmse:0.695260 
## [275]    train-rmse:0.556769 test-rmse:0.695239 
## [276]    train-rmse:0.556531 test-rmse:0.695253 
## [277]    train-rmse:0.555810 test-rmse:0.694970 
## [278]    train-rmse:0.555209 test-rmse:0.694752 
## [279]    train-rmse:0.554384 test-rmse:0.694733 
## [280]    train-rmse:0.553942 test-rmse:0.694630 
## [281]    train-rmse:0.553189 test-rmse:0.694353 
## [282]    train-rmse:0.552460 test-rmse:0.694244 
## [283]    train-rmse:0.551902 test-rmse:0.694220 
## [284]    train-rmse:0.551227 test-rmse:0.693975 
## [285]    train-rmse:0.550464 test-rmse:0.693972 
## [286]    train-rmse:0.549861 test-rmse:0.693836 
## [287]    train-rmse:0.549173 test-rmse:0.693645 
## [288]    train-rmse:0.548499 test-rmse:0.693360 
## [289]    train-rmse:0.548129 test-rmse:0.693306 
## [290]    train-rmse:0.547220 test-rmse:0.693066 
## [291]    train-rmse:0.546729 test-rmse:0.692966 
## [292]    train-rmse:0.546236 test-rmse:0.692806 
## [293]    train-rmse:0.545908 test-rmse:0.692669 
## [294]    train-rmse:0.545283 test-rmse:0.692483 
## [295]    train-rmse:0.544913 test-rmse:0.692460 
## [296]    train-rmse:0.544002 test-rmse:0.692145 
## [297]    train-rmse:0.543426 test-rmse:0.691983 
## [298]    train-rmse:0.542986 test-rmse:0.691819 
## [299]    train-rmse:0.542169 test-rmse:0.691402 
## [300]    train-rmse:0.541375 test-rmse:0.691259 
## [301]    train-rmse:0.540634 test-rmse:0.691005 
## [302]    train-rmse:0.539764 test-rmse:0.690667 
## [303]    train-rmse:0.539478 test-rmse:0.690714 
## [304]    train-rmse:0.538802 test-rmse:0.690450 
## [305]    train-rmse:0.537958 test-rmse:0.690285 
## [306]    train-rmse:0.537715 test-rmse:0.690268 
## [307]    train-rmse:0.537523 test-rmse:0.690302 
## [308]    train-rmse:0.537211 test-rmse:0.690269 
## [309]    train-rmse:0.537024 test-rmse:0.690284 
## [310]    train-rmse:0.536445 test-rmse:0.690140 
## [311]    train-rmse:0.535881 test-rmse:0.690002 
## [312]    train-rmse:0.535704 test-rmse:0.689937 
## [313]    train-rmse:0.535138 test-rmse:0.689854 
## [314]    train-rmse:0.534640 test-rmse:0.689714 
## [315]    train-rmse:0.533970 test-rmse:0.689637 
## [316]    train-rmse:0.533542 test-rmse:0.689674 
## [317]    train-rmse:0.533428 test-rmse:0.689650 
## [318]    train-rmse:0.532581 test-rmse:0.689274 
## [319]    train-rmse:0.531954 test-rmse:0.689092 
## [320]    train-rmse:0.531804 test-rmse:0.689094 
## [321]    train-rmse:0.531630 test-rmse:0.689051 
## [322]    train-rmse:0.530811 test-rmse:0.688894 
## [323]    train-rmse:0.530574 test-rmse:0.688866 
## [324]    train-rmse:0.530290 test-rmse:0.688889 
## [325]    train-rmse:0.529590 test-rmse:0.688706 
## [326]    train-rmse:0.528906 test-rmse:0.688665 
## [327]    train-rmse:0.528519 test-rmse:0.688620 
## [328]    train-rmse:0.527934 test-rmse:0.688621 
## [329]    train-rmse:0.527469 test-rmse:0.688549 
## [330]    train-rmse:0.527383 test-rmse:0.688533 
## [331]    train-rmse:0.527179 test-rmse:0.688528 
## [332]    train-rmse:0.526531 test-rmse:0.688265 
## [333]    train-rmse:0.526271 test-rmse:0.688219 
## [334]    train-rmse:0.525686 test-rmse:0.687960 
## [335]    train-rmse:0.525405 test-rmse:0.687881 
## [336]    train-rmse:0.524984 test-rmse:0.687882 
## [337]    train-rmse:0.524505 test-rmse:0.687773 
## [338]    train-rmse:0.523834 test-rmse:0.687656 
## [339]    train-rmse:0.523595 test-rmse:0.687673 
## [340]    train-rmse:0.523052 test-rmse:0.687496 
## [341]    train-rmse:0.522848 test-rmse:0.687545 
## [342]    train-rmse:0.522609 test-rmse:0.687532 
## [343]    train-rmse:0.521905 test-rmse:0.687409 
## [344]    train-rmse:0.521416 test-rmse:0.687258 
## [345]    train-rmse:0.520955 test-rmse:0.687175 
## [346]    train-rmse:0.520441 test-rmse:0.687058 
## [347]    train-rmse:0.519994 test-rmse:0.687022 
## [348]    train-rmse:0.519612 test-rmse:0.686999 
## [349]    train-rmse:0.519174 test-rmse:0.686854 
## [350]    train-rmse:0.518663 test-rmse:0.686790 
## [351]    train-rmse:0.517989 test-rmse:0.686583 
## [352]    train-rmse:0.517594 test-rmse:0.686509 
## [353]    train-rmse:0.517138 test-rmse:0.686368 
## [354]    train-rmse:0.517013 test-rmse:0.686318 
## [355]    train-rmse:0.516478 test-rmse:0.686143 
## [356]    train-rmse:0.515860 test-rmse:0.686033 
## [357]    train-rmse:0.515743 test-rmse:0.686102 
## [358]    train-rmse:0.515513 test-rmse:0.686098 
## [359]    train-rmse:0.515042 test-rmse:0.685986 
## [360]    train-rmse:0.514216 test-rmse:0.685679 
## [361]    train-rmse:0.513916 test-rmse:0.685514 
## [362]    train-rmse:0.513092 test-rmse:0.685188 
## [363]    train-rmse:0.512884 test-rmse:0.685179 
## [364]    train-rmse:0.512280 test-rmse:0.685115 
## [365]    train-rmse:0.512086 test-rmse:0.685105 
## [366]    train-rmse:0.511787 test-rmse:0.685010 
## [367]    train-rmse:0.511080 test-rmse:0.684905 
## [368]    train-rmse:0.510591 test-rmse:0.684908 
## [369]    train-rmse:0.510093 test-rmse:0.684776 
## [370]    train-rmse:0.509614 test-rmse:0.684709 
## [371]    train-rmse:0.509121 test-rmse:0.684572 
## [372]    train-rmse:0.508926 test-rmse:0.684502 
## [373]    train-rmse:0.508170 test-rmse:0.684344 
## [374]    train-rmse:0.507835 test-rmse:0.684293 
## [375]    train-rmse:0.507386 test-rmse:0.684268 
## [376]    train-rmse:0.506870 test-rmse:0.684087 
## [377]    train-rmse:0.506462 test-rmse:0.684048 
## [378]    train-rmse:0.506318 test-rmse:0.684050 
## [379]    train-rmse:0.506035 test-rmse:0.684061 
## [380]    train-rmse:0.505973 test-rmse:0.684029 
## [381]    train-rmse:0.505348 test-rmse:0.683978 
## [382]    train-rmse:0.504820 test-rmse:0.683820 
## [383]    train-rmse:0.504462 test-rmse:0.683821 
## [384]    train-rmse:0.504355 test-rmse:0.683800 
## [385]    train-rmse:0.504018 test-rmse:0.683700 
## [386]    train-rmse:0.503467 test-rmse:0.683584 
## [387]    train-rmse:0.502919 test-rmse:0.683536 
## [388]    train-rmse:0.502718 test-rmse:0.683447 
## [389]    train-rmse:0.502074 test-rmse:0.683370 
## [390]    train-rmse:0.501478 test-rmse:0.683281 
## [391]    train-rmse:0.501098 test-rmse:0.683187 
## [392]    train-rmse:0.500659 test-rmse:0.683135 
## [393]    train-rmse:0.500421 test-rmse:0.683124 
## [394]    train-rmse:0.500203 test-rmse:0.683135 
## [395]    train-rmse:0.499667 test-rmse:0.682972 
## [396]    train-rmse:0.499190 test-rmse:0.682841 
## [397]    train-rmse:0.498866 test-rmse:0.682769 
## [398]    train-rmse:0.498371 test-rmse:0.682600 
## [399]    train-rmse:0.498143 test-rmse:0.682555 
## [400]    train-rmse:0.497772 test-rmse:0.682538 
## [401]    train-rmse:0.497674 test-rmse:0.682500 
## [402]    train-rmse:0.497421 test-rmse:0.682452 
## [403]    train-rmse:0.496681 test-rmse:0.682236 
## [404]    train-rmse:0.496039 test-rmse:0.682086 
## [405]    train-rmse:0.495757 test-rmse:0.682048 
## [406]    train-rmse:0.495333 test-rmse:0.681948 
## [407]    train-rmse:0.495004 test-rmse:0.681887 
## [408]    train-rmse:0.494527 test-rmse:0.681742 
## [409]    train-rmse:0.494099 test-rmse:0.681794 
## [410]    train-rmse:0.493826 test-rmse:0.681757 
## [411]    train-rmse:0.493172 test-rmse:0.681582 
## [412]    train-rmse:0.492626 test-rmse:0.681448 
## [413]    train-rmse:0.492100 test-rmse:0.681328 
## [414]    train-rmse:0.491860 test-rmse:0.681267 
## [415]    train-rmse:0.491430 test-rmse:0.681258 
## [416]    train-rmse:0.491066 test-rmse:0.681160 
## [417]    train-rmse:0.490718 test-rmse:0.681108 
## [418]    train-rmse:0.490509 test-rmse:0.681103 
## [419]    train-rmse:0.490361 test-rmse:0.681070 
## [420]    train-rmse:0.490310 test-rmse:0.681087 
## [421]    train-rmse:0.489719 test-rmse:0.680924 
## [422]    train-rmse:0.489670 test-rmse:0.680931 
## [423]    train-rmse:0.489253 test-rmse:0.680846 
## [424]    train-rmse:0.488777 test-rmse:0.680714 
## [425]    train-rmse:0.488242 test-rmse:0.680595 
## [426]    train-rmse:0.487546 test-rmse:0.680348 
## [427]    train-rmse:0.487047 test-rmse:0.680143 
## [428]    train-rmse:0.486474 test-rmse:0.680063 
## [429]    train-rmse:0.485985 test-rmse:0.679895 
## [430]    train-rmse:0.485462 test-rmse:0.679781 
## [431]    train-rmse:0.485245 test-rmse:0.679737 
## [432]    train-rmse:0.484930 test-rmse:0.679640 
## [433]    train-rmse:0.484486 test-rmse:0.679602 
## [434]    train-rmse:0.484282 test-rmse:0.679588 
## [435]    train-rmse:0.484183 test-rmse:0.679585 
## [436]    train-rmse:0.483759 test-rmse:0.679531 
## [437]    train-rmse:0.483506 test-rmse:0.679459 
## [438]    train-rmse:0.483373 test-rmse:0.679411 
## [439]    train-rmse:0.482758 test-rmse:0.679237 
## [440]    train-rmse:0.482288 test-rmse:0.679182 
## [441]    train-rmse:0.481724 test-rmse:0.679014 
## [442]    train-rmse:0.481339 test-rmse:0.678917 
## [443]    train-rmse:0.480820 test-rmse:0.678835 
## [444]    train-rmse:0.480411 test-rmse:0.678760 
## [445]    train-rmse:0.479922 test-rmse:0.678831 
## [446]    train-rmse:0.479675 test-rmse:0.678762 
## [447]    train-rmse:0.479102 test-rmse:0.678721 
## [448]    train-rmse:0.478640 test-rmse:0.678601 
## [449]    train-rmse:0.478120 test-rmse:0.678542 
## [450]    train-rmse:0.478018 test-rmse:0.678524 
## [451]    train-rmse:0.477780 test-rmse:0.678517 
## [452]    train-rmse:0.477509 test-rmse:0.678453 
## [453]    train-rmse:0.477048 test-rmse:0.678260 
## [454]    train-rmse:0.476656 test-rmse:0.678205 
## [455]    train-rmse:0.476280 test-rmse:0.678020 
## [456]    train-rmse:0.476121 test-rmse:0.678017 
## [457]    train-rmse:0.475793 test-rmse:0.677964 
## [458]    train-rmse:0.475370 test-rmse:0.677880 
## [459]    train-rmse:0.474906 test-rmse:0.677798 
## [460]    train-rmse:0.474524 test-rmse:0.677694 
## [461]    train-rmse:0.474438 test-rmse:0.677678 
## [462]    train-rmse:0.473961 test-rmse:0.677525 
## [463]    train-rmse:0.473905 test-rmse:0.677504 
## [464]    train-rmse:0.473862 test-rmse:0.677519 
## [465]    train-rmse:0.473812 test-rmse:0.677517 
## [466]    train-rmse:0.473516 test-rmse:0.677459 
## [467]    train-rmse:0.473470 test-rmse:0.677423 
## [468]    train-rmse:0.472972 test-rmse:0.677399 
## [469]    train-rmse:0.472815 test-rmse:0.677350 
## [470]    train-rmse:0.472624 test-rmse:0.677361 
## [471]    train-rmse:0.472567 test-rmse:0.677373 
## [472]    train-rmse:0.472500 test-rmse:0.677360 
## [473]    train-rmse:0.471871 test-rmse:0.677197 
## [474]    train-rmse:0.471588 test-rmse:0.677200 
## [475]    train-rmse:0.471304 test-rmse:0.677190 
## [476]    train-rmse:0.471022 test-rmse:0.677182 
## [477]    train-rmse:0.470248 test-rmse:0.676915 
## [478]    train-rmse:0.470012 test-rmse:0.676881 
## [479]    train-rmse:0.469382 test-rmse:0.676721 
## [480]    train-rmse:0.469083 test-rmse:0.676679 
## [481]    train-rmse:0.468693 test-rmse:0.676637 
## [482]    train-rmse:0.468376 test-rmse:0.676535 
## [483]    train-rmse:0.468318 test-rmse:0.676545 
## [484]    train-rmse:0.467961 test-rmse:0.676430 
## [485]    train-rmse:0.467519 test-rmse:0.676291 
## [486]    train-rmse:0.467190 test-rmse:0.676300 
## [487]    train-rmse:0.466935 test-rmse:0.676258 
## [488]    train-rmse:0.466572 test-rmse:0.676248 
## [489]    train-rmse:0.466039 test-rmse:0.676263 
## [490]    train-rmse:0.465809 test-rmse:0.676217 
## [491]    train-rmse:0.465392 test-rmse:0.676073 
## [492]    train-rmse:0.464859 test-rmse:0.675989 
## [493]    train-rmse:0.464716 test-rmse:0.676000 
## [494]    train-rmse:0.464186 test-rmse:0.675883 
## [495]    train-rmse:0.463781 test-rmse:0.675812 
## [496]    train-rmse:0.463399 test-rmse:0.675684 
## [497]    train-rmse:0.463146 test-rmse:0.675621 
## [498]    train-rmse:0.462912 test-rmse:0.675564 
## [499]    train-rmse:0.462535 test-rmse:0.675497 
## [500]    train-rmse:0.462148 test-rmse:0.675416

#finalized parameters for xg boost
final <- xgboost(data = xgb_train, max.depth = 6, nrounds = 500, verbose = 0)

#make predictions from training data on testing data
pred_y <- predict(final, xgb_test)

MODEL PERFORMANCE EVALUATION - EXTREME GRADIENT BOOSTING (XG BOOST)

To carry out the prediction accuracy using Mean Squared Error MSE, Mean Average Error MAE, and Root Mean Squared Error RMSE metrics with caret

#evaluate model
mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)
## MSE:  0.4561864 MAE:  0.5236731  RMSE:  0.6754157

DATA MODELLING - MULTIPLE LINEAR REGRESSION

Linear regression is a widely used statistical tool for comparison of two variables. However, the dataset used consists of more than 2 variables so multiple linear regression is introduced to predict the outcome / target / independent using multiple predictor / dependent variables.

Remove target variable from normalized data and create a boxplot

temp_data <- subset(data_norm, select = -c(popularity_class))
melt_data <- melt(temp_data)

boxplot(data = melt_data, value ~ variable)

set.seed(42)

#split training and testing data
partition <- sample.split(Y = data_norm$popularity_class, SplitRatio = 0.7)
train_set <- subset(x = data_norm, partition == TRUE)
test_set <- subset(x = data_norm, partition == FALSE)

#fit linear model and summarize results
model5 <- lm(popularity_class ~ ., data = train_set)
summary(model5)
## 
## Call:
## lm(formula = popularity_class ~ ., data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24996 -0.83240  0.03958  0.47766  2.54915 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.823482   0.046404  39.296  < 2e-16 ***
## key                 0.007396   0.008829   0.838 0.402223    
## loudness            0.198463   0.055278   3.590 0.000331 ***
## speechiness        -0.606124   0.027523 -22.022  < 2e-16 ***
## instrumentalness   -0.368601   0.011206 -32.895  < 2e-16 ***
## liveness           -0.055154   0.016011  -3.445 0.000572 ***
## tempo               0.051016   0.024205   2.108 0.035063 *  
## duration_ms        -1.011489   0.137724  -7.344 2.09e-13 ***
## danceability        0.358495   0.020807  17.229  < 2e-16 ***
## energy             -0.157243   0.023397  -6.721 1.82e-11 ***
## acousticness       -0.082462   0.013477  -6.119 9.47e-10 ***
## valence            -0.335789   0.013780 -24.368  < 2e-16 ***
## time_signature      0.032491   0.006824   4.761 1.93e-06 ***
## parent_genre_class  0.002549   0.002120   1.202 0.229276    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.805 on 79780 degrees of freedom
## Multiple R-squared:  0.03367,    Adjusted R-squared:  0.03351 
## F-statistic: 213.8 on 13 and 79780 DF,  p-value: < 2.2e-16

Define and visualized residuals with ggplot (histogram)

#get residuals 
lm_residuals <- as.data.frame(residuals(model5))

#visualized residuals
ggplot(lm_residuals, aes(residuals(model5))) + geom_histogram(fill = "pink", color = "black") + theme_classic() + labs(title = "Residuals Plot")

The obtained residuals is skewed towards the left which suggests possible errors on the right.

#make prediction
prediction <- predict(model5, test_set)

#convert the results into a dataframe 
evaluation <- cbind(test_set$popularity_class, prediction)
colnames(evaluation) <- c("Actual", "Predicted")
evaluation <- as.data.frame(evaluation)
head(evaluation)

MODEL PERFORMANCE EVALUATION - MULTIPLE LINEAR REGRESSION

To carry out the prediction accuracy using Mean Squared Error MSE, Mean Average Error MAE, and Root Mean Squared Error RMSE metrics with Metrics

#library(Metrics)
#evaluate model
mse <- mean((evaluation$Actual - evaluation$Predicted)^2)
mae <- mae(evaluation$Actual, predict(model5))
rmse <- sqrt(mse)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)
## MSE:  0.6484976 MAE:  0.6760304  RMSE:  0.8052935

DATA MODELLING - DECISION TREE

Split the normalized data into 70% training and 30% testing set using createDataPartition

set.seed(10000)
#split training and testing data
index <- createDataPartition(y = data_norm$popularity_class,
                             p = 0.7,
                             list = FALSE)
train_data <- data_norm[index, ]
test_data <- data_norm[-index, ]
rm(index)

#head(train_data)
#head(test_data)

Create classification model with 10 fold cross-validation for xval = 10 and plot the decision tree plot()

#classification
set.seed(100)
full_classification <- rpart(formula = popularity_class~.,
                             data = train_data,
                             method = "class",
                             xval = 10)
rpart.plot(full_classification)

Display CP table for fitted decision tree with printcp

#show summary
printcp(full_classification)
## 
## Classification tree:
## rpart(formula = popularity_class ~ ., data = train_data, method = "class", 
##     xval = 10)
## 
## Variables actually used in tree construction:
## [1] duration_ms      instrumentalness liveness        
## 
## Root node error: 49443/79795 = 0.61963
## 
## n= 79795 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.060838      0   1.00000 1.00000 0.0027737
## 2 0.011680      1   0.93916 0.93977 0.0028177
## 3 0.010000      3   0.91580 0.92296 0.0028269

Display graph of cp values to cross validated error summary with plotcp()

#show graph
plotcp(full_classification)

Prune using prune() and plot decision tree by specifying the cp (associated cost-complexity)

#pruning the model
data_classification <- prune(full_classification,
                             cp = full_classification$cptable[which.min(full_classification$cptable[, "xerror"]), "CP"])
rm(full_classification)

#plot results
rpart.plot(data_classification, yesno = TRUE)

Create predictions by applying the model on testing set using predict()

#make prediction
data_classification_pred <- predict(data_classification,
                                    test_data,
                                    type = "class")
plot(test_data$popularity_class, data_classification_pred, main = "Classification - Predict vs Actual", xlab = "Actual", ylab = "Predicted")

MODEL PERFORMANCE EVALUATION - DECISION TREE

Decision tree is also a supervised machine learning algorithm that can perform both classification and regression tasks by using the nodes, sub-trees / branch to provide possible decision making.

Create confusion matrix to evaluate the model accuracy with confusionMatrix

#evaluate model
data_classification_confusion <- confusionMatrix(data = as.factor(data_classification_pred),
                                                 reference = as.factor(test_data$popularity_class))

data_classification_confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1 9061 7301 4648  405
##          2 3749 5755 2976  301
##          3    0    0    0    0
##          4    0    0    0    0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4333         
##                  95% CI : (0.428, 0.4385)
##     No Information Rate : 0.3818         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.0899         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.7073   0.4408    0.000  0.00000
## Specificity            0.4223   0.6676    1.000  1.00000
## Pos Pred Value         0.4231   0.4503      NaN      NaN
## Neg Pred Value         0.7067   0.6591    0.777  0.97935
## Prevalence             0.3746   0.3818    0.223  0.02065
## Detection Rate         0.2650   0.1683    0.000  0.00000
## Detection Prevalence   0.6262   0.3738    0.000  0.00000
## Balanced Accuracy      0.5648   0.5542    0.500  0.50000

DISCUSSION

Based on the linear output of the linear regression model that is trying to predict the popularity of a song using various audio features, it seems that several of the audio features are associated with popularity. The p-values are used to determine whether the estimates are statistically significant. A p-value less than 0.05 indicates that there is strong evidence that the corresponding predictor is associated with popularity. In particular, speechiness, instrumentalness, liveness, tempo, duration, danceability, energy, acousticness, valence, and time_signature are all associated with popularity, as their p-values are less than 0.05. Key and parent_genre_class do not seem to be associated with popularity, as their p-values are greater than 0.05.

To meet our research objectives, we found that below features contribute to the prediction task:
(i) Speechiness: The speechiness feature measures the presence of spoken words in a track, with a higher value indicating a higher presence of spoken words. A positive coefficient for speechiness suggests that as the presence of spoken words in a track increases, the popularity of the track decreases. This may suggest that tracks with less spoken words are more likely to be popular.
(ii) Instrumentalness: The instrumentalness feature measures the presence of instrumental music in a track, with a higher value indicating a higher presence of instrumental music. A negative coefficient for instrumentalness suggests that as the presence of instrumental music in a track increases, the popularity of the track decreases. This may suggest that tracks with less instrumental music are more likely to be popular.
(iii) Liveness: The liveness feature measures the presence of an audience in a track, with a higher value indicating a higher presence of an audience. A negative coefficient for liveness suggests that as the presence of an audience in a track increases, the popularity of the track decreases. This may suggest that tracks with less live audience are more likely to be popular.
(iv) Tempo: The tempo feature measures the tempo of the track, with a higher value indicating a higher tempo. A positive coefficient for tempo suggests that as the tempo of the track increases, the popularity of the track increases. This may suggest that tracks with faster tempo are more likely to be popular.
(v) Duration: The duration feature measures the duration of the track, with a higher value indicating a longer duration. A negative coefficient for duration suggests that as the duration of the track increases, the popularity of the track decreases. This may suggest that shorter tracks are more likely to be popular.
(vi) Danceability: The danceability feature measures how easy it is to dance to a track, with a higher value indicating a higher danceability. A positive coefficient for danceability suggests that as the danceability of the track increases, the popularity of the track increases. This may suggest that tracks that are more easy to dance to are more likely to be popular.
(vii) Energy: The energy feature measures the intensity of a track, with a higher value indicating a higher intensity. A negative coefficient for energy suggests that as the intensity of the track increases, the popularity of the track decreases. This may suggest that tracks with less intensity are more likely to be popular.
(viii) Acousticness: The acousticness feature measures the presence of acoustic music in a track, with a higher value indicating a higher presence of acoustic music. A negative coefficient for acousticness suggests that as the presence of acoustic music in a track increases, the popularity of the track decreases. This may suggest that tracks with less acoustic music are more likely to be popular.
(ix) Valence: The valence feature measures the positivity of a track, with a higher value indicating a more positive track. A negative coefficient for valence suggests that as the positivity of the track increases, the popularity of the track decreases. This may suggest that tracks with less positive are more likely to be popular.
(x) Time_signature: The time_signature feature measures the time signature of a track, with a higher value indicating a different time signature. A positive coefficient for time_signature suggests that as the time signature of the track increases, the popularity of the track increases. This may suggest that tracks with different time signature are more likely to be popular.
There is magnitude of the coefficients, correlation between predictors and other factors that may influence the popularity class.

CONCLUSION

From the three models used for predicting the predicting the popularity of the song, we get the results as summarized below. The XGBoost model has a mean squared error (MSE) of 0.4607571, a mean absolute error (MAE) of 0.5241448, and a root mean squared error (RMSE) of 0.6787909. These values indicate that the model is making relatively small errors on average, but the RMSE is a bit high. The linear regression model has a MSE of 0.6484976, and a RMSE of 0.8052935. The residual standard error of 0.805 on 79780 degrees of freedom is also provided. The multiple R-squared value is 0.03367, and the adjusted R-squared value is 0.03351. The F-statistic is 213.8 on 13 and 79780 DF, and the p-value is less than 2.2e-16. These values suggest that the model is not fitting the data very well, and that the linear relationship between the predictors and the response variable is not very strong. The decision tree model has a accuracy of 0.4333, with a Kappa statistic of 0.0899. The overall statistics indicate that the model is not performing very well, as the accuracy is low and the Kappa statistic is also low. The sensitivity, specificity, positive predictive value, negative predictive value, and prevalence are also provided for each class. The balanced accuracy is also low, at 0.5648.
Based on the evaluation metrics provided, XGBoost has the lowest MSE, MAE, and RMSE among the three models. These metrics indicate that XGBoost is the best at minimizing the error between the predicted and actual values. Additionally, XGBoost has the highest accuracy among the three models, which suggests that it is the best at correctly classifying this dataset.