Spotify is an audio streaming platform that that provides DRM-restricted music, videos and podcasts from record labels and media companies. It has over 50 million tracks which user can browse using various parameters like artist, album, genre, or via playlists.It pays the artists or the right holders via royalties which is approximately 70% of thier revenue. Thus, it is also a good platform for musicians to not only showcase their talent but also to make some money.
Thus, from a musician point of view, we have following problem statements:
To find popular singer.
To find popular genre.
To find most danced genre and subgenre.
To find the average duration of popular songs.
Modelling data with popularity as the response variable.
Answering these questions might help a musician to find a particular singer and compose song that will increase the probability of becoming popular.
Methodology: We are using spotify dataset that was collected using spotify API and available via spotifyr package. We begin with the analyses of data structre and cleaning and processing the data for further use. Next, we explore the data to find various relationship between variables using lists,tables and visualizations.
library(DT)
library(tidyverse)
library(ggplot2)
library(magrittr)
library(DT) - The DT library is used to create a table format of the dataset.
library(tidyverse) - The tidyverse library is used in cleaning and summarizing the dataset.
library(ggplot2) - The ggplot2 library is used in Exploratory Data Analysis (EDA).
library(magrittr) - it has two aims: to decrease development time and to improve readability and maintainability of code.
There are 23 variables in the dataset.
Here are variables we have:
Variables | Description | |
---|---|---|
track_id | Song unique ID | |
track_name | Song Name | |
track_artist | Song Artist | |
track_popularity | Song Popularity (0-100) where higher is better | |
track_album_id | Album unique ID | |
track_album_name | Song album name | |
track_album_release_date | Date when album released | |
playlist_name | Name of playlist | |
playlist_id | Playlist ID | |
playlist_genre | Playlist genre | |
playlist_subgenre | Playlist subgenre | |
danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. | |
energy | Song ArtisEnergy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. | |
key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 | |
loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. | |
mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. | |
speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. | |
acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic | |
instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0 | |
liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live | |
valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). | |
tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. | |
duration_ms | Duration of song in milliseconds |
str(spotify_data)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
## $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
## $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_name : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_id : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
There are 32833 obsrvations and 23 variables.
We will only be renaming few variables. Otherwise the data structure is good enough for our analysis with appropriate data types.
library(DT)
library(tidyverse)
library(ggplot2)
library(magrittr)
spotify_data = spotify_data %>%
rename(
genre = playlist_genre,
sub_genre = playlist_subgenre,
release_date = track_album_release_date,
artist = track_artist,
popularity = track_popularity,
album_name = track_album_name
)
df = unique(spotify_data)
dim(df)
## [1] 32833 23
Since the number of rows is same as before i.e 32833, we can say that there are no duplicates.
missing_val = colSums(is.na(spotify_data))
missing_val
## track_id track_name artist popularity
## 0 5 5 0
## track_album_id album_name release_date playlist_name
## 0 5 0 0
## playlist_id genre sub_genre danceability
## 0 0 0 0
## energy key loudness mode
## 0 0 0 0
## speechiness acousticness instrumentalness liveness
## 0 0 0 0
## valence tempo duration_ms
## 0 0 0
We see there are 5 missing values in track_name, track_artist and track_album_name.
index_track_name = which(is.na(spotify_data$track_name))
index_track_name
## [1] 8152 9283 9284 19569 19812
index_artist = which(is.na(spotify_data$artist))
index_artist
## [1] 8152 9283 9284 19569 19812
index_album_name = which(is.na(spotify_data$album_name))
index_album_name
## [1] 8152 9283 9284 19569 19812
Since the indexes are same, if we remove missing values, we would only loose 5 observations out of 32833 observations, which is very small relatively. Hence it won’t affect our analysis. Therefore, we would remove the missing values from the data.
spotify_data <- na.omit(spotify_data)
dim(spotify_data)
## [1] 32828 23
missing_val1 = colSums(is.na(spotify_data))
missing_val1
## track_id track_name artist popularity
## 0 0 0 0
## track_album_id album_name release_date playlist_name
## 0 0 0 0
## playlist_id genre sub_genre danceability
## 0 0 0 0
## energy key loudness mode
## 0 0 0 0
## speechiness acousticness instrumentalness liveness
## 0 0 0 0
## valence tempo duration_ms
## 0 0 0
As we can see, there are no more missing values.
summary(spotify_data)
## track_id track_name artist
## 7BKLCZ1jbUBVqRi2FVlTVw: 10 Poison : 22 Martin Garrix : 161
## 14sOS5L36385FJ3OL8hew4: 9 Breathe : 21 Queen : 136
## 3eekarcy7kvN4yt5ZFzltW: 9 Alive : 20 The Chainsmokers: 123
## 0nbXyq5TXYPCO7pr3N8S4I: 8 Forever : 20 David Guetta : 110
## 0qaWEvPkts34WF68r8Dzx9: 8 Paradise: 19 Don Omar : 102
## 0rIAC4PXANcKmitJfoqmVm: 8 Stay : 19 Drake : 100
## (Other) :32776 (Other) :32707 (Other) :32096
## popularity track_album_id
## Min. : 0.00 5L1xcowSxwzFUSJzvyMp48: 42
## 1st Qu.: 24.00 5fstCqs5NpIlF42VhPNv23: 29
## Median : 45.00 7CjJb2mikwAWA1V6kewFBF: 28
## Mean : 42.48 4VFG1DOuTeDMBjBLZT7hCK: 26
## 3rd Qu.: 62.00 2HTbQ0RHwukKVXAlTmCZP2: 21
## Max. :100.00 4CzT5ueFBRpbILw34HQYxi: 21
## (Other) :32661
## album_name release_date
## Greatest Hits : 139 2020-01-10: 270
## Ultimate Freestyle Mega Mix: 42 2019-11-22: 244
## Gold : 35 2019-12-06: 235
## Malibu : 30 2019-12-13: 220
## Rock & Rios (Remastered) : 29 2013-01-01: 219
## Appetite For Destruction : 28 2019-11-15: 215
## (Other) :32525 (Other) :31425
## playlist_name
## Indie Poptimism : 308
## 2020 Hits & 2019 Hits â\200“ Top Global Tracks 🔥🔥🔥 : 247
## Permanent Wave : 244
## Hard Rock Workout : 219
## Ultimate Indie Presents... Best Indie Tracks of the 2010s : 198
## Fitness Workout Electro | House | Dance | Progressive House: 195
## (Other) :31417
## playlist_id genre sub_genre
## 4JkkvMpVl4lSioqQjeAL0q: 247 edm :6043 progressive electro house: 1809
## 37i9dQZF1DWTHM4kX49UKs: 198 latin:5153 southern hip hop : 1674
## 6KnQDwp0syvhfHOR4lWP7x: 195 pop :5507 indie poptimism : 1672
## 3xMQTDLOIGvj3lWH5e5x6F: 189 r&b :5431 latin hip hop : 1655
## 3Ho3iO0iJykgEQNbjB2sic: 182 rap :5743 neo soul : 1637
## 25ButZrVb1Zj1MJioMs09D: 109 rock :4951 pop edm : 1517
## (Other) :31708 (Other) :22864
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6549 Mean :0.698603 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1754 Mean :0.0847599
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187805
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225797
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253581
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
As we observe the summary, we find that Loudness, Instrumentalness and Tempo seem to have outliers. Let us explore them further.
# Exploring Loudness with histogram and boxplots
ggplot(spotify_data, aes(x = loudness)) +
geom_histogram(binwidth = 0.1, bins = 10) +
ggtitle("Histogram of loudness")
ggplot(spotify_data, aes(x = 1, y = loudness)) +
geom_boxplot() +
ggtitle('Boxplot of Loudness')
As we can see from above two figures, the data is right skewed. And the minimum value looks like an outlier. Will further plot it against all genres and check.
ggplot(spotify_data, aes(x = 1, y = loudness)) +
geom_boxplot() +
facet_grid(genre~., scales = "free_x") +
coord_flip() +
labs(title = "Boxplot of Loudness by different genre") +
theme(axis.text.y = element_blank())
Since only latin genre has few extreme values including the minimum, and also since the minimum value is within the acceptable limits, it doesn’t seem like an outlier. Hence, retaining it as it is.
Now, let us check for instrumentalness.
ggplot(spotify_data, aes(x = instrumentalness)) +
geom_histogram(binwidth = 0.1, bins = 10) +
ggtitle("Histogram of Instrumentalness")
The difference in the mean and median is because most of the observations doesn’t cross 0.1. Hence, the data seems correct.
Moving on to Tempo which has a minimum value of 0.
ggplot(spotify_data, aes(x = 1, y = tempo)) +
geom_boxplot() +
ggtitle('Boxplot of Tempo')
The 0 value of Tempo means the estimated tempo of a track in beats per minute is 0 but it does not make any sense. Let us compare it with all genres and check.
ggplot(spotify_data, aes(x = 1, y = tempo)) +
geom_boxplot() +
facet_grid(genre~., scales = "free_x") +
coord_flip() +
labs(title = "Boxplot of Different Genre's Tempo by different genre") +
theme(axis.text.y = element_blank())
The minimum value looks like an outlier, hence removing from the dataset so that it doesn’t skew our analysis. Also, since it is only one observation, removing it would not affect our analysis.
spotify_data <- spotify_data[-which(spotify_data$tempo == min(spotify_data$tempo)),]
dim(spotify_data)
## [1] 32827 23
datatable(
head(spotify_data,50),
extensions = 'FixedColumns',
options = list(
scrollY = "400px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
The main objective of the exploratory data analysis will be to determine the patterns between variables of interest to discover interesting insights that would help us answer our questions.
We will be using ggplot to make plots to demonstrate data knowledge and describe the information.
In the below graph we show the information on the artist with the greatest popularity on spotify.
by_popularity <- spotify_data %>%
group_by(artist) %>%
summarise(popularity = sum(popularity)) %>%
arrange(desc(popularity)) %>%
top_n(20)
## Selecting by popularity
by_popularity
## # A tibble: 20 x 2
## artist popularity
## <fct> <int>
## 1 Martin Garrix 7600
## 2 The Chainsmokers 7097
## 3 David Guetta 5878
## 4 Queen 5848
## 5 Calvin Harris 5625
## 6 Kygo 5405
## 7 Ed Sheeran 5122
## 8 Drake 4643
## 9 The Weeknd 4304
## 10 Khalid 4292
## 11 Don Omar 4279
## 12 J Balvin 4206
## 13 Bad Bunny 4044
## 14 Dimitri Vegas & Like Mike 3980
## 15 Post Malone 3865
## 16 Maroon 5 3766
## 17 Selena Gomez 3759
## 18 Daddy Yankee 3677
## 19 Avicii 3618
## 20 Billie Eilish 3594
by_popularity %>%
ggplot(aes(reorder(artist, popularity), y = popularity)) +
geom_col(fill = "sky blue") +
#geom_label_repel(aes(label = total), size = 3) +
coord_flip() +
labs(title = 'Spotify | Favourite Artist | Martin Garrix',
x = "Artist Names",
y = "Total Track Popularity")
The plot shows that Martin Garrix has highest popularity among all the artists. He has the total track popularity over 8000.
In the below graph we show the information on the genre with the greatest popularity on spotify.
genre_popularity <- spotify_data %>%
group_by(genre) %>%
summarise(popularity = sum(popularity)) %>%
arrange(desc(popularity)) %>%
top_n(20)
## Selecting by popularity
by_popularity
## # A tibble: 20 x 2
## artist popularity
## <fct> <int>
## 1 Martin Garrix 7600
## 2 The Chainsmokers 7097
## 3 David Guetta 5878
## 4 Queen 5848
## 5 Calvin Harris 5625
## 6 Kygo 5405
## 7 Ed Sheeran 5122
## 8 Drake 4643
## 9 The Weeknd 4304
## 10 Khalid 4292
## 11 Don Omar 4279
## 12 J Balvin 4206
## 13 Bad Bunny 4044
## 14 Dimitri Vegas & Like Mike 3980
## 15 Post Malone 3865
## 16 Maroon 5 3766
## 17 Selena Gomez 3759
## 18 Daddy Yankee 3677
## 19 Avicii 3618
## 20 Billie Eilish 3594
genre_popularity %>%
ggplot(aes(reorder(genre, popularity), y = popularity)) +
geom_col(fill = "Gold") +
#geom_label_repel(aes(label = total), size = 3) +
coord_flip() +
labs(title = 'Spotify | Popular Genre | Pop',
x = "Artist Names",
y = "Total Track Popularity")
We can see that pop is the most famous genre as per spotify dataset.
The following plot shows the top 5 SubGenres that are most danced upon.
dancebility <- spotify_data %>%
group_by(genre) %>%
summarise(danceability = sum(danceability)) %>%
arrange(desc(danceability)) %>%
top_n(5)
## Selecting by danceability
dancebility
## # A tibble: 5 x 2
## genre danceability
## <fct> <dbl>
## 1 rap 4126.
## 2 edm 3958.
## 3 latin 3676.
## 4 r&b 3640.
## 5 pop 3521.
ggplot(data = dancebility, aes(x = genre, y = danceability, group = 1)) +
geom_line(linetype = "dashed") +
geom_point() +
labs(title = 'Spotify | Top 5 Danced Genre ',
x = "Dancebility",
y = "Genre")
It looks like rap is the most popular genre among the people.
The following plot shows the top 5 SubGenres that are most danced upon.
dancebility <- spotify_data %>%
group_by(sub_genre) %>%
summarise(danceability = sum(danceability)) %>%
arrange(desc(danceability)) %>%
top_n(5)
## Selecting by danceability
dancebility
## # A tibble: 5 x 2
## sub_genre danceability
## <fct> <dbl>
## 1 latin hip hop 1197.
## 2 southern hip hop 1196.
## 3 progressive electro house 1168.
## 4 electro house 1060.
## 5 neo soul 1056.
ggplot(data = dancebility, aes(x = sub_genre, y = danceability, group = 1)) +
geom_line(linetype = "dashed") +
geom_point() +
labs(title = 'Spotify | Top 5 Danced Genre ',
x = "Dancebility",
y = "Genre")
It looks like hip-hop is the most popular subgenre among the people who enjoy dancing. Both Latin and Southern hip-hop top the list.
duration <- spotify_data %>%
group_by(track_name) %>%
summarise(duration = mean(duration_ms/60000), popularity = sum(popularity)) %>%
arrange(desc(popularity)) %>%
top_n(20)
## Selecting by popularity
duration %>%
ggplot(aes(reorder(track_name, duration), y = duration)) +
geom_col(fill = "green") +
coord_flip() +
labs(title = 'Spotify | Most run time track | Closer(feat.Halsey)',
x = "Duration",
y = "TrackNames")
According to the above plot Closer is one of the top 20 songs with highest average duration.
We are modelling the data to find which parameter act as the driving factor for a song to increase its popularity
model1 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + key + (energy * loudness) , data=spotify_data)
summary(model1)
##
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy +
## instrumentalness + loudness + key + (energy * loudness),
## data = spotify_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.749 -17.593 3.196 18.988 74.255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.78721 1.71477 41.281 < 2e-16 ***
## acousticness 4.84515 0.73526 6.590 4.47e-11 ***
## danceability 6.98077 0.94383 7.396 1.43e-13 ***
## energy -30.79479 1.80115 -17.097 < 2e-16 ***
## instrumentalness -12.48794 0.61802 -20.206 < 2e-16 ***
## loudness 1.77181 0.13422 13.201 < 2e-16 ***
## key 0.01000 0.03707 0.270 0.787
## energy:loudness -0.16164 0.19650 -0.823 0.411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.25 on 32819 degrees of freedom
## Multiple R-squared: 0.05805, Adjusted R-squared: 0.05785
## F-statistic: 288.9 on 7 and 32819 DF, p-value: < 2.2e-16
In the summary of the above model we can see that the key , loudness and energy used together are insignificant to the response variable as the p-value is less than 0.05.
We will remove key from the model and use loudness and energy seperated in the next model.
model2 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + key + energy + loudness, data= spotify_data)
summary(model2)
##
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy +
## instrumentalness + loudness + key + energy + loudness, data = spotify_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.702 -17.602 3.196 18.997 73.325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.98461 1.41014 49.629 < 2e-16 ***
## acousticness 4.75297 0.72667 6.541 6.21e-11 ***
## danceability 7.08133 0.93587 7.567 3.93e-14 ***
## energy -29.65124 1.14528 -25.890 < 2e-16 ***
## instrumentalness -12.54553 0.61403 -20.431 < 2e-16 ***
## loudness 1.67431 0.06299 26.580 < 2e-16 ***
## key 0.01064 0.03706 0.287 0.774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.25 on 32820 degrees of freedom
## Multiple R-squared: 0.05803, Adjusted R-squared: 0.05785
## F-statistic: 337 on 6 and 32820 DF, p-value: < 2.2e-16
In the above model we can see that the loudness and energy used seperately are significant to the response variable.
So we can use this as the final model with R-square 5%.
So based on the analysis, we have following results.
Top three popular singers are: Martin Garrix, The Chainsmokers, David Guetta.
Pop remains the most popular genre followed by rap, latin, r&b, edm and rock being the last apparently.
Rap is the popular genre while hip-hop is the most popular sub-genre for the dancers.
The average duration of the most popular songs is around 3.5 to 4.2 minutes.
Based on the model, the influential paramenters for a popular song are: accousticness, danceability, energy, instrumentalness and loudness.