Spotify Data Analysis

Introduction

Spotify is an audio streaming platform that that provides DRM-restricted music, videos and podcasts from record labels and media companies. It has over 50 million tracks which user can browse using various parameters like artist, album, genre, or via playlists.It pays the artists or the right holders via royalties which is approximately 70% of thier revenue. Thus, it is also a good platform for musicians to not only showcase their talent but also to make some money.

Thus, from a musician point of view, we have following problem statements:

To find popular singer.
To find popular genre.
To find most danced genre and subgenre.
To find the average duration of popular songs.
Modelling data with popularity as the response variable.

Answering these questions might help a musician to find a particular singer and compose song that will increase the probability of becoming popular.

Methodology: We are using spotify dataset that was collected using spotify API and available via spotifyr package. We begin with the analyses of data structre and cleaning and processing the data for further use. Next, we explore the data to find various relationship between variables using lists,tables and visualizations.

Packages Required

library(DT)
library(tidyverse)
library(ggplot2)
library(magrittr)

library(DT) - The DT library is used to create a table format of the dataset.

library(tidyverse) - The tidyverse library is used in cleaning and summarizing the dataset.

library(ggplot2) - The ggplot2 library is used in Exploratory Data Analysis (EDA).

library(magrittr) - it has two aims: to decrease development time and to improve readability and maintainability of code.

Data Preparation

Data Importing

Spotify Dataset

Data Description

There are 23 variables in the dataset.

Here are variables we have:

	Variables	Description
	track_id	Song unique ID
	track_name	Song Name
	track_artist	Song Artist
	track_popularity	Song Popularity (0-100) where higher is better
	track_album_id	Album unique ID
	track_album_name	Song album name
	track_album_release_date	Date when album released
	playlist_name	Name of playlist
	playlist_id	Playlist ID
	playlist_genre	Playlist genre
	playlist_subgenre	Playlist subgenre
	danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
	energy	Song ArtisEnergy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
	key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1
	loudness	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
	mode	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
	speechiness	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
	acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
	instrumentalness	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0
	liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live
	valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
	tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
	duration_ms	Duration of song in milliseconds

Data Cleaning

Let us first understand the structure of the data.

str(spotify_data)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
##  $ track_name              : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ track_artist            : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
##  $ track_album_name        : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
##  $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_name           : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_id             : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

There are 32833 obsrvations and 23 variables.

We will only be renaming few variables. Otherwise the data structure is good enough for our analysis with appropriate data types.

Renaming the variables for ease of use and understanding.

library(DT)
library(tidyverse)
library(ggplot2)
library(magrittr) 
spotify_data = spotify_data %>% 
  rename(
    genre = playlist_genre,
    sub_genre = playlist_subgenre,
    release_date = track_album_release_date,
    artist = track_artist,
    popularity = track_popularity,
    album_name = track_album_name
  )

Checking for duplicates to ensure we have proper data.

df = unique(spotify_data)
dim(df)

## [1] 32833    23

Since the number of rows is same as before i.e 32833, we can say that there are no duplicates.

Checking for missing values to understand data better.

missing_val = colSums(is.na(spotify_data))
missing_val

##         track_id       track_name           artist       popularity 
##                0                5                5                0 
##   track_album_id       album_name     release_date    playlist_name 
##                0                5                0                0 
##      playlist_id            genre        sub_genre     danceability 
##                0                0                0                0 
##           energy              key         loudness             mode 
##                0                0                0                0 
##      speechiness     acousticness instrumentalness         liveness 
##                0                0                0                0 
##          valence            tempo      duration_ms 
##                0                0                0

We see there are 5 missing values in track_name, track_artist and track_album_name.

Checking for the index of the missing values.

index_track_name = which(is.na(spotify_data$track_name))
index_track_name

## [1]  8152  9283  9284 19569 19812

index_artist = which(is.na(spotify_data$artist))
index_artist

## [1]  8152  9283  9284 19569 19812

index_album_name = which(is.na(spotify_data$album_name))
index_album_name

## [1]  8152  9283  9284 19569 19812

Since the indexes are same, if we remove missing values, we would only loose 5 observations out of 32833 observations, which is very small relatively. Hence it won’t affect our analysis. Therefore, we would remove the missing values from the data.

spotify_data <- na.omit(spotify_data)
dim(spotify_data)

## [1] 32828    23

Checking for missing values again.

missing_val1 = colSums(is.na(spotify_data))
missing_val1

##         track_id       track_name           artist       popularity 
##                0                0                0                0 
##   track_album_id       album_name     release_date    playlist_name 
##                0                0                0                0 
##      playlist_id            genre        sub_genre     danceability 
##                0                0                0                0 
##           energy              key         loudness             mode 
##                0                0                0                0 
##      speechiness     acousticness instrumentalness         liveness 
##                0                0                0                0 
##          valence            tempo      duration_ms 
##                0                0                0

As we can see, there are no more missing values.

Let us check the summary statistics to further explore the data and make it suitable for data analysis.

summary(spotify_data)

##                    track_id        track_name                 artist     
##  7BKLCZ1jbUBVqRi2FVlTVw:   10   Poison  :   22   Martin Garrix   :  161  
##  14sOS5L36385FJ3OL8hew4:    9   Breathe :   21   Queen           :  136  
##  3eekarcy7kvN4yt5ZFzltW:    9   Alive   :   20   The Chainsmokers:  123  
##  0nbXyq5TXYPCO7pr3N8S4I:    8   Forever :   20   David Guetta    :  110  
##  0qaWEvPkts34WF68r8Dzx9:    8   Paradise:   19   Don Omar        :  102  
##  0rIAC4PXANcKmitJfoqmVm:    8   Stay    :   19   Drake           :  100  
##  (Other)               :32776   (Other) :32707   (Other)         :32096  
##    popularity                    track_album_id 
##  Min.   :  0.00   5L1xcowSxwzFUSJzvyMp48:   42  
##  1st Qu.: 24.00   5fstCqs5NpIlF42VhPNv23:   29  
##  Median : 45.00   7CjJb2mikwAWA1V6kewFBF:   28  
##  Mean   : 42.48   4VFG1DOuTeDMBjBLZT7hCK:   26  
##  3rd Qu.: 62.00   2HTbQ0RHwukKVXAlTmCZP2:   21  
##  Max.   :100.00   4CzT5ueFBRpbILw34HQYxi:   21  
##                   (Other)               :32661  
##                        album_name        release_date  
##  Greatest Hits              :  139   2020-01-10:  270  
##  Ultimate Freestyle Mega Mix:   42   2019-11-22:  244  
##  Gold                       :   35   2019-12-06:  235  
##  Malibu                     :   30   2019-12-13:  220  
##  Rock & Rios (Remastered)   :   29   2013-01-01:  219  
##  Appetite For Destruction   :   28   2019-11-15:  215  
##  (Other)                    :32525   (Other)   :31425  
##                                                      playlist_name  
##  Indie Poptimism                                            :  308  
##  2020 Hits & 2019  Hits â\200“ Top Global Tracks ðŸ”¥ðŸ”¥ðŸ”¥  :  247  
##  Permanent Wave                                             :  244  
##  Hard Rock Workout                                          :  219  
##  Ultimate Indie Presents... Best Indie Tracks of the 2010s  :  198  
##  Fitness Workout Electro | House | Dance | Progressive House:  195  
##  (Other)                                                    :31417  
##                  playlist_id      genre                          sub_genre    
##  4JkkvMpVl4lSioqQjeAL0q:  247   edm  :6043   progressive electro house: 1809  
##  37i9dQZF1DWTHM4kX49UKs:  198   latin:5153   southern hip hop         : 1674  
##  6KnQDwp0syvhfHOR4lWP7x:  195   pop  :5507   indie poptimism          : 1672  
##  3xMQTDLOIGvj3lWH5e5x6F:  189   r&b  :5431   latin hip hop            : 1655  
##  3Ho3iO0iJykgEQNbjB2sic:  182   rap  :5743   neo soul                 : 1637  
##  25ButZrVb1Zj1MJioMs09D:  109   rock :4951   pop edm                  : 1517  
##  (Other)               :31708                (Other)                  :22864  
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6549   Mean   :0.698603   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1754   Mean   :0.0847599  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187805  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225797  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253581  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

As we observe the summary, we find that Loudness, Instrumentalness and Tempo seem to have outliers. Let us explore them further.

# Exploring Loudness with histogram and boxplots
ggplot(spotify_data, aes(x = loudness)) +
  geom_histogram(binwidth = 0.1, bins = 10) +
  ggtitle("Histogram of loudness")

ggplot(spotify_data, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  ggtitle('Boxplot of Loudness')

As we can see from above two figures, the data is right skewed. And the minimum value looks like an outlier. Will further plot it against all genres and check.

ggplot(spotify_data, aes(x = 1, y = loudness)) +
  geom_boxplot() +
  facet_grid(genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Loudness by different genre") +
  theme(axis.text.y  = element_blank())

Since only latin genre has few extreme values including the minimum, and also since the minimum value is within the acceptable limits, it doesn’t seem like an outlier. Hence, retaining it as it is.

Now, let us check for instrumentalness.

ggplot(spotify_data, aes(x = instrumentalness)) +
  geom_histogram(binwidth = 0.1, bins = 10) +
  ggtitle("Histogram of Instrumentalness")

The difference in the mean and median is because most of the observations doesn’t cross 0.1. Hence, the data seems correct.

Moving on to Tempo which has a minimum value of 0.

ggplot(spotify_data, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  ggtitle('Boxplot of Tempo')

The 0 value of Tempo means the estimated tempo of a track in beats per minute is 0 but it does not make any sense. Let us compare it with all genres and check.

ggplot(spotify_data, aes(x = 1, y = tempo)) +
  geom_boxplot() +
  facet_grid(genre~., scales = "free_x") +
  coord_flip() +
  labs(title = "Boxplot of Different Genre's Tempo by different genre") +
  theme(axis.text.y  = element_blank())

The minimum value looks like an outlier, hence removing from the dataset so that it doesn’t skew our analysis. Also, since it is only one observation, removing it would not affect our analysis.

spotify_data <- spotify_data[-which(spotify_data$tempo == min(spotify_data$tempo)),]

dim(spotify_data)

## [1] 32827    23

Final Data Preview

datatable(
  head(spotify_data,50),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Exploratory Data Analysis

The main objective of the exploratory data analysis will be to determine the patterns between variables of interest to discover interesting insights that would help us answer our questions.

We will be using ggplot to make plots to demonstrate data knowledge and describe the information.

Barplot for Artist popularity

In the below graph we show the information on the artist with the greatest popularity on spotify.

by_popularity <- spotify_data %>% 
  group_by(artist) %>%
  summarise(popularity = sum(popularity)) %>% 
  arrange(desc(popularity)) %>%
  top_n(20)

## Selecting by popularity

by_popularity

## # A tibble: 20 x 2
##    artist                    popularity
##    <fct>                          <int>
##  1 Martin Garrix                   7600
##  2 The Chainsmokers                7097
##  3 David Guetta                    5878
##  4 Queen                           5848
##  5 Calvin Harris                   5625
##  6 Kygo                            5405
##  7 Ed Sheeran                      5122
##  8 Drake                           4643
##  9 The Weeknd                      4304
## 10 Khalid                          4292
## 11 Don Omar                        4279
## 12 J Balvin                        4206
## 13 Bad Bunny                       4044
## 14 Dimitri Vegas & Like Mike       3980
## 15 Post Malone                     3865
## 16 Maroon 5                        3766
## 17 Selena Gomez                    3759
## 18 Daddy Yankee                    3677
## 19 Avicii                          3618
## 20 Billie Eilish                   3594

by_popularity %>%
  ggplot(aes(reorder(artist, popularity), y = popularity)) +
  geom_col(fill = "sky blue") +
  #geom_label_repel(aes(label = total), size = 3) +
  coord_flip() +
  labs(title = 'Spotify | Favourite Artist | Martin Garrix',
       x = "Artist Names",
       y = "Total Track Popularity")

The plot shows that Martin Garrix has highest popularity among all the artists. He has the total track popularity over 8000.

Barplot for Popular Genre.

In the below graph we show the information on the genre with the greatest popularity on spotify.

genre_popularity <- spotify_data %>% 
  group_by(genre) %>%
  summarise(popularity = sum(popularity)) %>% 
  arrange(desc(popularity)) %>%
  top_n(20)

## Selecting by popularity

by_popularity

## # A tibble: 20 x 2
##    artist                    popularity
##    <fct>                          <int>
##  1 Martin Garrix                   7600
##  2 The Chainsmokers                7097
##  3 David Guetta                    5878
##  4 Queen                           5848
##  5 Calvin Harris                   5625
##  6 Kygo                            5405
##  7 Ed Sheeran                      5122
##  8 Drake                           4643
##  9 The Weeknd                      4304
## 10 Khalid                          4292
## 11 Don Omar                        4279
## 12 J Balvin                        4206
## 13 Bad Bunny                       4044
## 14 Dimitri Vegas & Like Mike       3980
## 15 Post Malone                     3865
## 16 Maroon 5                        3766
## 17 Selena Gomez                    3759
## 18 Daddy Yankee                    3677
## 19 Avicii                          3618
## 20 Billie Eilish                   3594

genre_popularity %>%
  ggplot(aes(reorder(genre, popularity), y = popularity)) +
  geom_col(fill = "Gold") +
  #geom_label_repel(aes(label = total), size = 3) +
  coord_flip() +
  labs(title = 'Spotify | Popular Genre | Pop',
       x = "Artist Names",
       y = "Total Track Popularity")

We can see that pop is the most famous genre as per spotify dataset.

LinePlot for Most Danced Genre.

The following plot shows the top 5 SubGenres that are most danced upon.

dancebility <- spotify_data %>% 
  group_by(genre) %>%
  summarise(danceability = sum(danceability)) %>% 
  arrange(desc(danceability)) %>%
  top_n(5)

## Selecting by danceability

dancebility

## # A tibble: 5 x 2
##   genre danceability
##   <fct>        <dbl>
## 1 rap          4126.
## 2 edm          3958.
## 3 latin        3676.
## 4 r&b          3640.
## 5 pop          3521.

ggplot(data = dancebility, aes(x = genre, y = danceability, group = 1)) +
  geom_line(linetype = "dashed") +
  geom_point() +
  labs(title = 'Spotify | Top 5 Danced Genre ',
       x = "Dancebility",
       y = "Genre")

It looks like rap is the most popular genre among the people.

LinePlot for Most Danced Sub_Genre

The following plot shows the top 5 SubGenres that are most danced upon.

dancebility <- spotify_data %>% 
  group_by(sub_genre) %>%
  summarise(danceability = sum(danceability)) %>% 
  arrange(desc(danceability)) %>%
  top_n(5)

## Selecting by danceability

dancebility

## # A tibble: 5 x 2
##   sub_genre                 danceability
##   <fct>                            <dbl>
## 1 latin hip hop                    1197.
## 2 southern hip hop                 1196.
## 3 progressive electro house        1168.
## 4 electro house                    1060.
## 5 neo soul                         1056.

ggplot(data = dancebility, aes(x = sub_genre, y = danceability, group = 1)) +
  geom_line(linetype = "dashed") +
  geom_point() +
  labs(title = 'Spotify | Top 5 Danced Genre ',
       x = "Dancebility",
       y = "Genre")

It looks like hip-hop is the most popular subgenre among the people who enjoy dancing. Both Latin and Southern hip-hop top the list.

Barplot for Average duration in minutes of popular songs.

duration <- spotify_data %>% 
  group_by(track_name) %>%
  summarise(duration = mean(duration_ms/60000), popularity = sum(popularity)) %>% 
  arrange(desc(popularity)) %>% 
  top_n(20)

## Selecting by popularity

duration %>%
  ggplot(aes(reorder(track_name, duration), y = duration)) +
  geom_col(fill = "green") +
  coord_flip() +
  labs(title = 'Spotify | Most run time track | Closer(feat.Halsey)',
       x = "Duration",
       y = "TrackNames")

According to the above plot Closer is one of the top 20 songs with highest average duration.

Data Modelling

We are modelling the data to find which parameter act as the driving factor for a song to increase its popularity

model1 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + key + (energy * loudness) ,  data=spotify_data)
summary(model1)

## 
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy + 
##     instrumentalness + loudness + key + (energy * loudness), 
##     data = spotify_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.749 -17.593   3.196  18.988  74.255 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       70.78721    1.71477  41.281  < 2e-16 ***
## acousticness       4.84515    0.73526   6.590 4.47e-11 ***
## danceability       6.98077    0.94383   7.396 1.43e-13 ***
## energy           -30.79479    1.80115 -17.097  < 2e-16 ***
## instrumentalness -12.48794    0.61802 -20.206  < 2e-16 ***
## loudness           1.77181    0.13422  13.201  < 2e-16 ***
## key                0.01000    0.03707   0.270    0.787    
## energy:loudness   -0.16164    0.19650  -0.823    0.411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.25 on 32819 degrees of freedom
## Multiple R-squared:  0.05805,    Adjusted R-squared:  0.05785 
## F-statistic: 288.9 on 7 and 32819 DF,  p-value: < 2.2e-16

In the summary of the above model we can see that the key , loudness and energy used together are insignificant to the response variable as the p-value is less than 0.05.

We will remove key from the model and use loudness and energy seperated in the next model.

model2 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + key + energy + loudness, data= spotify_data)

summary(model2)

## 
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy + 
##     instrumentalness + loudness + key + energy + loudness, data = spotify_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.702 -17.602   3.196  18.997  73.325 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       69.98461    1.41014  49.629  < 2e-16 ***
## acousticness       4.75297    0.72667   6.541 6.21e-11 ***
## danceability       7.08133    0.93587   7.567 3.93e-14 ***
## energy           -29.65124    1.14528 -25.890  < 2e-16 ***
## instrumentalness -12.54553    0.61403 -20.431  < 2e-16 ***
## loudness           1.67431    0.06299  26.580  < 2e-16 ***
## key                0.01064    0.03706   0.287    0.774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.25 on 32820 degrees of freedom
## Multiple R-squared:  0.05803,    Adjusted R-squared:  0.05785 
## F-statistic:   337 on 6 and 32820 DF,  p-value: < 2.2e-16

In the above model we can see that the loudness and energy used seperately are significant to the response variable.

So we can use this as the final model with R-square 5%.

Summary

So based on the analysis, we have following results.

Top three popular singers are: Martin Garrix, The Chainsmokers, David Guetta.
Pop remains the most popular genre followed by rap, latin, r&b, edm and rock being the last apparently.
Rap is the popular genre while hip-hop is the most popular sub-genre for the dancers.
The average duration of the most popular songs is around 3.5 to 4.2 minutes.
Based on the model, the influential paramenters for a popular song are: accousticness, danceability, energy, instrumentalness and loudness.