Synopsis

I will be attempting to understand what makes songs from Spotify so popular while also looking into why some songs are not as popular. I think it is interesting to compare some quantitative data about the songs, such as the energy it has, its valence, genre and so forth and see if certain ranges within this data could be a possible reason as to why a song is popular.

To conduct my analysis, I will compare the most popular music according to its popularity rating (top 75th %) with those that are the least of popularity (bottom 25th %).Each song has quantitative measures associated with it, and I will chose to focus on Valence, Energy, Danceability, and Song Duration. I will also separate these 2 popularity groups into Music Genres, and see if there is a difference among them when it comes to my chosen metrics and their popularity.

By taking part in this analysis for Spotify, I hope to possibly pinpoint the popular songs by noticing a pattern of characteristics for an enhanced song recommendation for users. More popular songs generate more playbacks, and adding more songs that fit the characteristics of a “popular song” can bring in more users. It is also important to note that this does not mean we should exclude all the unpopular songs in favor of the popular ones, but this analysis could give us some insight as to what kinds of songs users like, and use that info to bring in more.

Packages Required

library(ggplot2) # for visualizing the data
library(dplyr) # for data manipulation
library(tibble) #to create tibble date
library(tidyr) #to create variable data from music genre
library(magrittr) # for simplifying code with pipes
library(DT) # Loads visual date table
library(scales) # Scale functions for visualizations
library(knitr) # combining tables on Markdown

Data Preparation

Before we can understand what makes a song on Spotify popular, we must clean our data set

Loading the Data

raw_data <- read.csv("spotify_songs.csv")

Describing initial data set

str(raw_data)

The data set has 23 variables, and empty rows are stored as NA.

Describing the Variables of Choice

  • Valence - This measures how positive a song is. Songs with a higher valence (closer to 1) will sound more happy, cheerful, euphoric. On the other hand, songs with a Lower valence (closer to 0) are more negative and convey sadness, depression or anger. The Range is from 0 to 1.0

  • Energy - This measures perpetual intensity and activity. Songs with a higher energy feel faster, louder, and more noisy than those with a Lower energy. Heavy metal would have a higher energy rating (closer to 1) while Classical music will have a Lower energy rating (closer to 0). The Range is from 0 to 1.0

  • Danceability - This measures how suitable a song is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strenght and overall regularity. A song that rates as more danceable will be closer to 1.0 while a song that rates Lower will be closer to 0. The range is from 0 to 1.0

  • duration_ms (song_duration) - This measures the length of the song in milliseconds. I’ve converted this column to represent each song’s length in minutes and seconds. For example, a song with a duration_ms of 200960 will have a length of 3 minutes and 20 seconds (3:20)

Data Cleaning

First I will find how many missing value there are and remove them to create a data set with no missing values. Then I will remove duplicated songs, and I will also remove any remix songs because those will skew the results. Remix songs tend to either be a hit or miss, and a Low remix song could make the average popularity of an artist go down. Finally, I will convert the duration_ms column form milliseconds to minutes and seconds, while also changing its variable name to song_duration.

colSums(is.na(raw_data)) 
# counts the NA values in each column

complete_data<- na.omit(raw_data) 
# creates new data frame with non NA observations

colSums(is.na(complete_data)) 
#counts the NA values in the new data frame

unique_data <- distinct(complete_data,track_name, .keep_all= TRUE) 
# removes observations with the same song title

data_without_remix <- unique_data[!grepl("Remix",unique_data$track_name),]
# removes the remix songs

clean_data <- select(data_without_remix, -c(track_album_id, playlist_id, track_id))
#removes unique identifier columns for each track and playlist

clean_data$duration_ms <- format(as.POSIXct(Sys.Date()) + clean_data$duration_ms/1000, "%M:%S")
# This changes the format of time used from milliseconds to minutes and seconds for each song

clean_data$duration_ms <- as.POSIXct(clean_data$duration_ms, format = "%M:%S")
# This converts the data type of duration_ms from character to POSIXct. This conversion is necessary in order to scale my axis when working with time data

colnames(clean_data)[20] <- "song_duration"
 # This changes the column of duration_ms to song duration since milliseonds will no longer be used 

Data Cleaning Results

  • The original raw data set had 32833 observations with 23 variables
  • The clean data set has 21777 observations with 20 variables
  • a reduction of 11056 observations or about 34%
  • a reduction of 3 variables

Date Processing

In order to get my data ready for processing, I’ll have to add an extra column called popularity ranking. Songs that are in the bottom 25th % on the popularity scale will get a rating of Low. Songs that are in the top 75th % on the popularity scale will get a rating of High. Everything in between will get a rating of NA. I will also change the data type of this column to a factor in order to separate the data.

quantile(clean_data$track_popularity, .25)
## 25% 
##  23
quantile(clean_data$track_popularity, .75)
## 75% 
##  58
clean_data <- mutate(clean_data, 
        popularity_ranking = ifelse(track_popularity >= 58, "High", ifelse(track_popularity <= 23, "Low", "NA")))
# creating a new variable that rates the songs on popularity. The bottom 25th % will get a rating of Low, the top 75th % will get a rating of high, everything in between will be rated na.


clean_data$popularity_ranking <- as.factor(clean_data$popularity_ranking)
# changing the data type of this new variable to factor

Visualizing The Distribution of High, Low and NA

I will subset my data to only show the songs that either rate as high or Low on the popularity ranking scale. This will go into a new data frame called p_data. Its important that both of these groups are as close as possible in terms of count, so that my analysis can be more accurate

p_data <- clean_data %>%
  filter(popularity_ranking != "NA")
ggplot(p_data) + geom_bar(aes(x = popularity_ranking)) + ggtitle("High vs Low song counts") + 
  xlab("Popularity Ranking")

clean_data %>%
  filter(popularity_ranking =="NA") %>%
  summarise(count_of_na_removed = n())
##   count_of_na_removed
## 1               10511
# This counts the total number of NA's removed, which are songs that are in between the 25th% and 75% in terms of popularity

Data Table

datatable(top_n(p_data,10,track_name), caption = 'Top 10 Observations')

Summarizing Data Based Off Of Playlist Genre (High vs Low)

ggplot(p_data, aes(popularity_ranking, fill = popularity_ranking)) +
  geom_bar() + facet_wrap(ncol = 3,"playlist_genre")

Genre Statistics

## # A tibble: 6 x 5
##   playlist_genre High_count Low_count pct_High pct_Low
##   <chr>               <int>     <int>    <dbl>   <dbl>
## 1 edm                   288      1094     20.8    79.2
## 2 latin                 855       673     56.0    44.0
## 3 pop                  1620       860     65.3    34.7
## 4 r&b                   744      1033     41.9    58.1
## 5 rap                  1268       990     56.2    43.8
## 6 rock                  970       871     52.7    47.3

When comparing the High vs Low group, there is a discrepancy among the genres. To begin, edm songs have the highest difference among High vs Low. 79.2% of edm songs in these groups are in the Low group, meaning that there are many more edm songs that rank in the bottom 25% on the popularity scale than there are edm songs that rank in the top 25% on the popularity scale. The next biggest difference is within the pop genre. 65.3% of all pop songs in these 2 groups fall in High, meaning that there are many more pop songs in the top 25% in terms of popularity ranking as compared to the bottom 25%.

However, from a more general sense, edm and r&b are the only two genres that have a higher proportion of Low songs than High songs, while latin, pop, rap and rock all have a higher proportion of High songs versus Low songs.

Summarizing Data Based Off Of Playlist Genre (High vs Low) and Quantitative variables

Valence

Valence Distribution Across Genres

Density Plot Of Valence Across Genres

ggplot(p_data, aes(x =valence , fill = popularity_ranking)) +
  geom_density(alpha = 0.4) +
  theme_bw(base_size = 24) +
  labs(fill = "") +
  scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
  theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1))

Valence Statistics Across Genres

Low

## # A tibble: 6 x 5
##   playlist_genre Low_mean Low_median range_Low sd_Low
##   <chr>             <dbl>      <dbl>     <dbl>  <dbl>
## 1 edm               0.38        0.34      0.95   0.24
## 2 latin             0.63        0.66      0.93   0.22
## 3 pop               0.5         0.49      0.95   0.24
## 4 r&b               0.570       0.59      0.93   0.22
## 5 rap               0.51        0.53      0.94   0.22
## 6 rock              0.54        0.54      0.97   0.25

High

## # A tibble: 6 x 5
##   playlist_genre high_mean high_median range_high sd_high
##   <chr>              <dbl>       <dbl>      <dbl>   <dbl>
## 1 edm                 0.48        0.48       0.93    0.25
## 2 latin               0.63        0.67       0.98    0.21
## 3 pop                 0.51        0.51       0.94    0.22
## 4 r&b                 0.49        0.48       0.94    0.23
## 5 rap                 0.51        0.51       0.93    0.22
## 6 rock                0.54        0.52       0.94    0.23

By taking a look at the statistics and visuals for valence, it is clear to see that this variable has an effect on the popularity rating of the songs. For Example, the mean Valence for edm in the Low group was 0.38 while the mean Valence in the High group was 0.48. For edm songs, it can be concluded that songs that are more sad are less popular. This makes complete sense because edm songs are usually played at festivals, which are generally involve people that are more happy/uplifted. On the other hand, songs that had a higher Valence score (more sad) were more popular within the r&b genre. The mean Valence score for the high group in r&b was 0.57 while the mean Valence score for the Low group was 0.49. One can assume that this genre is more likely to get more popular with sad love songs. The other genres had differences as well, but too slight to make a definitive conclusion.

Energy

Energy Distribution Across Genres

Density Plot of Energy Across Genres

ggplot(p_data, aes(x =energy , fill = popularity_ranking)) +
  geom_density(alpha = 0.4) +
  theme_bw(base_size = 24) +
  labs(fill = "") +
  scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
  theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1))

Energy Statistics Across Genres

Low

# A tibble: 6 x 5
  playlist_genre Low_mean Low_median range_Low sd_Low
  <chr>             <dbl>      <dbl>     <dbl>  <dbl>
1 edm                0.82       0.84      0.72   0.14
2 latin              0.74       0.76      0.87   0.16
3 pop                0.72       0.75      0.92   0.18
4 r&b                0.61       0.62      0.96   0.18
5 rap                0.68       0.69      0.97   0.16
6 rock               0.73       0.77      0.91   0.2 

High

# A tibble: 6 x 5
  playlist_genre high_mean high_median range_high sd_high
  <chr>              <dbl>       <dbl>      <dbl>   <dbl>
1 edm                 0.79       0.82       0.68     0.13
2 latin               0.71       0.73       0.93     0.15
3 pop                 0.69       0.72       0.95     0.17
4 r&b                 0.56       0.580      0.900    0.18
5 rap                 0.64       0.65       0.910    0.15
6 rock                0.71       0.76       0.98     0.2 

By taking a look at the statistics and visuals for energy, it is clear to see that this variable also has an effect on the popularity rating of the songs. When comparing the mean energy from the Low group, they all have a higher energy rating when compared to the High group. This means that across every genre, Lower energy songs are rated higher than higher energy songs. The biggest difference among the 2 groups was within r&b, where there was a .06 difference in energy.

Danceability

Danceability Distribution Across Genres

Density Plot of Danceability Across Genres

ggplot(p_data, aes(x =danceability , fill = popularity_ranking)) +
  geom_density(alpha = 0.4) +
  theme_bw(base_size = 24) +
  labs(fill = "") +
  scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
  theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1))

Danceability Statistics Across Genres

Low

# A tibble: 6 x 5
  playlist_genre Low_mean Low_median range_Low sd_Low
  <chr>             <dbl>      <dbl>     <dbl>  <dbl>
1 edm                0.66       0.66      0.76   0.12
2 latin              0.71       0.72      0.75   0.12
3 pop                0.62       0.63      0.82   0.13
4 r&b                0.67       0.69      0.82   0.14
5 rap                0.7        0.71      0.75   0.14
6 rock               0.51       0.51      0.92   0.14

High

# A tibble: 6 x 5
  playlist_genre high_mean high_median range_high sd_high
  <chr>              <dbl>       <dbl>      <dbl>   <dbl>
1 edm                 0.66        0.66       0.66    0.13
2 latin               0.72        0.74       0.9     0.12
3 pop                 0.65        0.66       0.77    0.13
4 r&b                 0.66        0.68       0.76    0.14
5 rap                 0.75        0.78       0.72    0.12
6 rock                0.52        0.52       0.80    0.14

By taking a look at the statistics and visuals for danceability, it is also proven to slightly affect the popularity of songs. The average danceabiliy was slightly higher in the high group in all genres except for edm and r&b. The genre with the most significant difference on the danceability scale was rap, with a difference of .05 between the means of both groups. Although this variable didnt show much of a disparity, a small difference was present among the genres.

Song Duration

Song Duration Distribution Across Genres

Density Plot of Song Duration Across Genres

ggplot(p_data, aes(x =song_duration, fill = popularity_ranking)) +
  scale_x_datetime(breaks=date_breaks("2 min" ), labels = date_format("%M:%S"))+
  geom_density(alpha = 0.4) +
  theme_bw(base_size = 24) +
  labs(fill = "")   +
  scale_fill_manual(values = c("yellow", "navyblue")) + facet_grid(rows= "playlist_genre") +theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1))

Song Duration Statistics Across Genres

Low

# A tibble: 6 x 5
  playlist_genre high_mean           high_median         range_high    sd_high
  <chr>          <dttm>              <dttm>              <drtn>          <dbl>
1 edm            2020-12-20 00:04:04 2020-12-20 00:03:37 8.066667 mins    89.0
2 latin          2020-12-20 00:03:44 2020-12-20 00:03:38 7.283333 mins    55.7
3 pop            2020-12-20 00:03:51 2020-12-20 00:03:41 6.750000 mins    50.5
4 r&b            2020-12-20 00:04:05 2020-12-20 00:04:00 7.183333 mins    60.2
5 rap            2020-12-20 00:03:44 2020-12-20 00:03:44 7.583333 mins    61.6
6 rock           2020-12-20 00:04:11 2020-12-20 00:04:00 8.450000 mins    73.1

High

# A tibble: 6 x 5
  playlist_genre high_mean           high_median         range_high    sd_high
  <chr>          <dttm>              <dttm>              <drtn>          <dbl>
1 edm            2020-12-20 00:03:18 2020-12-20 00:03:10 4.950000 mins    44.6
2 latin          2020-12-20 00:03:34 2020-12-20 00:03:31 5.900000 mins    41.6
3 pop            2020-12-20 00:03:32 2020-12-20 00:03:28 6.733333 mins    39.5
4 r&b            2020-12-20 00:03:42 2020-12-20 00:03:37 6.566667 mins    51.1
5 rap            2020-12-20 00:03:27 2020-12-20 00:03:25 5.900000 mins    52.2
6 rock           2020-12-20 00:04:07 2020-12-20 00:03:58 6.750000 mins    61.4

By taking a look at the statistics and visuals for song_duration, it is clear that the more popular songs are shorter. Across every genre, the average length of songs in the High group were much shorter than those in the Low group. The largest difference was within the edm genre, with an average difference of 42 seconds between the Low and High groups. The smallest difference was within the rock genre, with an average difference of 4 seconds between the Low and the High groups. The range was also much shorter in the High group, meaning that their lengths are not as spread out as the Low group’s length’s.

Conclusion

The variables I used were helpful in determining the differences between songs that are popular with songs that are not that popular. edm music was less likely to be popular in proportion to the total, but if it was, it would most likely have to rate high on positivity (valence). Also, another interesting thing i’ve learned was that songs that have a higher rating of energy will be less popular than those with a little less energy, across all genres. Also, when it comes to how danceable a song is, popularrap music will score higher on the danceability scale when compared to non popular rap, while other genres experienced minimal differences. Lastly, I found it interesting that songs that are shorter in length were much more popular than longer songs across all genres. Maybe this has to do with attention spans, or that we just prefer shorter music.

For future analysis, I think it would be helpful to conduct a correlation analysis/heat map of two variables in order to get a deeper understanding. One of my biggest limitations when conducting this analysis was the exclusion of a large % of data. Given than the songs that fell within 25% and 75% on popularity were dropped, it severely reduced the amount of data i got to work with. I would also include more groups in order to capture all of the inner data, such as a range that falls within a certain distance of the mean popularity score.