I will be attempting to understand what makes songs from Spotify so popular while also looking into why some songs are not as popular. I think it is interesting to compare some quantitative data about the songs, such as the energy it has, its valence, genre and so forth and see if certain ranges within this data could be a possible reason as to why a song is popular.
To conduct my analysis, I will compare the most popular music according to its popularity rating (top 75th %) with those that are the least of popularity (bottom 25th %).Each song has quantitative measures associated with it, and I will chose to focus on Valence, Energy, Danceability, and Song Duration. I will also separate these 2 popularity groups into Music Genres, and see if there is a difference among them when it comes to my chosen metrics and their popularity.
By taking part in this analysis for Spotify, I hope to possibly pinpoint the popular songs by noticing a pattern of characteristics for an enhanced song recommendation for users. More popular songs generate more playbacks, and adding more songs that fit the characteristics of a “popular song” can bring in more users. It is also important to note that this does not mean we should exclude all the unpopular songs in favor of the popular ones, but this analysis could give us some insight as to what kinds of songs users like, and use that info to bring in more.
library(ggplot2) # for visualizing the data
library(dplyr) # for data manipulation
library(tibble) #to create tibble date
library(tidyr) #to create variable data from music genre
library(magrittr) # for simplifying code with pipes
library(DT) # Loads visual date table
library(scales) # Scale functions for visualizations
library(knitr) # combining tables on Markdown
Before we can understand what makes a song on Spotify popular, we must clean our data set
raw_data <- read.csv("spotify_songs.csv")
str(raw_data)
The data set has 23 variables, and empty rows are stored as NA.
Valence - This measures how positive a song is. Songs with a higher valence (closer to 1) will sound more happy, cheerful, euphoric. On the other hand, songs with a Lower valence (closer to 0) are more negative and convey sadness, depression or anger. The Range is from 0 to 1.0
Energy - This measures perpetual intensity and activity. Songs with a higher energy feel faster, louder, and more noisy than those with a Lower energy. Heavy metal would have a higher energy rating (closer to 1) while Classical music will have a Lower energy rating (closer to 0). The Range is from 0 to 1.0
Danceability - This measures how suitable a song is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strenght and overall regularity. A song that rates as more danceable will be closer to 1.0 while a song that rates Lower will be closer to 0. The range is from 0 to 1.0
duration_ms (song_duration) - This measures the length of the song in milliseconds. I’ve converted this column to represent each song’s length in minutes and seconds. For example, a song with a duration_ms of 200960 will have a length of 3 minutes and 20 seconds (3:20)
First I will find how many missing value there are and remove them to create a data set with no missing values. Then I will remove duplicated songs, and I will also remove any remix songs because those will skew the results. Remix songs tend to either be a hit or miss, and a Low remix song could make the average popularity of an artist go down. Finally, I will convert the duration_ms column form milliseconds to minutes and seconds, while also changing its variable name to song_duration.
colSums(is.na(raw_data))
# counts the NA values in each column
complete_data<- na.omit(raw_data)
# creates new data frame with non NA observations
colSums(is.na(complete_data))
#counts the NA values in the new data frame
unique_data <- distinct(complete_data,track_name, .keep_all= TRUE)
# removes observations with the same song title
data_without_remix <- unique_data[!grepl("Remix",unique_data$track_name),]
# removes the remix songs
clean_data <- select(data_without_remix, -c(track_album_id, playlist_id, track_id))
#removes unique identifier columns for each track and playlist
clean_data$duration_ms <- format(as.POSIXct(Sys.Date()) + clean_data$duration_ms/1000, "%M:%S")
# This changes the format of time used from milliseconds to minutes and seconds for each song
clean_data$duration_ms <- as.POSIXct(clean_data$duration_ms, format = "%M:%S")
# This converts the data type of duration_ms from character to POSIXct. This conversion is necessary in order to scale my axis when working with time data
colnames(clean_data)[20] <- "song_duration"
# This changes the column of duration_ms to song duration since milliseonds will no longer be used
In order to get my data ready for processing, I’ll have to add an extra column called popularity ranking. Songs that are in the bottom 25th % on the popularity scale will get a rating of Low. Songs that are in the top 75th % on the popularity scale will get a rating of High. Everything in between will get a rating of NA. I will also change the data type of this column to a factor in order to separate the data.
quantile(clean_data$track_popularity, .25)
## 25%
## 23
quantile(clean_data$track_popularity, .75)
## 75%
## 58
clean_data <- mutate(clean_data,
popularity_ranking = ifelse(track_popularity >= 58, "High", ifelse(track_popularity <= 23, "Low", "NA")))
# creating a new variable that rates the songs on popularity. The bottom 25th % will get a rating of Low, the top 75th % will get a rating of high, everything in between will be rated na.
clean_data$popularity_ranking <- as.factor(clean_data$popularity_ranking)
# changing the data type of this new variable to factor
I will subset my data to only show the songs that either rate as high or Low on the popularity ranking scale. This will go into a new data frame called p_data. Its important that both of these groups are as close as possible in terms of count, so that my analysis can be more accurate
p_data <- clean_data %>%
filter(popularity_ranking != "NA")
ggplot(p_data) + geom_bar(aes(x = popularity_ranking)) + ggtitle("High vs Low song counts") +
xlab("Popularity Ranking")
clean_data %>%
filter(popularity_ranking =="NA") %>%
summarise(count_of_na_removed = n())
## count_of_na_removed
## 1 10511
# This counts the total number of NA's removed, which are songs that are in between the 25th% and 75% in terms of popularity
datatable(top_n(p_data,10,track_name), caption = 'Top 10 Observations')
ggplot(p_data, aes(popularity_ranking, fill = popularity_ranking)) +
geom_bar() + facet_wrap(ncol = 3,"playlist_genre")
## # A tibble: 6 x 5
## playlist_genre High_count Low_count pct_High pct_Low
## <chr> <int> <int> <dbl> <dbl>
## 1 edm 288 1094 20.8 79.2
## 2 latin 855 673 56.0 44.0
## 3 pop 1620 860 65.3 34.7
## 4 r&b 744 1033 41.9 58.1
## 5 rap 1268 990 56.2 43.8
## 6 rock 970 871 52.7 47.3
When comparing the High vs Low group, there is a discrepancy among the genres. To begin, edm songs have the highest difference among High vs Low. 79.2% of edm songs in these groups are in the Low group, meaning that there are many more edm songs that rank in the bottom 25% on the popularity scale than there are edm songs that rank in the top 25% on the popularity scale. The next biggest difference is within the pop genre. 65.3% of all pop songs in these 2 groups fall in High, meaning that there are many more pop songs in the top 25% in terms of popularity ranking as compared to the bottom 25%.
However, from a more general sense, edm and r&b are the only two genres that have a higher proportion of Low songs than High songs, while latin, pop, rap and rock all have a higher proportion of High songs versus Low songs.
ggplot(p_data, aes(x =valence , fill = popularity_ranking)) +
geom_density(alpha = 0.4) +
theme_bw(base_size = 24) +
labs(fill = "") +
scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1))
## # A tibble: 6 x 5
## playlist_genre Low_mean Low_median range_Low sd_Low
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 edm 0.38 0.34 0.95 0.24
## 2 latin 0.63 0.66 0.93 0.22
## 3 pop 0.5 0.49 0.95 0.24
## 4 r&b 0.570 0.59 0.93 0.22
## 5 rap 0.51 0.53 0.94 0.22
## 6 rock 0.54 0.54 0.97 0.25
## # A tibble: 6 x 5
## playlist_genre high_mean high_median range_high sd_high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 edm 0.48 0.48 0.93 0.25
## 2 latin 0.63 0.67 0.98 0.21
## 3 pop 0.51 0.51 0.94 0.22
## 4 r&b 0.49 0.48 0.94 0.23
## 5 rap 0.51 0.51 0.93 0.22
## 6 rock 0.54 0.52 0.94 0.23
By taking a look at the statistics and visuals for valence, it is clear to see that this variable has an effect on the popularity rating of the songs. For Example, the mean Valence for edm in the Low group was 0.38 while the mean Valence in the High group was 0.48. For edm songs, it can be concluded that songs that are more sad are less popular. This makes complete sense because edm songs are usually played at festivals, which are generally involve people that are more happy/uplifted. On the other hand, songs that had a higher Valence score (more sad) were more popular within the r&b genre. The mean Valence score for the high group in r&b was 0.57 while the mean Valence score for the Low group was 0.49. One can assume that this genre is more likely to get more popular with sad love songs. The other genres had differences as well, but too slight to make a definitive conclusion.
ggplot(p_data, aes(x =energy , fill = popularity_ranking)) +
geom_density(alpha = 0.4) +
theme_bw(base_size = 24) +
labs(fill = "") +
scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1))
# A tibble: 6 x 5
playlist_genre Low_mean Low_median range_Low sd_Low
<chr> <dbl> <dbl> <dbl> <dbl>
1 edm 0.82 0.84 0.72 0.14
2 latin 0.74 0.76 0.87 0.16
3 pop 0.72 0.75 0.92 0.18
4 r&b 0.61 0.62 0.96 0.18
5 rap 0.68 0.69 0.97 0.16
6 rock 0.73 0.77 0.91 0.2
# A tibble: 6 x 5
playlist_genre high_mean high_median range_high sd_high
<chr> <dbl> <dbl> <dbl> <dbl>
1 edm 0.79 0.82 0.68 0.13
2 latin 0.71 0.73 0.93 0.15
3 pop 0.69 0.72 0.95 0.17
4 r&b 0.56 0.580 0.900 0.18
5 rap 0.64 0.65 0.910 0.15
6 rock 0.71 0.76 0.98 0.2
By taking a look at the statistics and visuals for energy, it is clear to see that this variable also has an effect on the popularity rating of the songs. When comparing the mean energy from the Low group, they all have a higher energy rating when compared to the High group. This means that across every genre, Lower energy songs are rated higher than higher energy songs. The biggest difference among the 2 groups was within r&b, where there was a .06 difference in energy.
ggplot(p_data, aes(x =danceability , fill = popularity_ranking)) +
geom_density(alpha = 0.4) +
theme_bw(base_size = 24) +
labs(fill = "") +
scale_fill_manual(values = c("yellow", "navyblue")) + facet_wrap(ncol = 3,~playlist_genre)+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1))
# A tibble: 6 x 5
playlist_genre Low_mean Low_median range_Low sd_Low
<chr> <dbl> <dbl> <dbl> <dbl>
1 edm 0.66 0.66 0.76 0.12
2 latin 0.71 0.72 0.75 0.12
3 pop 0.62 0.63 0.82 0.13
4 r&b 0.67 0.69 0.82 0.14
5 rap 0.7 0.71 0.75 0.14
6 rock 0.51 0.51 0.92 0.14
# A tibble: 6 x 5
playlist_genre high_mean high_median range_high sd_high
<chr> <dbl> <dbl> <dbl> <dbl>
1 edm 0.66 0.66 0.66 0.13
2 latin 0.72 0.74 0.9 0.12
3 pop 0.65 0.66 0.77 0.13
4 r&b 0.66 0.68 0.76 0.14
5 rap 0.75 0.78 0.72 0.12
6 rock 0.52 0.52 0.80 0.14
By taking a look at the statistics and visuals for danceability, it is also proven to slightly affect the popularity of songs. The average danceabiliy was slightly higher in the high group in all genres except for edm and r&b. The genre with the most significant difference on the danceability scale was rap, with a difference of .05 between the means of both groups. Although this variable didnt show much of a disparity, a small difference was present among the genres.
ggplot(p_data, aes(x =song_duration, fill = popularity_ranking)) +
scale_x_datetime(breaks=date_breaks("2 min" ), labels = date_format("%M:%S"))+
geom_density(alpha = 0.4) +
theme_bw(base_size = 24) +
labs(fill = "") +
scale_fill_manual(values = c("yellow", "navyblue")) + facet_grid(rows= "playlist_genre") +theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1))
# A tibble: 6 x 5
playlist_genre high_mean high_median range_high sd_high
<chr> <dttm> <dttm> <drtn> <dbl>
1 edm 2020-12-20 00:04:04 2020-12-20 00:03:37 8.066667 mins 89.0
2 latin 2020-12-20 00:03:44 2020-12-20 00:03:38 7.283333 mins 55.7
3 pop 2020-12-20 00:03:51 2020-12-20 00:03:41 6.750000 mins 50.5
4 r&b 2020-12-20 00:04:05 2020-12-20 00:04:00 7.183333 mins 60.2
5 rap 2020-12-20 00:03:44 2020-12-20 00:03:44 7.583333 mins 61.6
6 rock 2020-12-20 00:04:11 2020-12-20 00:04:00 8.450000 mins 73.1
# A tibble: 6 x 5
playlist_genre high_mean high_median range_high sd_high
<chr> <dttm> <dttm> <drtn> <dbl>
1 edm 2020-12-20 00:03:18 2020-12-20 00:03:10 4.950000 mins 44.6
2 latin 2020-12-20 00:03:34 2020-12-20 00:03:31 5.900000 mins 41.6
3 pop 2020-12-20 00:03:32 2020-12-20 00:03:28 6.733333 mins 39.5
4 r&b 2020-12-20 00:03:42 2020-12-20 00:03:37 6.566667 mins 51.1
5 rap 2020-12-20 00:03:27 2020-12-20 00:03:25 5.900000 mins 52.2
6 rock 2020-12-20 00:04:07 2020-12-20 00:03:58 6.750000 mins 61.4
By taking a look at the statistics and visuals for song_duration, it is clear that the more popular songs are shorter. Across every genre, the average length of songs in the High group were much shorter than those in the Low group. The largest difference was within the edm genre, with an average difference of 42 seconds between the Low and High groups. The smallest difference was within the rock genre, with an average difference of 4 seconds between the Low and the High groups. The range was also much shorter in the High group, meaning that their lengths are not as spread out as the Low group’s length’s.
The variables I used were helpful in determining the differences between songs that are popular with songs that are not that popular. edm music was less likely to be popular in proportion to the total, but if it was, it would most likely have to rate high on positivity (valence). Also, another interesting thing i’ve learned was that songs that have a higher rating of energy will be less popular than those with a little less energy, across all genres. Also, when it comes to how danceable a song is, popularrap music will score higher on the danceability scale when compared to non popular rap, while other genres experienced minimal differences. Lastly, I found it interesting that songs that are shorter in length were much more popular than longer songs across all genres. Maybe this has to do with attention spans, or that we just prefer shorter music.
For future analysis, I think it would be helpful to conduct a correlation analysis/heat map of two variables in order to get a deeper understanding. One of my biggest limitations when conducting this analysis was the exclusion of a large % of data. Given than the songs that fell within 25% and 75% on popularity were dropped, it severely reduced the amount of data i got to work with. I would also include more groups in order to capture all of the inner data, such as a range that falls within a certain distance of the mean popularity score.