Introduction

The goal of this project was to investigate how different Spotify audio features of songs relate to the track’s popularity. The graphs use Base R, GGPlot, and Trelliscope to examine characteristics like duration, energy, danceability, loudness, instrumentalness, and liveness to explore their relationships with popularity. Each graph has a different set of these variables to show what different patterns appear and how different variables can determine how much listeners want to hear a song.

Data

data <- read.csv("spotify_analysis_dataset.csv")

Structure

str(data)
## 'data.frame':    50 obs. of  16 variables:
##  $ track_id        : chr  "TRK1000" "TRK1001" "TRK1002" "TRK1003" ...
##  $ track_name      : chr  "Song 0" "Song 1" "Song 2" "Song 3" ...
##  $ artist          : chr  "Artist 7" "Artist 4" "Artist 13" "Artist 11" ...
##  $ album           : chr  "Album 3" "Album 5" "Album 19" "Album 7" ...
##  $ release_date    : chr  "1/1/2010" "1/2/2010" "1/3/2010" "1/4/2010" ...
##  $ duration_ms     : int  240151 253767 244375 299262 256330 159504 284231 133986 181858 249312 ...
##  $ popularity      : int  70 58 85 27 65 41 44 61 56 5 ...
##  $ danceability    : num  0.037 0.61 0.503 0.051 0.279 0.908 0.24 0.145 0.489 0.986 ...
##  $ energy          : num  0.349 0.726 0.897 0.887 0.78 0.642 0.084 0.162 0.899 0.606 ...
##  $ loudness        : num  -2.87 -22.87 -1.33 -1.46 -5.88 ...
##  $ speechiness     : num  0.522 0.77 0.216 0.623 0.085 0.052 0.531 0.541 0.637 0.726 ...
##  $ acousticness    : num  0.616 0.635 0.045 0.375 0.626 0.503 0.856 0.659 0.163 0.071 ...
##  $ instrumentalness: num  0.931 0.858 0.429 0.751 0.755 0.103 0.903 0.505 0.826 0.32 ...
##  $ liveness        : num  0.947 0.986 0.753 0.376 0.084 0.777 0.558 0.424 0.906 0.111 ...
##  $ valence         : num  0.353 0.584 0.078 0.974 0.986 0.698 0.536 0.31 0.814 0.685 ...
##  $ tempo           : num  110.1 75.9 154 132.8 168.1 ...
cor(data[,6:16])
##                  duration_ms  popularity danceability        energy
## duration_ms       1.00000000 -0.35235010   0.04130791  0.3100317396
## popularity       -0.35235010  1.00000000  -0.16558637 -0.1378591340
## danceability      0.04130791 -0.16558637   1.00000000  0.1734400779
## energy            0.31003174 -0.13785913   0.17344008  1.0000000000
## loudness          0.05959870  0.02001551  -0.29604811 -0.0003322671
## speechiness      -0.03890689 -0.04165473  -0.15031221 -0.3281417512
## acousticness      0.18011688 -0.05997322  -0.14563839 -0.0060286464
## instrumentalness -0.09434179  0.36940555  -0.21024876  0.0440278141
## liveness          0.04788366  0.30537833  -0.20380103  0.2657474178
## valence          -0.08757974 -0.09344429  -0.06777716  0.0196295002
## tempo             0.10549049  0.09851274  -0.27702525 -0.1564879450
##                       loudness speechiness acousticness instrumentalness
## duration_ms       0.0595987032 -0.03890689  0.180116883      -0.09434179
## popularity        0.0200155053 -0.04165473 -0.059973219       0.36940555
## danceability     -0.2960481141 -0.15031221 -0.145638388      -0.21024876
## energy           -0.0003322671 -0.32814175 -0.006028646       0.04402781
## loudness          1.0000000000  0.09725450 -0.152796220      -0.03402803
## speechiness       0.0972545011  1.00000000 -0.187845119       0.18103566
## acousticness     -0.1527962200 -0.18784512  1.000000000      -0.04315315
## instrumentalness -0.0340280301  0.18103566 -0.043153150       1.00000000
## liveness         -0.0933493996 -0.12210790  0.169140974       0.21518757
## valence          -0.2521646576 -0.12355863 -0.014379389      -0.13938162
## tempo            -0.0710125256  0.11424786 -0.249876655       0.19667675
##                     liveness     valence       tempo
## duration_ms       0.04788366 -0.08757974  0.10549049
## popularity        0.30537833 -0.09344429  0.09851274
## danceability     -0.20380103 -0.06777716 -0.27702525
## energy            0.26574742  0.01962950 -0.15648794
## loudness         -0.09334940 -0.25216466 -0.07101253
## speechiness      -0.12210790 -0.12355863  0.11424786
## acousticness      0.16914097 -0.01437939 -0.24987665
## instrumentalness  0.21518757 -0.13938162  0.19667675
## liveness          1.00000000 -0.30293005 -0.03621045
## valence          -0.30293005  1.00000000  0.06147433
## tempo            -0.03621045  0.06147433  1.00000000

Base R Plot Goal

The purpose of this plot created in Base R is to investigate if longer songs have more energy, and if longer or more high energy songs tend to be more popular. I decided to use a scatter plot because this graph would easily allow me to see the relationship between duration and energy. Then, coloring the points based on popularity could form clusters if there is any pattern with these variables and popularity.

Base R Plot

colors <- numeric(50) 
colors[data$popularity<=33] <- "lightpink"
colors[data$popularity>33 & data$popularity<= 66] <- "deeppink"
colors[data$popularity>66 & data$popularity<= 99] <- "red"



plot(data$energy, data$duration_ms, col = colors, pch = 19, xlab = "Energy Rating", ylab = "Duration (Milliseconds)", main = "Song Duration vs Energy by Popularity")
legend("topleft", legend = c("Low", "Medium", "High"),inset = c(-.13,-.34),col =c("lightpink","deeppink", "red"), pch = c(19,19, 19), xpd = TRUE, cex = .7, title = "Popularity")

Base R Plot Discussion

The scatterplot shows that there is no strong relationship between a track’s energy level and its duration. Songs of low, medium, and high popularity are scattered across all energy levels. This shows that both low and high energy tracks can become popular. The same is true for duration, with the levels of popularity scattered across all durations. Overall, the scatterplot shows that energy and duration alone do not strongly predict the popularity of a song.

GGPlot Graph Goal

The goal of this plot created with ggplot is to investigate if louder and more danceable songs tend to be more popular among listeners. I chose a scatter plot because this graph would allow me to see the relationship between loudness and danceability. Using ggplot would allow me to use a gradient to color the points based on popularity. This is different than Base R where I had to separate the categories manually rather than using a gradient.

GGplot Graph

library(ggplot2)
ggplot(data, aes(x= loudness, y= danceability, color = popularity)) +
  geom_point() +
  labs(title = "Song Danceability Rating vs Loudness Rating by Popularity", x = "Loudness (dB)", y = "Danceability Rating")

GGPlot Graph Discussion

The scatter plot shows that there is no strong linear relationship between loudness and danceability. The tracks span the full range of loudness from quiet to loud while also scattering across danceability. There is also no distinct pattern for popularity. Lighter blue points, which indicate higher popularity, are scattered across the plot instead of gathering in any specific region. This shows that loudness and danceability alone do not guarantee the higher popularity of a song. As seen in the previous graph, popularity seems to be affected by a combination of many musical factors and not solely by the ones seen in this graph.

Trelliscope Graph Goal

The goal of this trelliscope plot was to investigate if different artists show different patterns between popularity and instrumentalness. I chose to use trelliscope and a scatterplot for this graph because there are 14 different artists, and trelliscope allowed me to facet these without it being unreadable.

Trelliscope Graph

library(trelliscopejs)

ggplot(data, aes(x=instrumentalness, y = popularity)) +
  geom_point(col = "red")  +
  labs(title = "Song Popularity vs Instrumentalness by Artist") +
  facet_trelliscope(~artist, 
                    name = "Popularity vs Instrumentalness by Artist ", 
                    nrow = 2, 
                    ncol = 3,
                    path = "Final_Project", 
                    self_contained = TRUE
  )

Trelliscope Graph Discussion

The trelliscope display shows that popularity and instrumentalness’s relationship varies from artist to artist. There is no uniform pattern across the 14 plots. For example, Artist 8 seems to have a positive correlation between instrumentalness and popularity. In comparison, Artist 12 is more scattered and there does not seem to be a distinct relationship. Another issue could be that there are not enough points/songs for each artist. With the current information, it seems like instrumentalness generally is not a strong predictor for popularity and instead each artist has their own style that contributes to their success.

GGPlot 2nd Graph Goal

The goal of this boxplot created with ggplot is to see if liveness (how much the track sounds like it was recorded in front of an audience) has any differences across popularity levels. I decided to use a boxplot because it could separate the three levels of popularity I defined and then show the distribution of liveness for each. This makes it easier to compare the three categories.

GGPlot 2nd Graph

data$pop_group <- cut(
  data$popularity,
  breaks = c(-Inf, 33, 66, Inf),
  labels = c("Low", "Medium", "High")
)
ggplot(data, aes(x= pop_group, y = liveness, fill = pop_group)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Low" = "lightpink", 
                               "Medium" = "deeppink",
                               "High" = "red")) +
  labs(title = "Liveness by Popularity Group", x = "Popularity Group", y = "Liveness Rating", fill = "Popularity Group") 

GGPlot 2nd Graph Discussion

The boxplot shows that more popular songs tend to have higher liveness levels. The median of each boxplot increases as popularity increases. The Low Popularity median is around .375, the Medium Popularity median is around .5, and the High Popularity median is around .63. The high popularity tracks also show a higher concentration at the top of the liveness scale. There seem to be only a few outliers among the high popularity boxplot. Lower and median popularity songs are more spread out across the liveness scale. Overall, songs with more live characteristics tend to be more popular than less live sounding songs.

Conclusion

Throughout the four visualizations, the analysis shows that no single Spotify music feature predicts popularity consistently on its own. Popularity varies widely across energy, duration, loudness, danceability, and instrumentalness. This shows that successful songs can emerge from many different forms and there is no set formula to popularity. There was an association between high popularity and high liveness rating, but this trend is not uniform across all tracks. In conclusion, popular songs come in many different forms and popularity is shaped by many musical features combined rather than just a single feature.