2026-03-29

Dataset Overview and Source

Spotify Track analysis

This analysis examines data from 6187 unique tracks to identify artist, genre, popularity, and more.

Data Source: Spotify Dataset by Gati Ambiliya on Kaggle

Key Variables:

  • id: Unique identifier for the track on Spotify.
  • name: Name of the track
  • genre: Genre of the song
  • artists: Names of the artists who performed the track, separated by commas if there are multiple artists
  • album: Name of the album the track belongs to
  • popularity: Popularity score of the track (0-100, where higher is more popular)
  • duration_ms: Duration of the track in milliseconds
  • explicit: Boolean indicating whether the track contains explicit content

R Code for Data Preparation

Here’s how I load and prepare the data:

# Load required libraries
library(ggplot2)
library(plotly)
library(dplyr)

# Load the data
spotify = read.csv("spotify_tracks.csv")

#Clean data by removing songs that appear twice without removing 2 
#different songs with the same name
spotify <- distinct(spotify, id, .keep_all = TRUE)

#Factorize - Do not factor id, name, artist, album, and durations 
#as there's thousands of variables
spotify$explicit <- factor(spotify$explicit,
                              levels = c("False", "True"), 
                              labels = c("Not Explicit", "Explicit"))

plot_ly 3D Scatter Plot - Popularity, Duration, & genre

3D Plot Analysis

Key Observations:

  • Duration Pattern: The most popular songs tend to not go any higher than 400k (hence the reason I shrunk it from the original 3.5M the data set goes to). This confirms that the general public usually likes shorter songs (1-6 minutes).

  • Popularity Distribution: The songs that are on the top of the popularity scale don’t ever go past 300k milliseconds (5 minutes exact). This means the general population likes short and medium length songs, and that a lot of people see songs above 5 minutes as “too long.”

  • Genre playing into effect: Not only can you see some genres having more popular songs than others (Techno and House having more than metalcore and acoustic), but you can also see some of the more popular genres having mostly shorter songs, like pscyh rock having more longer popular songs than industrial.

  • Combined Effect: The 3D view reveals that of the songs that where scraped, a large majority of them where sub 6 minutes as is because of the nature of songs released to streaming services. But also, most popular songs are 5 minutes or shorter as well.

plot_ly Bubble Plot - Genres with the most artists

ggplot Faceted Area Charts - popularity, explicity of top genres

ggplot Bar Plot - % of Explicit Songs in the Top 15 Genres

Statistical Analysis: Summary Statistics

# Five-number summary and means for key variables
spotify %>%
  group_by(explicit) %>%
  summarise(
    Count = n(),
    Mean_Popularity = round(mean(popularity), 1),
    Median_Popularity = median(popularity),
    SD_Popularity = round(sd(popularity), 1),
    Mean_Duration = round(mean(duration_ms), 1),
    Median_Duration = median(duration_ms),
    SD_Duration = round(sd(duration_ms), 1)
  )
## # A tibble: 2 × 8
##   explicit   Count Mean_Popularity Median_Popularity SD_Popularity Mean_Duration
##   <fct>      <int>           <dbl>             <dbl>         <dbl>         <dbl>
## 1 Not Expli…  5022            29.9                28          19.7       205808.
## 2 Explicit    1165            33.3                32          20.3       191170.
## # ℹ 2 more variables: Median_Duration <dbl>, SD_Duration <dbl>

Summary Statistics: Interpretation

Detailed Findings:

  • Balanced Dataset: With 5022 non explicit songs and 1165 explicit songs, there are a heavier density on the former category, but over 1100 entries is still more than enough variance for the explicit category

  • Explicit songs more popular: Based on popularity, explicit songs are on average more popular, with a mean popularity of 33.3, median of 32, and standard dev or 20.33, compared 29.9 mean, 28 median, and 19.7 standard dev of the non-explicit songs.

  • Non explicit songs are longer: In terms of ms, Non-explicit songs are slightly longer on average, with 205.8k+ ms average and a 194.9k median, compared to an average of 191.2k and median of 175.9k for the explicit songs.

  • Proof that the population prefers small songs: Because explicit tends to be more popular, but shorter, and non explicit is both less popular and longer on average, this proves the theory that more popular songs tend to be shorter.

Statistical Analysis: T-Test

# Two-sample t-test comparing MaxHR between groups
explicit_pop <- spotify$popularity[spotify$explicit == "Explicit"]
nonexplicit_pop <- spotify$popularity[spotify$explicit == "Not Explicit"]

explicit_dur <- spotify$duration_ms[spotify$explicit == "Explicit"]
nonexplicit_dur <- spotify$duration_ms[spotify$explicit == "Not Explicit"]

t_test_pop_result <- t.test(explicit_pop, nonexplicit_pop)
t_test_dur_result <- t.test(explicit_dur, nonexplicit_dur)

t_test_pop_result
## 
##  Welch Two Sample t-test
## 
## data:  explicit_pop and nonexplicit_pop
## t = 5.2912, df = 1707.2, p-value = 1.373e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.189168 4.768142
## sample estimates:
## mean of x mean of y 
##  33.33648  29.85783
t_test_dur_result
## 
##  Welch Two Sample t-test
## 
## data:  explicit_dur and nonexplicit_dur
## t = -3.1249, df = 1491.3, p-value = 0.001813
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23827.190  -5449.748
## sample estimates:
## mean of x mean of y 
##  191169.7  205808.2

T-Test: Interpretation

Comprehensive Analysis:

  • Statistical Significance: The p-value of 1.373e-07 (essentially zero), provides evidence that the difference in terms of popularity is real and not due to random chance. The p-value of 0.001813 is for duration is closer but not close enough to change any current thinking.

  • Effect Size: The 3.47865 popularity difference is both statistically significant and meaningful to the sake of song popularity.

  • Confidence Interval: We can be 95% confident that the popularity difference between explicit and non-explicit lies between 2.189168/100 and 4.768142/100 based on this sample.

  • Clinical Implication: Songs that tend to be longer also have a higher probability of being less popular.

Key Insights and Conclusions

Major Findings:

First: Song duration and explicitness plays a heavy role in predicting the popularity of a song, besides just the factors of a popular artist making it or it being a part of a popular genre.

Second: Some genres can have tendencies to product a larger amount of longer songs than others, and some genres are just flat out more popular than others, but those genres also tend to have shorter songs.

Study Limitations:

This analysis only has a sample from presumably one point of time, and does not capture how these songs change over multiple points of time, as well as other factors like how loud the song is.

Future Research Directions:

Testing tracks over time to see how popularity of genres, durations, explicits, albums, and artists changes over time

Thank You