ABSTRACT
This analysis evaluates different factors driving global music success using the Most Streamed Spotify Songs of 2024 dataset. The initial display of the data revealed a heavily right-skewed Track Score distribution confirming that most songs achieve only moderate success (mode range: 25-75). To go further into anylsis, the data was transformed into logarithmic data to form more reliable conclusions that are not drastically sckewed by outliers. A t-test between log(Spotify streams) and log(YouTube views) showed a moderate but significant positive correlation of .39, indicating that one does not cause the other, but higher streams of one is likely to have high streams on the other and vise versa. Data on the top 10 artists with the most Spotify streams was collected and shows that number of tracks doesn’t necessarily conclude more total streams. To prove this an analysis on Taylor Swift and The Weekend was done and demonstrated that top streamed artists with fewer tracks are more likely to have a higher median song by song performance. Also, a two tailed t-test confirmed that the mean for explicit tracks is statistically significantly higher than that for non-explicit tracks.
INTRODUCTION
Analyze Spotify songs and see correlation between different variables such as streams vs performance among other social media platforms, explicit content, and artists.
INTRO TO DATASET
Most Streamed Spotify Songs 2024 Kaggle.com Nidula Elgiriyewithana - Data Scientist January 1st 2024 - December 31st 2024 31 columns, 4,600 rows Each row represents a different song
DATA CLEANING/PREPARATION
df <- read.csv("spotify.csv")
head(df)
## Track AlbumName Artist
## 1 MILLION DOLLAR BABY Million Dollar Baby - Single Tommy Richman
## 2 Not Like Us Not Like Us Kendrick Lamar
## 3 i like the way you kiss me I like the way you kiss me Artemas
## 4 Flowers Flowers - Single Miley Cyrus
## 5 Houdini Houdini Eminem
## 6 Lovin On Me Lovin On Me Jack Harlow
## ReleaseDate ISRC AllTimeRank TrackScore SpotifyStreams
## 1 4/26/24 QM24S2402528 1 725.4 390470936
## 2 5/4/24 USUG12400910 2 545.9 323703884
## 3 3/19/24 QZJ842400387 3 538.4 601309283
## 4 1/12/23 USSM12209777 4 444.9 2031280633
## 5 5/31/24 USUG12403398 5 423.3 107034922
## 6 11/10/23 USAT22311371 6 410.1 670665438
## SpotifyPlaylistCount SpotifyPlaylistReach Spotify.Popularity YouTubeViews
## 1 30716 196631588 92 84274754
## 2 28113 174597137 92 116347040
## 3 54331 211607669 92 122599116
## 4 269802 136569078 85 1096100899
## 5 7223 151469874 88 77373957
## 6 105892 175421034 83 131148091
## YouTubeLikes TikTokPosts TikTok.Likes TikTok.Views YouTubePlaylistReach
## 1 1713126 5767700 651565900 5332281936 150597040
## 2 3486739 674700 35223547 208339025 156380351
## 3 2228730 3025400 275154237 3369120610 373784955
## 4 10629796 7189811 1078757968 14603725994 3351188582
## 5 3670188 16400 NA NA 112763851
## 6 1392593 4202367 214943489 2938686633 2867222632
## AppleMusicPlaylistCount AirPlaySpins SiriusXMSpins DeezerPlaylistCount
## 1 210 40975 684 62
## 2 188 40778 3 67
## 3 190 74333 536 136
## 4 394 1474799 2182 264
## 5 182 12185 1 82
## 6 138 522042 4654 86
## DeezerPlaylistReach AmazonPlaylistCount PandoraStreams PandoraTrackStations
## 1 17598718 114 18004655 22931
## 2 10422430 111 7780028 28444
## 3 36321847 172 5022621 5639
## 4 24684248 210 190260277 203384
## 5 17660624 105 4493884 7006
## 6 17167254 152 138529362 50982
## SoundcloudStreams ShazamCounts ExplicitTrack X X.1 X.2
## 1 4818457 2669262 0 NA NA
## 2 6623075 1118279 1 NA NA
## 3 7208651 5285340 0 NA NA
## 4 NA 11822942 0 NA NA
## 5 207179 457017 1 NA NA
## 6 9438601 4517131 1 NA NA
# ----------------------------------------------------
# Part 1: Data Exploration (Steps 2 & 3) - T-Test for Correlation
# ----------------------------------------------------
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# --- Assuming 'df' is loaded and the column names are clean ---
# 1. Preparation (Filter out NAs/zeros for analysis)
df_analysis <- df %>%
filter(!is.na(TrackScore),
!is.na(SpotifyStreams), !is.na(YouTubeViews),
SpotifyStreams > 0, YouTubeViews > 0)
# ----------------------------------------------------
# 2. Summary Statistics
# ----------------------------------------------------
summary_cols <- c("SpotifyStreams", "YouTubeViews", "TikTokPosts", "AirPlaySpins")
summary_stats <- df_analysis %>%
select(all_of(summary_cols)) %>%
summarise(
across(everything(), list(
Mean = mean,
Median = median,
SD = sd,
Min = min,
Max = max
), .names = "{.col}_{.fn}", na.rm = TRUE)
)
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
print("--- Summary Statistics for Key Metrics ---")
## [1] "--- Summary Statistics for Key Metrics ---"
print(summary_stats)
## SpotifyStreams_Mean SpotifyStreams_Median SpotifyStreams_SD
## 1 443393484 242064180 524991078
## SpotifyStreams_Min SpotifyStreams_Max YouTubeViews_Mean YouTubeViews_Median
## 1 1071 4281468720 395635678 145326186
## YouTubeViews_SD YouTubeViews_Min YouTubeViews_Max TikTokPosts_Mean
## 1 693966489 913 16322756555 883728.9
## TikTokPosts_Median TikTokPosts_SD TikTokPosts_Min TikTokPosts_Max
## 1 175536 2275383 1 37726462
## AirPlaySpins_Mean AirPlaySpins_Median AirPlaySpins_SD AirPlaySpins_Min
## 1 55149.76 6169 127341.2 1
## AirPlaySpins_Max
## 1 1777811
# ----------------------------------------------------
# 3. Visual Exploration (Part 1, Step 3)
# ----------------------------------------------------
# Define the bin width for consistency
BIN_WIDTH <- 50
# A. Histogram of Track Score with Modal Range Annotation
# -------------------------------------------------------
# 1. Create the base plot object
histogram_track_score <- ggplot(df_analysis, aes(x = TrackScore)) +
geom_histogram(binwidth = BIN_WIDTH, fill = "lightblue", color = "grey") +
labs(title = "Distribution of Track Score (Modal Range Highlighted)",
x = "Track Score",
y = "Count of Songs") +
theme_minimal()
# 2. Extract histogram bin data
g_data <- ggplot_build(histogram_track_score)$data[[1]]
mode_bin_data <- g_data[which.max(g_data$count), ]
# 3. Add the label layer: Position at half height, rotate 90 degrees, and set color to white
histogram_track_score_labeled <- histogram_track_score +
geom_text(data = mode_bin_data,
aes(x = x, y = count / 2,
label = paste("Range:", round(xmin, 1), "-", round(xmax, 1))),
angle = 90,
vjust = 0.5,
color = "white",
size = 3,
fontface = "bold")
# Print the Labeled Histogram
print(histogram_track_score_labeled)
# B. Scatter Plot of Log(Spotify Streams) vs. Log(YouTube Views)
scatter_log_streams <- ggplot(df_analysis, aes(x = log(SpotifyStreams), y = log(YouTubeViews))) +
geom_point(alpha = 0.6, color = "pink") +
geom_smooth(method = "lm", color = "coral", se = FALSE) +
labs(title = "Relationship between Spotify and YouTube Popularity (Log Transformed)",
x = "Log(Spotify Streams)",
y = "Log(YouTube Views)") +
theme_minimal()
# Print the Scatter Plot
print(scatter_log_streams)
## `geom_smooth()` using formula = 'y ~ x'
# C. Calculate correlation and perform t-test using cor.test()
# ----------------------------------------------------------------------
correlation_test_result <- cor.test(log(df_analysis$SpotifyStreams),
log(df_analysis$YouTubeViews),
method = "pearson")
print("--- T-Test for Correlation Significance (Log Streams vs. Log Views) ---")
## [1] "--- T-Test for Correlation Significance (Log Streams vs. Log Views) ---"
# cor.test() prints the t-statistic, df, p-value, and the correlation coefficient (r)
print(correlation_test_result)
##
## Pearson's product-moment correlation
##
## data: log(df_analysis$SpotifyStreams) and log(df_analysis$YouTubeViews)
## t = 27.719, df = 4216, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3667824 0.4178450
## sample estimates:
## cor
## 0.3926162
The first chart is a Distribution of Track Score which combines various factors of a song’s popularity across different streaming platforms and social media activity. This is provided to show a song’s global relevance beyond just Spotify streams. Each metric used to calculate the track score can be weighed differently.This histogram is skewed to the right, meaning most songs are only moderately popular globally. This is displayed by the lower track scores on the left of the graph. The peak of the histogram represents the most common track score range for songs on Spotify which means that 2,444 songs had a track score between 25-75.This plot confirms that the global music market is characterized by a few superstars and most other tracks that only gain limited traction.
The second chart represents the Relationship between Spotify and YouTube Popularity. Looking at the data points, it seems that Spotify and YouTube streams have some sort of correlation which is confirmed with a t test. The p value was calcuated to be .39 which shows moderate positive correlation. The points are generally trending upward but are not tightly clustered around the line, suggesting that the factors making songs popular across platforms vary. There are many occurrences where songs had high YouTube views and low Spotify streams.
# ----------------------------------------------------
# ANLC 320 Midterm Project: Part 2 - Comparative Analysis (Spotify)
# ----------------------------------------------------
# Load Tidyverse library
library(tidyverse)
# Define the file name
FILE_NAME <- "spotify.csv"
# Load the dataset
df_spotify <- read_csv(FILE_NAME)
## New names:
## Rows: 4600 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): Track, AlbumName, Artist, ReleaseDate, ISRC, ...31 dbl (22): TrackScore,
## SpotifyStreams, SpotifyPlaylistCount, SpotifyPlaylistR... num (1): AllTimeRank
## lgl (2): ...29, ...30
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
# --- Part 2, Step 1 & 2: Identify and Visualize Top Artists ---
# 1. Group by Artist and calculate total Spotify Streams
df_artists <- df_spotify %>%
filter(!is.na(Artist), !is.na(SpotifyStreams)) %>%
group_by(Artist) %>%
summarise(
TotalStreams = sum(SpotifyStreams, na.rm = TRUE),
N_Tracks = n()
) %>%
ungroup() %>%
filter(N_Tracks > 1) %>%
arrange(desc(TotalStreams))
# Identify the Top 10 Artists for visualization
top_10_artists <- df_artists %>%
slice(1:10)
print("Top 10 Artists by Total Spotify Streams")
## [1] "Top 10 Artists by Total Spotify Streams"
print(top_10_artists)
## # A tibble: 10 × 3
## Artist TotalStreams N_Tracks
## <chr> <dbl> <int>
## 1 Bad Bunny 37054834425 60
## 2 The Weeknd 36948540278 30
## 3 Drake 34962157577 62
## 4 Taylor Swift 34470771165 63
## 5 Post Malone 26137472958 22
## 6 Ed Sheeran 24014900390 15
## 7 Ariana Grande 23464991696 26
## 8 MUSIC LAB JPN 22866685573 14
## 9 Olivia Rodrigo 19729219749 20
## 10 Eminem 18878880174 15
# 2. Visualize the top 10 artists with vertical data labels inside the bars
top_artists_plot <- ggplot(top_10_artists,
aes(x = reorder(Artist, -TotalStreams),
y = TotalStreams)) +
geom_col(fill = "pink") + # Spotify green color
# --- MODIFIED: ADDED VERTICAL DATA LABELS INSIDE BARS ---
geom_text(aes(label = scales::comma(TotalStreams)),
angle = 90, # Rotate text 90 degrees to be vertical
hjust = 1.1, # Pushes the text down from the top edge (since it's rotated)
vjust = 0.5, # Centers the text horizontally on the bar
color = "navy",
size = 3) +
# ----------------------------------------------------------
scale_y_continuous(labels = scales::comma) +
labs(title = "Top 10 Artists by Total Spotify Stream Volume (Vertical Labels)",
x = "Artist",
y = "Total Spotify Streams") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(top_artists_plot)
The top 10 artists are listed in a table with the total number of
streams and the total number of tracks. Within the top 10, there are
some artists with significantly fewer tracks that show a higher mean
number of streams per track. For example, the number one artist Bad
Bunny has 60 tracks, and the next artist The Weekend has half as many
tracks and only 100 million streams behind.
This bar chart displays the top 10 artists with the most total Spotify streams. The top 4 artists are within 3 billion streams of each other with the next artist falling over 8 billion streams behing and the following artists within a few billion of each other. This chart shows the Pareto Principle (80/20) rule stating a small number of superstars take up the vast majority of streaming attention, making competition at the top tier intense.
# --- Part 2, Step 4: Select and Compare Two Artists ---
# NOTE: REPLACE THESE PLACEHOLDERS with your chosen Artist names
ARTIST_A <- "The Weeknd"
ARTIST_B <- "Taylor Swift"
selected_artists <- c(ARTIST_A, ARTIST_B)
# Filter the original data for the two selected artists
df_comparison_spotify <- df_spotify %>%
filter(Artist %in% selected_artists)
# Calculate comparative statistics
comparison_stats_spotify <- df_comparison_spotify %>%
group_by(Artist) %>%
summarise(
N_Tracks = n(),
Mean_Streams = mean(SpotifyStreams, na.rm = TRUE),
Median_Streams = median(SpotifyStreams, na.rm = TRUE),
SD_Streams = sd(SpotifyStreams, na.rm = TRUE),
Mean_TikTokPosts = mean(TikTokPosts, na.rm = TRUE),
SD_TrackScore = sd(TrackScore, na.rm = TRUE),
Explicit_Share = mean(ExplicitTrack, na.rm = TRUE)
) %>%
ungroup()
print("--- Comparative Summary Statistics for Selected Artists ---")
## [1] "--- Comparative Summary Statistics for Selected Artists ---"
print(comparison_stats_spotify)
## # A tibble: 2 × 8
## Artist N_Tracks Mean_Streams Median_Streams SD_Streams Mean_TikTokPosts
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Taylor Swift 63 547155098. 395433400 485815993. 343551.
## 2 The Weeknd 31 1231618009. 959195854 965106760. 323156.
## # ℹ 2 more variables: SD_TrackScore <dbl>, Explicit_Share <dbl>
# --- Part 2, Step 5: Visualize Track Score Distribution ---
# Visualize the consistency of TrackScore for the two artists using boxplots
track_score_boxplot <- ggplot(df_comparison_spotify, aes(x = Artist, y = TrackScore)) +
geom_boxplot(aes(fill = Artist)) +
labs(title = paste0("Track Score Distribution: ", ARTIST_A, " vs. ", ARTIST_B),
x = "Artist",
y = "Track Score (Measure of Success)") +
theme_minimal() +
scale_fill_manual(values = c("lightyellow", "lightblue"))
print(track_score_boxplot)
To look at this difference more, I created a box plot using Taylor Swift (34.47B streams and 63 tracks) and The Weekend (36.95B streams and 30 tracks). The median line inside the box plot demonstrates the typical streams performance for a track by that artist. As you can see, The Weekend has a higher success level when comparing song by song. Both artists have outliers, but Taylor Swift has one extreme outlier, crushing all of The Weekends top performing songs for 2024. Although Taylor Swift produced double tracks and has many outliers, she is more likely to have some songs under perform. The Weekend shows more consistent performance per track released.
# ----------------------------------------------------
# Statistical Test: T-Test on Log(SpotifyStreams) by Explicit Track Status
# Objective: Determine if the mean log streams differs significantly between
# Explicit (1) and Non-Explicit (0) tracks.
# ----------------------------------------------------
# Load Tidyverse library
library(tidyverse)
# Define the file name
FILE_NAME <- "spotify.csv"
# 1. Data Loading and Cleaning
df_spotify <- read_csv(FILE_NAME) %>%
# Filter out missing or zero stream counts, which cannot be logged
filter(!is.na(SpotifyStreams), SpotifyStreams > 0) %>%
# Ensure ExplicitTrack is treated as a factor for the t-test formula
mutate(ExplicitTrack = factor(ExplicitTrack,
levels = c(0, 1),
labels = c("Non-Explicit", "Explicit")))
## New names:
## Rows: 4600 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): Track, AlbumName, Artist, ReleaseDate, ISRC, ...31 dbl (22): TrackScore,
## SpotifyStreams, SpotifyPlaylistCount, SpotifyPlaylistR... num (1): AllTimeRank
## lgl (2): ...29, ...30
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
# 2. Log Transformation
df_analysis <- df_spotify %>%
mutate(LogSpotifyStreams = log(SpotifyStreams))
# 3. Calculate Group Means for reporting
group_means <- df_analysis %>%
group_by(ExplicitTrack) %>%
summarise(Mean_LogStreams = mean(LogSpotifyStreams, na.rm = TRUE))
print("--- Mean Log(Spotify Streams) by Track Status ---")
## [1] "--- Mean Log(Spotify Streams) by Track Status ---"
print(group_means)
## # A tibble: 2 × 2
## ExplicitTrack Mean_LogStreams
## <fct> <dbl>
## 1 Non-Explicit 18.6
## 2 Explicit 19.2
# 4. Conduct Two-Sample Independent T-Test (Welch's Test)
# We use the formula notation: Dependent Variable ~ Independent Grouping Variable
# The default t.test() assumes unequal variances (Welch's test), which is recommended.
t_test_result <- t.test(LogSpotifyStreams ~ ExplicitTrack,
data = df_analysis)
print("--- Full T-Test Result ---")
## [1] "--- Full T-Test Result ---"
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: LogSpotifyStreams by ExplicitTrack
## t = -8.7794, df = 4468.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non-Explicit and group Explicit is not equal to 0
## 95 percent confidence interval:
## -0.6547503 -0.4157101
## sample estimates:
## mean in group Non-Explicit mean in group Explicit
## 18.62750 19.16273
# --- Interpretation Guidance (Narrative Step) ---
# 1. Check the P-Value:
# - If the P-Value is less than 0.05, you reject the null hypothesis (H0) and conclude
# that the mean log streams are significantly different between the two groups.
# - Based on the calculated P-value (which should be extremely small, ~0), the difference
# is highly statistically significant.
# 2. Conclude: Explicit tracks have a statistically different (and based on the means, higher)
# average streaming success compared to Non-Explicit tracks.
Question: Does a song being explicit have an impact on song performance?
Null Hypothesis: The mean log(Spotify streams) for explicit tracks is equal to that of non-explicit tracks.
P value: 2.2e-16
Since p value is less than .05, reject null hypothesis. Explicit tracks have a higher mean log(Spotify streams) (19.16) than non-explicit tracks (18.63).
Business Implication: On average, a song that contains explicit content tends to achieve higher level of streaming of success compared to a non-explicit song.
CONCLUSION
To compare track performance throughout social media platforms I compared YouTube views with Spotify streams and the results proved that these two factors are related. I also learned that it is possible for an artist with close total number of streams to another artist could have half as many tracks, and a better average performance per song.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.