Final Project

ABSTRACT

This analysis evaluates different factors driving global music success using the Most Streamed Spotify Songs of 2024 dataset. The initial display of the data revealed a heavily right-skewed Track Score distribution confirming that most songs achieve only moderate success (mode range: 25-75). To go further into anylsis, the data was transformed into logarithmic data to form more reliable conclusions that are not drastically sckewed by outliers. A t-test between log(Spotify streams) and log(YouTube views) showed a moderate but significant positive correlation of .39, indicating that one does not cause the other, but higher streams of one is likely to have high streams on the other and vise versa. Data on the top 10 artists with the most Spotify streams was collected and shows that number of tracks doesn’t necessarily conclude more total streams. To prove this an analysis on Taylor Swift and The Weekend was done and demonstrated that top streamed artists with fewer tracks are more likely to have a higher median song by song performance. Also, a two tailed t-test confirmed that the mean for explicit tracks is statistically significantly higher than that for non-explicit tracks.

INTRODUCTION

Analyze Spotify songs and see correlation between different variables such as streams vs performance among other social media platforms, explicit content, and artists.

INTRO TO DATASET

Most Streamed Spotify Songs 2024 Kaggle.com Nidula Elgiriyewithana - Data Scientist January 1st 2024 - December 31st 2024 31 columns, 4,600 rows Each row represents a different song

DATA CLEANING/PREPARATION

Changed the formatting of columns to number format (removes commas)
Any values that were blank (0) I filtered out

df <- read.csv("spotify.csv")
head(df)

##                        Track                    AlbumName         Artist
## 1        MILLION DOLLAR BABY Million Dollar Baby - Single  Tommy Richman
## 2                Not Like Us                  Not Like Us Kendrick Lamar
## 3 i like the way you kiss me   I like the way you kiss me        Artemas
## 4                    Flowers             Flowers - Single    Miley Cyrus
## 5                    Houdini                      Houdini         Eminem
## 6                Lovin On Me                  Lovin On Me    Jack Harlow
##   ReleaseDate         ISRC AllTimeRank TrackScore SpotifyStreams
## 1     4/26/24 QM24S2402528           1      725.4      390470936
## 2      5/4/24 USUG12400910           2      545.9      323703884
## 3     3/19/24 QZJ842400387           3      538.4      601309283
## 4     1/12/23 USSM12209777           4      444.9     2031280633
## 5     5/31/24 USUG12403398           5      423.3      107034922
## 6    11/10/23 USAT22311371           6      410.1      670665438
##   SpotifyPlaylistCount SpotifyPlaylistReach Spotify.Popularity YouTubeViews
## 1                30716            196631588                 92     84274754
## 2                28113            174597137                 92    116347040
## 3                54331            211607669                 92    122599116
## 4               269802            136569078                 85   1096100899
## 5                 7223            151469874                 88     77373957
## 6               105892            175421034                 83    131148091
##   YouTubeLikes TikTokPosts TikTok.Likes TikTok.Views YouTubePlaylistReach
## 1      1713126     5767700    651565900   5332281936            150597040
## 2      3486739      674700     35223547    208339025            156380351
## 3      2228730     3025400    275154237   3369120610            373784955
## 4     10629796     7189811   1078757968  14603725994           3351188582
## 5      3670188       16400           NA           NA            112763851
## 6      1392593     4202367    214943489   2938686633           2867222632
##   AppleMusicPlaylistCount AirPlaySpins SiriusXMSpins DeezerPlaylistCount
## 1                     210        40975           684                  62
## 2                     188        40778             3                  67
## 3                     190        74333           536                 136
## 4                     394      1474799          2182                 264
## 5                     182        12185             1                  82
## 6                     138       522042          4654                  86
##   DeezerPlaylistReach AmazonPlaylistCount PandoraStreams PandoraTrackStations
## 1            17598718                 114       18004655                22931
## 2            10422430                 111        7780028                28444
## 3            36321847                 172        5022621                 5639
## 4            24684248                 210      190260277               203384
## 5            17660624                 105        4493884                 7006
## 6            17167254                 152      138529362                50982
##   SoundcloudStreams ShazamCounts ExplicitTrack  X X.1 X.2
## 1           4818457      2669262             0 NA  NA    
## 2           6623075      1118279             1 NA  NA    
## 3           7208651      5285340             0 NA  NA    
## 4                NA     11822942             0 NA  NA    
## 5            207179       457017             1 NA  NA    
## 6           9438601      4517131             1 NA  NA

# ----------------------------------------------------
# Part 1: Data Exploration (Steps 2 & 3) - T-Test for Correlation
# ----------------------------------------------------

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# --- Assuming 'df' is loaded and the column names are clean ---

# 1. Preparation (Filter out NAs/zeros for analysis)
df_analysis <- df %>%
  filter(!is.na(TrackScore), 
         !is.na(SpotifyStreams), !is.na(YouTubeViews), 
         SpotifyStreams > 0, YouTubeViews > 0)


# ----------------------------------------------------
# 2. Summary Statistics 
# ----------------------------------------------------

summary_cols <- c("SpotifyStreams", "YouTubeViews", "TikTokPosts", "AirPlaySpins")
summary_stats <- df_analysis %>%
  select(all_of(summary_cols)) %>%
  summarise(
    across(everything(), list(
      Mean = mean,
      Median = median,
      SD = sd,
      Min = min,
      Max = max
    ), .names = "{.col}_{.fn}", na.rm = TRUE)
  )

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print("--- Summary Statistics for Key Metrics ---")

## [1] "--- Summary Statistics for Key Metrics ---"

print(summary_stats)

##   SpotifyStreams_Mean SpotifyStreams_Median SpotifyStreams_SD
## 1           443393484             242064180         524991078
##   SpotifyStreams_Min SpotifyStreams_Max YouTubeViews_Mean YouTubeViews_Median
## 1               1071         4281468720         395635678           145326186
##   YouTubeViews_SD YouTubeViews_Min YouTubeViews_Max TikTokPosts_Mean
## 1       693966489              913      16322756555         883728.9
##   TikTokPosts_Median TikTokPosts_SD TikTokPosts_Min TikTokPosts_Max
## 1             175536        2275383               1        37726462
##   AirPlaySpins_Mean AirPlaySpins_Median AirPlaySpins_SD AirPlaySpins_Min
## 1          55149.76                6169        127341.2                1
##   AirPlaySpins_Max
## 1          1777811

# ----------------------------------------------------
# 3. Visual Exploration (Part 1, Step 3)
# ----------------------------------------------------

# Define the bin width for consistency
BIN_WIDTH <- 50

# A. Histogram of Track Score with Modal Range Annotation
# -------------------------------------------------------

# 1. Create the base plot object
histogram_track_score <- ggplot(df_analysis, aes(x = TrackScore)) +
  geom_histogram(binwidth = BIN_WIDTH, fill = "lightblue", color = "grey") +
  labs(title = "Distribution of Track Score (Modal Range Highlighted)",
       x = "Track Score",
       y = "Count of Songs") +
  theme_minimal()

# 2. Extract histogram bin data
g_data <- ggplot_build(histogram_track_score)$data[[1]]
mode_bin_data <- g_data[which.max(g_data$count), ]

# 3. Add the label layer: Position at half height, rotate 90 degrees, and set color to white
histogram_track_score_labeled <- histogram_track_score +
  geom_text(data = mode_bin_data, 
            aes(x = x, y = count / 2, 
                label = paste("Range:", round(xmin, 1), "-", round(xmax, 1))),
            angle = 90,             
            vjust = 0.5,            
            color = "white",        
            size = 3,
            fontface = "bold")

# Print the Labeled Histogram
print(histogram_track_score_labeled)

# B. Scatter Plot of Log(Spotify Streams) vs. Log(YouTube Views)
scatter_log_streams <- ggplot(df_analysis, aes(x = log(SpotifyStreams), y = log(YouTubeViews))) +
  geom_point(alpha = 0.6, color = "pink") +
  geom_smooth(method = "lm", color = "coral", se = FALSE) +
  labs(title = "Relationship between Spotify and YouTube Popularity (Log Transformed)",
       x = "Log(Spotify Streams)",
       y = "Log(YouTube Views)") +
  theme_minimal()

# Print the Scatter Plot
print(scatter_log_streams)

## `geom_smooth()` using formula = 'y ~ x'

# C. Calculate correlation and perform t-test using cor.test()
# ----------------------------------------------------------------------
correlation_test_result <- cor.test(log(df_analysis$SpotifyStreams), 
                                    log(df_analysis$YouTubeViews), 
                                    method = "pearson")

print("--- T-Test for Correlation Significance (Log Streams vs. Log Views) ---")

## [1] "--- T-Test for Correlation Significance (Log Streams vs. Log Views) ---"

# cor.test() prints the t-statistic, df, p-value, and the correlation coefficient (r)
print(correlation_test_result)

## 
##  Pearson's product-moment correlation
## 
## data:  log(df_analysis$SpotifyStreams) and log(df_analysis$YouTubeViews)
## t = 27.719, df = 4216, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3667824 0.4178450
## sample estimates:
##       cor 
## 0.3926162

The first chart is a Distribution of Track Score which combines various factors of a song’s popularity across different streaming platforms and social media activity. This is provided to show a song’s global relevance beyond just Spotify streams. Each metric used to calculate the track score can be weighed differently.This histogram is skewed to the right, meaning most songs are only moderately popular globally. This is displayed by the lower track scores on the left of the graph. The peak of the histogram represents the most common track score range for songs on Spotify which means that 2,444 songs had a track score between 25-75.This plot confirms that the global music market is characterized by a few superstars and most other tracks that only gain limited traction.

The second chart represents the Relationship between Spotify and YouTube Popularity. Looking at the data points, it seems that Spotify and YouTube streams have some sort of correlation which is confirmed with a t test. The p value was calcuated to be .39 which shows moderate positive correlation. The points are generally trending upward but are not tightly clustered around the line, suggesting that the factors making songs popular across platforms vary. There are many occurrences where songs had high YouTube views and low Spotify streams.

# ----------------------------------------------------
# ANLC 320 Midterm Project: Part 2 - Comparative Analysis (Spotify)
# ----------------------------------------------------

# Load Tidyverse library
library(tidyverse)

# Define the file name
FILE_NAME <- "spotify.csv"

# Load the dataset
df_spotify <- read_csv(FILE_NAME)

## New names:
## Rows: 4600 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): Track, AlbumName, Artist, ReleaseDate, ISRC, ...31 dbl (22): TrackScore,
## SpotifyStreams, SpotifyPlaylistCount, SpotifyPlaylistR... num (1): AllTimeRank
## lgl (2): ...29, ...30
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`

# --- Part 2, Step 1 & 2: Identify and Visualize Top Artists ---

# 1. Group by Artist and calculate total Spotify Streams
df_artists <- df_spotify %>%
  filter(!is.na(Artist), !is.na(SpotifyStreams)) %>%
  
  group_by(Artist) %>%
  summarise(
    TotalStreams = sum(SpotifyStreams, na.rm = TRUE),
    N_Tracks = n() 
  ) %>%
  ungroup() %>%
  
  filter(N_Tracks > 1) %>% 
  arrange(desc(TotalStreams))

# Identify the Top 10 Artists for visualization
top_10_artists <- df_artists %>%
  slice(1:10)

print("Top 10 Artists by Total Spotify Streams")

## [1] "Top 10 Artists by Total Spotify Streams"

print(top_10_artists)

## # A tibble: 10 × 3
##    Artist         TotalStreams N_Tracks
##    <chr>                 <dbl>    <int>
##  1 Bad Bunny       37054834425       60
##  2 The Weeknd      36948540278       30
##  3 Drake           34962157577       62
##  4 Taylor Swift    34470771165       63
##  5 Post Malone     26137472958       22
##  6 Ed Sheeran      24014900390       15
##  7 Ariana Grande   23464991696       26
##  8 MUSIC LAB JPN   22866685573       14
##  9 Olivia Rodrigo  19729219749       20
## 10 Eminem          18878880174       15

# 2. Visualize the top 10 artists with vertical data labels inside the bars
top_artists_plot <- ggplot(top_10_artists, 
                           aes(x = reorder(Artist, -TotalStreams), 
                               y = TotalStreams)) +
  geom_col(fill = "pink") + # Spotify green color
  
  # --- MODIFIED: ADDED VERTICAL DATA LABELS INSIDE BARS ---
  geom_text(aes(label = scales::comma(TotalStreams)),
            angle = 90,    # Rotate text 90 degrees to be vertical
            hjust = 1.1,   # Pushes the text down from the top edge (since it's rotated)
            vjust = 0.5,   # Centers the text horizontally on the bar
            color = "navy",
            size = 3) +
  # ----------------------------------------------------------
  
  scale_y_continuous(labels = scales::comma) + 
  labs(title = "Top 10 Artists by Total Spotify Stream Volume (Vertical Labels)",
       x = "Artist",
       y = "Total Spotify Streams") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

print(top_artists_plot)

The top 10 artists are listed in a table with the total number of streams and the total number of tracks. Within the top 10, there are some artists with significantly fewer tracks that show a higher mean number of streams per track. For example, the number one artist Bad Bunny has 60 tracks, and the next artist The Weekend has half as many tracks and only 100 million streams behind.

This bar chart displays the top 10 artists with the most total Spotify streams. The top 4 artists are within 3 billion streams of each other with the next artist falling over 8 billion streams behing and the following artists within a few billion of each other. This chart shows the Pareto Principle (80/20) rule stating a small number of superstars take up the vast majority of streaming attention, making competition at the top tier intense.

# --- Part 2, Step 4: Select and Compare Two Artists ---

# NOTE: REPLACE THESE PLACEHOLDERS with your chosen Artist names
ARTIST_A <- "The Weeknd" 
ARTIST_B <- "Taylor Swift" 

selected_artists <- c(ARTIST_A, ARTIST_B)

# Filter the original data for the two selected artists
df_comparison_spotify <- df_spotify %>%
  filter(Artist %in% selected_artists)

# Calculate comparative statistics
comparison_stats_spotify <- df_comparison_spotify %>%
  group_by(Artist) %>%
  summarise(
    N_Tracks = n(),
    Mean_Streams = mean(SpotifyStreams, na.rm = TRUE),
    Median_Streams = median(SpotifyStreams, na.rm = TRUE),
    SD_Streams = sd(SpotifyStreams, na.rm = TRUE),
    Mean_TikTokPosts = mean(TikTokPosts, na.rm = TRUE),
    SD_TrackScore = sd(TrackScore, na.rm = TRUE), 
    Explicit_Share = mean(ExplicitTrack, na.rm = TRUE)
  ) %>%
  ungroup()

print("--- Comparative Summary Statistics for Selected Artists ---")

## [1] "--- Comparative Summary Statistics for Selected Artists ---"

print(comparison_stats_spotify)

## # A tibble: 2 × 8
##   Artist       N_Tracks Mean_Streams Median_Streams SD_Streams Mean_TikTokPosts
##   <chr>           <int>        <dbl>          <dbl>      <dbl>            <dbl>
## 1 Taylor Swift       63   547155098.      395433400 485815993.          343551.
## 2 The Weeknd         31  1231618009.      959195854 965106760.          323156.
## # ℹ 2 more variables: SD_TrackScore <dbl>, Explicit_Share <dbl>

# --- Part 2, Step 5: Visualize Track Score Distribution ---

# Visualize the consistency of TrackScore for the two artists using boxplots
track_score_boxplot <- ggplot(df_comparison_spotify, aes(x = Artist, y = TrackScore)) +
  geom_boxplot(aes(fill = Artist)) +
  labs(title = paste0("Track Score Distribution: ", ARTIST_A, " vs. ", ARTIST_B),
       x = "Artist",
       y = "Track Score (Measure of Success)") +
  theme_minimal() +
  scale_fill_manual(values = c("lightyellow", "lightblue"))

print(track_score_boxplot)

To look at this difference more, I created a box plot using Taylor Swift (34.47B streams and 63 tracks) and The Weekend (36.95B streams and 30 tracks). The median line inside the box plot demonstrates the typical streams performance for a track by that artist. As you can see, The Weekend has a higher success level when comparing song by song. Both artists have outliers, but Taylor Swift has one extreme outlier, crushing all of The Weekends top performing songs for 2024. Although Taylor Swift produced double tracks and has many outliers, she is more likely to have some songs under perform. The Weekend shows more consistent performance per track released.

# ----------------------------------------------------
# Statistical Test: T-Test on Log(SpotifyStreams) by Explicit Track Status
# Objective: Determine if the mean log streams differs significantly between
#            Explicit (1) and Non-Explicit (0) tracks.
# ----------------------------------------------------

# Load Tidyverse library
library(tidyverse)

# Define the file name
FILE_NAME <- "spotify.csv"

# 1. Data Loading and Cleaning
df_spotify <- read_csv(FILE_NAME) %>%
  # Filter out missing or zero stream counts, which cannot be logged
  filter(!is.na(SpotifyStreams), SpotifyStreams > 0) %>%
  # Ensure ExplicitTrack is treated as a factor for the t-test formula
  mutate(ExplicitTrack = factor(ExplicitTrack, 
                                levels = c(0, 1), 
                                labels = c("Non-Explicit", "Explicit")))

## New names:
## Rows: 4600 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): Track, AlbumName, Artist, ReleaseDate, ISRC, ...31 dbl (22): TrackScore,
## SpotifyStreams, SpotifyPlaylistCount, SpotifyPlaylistR... num (1): AllTimeRank
## lgl (2): ...29, ...30
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`

# 2. Log Transformation
df_analysis <- df_spotify %>%
  mutate(LogSpotifyStreams = log(SpotifyStreams))


# 3. Calculate Group Means for reporting
group_means <- df_analysis %>%
  group_by(ExplicitTrack) %>%
  summarise(Mean_LogStreams = mean(LogSpotifyStreams, na.rm = TRUE))

print("--- Mean Log(Spotify Streams) by Track Status ---")

## [1] "--- Mean Log(Spotify Streams) by Track Status ---"

print(group_means)

## # A tibble: 2 × 2
##   ExplicitTrack Mean_LogStreams
##   <fct>                   <dbl>
## 1 Non-Explicit             18.6
## 2 Explicit                 19.2

# 4. Conduct Two-Sample Independent T-Test (Welch's Test)
# We use the formula notation: Dependent Variable ~ Independent Grouping Variable
# The default t.test() assumes unequal variances (Welch's test), which is recommended.
t_test_result <- t.test(LogSpotifyStreams ~ ExplicitTrack, 
                        data = df_analysis)

print("--- Full T-Test Result ---")

## [1] "--- Full T-Test Result ---"

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  LogSpotifyStreams by ExplicitTrack
## t = -8.7794, df = 4468.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non-Explicit and group Explicit is not equal to 0
## 95 percent confidence interval:
##  -0.6547503 -0.4157101
## sample estimates:
## mean in group Non-Explicit     mean in group Explicit 
##                   18.62750                   19.16273

# --- Interpretation Guidance (Narrative Step) ---

# 1. Check the P-Value:
#    - If the P-Value is less than 0.05, you reject the null hypothesis (H0) and conclude 
#      that the mean log streams are significantly different between the two groups.
#    - Based on the calculated P-value (which should be extremely small, ~0), the difference
#      is highly statistically significant.

# 2. Conclude: Explicit tracks have a statistically different (and based on the means, higher) 
#    average streaming success compared to Non-Explicit tracks.

Question: Does a song being explicit have an impact on song performance?

Null Hypothesis: The mean log(Spotify streams) for explicit tracks is equal to that of non-explicit tracks.

P value: 2.2e-16

Since p value is less than .05, reject null hypothesis. Explicit tracks have a higher mean log(Spotify streams) (19.16) than non-explicit tracks (18.63).

Business Implication: On average, a song that contains explicit content tends to achieve higher level of streaming of success compared to a non-explicit song.

CONCLUSION

To compare track performance throughout social media platforms I compared YouTube views with Spotify streams and the results proved that these two factors are related. I also learned that it is possible for an artist with close total number of streams to another artist could have half as many tracks, and a better average performance per song.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Final Project

2025-12-08

R Markdown

Including Plots