Spotify_2023_statistical

Introduction

The world of music streaming is abuzz with catchy tunes vying for our ears and attention. But what makes a song truly soar to the top of the charts and capture our hearts? In this exploration, we dive into the fascinating world of Spotify’s most streamed songs in 2023, armed with a dataset of nearly 1,000 musical champions. Our mission: to uncover the hidden gems within these songs, the attributes that contribute to their meteoric rise on the streaming platform.

Key Features:

This treasure trove of data holds a wealth of information about each song, like its name, artists, release date, and even its musical pulse and mood.

track_name: Name of the song

artist(s)_name: Name of the artist(s) of the song

artist_count: Number of artists contributing to the song

released_year: Year when the song was released

released_month: Month when the song was released

released_day: Day of the month when the song was released

in_spotify_playlists: Number of Spotify playlists the song is included in

in_spotify_charts: Presence and rank of the song on Spotify charts

streams: Total number of streams on Spotify

in_apple_playlists: Number of Apple Music playlists the song is included in

in_apple_charts: Presence and rank of the song on Apple Music charts

in_deezer_playlists: Number of Deezer playlists the song is included in

in_deezer_charts: Presence and rank of the song on Deezer charts

in_shazam_charts: Presence and rank of the song on Shazam charts

bpm: Beats per minute, a measure of song tempo

key: Key of the song

mode: Mode of the song (major or minor)

danceability_%: Percentage indicating how suitable the song is for dancing

valence_%: Positivity of the song’s musical content

energy_%: Perceived energy level of the song

acousticness_%: Amount of acoustic sound in the song

instrumentalness_%: Amount of instrumental content in the song

liveness_%: Presence of live performance elements

speechiness_%: Amount of spoken words in the song

Libraries

# A comprehensive toolkit for data science in R
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# To generate a comprehensive summary of data including mean, median, standard deviation, skewness, and kurtosis.
library(psych)

## Warning: package 'psych' was built under R version 4.3.3

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

data <- read.csv("spotify-2023.csv")

General understanding of spotify-2023 dataset

nrow(data)

## [1] 953

ncol(data)

## [1] 24

colnames(data)

##  [1] "track_name"           "artist.s._name"       "artist_count"        
##  [4] "released_year"        "released_month"       "released_day"        
##  [7] "in_spotify_playlists" "in_spotify_charts"    "streams"             
## [10] "in_apple_playlists"   "in_apple_charts"      "in_deezer_playlists" 
## [13] "in_deezer_charts"     "in_shazam_charts"     "bpm"                 
## [16] "key"                  "mode"                 "danceability_."      
## [19] "valence_."            "energy_."             "acousticness_."      
## [22] "instrumentalness_."   "liveness_."           "speechiness_."

Descriptive Statistics

The describe() function from the psych library is valuable for analyzing Spotify data because it provides a quick summary of key statistics like mean, standard deviation, and quartiles for each variable. This helps in understanding the data’s distribution, identifying outliers, and selecting relevant variables for further analysis. Overall, it’s a handy tool for getting an initial overview of the dataset’s characteristics and making informed analytical decisions

describe(data)

##                      vars   n         mean           sd    median      trimmed
## track_name*             1 953       471.90       272.14       472       471.96
## artist.s._name*         2 953       325.55       188.29       323       327.09
## artist_count            3 953         1.56         0.89         1         1.38
## released_year           4 953      2018.24        11.12      2022      2020.96
## released_month          5 953         6.03         3.57         6         5.94
## released_day            6 953        13.93         9.20        13        13.64
## in_spotify_playlists    7 953      5200.12      7897.61      2224      3314.33
## in_spotify_charts       8 953        12.01        19.58         3         7.66
## streams                 9 953 513597931.31 566803887.06 290228626 400416420.60
## in_apple_playlists     10 953        67.81        86.44        34        49.80
## in_apple_charts        11 953        51.91        50.63        38        45.63
## in_deezer_playlists*   12 953       166.15       100.64       165       164.75
## in_deezer_charts       13 953         2.67         6.04         0         1.11
## in_shazam_charts*      14 953        52.98        62.90        11        43.85
## bpm                    15 953       122.54        28.06       121       121.01
## key*                   16 953         6.54         3.58         6         6.55
## mode*                  17 953         1.42         0.49         1         1.40
## danceability_.         18 953        66.97        14.63        69        67.70
## valence_.              19 953        51.43        23.48        51        51.35
## energy_.               20 953        64.28        16.55        66        65.11
## acousticness_.         21 953        27.06        26.00        18        23.53
## instrumentalness_.     22 953         1.58         8.41         0         0.00
## liveness_.             23 953        18.21        13.71        12        15.81
## speechiness_.          24 953        10.13         9.91         6         7.96
##                               mad  min        max      range  skew kurtosis
## track_name*                349.89    1        943        942  0.00    -1.20
## artist.s._name*            250.56    1        645        644 -0.02    -1.28
## artist_count                 0.00    1          8          7  2.54    10.28
## released_year                1.48 1930       2023         93 -4.28    20.35
## released_month               4.45    1         12         11  0.18    -1.20
## released_day                11.86    1         31         30  0.16    -1.24
## in_spotify_playlists      2364.75   31      52898      52867  2.92     9.79
## in_spotify_charts            4.45    0        147        147  2.57     8.43
## streams              279393863.23    0 3703895074 3703895074  1.99     4.33
## in_apple_playlists          40.03    0        672        672  2.47     7.85
## in_apple_charts             51.89    0        275        275  1.03     0.87
## in_deezer_playlists*       133.43    1        348        347  0.10    -1.20
## in_deezer_charts             0.00    0         58         58  3.75    18.87
## in_shazam_charts*           14.83    1        199        198  0.85    -0.72
## bpm                         31.13   65        206        141  0.41    -0.41
## key*                         4.45    1         12         11  0.01    -1.29
## mode*                        0.00    1          2          1  0.31    -1.90
## danceability_.              14.83   23         96         73 -0.43    -0.34
## valence_.                   28.17    4         97         93  0.01    -0.95
## energy_.                    17.79    9         97         88 -0.44    -0.27
## acousticness_.              22.24    0         97         97  0.95    -0.20
## instrumentalness_.           0.00    0         91         91  7.10    56.21
## liveness_.                   5.93    3         97         94  2.10     5.66
## speechiness_.                2.97    2         64         62  1.93     3.34
##                               se
## track_name*                 8.82
## artist.s._name*             6.10
## artist_count                0.03
## released_year               0.36
## released_month              0.12
## released_day                0.30
## in_spotify_playlists      255.83
## in_spotify_charts           0.63
## streams              18360578.88
## in_apple_playlists          2.80
## in_apple_charts             1.64
## in_deezer_playlists*        3.26
## in_deezer_charts            0.20
## in_shazam_charts*           2.04
## bpm                         0.91
## key*                        0.12
## mode*                       0.02
## danceability_.              0.47
## valence_.                   0.76
## energy_.                    0.54
## acousticness_.              0.84
## instrumentalness_.          0.27
## liveness_.                  0.44
## speechiness_.               0.32

Popularity Distribution (Distribution, correlation)

The high mean and large range for streams (total streams per track) confirm a skewed distribution of popularity. This aligns with the typical “long tail” phenomenon in streaming services, where a few tracks dominate in terms of streams, while many others have significantly fewer.

ggplot(data, aes(x = streams)) +
  geom_histogram(binwidth = 1e7, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Streams", x = "Streams (Total streams per track)", y = "Frequency") +
  theme_minimal()

The positive skew for streams and in_spotify_playlists (number of times a track appears on Spotify playlists) further supports this notion. There are likely a small number of very popular tracks with millions of streams and playlist appearances, while most tracks have considerably less exposure.

I initiated this analysis with the hypothesis that songs included in user-generated playlists might significantly influence their stream counts. To verify this hypothesis, I set out to explore the correlation between two key variables: ‘streams_in_millions’ and ‘total_playlist_inclusions.’ Through this examination, my goal was to determine if there exists a connection between a song’s popularity, as indicated by its streaming figures, and its presence in playlists across different platforms. This inquiry is motivated by the notion that songs featured in playlists could potentially experience a surge in streams. Let’s delve into the findings.

data$in_deezer_playlists <- as.integer(data$in_deezer_playlists)

## Warning: NAs introduced by coercion

data$in_shazam_charts <- as.integer(data$in_shazam_charts)

## Warning: NAs introduced by coercion

data$total_playlist_inclusions<-data$in_spotify_playlists+data$in_apple_playlists+data$in_deezer_playlists+data$in_shazam_charts

plot(data$total_playlist_inclusions, data$streams, xlab = "Total Playlist Inclusions", ylab = "Streams", main = "Relationship between Total Playlist Inclusions and Streams")

data$streams_in_millions <- data$streams / 1000000

ggplot(data, aes(x = in_spotify_playlists, y = streams_in_millions)) +
  geom_point(color = "green", size = 3, alpha = 0.6) +
  labs(x = "Spotify Playlist Inclusions", y = "Streams",  
       title = "Relationship between Spotify Playlist Inclusions and Streams") +
  theme_minimal()

correlation <- cor(data$in_spotify_playlists, data$streams, method = "spearman")
print(correlation)

## [1] 0.8301377

Based on the analysis, we observe a positive correlation between the total number of playlist inclusions and the number of streams for the songs in our dataset.To further validate this relationship, I performed a Spearman Rank correlation analysis, which also confirmed a significant positive association with Spearman correlation coefficient of 0.8301377. This suggests that songs that are included in a greater number of playlists tend to accumulate higher numbers of streams. This relationship implies that playlist inclusion plays a significant role in driving the popularity and streaming performance of songs on music platforms.
Furthermore, upon visual examination of the data using a graph, no outliers are immediately evident. This indicates that there are no distinct data points that significantly deviate from the overall pattern of the data. The absence of outliers suggests that the relationship between playlist inclusions and streams is relatively consistent across the dataset without any extreme or unusual observations that could skew the analysis or interpretation.

Audio Features (Visualizing Subsample Variations)

Audio features such as bpm, danceability, valence, and energy have relatively small standard deviations compared to their means, suggesting less variability in these features.
Some features, like instrumentalness, exhibit high kurtosis, indicating heavy-tailed distributions with a few tracks having very high values.

# Set seed for reproducibility
set.seed(123)

# Calculate the size of each subsample (roughly 50% of the original dataset)
subsample_size <- nrow(data) * 0.5

# Create 5 random subsamples
df_1 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_2 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_3 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_4 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_5 <- data %>% sample_n(size = subsample_size, replace = TRUE)

# Function to calculate summary statistics for a dataset
calculate_summary <- function(df) {
  summary <- df |>
    summarise(
      mean_instrumentalness = mean(instrumentalness_.),
    )
  return(summary)
}

# Calculate summary statistics for each subsample
summary_df_1 <- calculate_summary(df_1)
summary_df_2 <- calculate_summary(df_2)
summary_df_3 <- calculate_summary(df_3)
summary_df_4 <- calculate_summary(df_4)
summary_df_5 <- calculate_summary(df_5)

# Combine summaries into a single dataframe
combined_summary <- bind_rows(summary_df_1, summary_df_2, summary_df_3, summary_df_4, summary_df_5, .id = "Subsample")

ggplot(combined_summary, aes(x = Subsample, y = mean_instrumentalness, fill = Subsample)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Instrumentalness Across Subsamples", x = "Subsample", y = "Mean Instrumentalness")

Instrumentalness column has the most variations in the subsamples. This is likely because instrumentality is a subjective measure, and there is no one definitive way to classify a song as being instrumental or not. Additionally, there are many different subgenres of instrumental music, each with its own unique sound. This can lead to a lot of variation in the instrumentalness scores between different subsamples.

For example, a subsample of songs that are all classified as being classical music might have a very low standard deviation in instrumentalness scores, as all of the songs in the subsample would likely be scored as being very instrumental. On the other hand, a subsample of songs that are all classified as being pop music might have a much higher standard deviation in instrumentalness scores, as there is a wider range of possible scores for this genre.

Analyzing the Impact of Musical Modes (Major vs. Minor) on Streaming Counts

The two most common modes in music are “major” and “minor.” Here’s an explanation of each:

Major Mode: The major mode is often associated with a bright, happy, and cheerful mood in music. It is characterized by a specific pattern of whole and half steps between notes and is built around a major scale. Major keys typically sound more positive and uplifting.

Minor Mode: The minor mode, on the other hand, is often associated with a sadder or more melancholic mood. It has a different pattern of whole and half steps and is built around a natural minor scale. Minor keys tend to convey a sense of sadness, seriousness, or introspection.

Now, I am going to investigate whether the mode of songs (major or minor) has any influence on their streaming counts.So, let’s conduct a two-sample t-test to determine whether these modes have the same mean or if there’s a significant difference between them in terms of streaming counts.

# Count the number of occurrences for each mode
mode_counts <- table(data$mode)

# Convert mode_counts to a data frame
mode_counts_df <- as.data.frame(mode_counts)
names(mode_counts_df) <- c("Mode", "Count")

# Plot the distribution of modes using a pie chart
ggplot(mode_counts_df, aes(x = "", y = Count, fill = Mode)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Modes",
       fill = "Mode") +
  theme_void() +
  theme(legend.position = "right")

# Prepare the data
major_music <- data[data$mode == "Major", "streams_in_millions"]
minor_music <- data[data$mode == "Minor", "streams_in_millions"]

# Check for the normality of the data. Plot the histogram to check for normality
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
hist(major_music, main = "Normality of Major music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
hist(minor_music, main = "Normality of Minor music streams", xlab = "Streams", col = "lightblue", probability = TRUE)

# Transform the data using log transformation
log_major <- log1p(major_music)
log_minor <- log1p(minor_music)

# Plot histograms after log transformation
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
hist(log_major, main = "Normality of log_major music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
hist(log_minor, main = "Normality of log_minor Music streams", xlab = "Streams", col = "lightblue", probability = TRUE)

# Perform Bartlett's test
bartlett_test <- bartlett.test(list(log_major, log_minor))

# Extract test statistics and p-value
bart_statsistic <- bartlett_test$statistic
bart_pvalue <- bartlett_test$p.value

# Print the results
print(paste("Bartlett's test statistic:", bart_statsistic))

## [1] "Bartlett's test statistic: 1.37789260020338"

print(paste("Bartlett's test p-value:", bart_pvalue))

## [1] "Bartlett's test p-value: 0.24046044055139"

My p value for Bartlett’s test is 0.2404604 It is greater than my significance level 0.05 so i fail to reject null hypothesis.so it states that i have a equal variance

Now performing two sample t test to determine mode of songs (major or minor) has any influence on their streaming counts.

# Perform two sample t-test to determine if mode of songs (major or minor) has any influence on their streaming counts
t_test_result <- t.test(log_major, log_minor, var.equal = TRUE)
print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  log_major and log_minor
## t = 1.2769, df = 951, p-value = 0.2019
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0496680  0.2346915
## sample estimates:
## mean of x mean of y 
##  5.736781  5.644269

Since the p-value (0.2019) is greater than the significance level (0.05), we do not have enough statistical evidence to conclude that the modes of music significantly impact music streams. This suggests that people enjoy music from both major and minor modes equally.

Poisson GLM

# Fit a Poisson GLM
glm_model <- glm(streams ~ key + mode + bpm + danceability_. + valence_. + energy_. + acousticness_. + instrumentalness_. + speechiness_.,
                 data = data,
                 family = poisson(link = "log"))

# Display summary of the model
summary(glm_model)

## 
## Call:
## glm(formula = streams ~ key + mode + bpm + danceability_. + valence_. + 
##     energy_. + acousticness_. + instrumentalness_. + speechiness_., 
##     family = poisson(link = "log"), data = data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.098e+01  1.346e-05 1559139   <2e-16 ***
## keyA               -2.315e-01  7.357e-06  -31472   <2e-16 ***
## keyA#               1.430e-01  7.277e-06   19652   <2e-16 ***
## keyB                8.950e-02  6.743e-06   13272   <2e-16 ***
## keyC#               2.356e-01  5.912e-06   39853   <2e-16 ***
## keyD                5.536e-02  6.621e-06    8362   <2e-16 ***
## keyD#               1.349e-01  8.779e-06   15368   <2e-16 ***
## keyE                1.714e-01  7.164e-06   23930   <2e-16 ***
## keyF               -3.905e-02  6.700e-06   -5828   <2e-16 ***
## keyF#               9.198e-02  6.939e-06   13256   <2e-16 ***
## keyG               -1.050e-01  6.589e-06  -15934   <2e-16 ***
## keyG#              -2.239e-02  6.619e-06   -3383   <2e-16 ***
## modeMinor          -6.707e-02  3.127e-06  -21448   <2e-16 ***
## bpm                -5.132e-04  5.209e-08   -9851   <2e-16 ***
## danceability_.     -7.714e-03  1.152e-07  -66942   <2e-16 ***
## valence_.           8.146e-04  7.434e-08   10957   <2e-16 ***
## energy_.           -3.268e-03  1.166e-07  -28027   <2e-16 ***
## acousticness_.     -2.285e-03  7.158e-08  -31918   <2e-16 ***
## instrumentalness_. -9.062e-03  2.150e-07  -42146   <2e-16 ***
## speechiness_.      -1.320e-02  1.694e-07  -77960   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4.6822e+11  on 952  degrees of freedom
## Residual deviance: 4.4474e+11  on 933  degrees of freedom
## AIC: 4.4474e+11
## 
## Number of Fisher Scoring iterations: 5

1. Embrace the Dancefloor: Prioritize songs with high danceability metrics, as they strongly correlate with increased streams. Incorporate energetic rhythms and infectious melodies to encourage listeners to move and engage with the music.
Null Hypothesis (H₀): No significant difference exists in stream counts between high and low danceability songs.
Alternative Hypothesis (H₁): Songs with higher danceability scores yield significantly more streams.

Rationale: Listeners tend to gravitate towards songs that inspire movement and engagement. The energetic nature of high danceability songs often leads to increased interaction and enjoyment.

# Define the threshold for high danceability (you can adjust this threshold as needed)
danceability_threshold <- 70

# Create two subsets based on danceability threshold
high_danceability <- filter(data, danceability_. >= danceability_threshold)
low_danceability <- filter(data, danceability_. < danceability_threshold)

# Perform two-sample t-test
t_test_result <- t.test(high_danceability$streams, low_danceability$streams, alternative = "greater")

# Output the results
print("Two-sample t-test result:")

## [1] "Two-sample t-test result:"

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  high_danceability$streams and low_danceability$streams
## t = -2.2281, df = 942.72, p-value = 0.9869
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -141803892        Inf
## sample estimates:
## mean of x mean of y 
## 472354365 553900254

The results of the Welch Two Sample t-test indicate:

Test statistic (t): -2.2281
Degrees of freedom (df): 942.72
p-value: 0.9869

The p-value is greater than the significance level (typically 0.05), indicating weak evidence against the null hypothesis. Therefore, we fail to reject the null hypothesis.

The confidence interval for the difference in means is (-141,803,892, Inf), suggesting that the true difference in means could potentially be negative or infinite, but it’s not likely to be significantly greater than 0.

In summary, there is insufficient evidence to conclude that songs with higher danceability scores have a significantly higher number of streams compared to songs with lower danceability scores.

1. Keep it Positive, Less Talk: While a positive vibe enhances appeal, minimize spoken word sections. Let the music convey the message, focusing on creating a positive emotional journey for the listener.

Hypothesis:

Null Hypothesis (H₀): Songs with a more positive vibe and minimal spoken word sections do not have a significantly higher overall popularity or streaming count compared to songs with a less positive vibe or with more spoken word sections.
Alternative Hypothesis (H₁): Songs with a more positive vibe and minimal spoken word sections have a significantly higher overall popularity or streaming count compared to songs with a less positive vibe or with more spoken word sections, indicating that positivity and instrumental focus contribute to a better listener experience and higher engagement.

# Define thresholds for positivity and spoken word sections 
positivity_threshold <- 70  
spoken_word_threshold <- 10 

# Create subsets based on positivity and spoken word thresholds
positive_songs <- filter(data, valence_. >= positivity_threshold)
minimal_spoken_word_songs <- filter(data, speechiness_. < spoken_word_threshold)

# Perform Wilcoxon rank-sum test
wilcox_test_result <- wilcox.test(positive_songs$streams, minimal_spoken_word_songs$streams, alternative = "greater")

# Output the results
print("Wilcoxon rank-sum test result:")

## [1] "Wilcoxon rank-sum test result:"

print(wilcox_test_result)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  positive_songs$streams and minimal_spoken_word_songs$streams
## W = 78278, p-value = 0.8859
## alternative hypothesis: true location shift is greater than 0

The results of the Wilcoxon rank-sum test indicate:

Test statistic (W): 78278
p-value: 0.8859

The p-value is greater than the significance level (typically 0.05), indicating weak evidence against the null hypothesis. Therefore, we fail to reject the null hypothesis.

The alternative hypothesis suggests a true location shift greater than 0, meaning that the distribution of streaming counts for songs with a more positive vibe and minimal spoken word sections tends to be higher than that for songs with a less positive vibe or more spoken word sections

Time Series Analysis

library(tsibble)

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

# Combine year, month, and day columns to create a Date object
data$release_date <- as.Date(paste(data$released_year, data$released_month, data$released_day, sep = "-"))

# Create a unique identifier column using row_number()
data <- mutate(data, unique_id = row_number())

# Create a tsibble object with release_date as index and unique_id as key
streaming_tsibble <- as_tsibble(data, key = unique_id, index = release_date) |>
  select(unique_id, streams)

ggplot(streaming_tsibble, aes(x = release_date, y = streams)) +
  geom_line() +
  labs(title = "Streaming Statistics Over Time",
       x = "Date",
       y = "Streams")

# Re-index the data by half-year and calculate average streams
spotify_2023_halfyear <- streaming_tsibble %>%
  mutate(half_year = floor_date(release_date, '6 months')) %>%
  group_by(half_year) %>%
  summarise(avg_streams = mean(streams))

# Plotting average streams over time with LOESS smoothing
ggplot(spotify_2023_halfyear, aes(x = half_year, y = avg_streams)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se = FALSE) +
  labs(title = "Average Streams Over Time",
       subtitle = "(by half-year)",
       x = "Year",
       y = "Average Streams") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18809

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 184

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4205e-15

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 32761

Based on this analysis, we observe that the year 2022 had the highest number of streams compared to the 2023 and previous years. However, as we look further back in time, the number of streams tends to decline. This decline could be attributed to several factors. One possible explanation is that Spotify’s user base predominantly consists of younger generations who are more inclined towards listening to the latest music releases. Older generations, who may have different music preferences, might not contribute as much to the streaming numbers on Spotify, as they may still rely on CDs, cassettes, or other traditional media for listening to their favorite songs.

# Create a new column named "rank_streams" containing the rank of streams
data$rank_streams <- rank(-data$streams, ties.method = "min")

# Sort the data frame by the rank of streams in descending order
data <- data[order(-data$streams), ]

# Function to calculate percentage and select top n concentrated release years
calculate_top_n_concentrated <- function(df, n) {
  df %>%
    group_by(released_year) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    head(n) %>%
    mutate(percentage = paste0(round((count / sum(count)) * 100, 1), "%"))
}

# Subset the data for top 10, top 50, and top 100
top_10 <- head(data, 10)
top_50 <- head(data, 50)
top_100 <- head(data, 100)

# Calculate top 3 concentrated release years for each subset
top_3_concentrated_top_10 <- calculate_top_n_concentrated(top_10, 3)
top_3_concentrated_top_50 <- calculate_top_n_concentrated(top_50, 3)
top_3_concentrated_top_100 <- calculate_top_n_concentrated(top_100, 3)

# Combine top 3 concentrated release years for plotting
combined_top_3_concentrated <- rbind(
  data.frame(release_year = top_3_concentrated_top_10$released_year, count = top_3_concentrated_top_10$count, percentage = top_3_concentrated_top_10$percentage, group = "Top 10"),
  data.frame(release_year = top_3_concentrated_top_50$released_year, count = top_3_concentrated_top_50$count, percentage = top_3_concentrated_top_50$percentage, group = "Top 50"),
  data.frame(release_year = top_3_concentrated_top_100$released_year, count = top_3_concentrated_top_100$count, percentage = top_3_concentrated_top_100$percentage, group = "Top 100")
)

# Plot the bar graph
ggplot(combined_top_3_concentrated, aes(x = release_year, y = count, fill = group)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = percentage), position = position_dodge(width = 1), vjust = -0.5) +
  labs(title = "Top 3 Concentrated Release Years in Top 10, Top 50, and Top 100 Songs",
       x = "Release Year",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "top")

The average release year is 2018, with a relatively small standard deviation of 11.12, indicating that most tracks were released around that time are still relevant are in popular terms after 6 years.

ggplot(data, aes(x = released_month)) +
  geom_bar(fill = rainbow(12), width = 1, stat = "count") +
  coord_polar() +
  labs(title = "Distribution of Release Months", 
       x = "Release Month", 
       y = "Frequency",
       fill = "Month") +  # Add fill legend label
  scale_fill_discrete(name = "Month", labels = month.abb) +  # Change legend labels to month abbreviations
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

The released month of songs may a play role in its performance. January, May and September were popular months.

January and May are the most popular release months with 38.5% and 36.8%, respectively
In the Top 100:
- 56.5% of songs were released in January
- 23.9% of songs were released in September
- 19.6% of songs were released in March
In the Top 50:
- 50% of songs were released in January
- 29.2% of songs were released in September
- 20.9% of of songs were released in May
In the Top 10:
- 33.3% of songs were released in January
- 33.3% of songs were released in May
- 33.3% of songs were released in November
January: New year, new songs. Spotify users seek out new music.
May: Warmer days are around the corner, it’s also the perfect time for tours and festivals to promote new music.
September: People are back to work or back to school and looking for new songs to concentrate while doing work.

Summary:

The analysis delves into Spotify’s most streamed songs in 2023, aiming to uncover the factors contributing to their popularity. The dataset, comprising nearly 1,000 songs, offers insights into various attributes such as track features, popularity metrics, and release details.

Key Findings:

Popularity Distribution: The distribution of streams exhibits a skewed pattern, typical of the “long tail” phenomenon in streaming services, indicating a few highly popular tracks and many with lower streams.
Playlist Inclusions: Songs included in a greater number of playlists tend to accumulate higher streams, suggesting playlist placement significantly influences a song’s popularity.
Audio Features: Features like danceability, valence, and energy show less variability, while instrumentalness displays high variation across subsamples, indicating subjective classification.
Poisson GLM: Danceability positively correlates with increased streams, emphasizing the importance of incorporating energetic rhythms and infectious melodies.
Time Series Analysis: Streaming statistics over time reveal a decline in streams over years, possibly due to shifting user preferences and demographics.
Release Trends: January, May, and September emerge as popular release months, with January dominating across all analyzed subsets.

Marketing Recommendations:

Prioritize Danceability: Embrace energetic rhythms and infectious melodies to engage listeners and increase streams.
Focus on Positive Vibes: Minimize spoken word sections and emphasize positivity in song content to enhance listener experience and engagement.
Strategic Release Timing: Capitalize on peak release months like January, May, and September to maximize visibility and audience engagement.
Playlist Placement Strategy: Aim for inclusion in diverse playlists across platforms to boost exposure and increase streaming performance.
Adapt to Shifting Trends: Stay attuned to evolving user preferences and demographic shifts to tailor marketing strategies effectively.

By aligning marketing efforts with these recommendations, Spotify and its artists can enhance their visibility, engagement, and ultimately, their streaming performance in the dynamic music streaming landscape.

Spotify_2023_statistical_inference

Gagan

2024-04-22