The world of music streaming is abuzz with catchy tunes vying for our ears and attention. But what makes a song truly soar to the top of the charts and capture our hearts? In this exploration, we dive into the fascinating world of Spotify’s most streamed songs in 2023, armed with a dataset of nearly 1,000 musical champions. Our mission: to uncover the hidden gems within these songs, the attributes that contribute to their meteoric rise on the streaming platform.
This treasure trove of data holds a wealth of information about each song, like its name, artists, release date, and even its musical pulse and mood.
track_name: Name of the song
artist(s)_name: Name of the artist(s) of the song
artist_count: Number of artists contributing to the song
released_year: Year when the song was released
released_month: Month when the song was released
released_day: Day of the month when the song was released
in_spotify_playlists: Number of Spotify playlists the song is included in
in_spotify_charts: Presence and rank of the song on Spotify charts
streams: Total number of streams on Spotify
in_apple_playlists: Number of Apple Music playlists the song is included in
in_apple_charts: Presence and rank of the song on Apple Music charts
in_deezer_playlists: Number of Deezer playlists the song is included in
in_deezer_charts: Presence and rank of the song on Deezer charts
in_shazam_charts: Presence and rank of the song on Shazam charts
bpm: Beats per minute, a measure of song tempo
key: Key of the song
mode: Mode of the song (major or minor)
danceability_%: Percentage indicating how suitable the song is for dancing
valence_%: Positivity of the song’s musical content
energy_%: Perceived energy level of the song
acousticness_%: Amount of acoustic sound in the song
instrumentalness_%: Amount of instrumental content in the song
liveness_%: Presence of live performance elements
speechiness_%: Amount of spoken words in the song
# A comprehensive toolkit for data science in R
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# To generate a comprehensive summary of data including mean, median, standard deviation, skewness, and kurtosis.
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
data <- read.csv("spotify-2023.csv")
nrow(data)
## [1] 953
ncol(data)
## [1] 24
colnames(data)
## [1] "track_name" "artist.s._name" "artist_count"
## [4] "released_year" "released_month" "released_day"
## [7] "in_spotify_playlists" "in_spotify_charts" "streams"
## [10] "in_apple_playlists" "in_apple_charts" "in_deezer_playlists"
## [13] "in_deezer_charts" "in_shazam_charts" "bpm"
## [16] "key" "mode" "danceability_."
## [19] "valence_." "energy_." "acousticness_."
## [22] "instrumentalness_." "liveness_." "speechiness_."
The describe() function from the psych library is valuable for analyzing Spotify data because it provides a quick summary of key statistics like mean, standard deviation, and quartiles for each variable. This helps in understanding the data’s distribution, identifying outliers, and selecting relevant variables for further analysis. Overall, it’s a handy tool for getting an initial overview of the dataset’s characteristics and making informed analytical decisions
describe(data)
## vars n mean sd median trimmed
## track_name* 1 953 471.90 272.14 472 471.96
## artist.s._name* 2 953 325.55 188.29 323 327.09
## artist_count 3 953 1.56 0.89 1 1.38
## released_year 4 953 2018.24 11.12 2022 2020.96
## released_month 5 953 6.03 3.57 6 5.94
## released_day 6 953 13.93 9.20 13 13.64
## in_spotify_playlists 7 953 5200.12 7897.61 2224 3314.33
## in_spotify_charts 8 953 12.01 19.58 3 7.66
## streams 9 953 513597931.31 566803887.06 290228626 400416420.60
## in_apple_playlists 10 953 67.81 86.44 34 49.80
## in_apple_charts 11 953 51.91 50.63 38 45.63
## in_deezer_playlists* 12 953 166.15 100.64 165 164.75
## in_deezer_charts 13 953 2.67 6.04 0 1.11
## in_shazam_charts* 14 953 52.98 62.90 11 43.85
## bpm 15 953 122.54 28.06 121 121.01
## key* 16 953 6.54 3.58 6 6.55
## mode* 17 953 1.42 0.49 1 1.40
## danceability_. 18 953 66.97 14.63 69 67.70
## valence_. 19 953 51.43 23.48 51 51.35
## energy_. 20 953 64.28 16.55 66 65.11
## acousticness_. 21 953 27.06 26.00 18 23.53
## instrumentalness_. 22 953 1.58 8.41 0 0.00
## liveness_. 23 953 18.21 13.71 12 15.81
## speechiness_. 24 953 10.13 9.91 6 7.96
## mad min max range skew kurtosis
## track_name* 349.89 1 943 942 0.00 -1.20
## artist.s._name* 250.56 1 645 644 -0.02 -1.28
## artist_count 0.00 1 8 7 2.54 10.28
## released_year 1.48 1930 2023 93 -4.28 20.35
## released_month 4.45 1 12 11 0.18 -1.20
## released_day 11.86 1 31 30 0.16 -1.24
## in_spotify_playlists 2364.75 31 52898 52867 2.92 9.79
## in_spotify_charts 4.45 0 147 147 2.57 8.43
## streams 279393863.23 0 3703895074 3703895074 1.99 4.33
## in_apple_playlists 40.03 0 672 672 2.47 7.85
## in_apple_charts 51.89 0 275 275 1.03 0.87
## in_deezer_playlists* 133.43 1 348 347 0.10 -1.20
## in_deezer_charts 0.00 0 58 58 3.75 18.87
## in_shazam_charts* 14.83 1 199 198 0.85 -0.72
## bpm 31.13 65 206 141 0.41 -0.41
## key* 4.45 1 12 11 0.01 -1.29
## mode* 0.00 1 2 1 0.31 -1.90
## danceability_. 14.83 23 96 73 -0.43 -0.34
## valence_. 28.17 4 97 93 0.01 -0.95
## energy_. 17.79 9 97 88 -0.44 -0.27
## acousticness_. 22.24 0 97 97 0.95 -0.20
## instrumentalness_. 0.00 0 91 91 7.10 56.21
## liveness_. 5.93 3 97 94 2.10 5.66
## speechiness_. 2.97 2 64 62 1.93 3.34
## se
## track_name* 8.82
## artist.s._name* 6.10
## artist_count 0.03
## released_year 0.36
## released_month 0.12
## released_day 0.30
## in_spotify_playlists 255.83
## in_spotify_charts 0.63
## streams 18360578.88
## in_apple_playlists 2.80
## in_apple_charts 1.64
## in_deezer_playlists* 3.26
## in_deezer_charts 0.20
## in_shazam_charts* 2.04
## bpm 0.91
## key* 0.12
## mode* 0.02
## danceability_. 0.47
## valence_. 0.76
## energy_. 0.54
## acousticness_. 0.84
## instrumentalness_. 0.27
## liveness_. 0.44
## speechiness_. 0.32
mean and large range for
streams (total streams per track) confirm a skewed
distribution of popularity. This aligns with the typical “long tail”
phenomenon in streaming services, where a few tracks dominate in terms
of streams, while many others have significantly fewer.ggplot(data, aes(x = streams)) +
geom_histogram(binwidth = 1e7, fill = "skyblue", color = "black") +
labs(title = "Distribution of Streams", x = "Streams (Total streams per track)", y = "Frequency") +
theme_minimal()
skew for streams and
in_spotify_playlists (number of times a track appears on
Spotify playlists) further supports this notion. There are likely a
small number of very popular tracks with millions of streams and
playlist appearances, while most tracks have considerably less
exposure.I initiated this analysis with the hypothesis that songs included in user-generated playlists might significantly influence their stream counts. To verify this hypothesis, I set out to explore the correlation between two key variables: ‘streams_in_millions’ and ‘total_playlist_inclusions.’ Through this examination, my goal was to determine if there exists a connection between a song’s popularity, as indicated by its streaming figures, and its presence in playlists across different platforms. This inquiry is motivated by the notion that songs featured in playlists could potentially experience a surge in streams. Let’s delve into the findings.
data$in_deezer_playlists <- as.integer(data$in_deezer_playlists)
## Warning: NAs introduced by coercion
data$in_shazam_charts <- as.integer(data$in_shazam_charts)
## Warning: NAs introduced by coercion
data$total_playlist_inclusions<-data$in_spotify_playlists+data$in_apple_playlists+data$in_deezer_playlists+data$in_shazam_charts
plot(data$total_playlist_inclusions, data$streams, xlab = "Total Playlist Inclusions", ylab = "Streams", main = "Relationship between Total Playlist Inclusions and Streams")
data$streams_in_millions <- data$streams / 1000000
ggplot(data, aes(x = in_spotify_playlists, y = streams_in_millions)) +
geom_point(color = "green", size = 3, alpha = 0.6) +
labs(x = "Spotify Playlist Inclusions", y = "Streams",
title = "Relationship between Spotify Playlist Inclusions and Streams") +
theme_minimal()
correlation <- cor(data$in_spotify_playlists, data$streams, method = "spearman")
print(correlation)
## [1] 0.8301377
Based on the analysis, we observe a positive correlation between the total number of playlist inclusions and the number of streams for the songs in our dataset.To further validate this relationship, I performed a Spearman Rank correlation analysis, which also confirmed a significant positive association with Spearman correlation coefficient of 0.8301377. This suggests that songs that are included in a greater number of playlists tend to accumulate higher numbers of streams. This relationship implies that playlist inclusion plays a significant role in driving the popularity and streaming performance of songs on music platforms.
Furthermore, upon visual examination of the data using a graph, no outliers are immediately evident. This indicates that there are no distinct data points that significantly deviate from the overall pattern of the data. The absence of outliers suggests that the relationship between playlist inclusions and streams is relatively consistent across the dataset without any extreme or unusual observations that could skew the analysis or interpretation.
Audio features such as bpm,
danceability, valence, and energy
have relatively small standard deviations compared to their means,
suggesting less variability in these features.
Some features, like instrumentalness, exhibit high
kurtosis, indicating heavy-tailed distributions with a few tracks having
very high values.
# Set seed for reproducibility
set.seed(123)
# Calculate the size of each subsample (roughly 50% of the original dataset)
subsample_size <- nrow(data) * 0.5
# Create 5 random subsamples
df_1 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_2 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_3 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_4 <- data %>% sample_n(size = subsample_size, replace = TRUE)
df_5 <- data %>% sample_n(size = subsample_size, replace = TRUE)
# Function to calculate summary statistics for a dataset
calculate_summary <- function(df) {
summary <- df |>
summarise(
mean_instrumentalness = mean(instrumentalness_.),
)
return(summary)
}
# Calculate summary statistics for each subsample
summary_df_1 <- calculate_summary(df_1)
summary_df_2 <- calculate_summary(df_2)
summary_df_3 <- calculate_summary(df_3)
summary_df_4 <- calculate_summary(df_4)
summary_df_5 <- calculate_summary(df_5)
# Combine summaries into a single dataframe
combined_summary <- bind_rows(summary_df_1, summary_df_2, summary_df_3, summary_df_4, summary_df_5, .id = "Subsample")
ggplot(combined_summary, aes(x = Subsample, y = mean_instrumentalness, fill = Subsample)) +
geom_bar(stat = "identity") +
labs(title = "Mean Instrumentalness Across Subsamples", x = "Subsample", y = "Mean Instrumentalness")
Instrumentalness column has the most variations in the subsamples. This is likely because instrumentality is a subjective measure, and there is no one definitive way to classify a song as being instrumental or not. Additionally, there are many different subgenres of instrumental music, each with its own unique sound. This can lead to a lot of variation in the instrumentalness scores between different subsamples.
For example, a subsample of songs that are all classified as being classical music might have a very low standard deviation in instrumentalness scores, as all of the songs in the subsample would likely be scored as being very instrumental. On the other hand, a subsample of songs that are all classified as being pop music might have a much higher standard deviation in instrumentalness scores, as there is a wider range of possible scores for this genre.
The two most common modes in music are “major” and “minor.” Here’s an explanation of each:
Major Mode: The major mode is often associated with a bright, happy, and cheerful mood in music. It is characterized by a specific pattern of whole and half steps between notes and is built around a major scale. Major keys typically sound more positive and uplifting.
Minor Mode: The minor mode, on the other hand, is often associated with a sadder or more melancholic mood. It has a different pattern of whole and half steps and is built around a natural minor scale. Minor keys tend to convey a sense of sadness, seriousness, or introspection.
Now, I am going to investigate whether the mode of songs (major or minor) has any influence on their streaming counts.So, let’s conduct a two-sample t-test to determine whether these modes have the same mean or if there’s a significant difference between them in terms of streaming counts.
# Count the number of occurrences for each mode
mode_counts <- table(data$mode)
# Convert mode_counts to a data frame
mode_counts_df <- as.data.frame(mode_counts)
names(mode_counts_df) <- c("Mode", "Count")
# Plot the distribution of modes using a pie chart
ggplot(mode_counts_df, aes(x = "", y = Count, fill = Mode)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Distribution of Modes",
fill = "Mode") +
theme_void() +
theme(legend.position = "right")
# Prepare the data
major_music <- data[data$mode == "Major", "streams_in_millions"]
minor_music <- data[data$mode == "Minor", "streams_in_millions"]
# Check for the normality of the data. Plot the histogram to check for normality
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
hist(major_music, main = "Normality of Major music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
hist(minor_music, main = "Normality of Minor music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
# Transform the data using log transformation
log_major <- log1p(major_music)
log_minor <- log1p(minor_music)
# Plot histograms after log transformation
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
hist(log_major, main = "Normality of log_major music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
hist(log_minor, main = "Normality of log_minor Music streams", xlab = "Streams", col = "lightblue", probability = TRUE)
# Perform Bartlett's test
bartlett_test <- bartlett.test(list(log_major, log_minor))
# Extract test statistics and p-value
bart_statsistic <- bartlett_test$statistic
bart_pvalue <- bartlett_test$p.value
# Print the results
print(paste("Bartlett's test statistic:", bart_statsistic))
## [1] "Bartlett's test statistic: 1.37789260020338"
print(paste("Bartlett's test p-value:", bart_pvalue))
## [1] "Bartlett's test p-value: 0.24046044055139"
My p value for Bartlett’s test is 0.2404604 It is greater than my significance level 0.05 so i fail to reject null hypothesis.so it states that i have a equal variance
Now performing two sample t test to determine mode of songs (major or minor) has any influence on their streaming counts.
# Perform two sample t-test to determine if mode of songs (major or minor) has any influence on their streaming counts
t_test_result <- t.test(log_major, log_minor, var.equal = TRUE)
print(t_test_result)
##
## Two Sample t-test
##
## data: log_major and log_minor
## t = 1.2769, df = 951, p-value = 0.2019
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0496680 0.2346915
## sample estimates:
## mean of x mean of y
## 5.736781 5.644269
Since the p-value (0.2019) is greater than the significance level (0.05), we do not have enough statistical evidence to conclude that the modes of music significantly impact music streams. This suggests that people enjoy music from both major and minor modes equally.
# Fit a Poisson GLM
glm_model <- glm(streams ~ key + mode + bpm + danceability_. + valence_. + energy_. + acousticness_. + instrumentalness_. + speechiness_.,
data = data,
family = poisson(link = "log"))
# Display summary of the model
summary(glm_model)
##
## Call:
## glm(formula = streams ~ key + mode + bpm + danceability_. + valence_. +
## energy_. + acousticness_. + instrumentalness_. + speechiness_.,
## family = poisson(link = "log"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.098e+01 1.346e-05 1559139 <2e-16 ***
## keyA -2.315e-01 7.357e-06 -31472 <2e-16 ***
## keyA# 1.430e-01 7.277e-06 19652 <2e-16 ***
## keyB 8.950e-02 6.743e-06 13272 <2e-16 ***
## keyC# 2.356e-01 5.912e-06 39853 <2e-16 ***
## keyD 5.536e-02 6.621e-06 8362 <2e-16 ***
## keyD# 1.349e-01 8.779e-06 15368 <2e-16 ***
## keyE 1.714e-01 7.164e-06 23930 <2e-16 ***
## keyF -3.905e-02 6.700e-06 -5828 <2e-16 ***
## keyF# 9.198e-02 6.939e-06 13256 <2e-16 ***
## keyG -1.050e-01 6.589e-06 -15934 <2e-16 ***
## keyG# -2.239e-02 6.619e-06 -3383 <2e-16 ***
## modeMinor -6.707e-02 3.127e-06 -21448 <2e-16 ***
## bpm -5.132e-04 5.209e-08 -9851 <2e-16 ***
## danceability_. -7.714e-03 1.152e-07 -66942 <2e-16 ***
## valence_. 8.146e-04 7.434e-08 10957 <2e-16 ***
## energy_. -3.268e-03 1.166e-07 -28027 <2e-16 ***
## acousticness_. -2.285e-03 7.158e-08 -31918 <2e-16 ***
## instrumentalness_. -9.062e-03 2.150e-07 -42146 <2e-16 ***
## speechiness_. -1.320e-02 1.694e-07 -77960 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4.6822e+11 on 952 degrees of freedom
## Residual deviance: 4.4474e+11 on 933 degrees of freedom
## AIC: 4.4474e+11
##
## Number of Fisher Scoring iterations: 5
Null Hypothesis (H₀): No significant difference exists in stream counts between high and low danceability songs.
Alternative Hypothesis (H₁): Songs with higher danceability scores yield significantly more streams.
Rationale: Listeners tend to gravitate towards songs that inspire movement and engagement. The energetic nature of high danceability songs often leads to increased interaction and enjoyment.
# Define the threshold for high danceability (you can adjust this threshold as needed)
danceability_threshold <- 70
# Create two subsets based on danceability threshold
high_danceability <- filter(data, danceability_. >= danceability_threshold)
low_danceability <- filter(data, danceability_. < danceability_threshold)
# Perform two-sample t-test
t_test_result <- t.test(high_danceability$streams, low_danceability$streams, alternative = "greater")
# Output the results
print("Two-sample t-test result:")
## [1] "Two-sample t-test result:"
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: high_danceability$streams and low_danceability$streams
## t = -2.2281, df = 942.72, p-value = 0.9869
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -141803892 Inf
## sample estimates:
## mean of x mean of y
## 472354365 553900254
The results of the Welch Two Sample t-test indicate:
Test statistic (t): -2.2281
Degrees of freedom (df): 942.72
p-value: 0.9869
The p-value is greater than the significance level (typically 0.05), indicating weak evidence against the null hypothesis. Therefore, we fail to reject the null hypothesis.
The confidence interval for the difference in means is (-141,803,892, Inf), suggesting that the true difference in means could potentially be negative or infinite, but it’s not likely to be significantly greater than 0.
In summary, there is insufficient evidence to conclude that songs with higher danceability scores have a significantly higher number of streams compared to songs with lower danceability scores.
Hypothesis:
# Define thresholds for positivity and spoken word sections
positivity_threshold <- 70
spoken_word_threshold <- 10
# Create subsets based on positivity and spoken word thresholds
positive_songs <- filter(data, valence_. >= positivity_threshold)
minimal_spoken_word_songs <- filter(data, speechiness_. < spoken_word_threshold)
# Perform Wilcoxon rank-sum test
wilcox_test_result <- wilcox.test(positive_songs$streams, minimal_spoken_word_songs$streams, alternative = "greater")
# Output the results
print("Wilcoxon rank-sum test result:")
## [1] "Wilcoxon rank-sum test result:"
print(wilcox_test_result)
##
## Wilcoxon rank sum test with continuity correction
##
## data: positive_songs$streams and minimal_spoken_word_songs$streams
## W = 78278, p-value = 0.8859
## alternative hypothesis: true location shift is greater than 0
The results of the Wilcoxon rank-sum test indicate:
Test statistic (W): 78278
p-value: 0.8859
The p-value is greater than the significance level (typically 0.05), indicating weak evidence against the null hypothesis. Therefore, we fail to reject the null hypothesis.
The alternative hypothesis suggests a true location shift greater than 0, meaning that the distribution of streaming counts for songs with a more positive vibe and minimal spoken word sections tends to be higher than that for songs with a less positive vibe or more spoken word sections
library(tsibble)
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
# Combine year, month, and day columns to create a Date object
data$release_date <- as.Date(paste(data$released_year, data$released_month, data$released_day, sep = "-"))
# Create a unique identifier column using row_number()
data <- mutate(data, unique_id = row_number())
# Create a tsibble object with release_date as index and unique_id as key
streaming_tsibble <- as_tsibble(data, key = unique_id, index = release_date) |>
select(unique_id, streams)
ggplot(streaming_tsibble, aes(x = release_date, y = streams)) +
geom_line() +
labs(title = "Streaming Statistics Over Time",
x = "Date",
y = "Streams")
# Re-index the data by half-year and calculate average streams
spotify_2023_halfyear <- streaming_tsibble %>%
mutate(half_year = floor_date(release_date, '6 months')) %>%
group_by(half_year) %>%
summarise(avg_streams = mean(streams))
# Plotting average streams over time with LOESS smoothing
ggplot(spotify_2023_halfyear, aes(x = half_year, y = avg_streams)) +
geom_line() +
geom_smooth(span = 0.3, color = 'blue', se = FALSE) +
labs(title = "Average Streams Over Time",
subtitle = "(by half-year)",
x = "Year",
y = "Average Streams") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18809
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 184
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4205e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 32761
Based on this analysis, we observe that the year 2022 had the highest number of streams compared to the 2023 and previous years. However, as we look further back in time, the number of streams tends to decline. This decline could be attributed to several factors. One possible explanation is that Spotify’s user base predominantly consists of younger generations who are more inclined towards listening to the latest music releases. Older generations, who may have different music preferences, might not contribute as much to the streaming numbers on Spotify, as they may still rely on CDs, cassettes, or other traditional media for listening to their favorite songs.
# Create a new column named "rank_streams" containing the rank of streams
data$rank_streams <- rank(-data$streams, ties.method = "min")
# Sort the data frame by the rank of streams in descending order
data <- data[order(-data$streams), ]
# Function to calculate percentage and select top n concentrated release years
calculate_top_n_concentrated <- function(df, n) {
df %>%
group_by(released_year) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(n) %>%
mutate(percentage = paste0(round((count / sum(count)) * 100, 1), "%"))
}
# Subset the data for top 10, top 50, and top 100
top_10 <- head(data, 10)
top_50 <- head(data, 50)
top_100 <- head(data, 100)
# Calculate top 3 concentrated release years for each subset
top_3_concentrated_top_10 <- calculate_top_n_concentrated(top_10, 3)
top_3_concentrated_top_50 <- calculate_top_n_concentrated(top_50, 3)
top_3_concentrated_top_100 <- calculate_top_n_concentrated(top_100, 3)
# Combine top 3 concentrated release years for plotting
combined_top_3_concentrated <- rbind(
data.frame(release_year = top_3_concentrated_top_10$released_year, count = top_3_concentrated_top_10$count, percentage = top_3_concentrated_top_10$percentage, group = "Top 10"),
data.frame(release_year = top_3_concentrated_top_50$released_year, count = top_3_concentrated_top_50$count, percentage = top_3_concentrated_top_50$percentage, group = "Top 50"),
data.frame(release_year = top_3_concentrated_top_100$released_year, count = top_3_concentrated_top_100$count, percentage = top_3_concentrated_top_100$percentage, group = "Top 100")
)
# Plot the bar graph
ggplot(combined_top_3_concentrated, aes(x = release_year, y = count, fill = group)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = percentage), position = position_dodge(width = 1), vjust = -0.5) +
labs(title = "Top 3 Concentrated Release Years in Top 10, Top 50, and Top 100 Songs",
x = "Release Year",
y = "Count") +
theme_minimal() +
theme(legend.position = "top")
ggplot(data, aes(x = released_month)) +
geom_bar(fill = rainbow(12), width = 1, stat = "count") +
coord_polar() +
labs(title = "Distribution of Release Months",
x = "Release Month",
y = "Frequency",
fill = "Month") + # Add fill legend label
scale_fill_discrete(name = "Month", labels = month.abb) + # Change legend labels to month abbreviations
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
The released month of songs may a play role in its performance. January, May and September were popular months.
January and May are the most popular release months with 38.5% and 36.8%, respectively
In the Top 100:
56.5% of songs were released in January
23.9% of songs were released in September
19.6% of songs were released in March
In the Top 50:
50% of songs were released in January
29.2% of songs were released in September
20.9% of of songs were released in May
In the Top 10:
33.3% of songs were released in January
33.3% of songs were released in May
33.3% of songs were released in November
January: New year, new songs. Spotify users seek out new music.
May: Warmer days are around the corner, it’s also the perfect time for tours and festivals to promote new music.
September: People are back to work or back to school and looking for new songs to concentrate while doing work.
The analysis delves into Spotify’s most streamed songs in 2023, aiming to uncover the factors contributing to their popularity. The dataset, comprising nearly 1,000 songs, offers insights into various attributes such as track features, popularity metrics, and release details.
Popularity Distribution: The distribution of streams exhibits a skewed pattern, typical of the “long tail” phenomenon in streaming services, indicating a few highly popular tracks and many with lower streams.
Playlist Inclusions: Songs included in a greater number of playlists tend to accumulate higher streams, suggesting playlist placement significantly influences a song’s popularity.
Audio Features: Features like danceability, valence, and energy show less variability, while instrumentalness displays high variation across subsamples, indicating subjective classification.
Poisson GLM: Danceability positively correlates with increased streams, emphasizing the importance of incorporating energetic rhythms and infectious melodies.
Time Series Analysis: Streaming statistics over time reveal a decline in streams over years, possibly due to shifting user preferences and demographics.
Release Trends: January, May, and September emerge as popular release months, with January dominating across all analyzed subsets.
Prioritize Danceability: Embrace energetic rhythms and infectious melodies to engage listeners and increase streams.
Focus on Positive Vibes: Minimize spoken word sections and emphasize positivity in song content to enhance listener experience and engagement.
Strategic Release Timing: Capitalize on peak release months like January, May, and September to maximize visibility and audience engagement.
Playlist Placement Strategy: Aim for inclusion in diverse playlists across platforms to boost exposure and increase streaming performance.
Adapt to Shifting Trends: Stay attuned to evolving user preferences and demographic shifts to tailor marketing strategies effectively.
By aligning marketing efforts with these recommendations, Spotify and its artists can enhance their visibility, engagement, and ultimately, their streaming performance in the dynamic music streaming landscape.