2025-02-13

Objective

For this project I will be using a data set from Kaggle. The data is of the top songs streamed on Spotify during the year 2023. There are 943 unique entries that will be a part of this analysis. The goal of this project is to see if there is any correlation between the release date of a song and the number of streams it receives.

Data Cleaning/Munging

##Cleaning Data
dataset = read.csv("spotify-2023.csv")
str(dataset)
summary(dataset)

dataset = na.omit(dataset)##omit rows with missing values

dataset = dataset[!duplicated(dataset), ]
##ignore duplicates
dataset$streams <- gsub("[^0-9]", "", dataset$streams)
dataset$streams = as.numeric(dataset$streams)
##convert streams to numeric
dataset <- dataset[dataset$streams != max(dataset$streams), ]

We begin by loading and cleaning the data to ignore duplicated and missing values. We also convert the “Streams” variable to numeric for later.

Summary of Dataset

summary(dataset$released_year)
summary(dataset$released_month)
summary(dataset$released_day)

We follow up by taking the summary of the different data we will be using in this analysis. We can see that although it is streaming data from 2023, the mean year of song is 2018. As far as months go, the mean is just over 6 which is right in the middle. For days, the mean is just under 14, so there are slightly more songs in the data release near the start of the month. \[\text{Mean of Year} = 2018 \\ \text{Mean of Month} = 6.034 \\ \text{Mean of Day} = 13.93\]

Correlation Analysis

cor_matrix <- cor(dataset[, c("released_year",
                              "released_month",
                              "released_day",
                              "streams")])
print(cor_matrix)
##                released_year released_month released_day     streams
## released_year     1.00000000     0.07105497   0.16973315 -0.23080298
## released_month    0.07105497     1.00000000   0.07839121 -0.02493793
## released_day      0.16973315     0.07839121   1.00000000  0.01059794
## streams          -0.23080298    -0.02493793   0.01059794  1.00000000

Correlation Analysis Continued

We can see that our correlation between streams, and our other variables are mostly negative. The correlation between streams and released year indicates that older songs tend to have more streams than newer songs. The correlation between streams and released month indicates that songs released earlier in the year tend to have more streams than songs released later in the year. And being the only positive, the correlation between streams and released day indicates that songs released earlier in the month have less streams compared to those released later in the month.

Release Year vs. Streams

This plot shows us that the majority of the top streamed songs were from the past few years before 2023, namely 2019-2021. So while most of the songs are from recent years, we can still see that 2023 had much less highly streamed songs than a few years before.

Release month vs. Streams

This shows that while there is not a huge difference in the 2 different halves of the month. We can see that January is a huge month for releasing popular songs, which may be the reason that our correlation analysis was negative for this relationship.

Release Day vs. Streams

This seems to demonstrate something very similar to our last plot. Namely, how the first day of the month seems to be important for songs to have large amounts of streams.

Regression Analysis

dataset$released_year <- as.numeric(as.character(dataset$released_year))
lm_model <- lm(streams ~ released_year +
                 released_month + released_day, data = dataset)

Regression Results

## 
## Call:
## lm(formula = streams ~ released_year + released_month + released_day, 
##     data = dataset)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.477e+09 -3.388e+08 -1.981e+08  1.540e+08  3.160e+09 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.529e+10  3.327e+09   7.602 6.98e-14 ***
## released_year  -1.229e+07  1.651e+06  -7.444 2.19e-13 ***
## released_month -1.917e+06  5.043e+06  -0.380    0.704    
## released_day    3.209e+06  1.978e+06   1.622    0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 551600000 on 948 degrees of freedom
## Multiple R-squared:  0.05596,    Adjusted R-squared:  0.05298 
## F-statistic: 18.73 on 3 and 948 DF,  p-value: 8.252e-12

Regression Analysis Continued

This regression analysis shows us that for every year that advances, a song is likely to have 12,290,000 less streams than the previous year given other factors are constant. Our released year also has a p value where \[ p < 0.05 \] So we can say that the released year and streams relationship is statistically significant. However we can’t say the same for released month and released day who both have p values where \[ p > 0.05 \]

More Regression

lm_model_year <- lm(streams ~ released_year, data = dataset)
summary(lm_model_year)

Because the multiple r-squared value was so low, and the p values for release month and release day were so high. I ran regression with just release_year. The P value comes in under 0.5 and the R-Squared value comes in at 0.4837. Showing that there is some significant relationship between streams and release year.

Insights/Interpretations

The regression results show us that the release year has a statistically significant and negative impact on streams. Release month and day however, do not have a significant impact on streams. The multiple R-Squared value is 0.05596, which indicated that around 5.6% of the variance is due to release year, release month, and release day combined. However, when I did regression analysis with just release year, the R-Squared value came back as 0.4837. This shows us that 48.37% of the variance in streams is explained by just release year.

Conclusion

While I originally started by looking at all different types of release dates, days, months, and years. During regression analysis I quickly found out that release day and release month had little to do with number of streams. However, release year had a huge impact on the number of streams a song would receive.

While 48.37% of the variance in streams is explained by release year alone, there are definitely other factors that come into play. For example I’m sure there are a lot of songs from the 1930s and 1940s that were never uploaded to Spotify to begin with. And songs with multiple large artists participating would also increase the amount of streams bringing together the fans of multiple artists.

Overall, we found some statistical significance between the release year and the number of streams a song receives on Spotify. However there are many different variables we could examine to help figure out what else causes the huge difference between streams each song receives if additional analysis is done.