To begin with the data analysis, let’s first load the dataset and convert the relevant time-based columns into Date format in R. In this dataset, the columns “released_year”, “released_month”, and “released_day” together represent the release date of each song. We will combine these columns to create a Date object.
# Load required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
data <- read.csv("spotify-2023.csv")
# Combine year, month, and day columns to create a Date object
data$release_date <- as.Date(paste(data$released_year, data$released_month, data$released_day, sep = "-"))
# View the first few rows of the dataset with the new release_date column
head(data)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1 80 89 83 31 0
## 2 71 61 74 7 0
## 3 51 32 53 17 0
## 4 55 58 72 11 0
## 5 65 23 80 14 63
## 6 92 66 58 19 0
## liveness_. speechiness_. release_date
## 1 8 4 2023-07-14
## 2 10 4 2023-03-23
## 3 31 6 2023-06-30
## 4 11 15 2019-08-23
## 5 11 6 2023-05-18
## 6 8 24 2023-06-01
Now, let’s proceed with the analysis using the release_date column as our time-based variable.
We will select a response variable to analyze over time. Let’s choose the total number of streams as our response variable of interest.
# Create a tsibble object with release_date and streams
library(tsibble)
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
# Create a unique identifier column using row_number()
data <- mutate(data, unique_id = row_number())
# Create a tsibble object with release_date as index and unique_id as key
spotify_ts <- as_tsibble(data, key = unique_id, index = release_date) |>
select(unique_id, streams)
# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
geom_line() +
labs(title = "Total Streams Over Time",
x = "Release Date",
y = "Total Streams")
I want to zoom in on the data that concentrated high at the end to understand or discover more clear pattern
# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
geom_line() +
labs(title = "Total Streams Over Time (2015-2023)",
x = "Release Date",
y = "Total Streams") +
scale_x_date(limits = as.Date(c("2015-01-01", "2023-12-31")), date_labels = "%Y")
## Warning: Removed 122 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Total number of songs before 2015
total_songs_before_2015 <- sum(data$released_year < 2015)
# Total number of songs from 2015 onwards
total_songs_from_2015 <- sum(data$released_year >= 2015)
# Oldest recorded release year for a song
oldest_release_year <- min(data$released_year)
# Print the results
cat("Total number of songs before 2015:", total_songs_before_2015, "\n")
## Total number of songs before 2015: 122
cat("Total number of songs from 2015 onwards:", total_songs_from_2015, "\n")
## Total number of songs from 2015 onwards: 831
cat("Oldest recorded release year for a song:", oldest_release_year)
## Oldest recorded release year for a song: 1930
# Calculate the count of songs for each year
song_count_per_year <- data %>%
group_by(released_year) %>%
summarise(song_count = n())
# Print the result
tail(song_count_per_year,10)
## # A tibble: 10 × 2
## released_year song_count
## <int> <int>
## 1 2014 13
## 2 2015 11
## 3 2016 18
## 4 2017 23
## 5 2018 10
## 6 2019 36
## 7 2020 37
## 8 2021 119
## 9 2022 402
## 10 2023 175
# Create a bar graph of song count per year
ggplot(song_count_per_year, aes(x = factor(released_year), y = song_count)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Number of Songs Released Each Year",
x = "Year",
y = "Number of Songs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Extract year and month from release_date
spotify_ts <- spotify_ts %>%
mutate(year = lubridate::year(release_date),
month = lubridate::month(release_date, label = TRUE))
# Filter data for the years 2015 to 2023
spotify_ts_filtered <- spotify_ts %>%
filter(year >= 2020 & year <= 2023)
# Plotting total streams for each month and year
ggplot(spotify_ts_filtered, aes(x = month, y = streams, group = year, color = factor(year))) +
geom_line() +
labs(title = "Total Streams for Each Month (2020-2023)",
x = "Month",
y = "Total Streams",
color = "Year") +
scale_x_discrete(labels = month.abb) +
theme(legend.position = "top")
#install.packages("reshape2")
# Load required library
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.3
# Extract year and month from release_date
spotify_ts <- spotify_ts %>%
mutate(year = lubridate::year(release_date),
month = lubridate::month(release_date, label = TRUE))
# Filter data for the years 2015 to 2023
spotify_ts_filtered <- spotify_ts %>%
filter(year >= 2015 & year <= 2023)
# Create a matrix of total streams for each month and year
streams_matrix <- dcast(spotify_ts_filtered, month ~ year, value.var = "streams", fun.aggregate = sum)
# Print the matrix
print(streams_matrix)
## month 2015 2016 2017 2018 2019 2020
## 1 Jan 2391337035 1522708285 9406217866 0 1825208467 3115378988
## 2 Feb 1410088830 0 3124126410 0 0 3811764039
## 3 Mar 0 0 3549285187 3955847936 991336132 7245058746
## 4 Apr 571386359 2870580716 1116995633 0 1065580332 1180094974
## 5 May 2771792003 2591224264 0 1374581173 5286949797 403097450
## 6 Jun 165484133 380319238 5074259994 0 4624539670 2596387718
## 7 Jul 370068639 0 1047101291 0 726837877 3947585427
## 8 Aug 0 1227532697 3377978123 0 4085161792 1692897992
## 9 Sep 0 6823245701 0 3281711063 2146648182 3869160737
## 10 Oct 1127468248 0 683666898 2808096550 5104223531 3355401039
## 11 Nov 2123309722 4827329563 2710197180 3610285668 8745052442 1739157851
## 12 Dec 0 0 1367810478 0 3269394359 1747746896
## 2021 2022 2023
## 1 3442726321 7254514460 5465073812
## 2 0 6673025355 5317407386
## 3 5418303643 10222740305 6293413679
## 4 6246562311 9238399475 3166914154
## 5 7167588076 24605957716 2738201645
## 6 8765449364 8650038759 2246408110
## 7 8653782106 10133593143 581065318
## 8 5548916147 6570256277 0
## 9 11309054926 5232918178 0
## 10 6603312096 11248081403 0
## 11 6837053899 5862709274 0
## 12 3815354150 10710143617 0
# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
geom_line() +
labs(title = "Total Streams Over Time (2020-2023)",
x = "Release Date",
y = "Total Streams") +
scale_x_date(limits = as.Date(c("2020-01-01", "2023-12-31")), date_labels = "%Y")
## Warning: Removed 220 rows containing missing values or values outside the scale range
## (`geom_line()`).
We’ll perform linear regression to detect any upward or downward trends in the total number of streams.
# Linear regression to detect trends
lm_model <- lm(streams ~ release_date, data = spotify_ts)
# Summary of linear regression
summary(lm_model)
##
## Call:
## lm(formula = streams ~ release_date, data = spotify_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.434e+09 -3.387e+08 -1.981e+08 1.426e+08 3.204e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1068744101 80261931 13.316 < 2e-16 ***
## release_date -31216 4399 -7.095 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 552700000 on 951 degrees of freedom
## Multiple R-squared: 0.05028, Adjusted R-squared: 0.04928
## F-statistic: 50.35 on 1 and 951 DF, p-value: 2.518e-12
Based on the provided linear regression model:
\[ streams = 1.069 \times 10^9 - 31,216 \times \text{release_date} \]
Interpretation of Coefficients:
The intercept coefficient (1.069 * 10^9) represents the estimated number of streams when the release date is zero. However, this interpretation may not be meaningful in this context as the release date is represented in date format.
The coefficient for the release date (-31,216) indicates the average change in streams per unit increase in the release date. In this case, it suggests a decrease in streams over time, with each additional day after the initial release date associated with a decrease of approximately 31,216 streams.
Residuals:
Model Fit:
The adjusted R-squared value (0.04928) suggests that the model explains approximately 4.9% of the variability in the number of streams. While statistically significant (p-value < 0.05), the model has limited explanatory power.
The F-statistic tests the overall significance of the model. With a p-value of 2.518e-12 (very close to zero), it indicates that the model is statistically significant in predicting the number of streams.
Overall, while the model is statistically significant, its practical significance may be limited given the low R-squared value.
We’ll use smoothing techniques to detect seasonality in the total number of streams over time.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:tsibble':
##
## interval
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Re-index the data by half-year and calculate average streams
spotify_2023_halfyear <- spotify_ts %>%
mutate(half_year = floor_date(release_date, '6 months')) %>%
group_by(half_year) %>%
summarise(avg_streams = mean(streams))
# Plotting average streams over time with LOESS smoothing
ggplot(spotify_2023_halfyear, aes(x = half_year, y = avg_streams)) +
geom_line() +
geom_smooth(span = 0.3, color = 'blue', se = FALSE) +
labs(title = "Average Streams Over Time",
subtitle = "(by half-year)",
x = "Year",
y = "Average Streams") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18809
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 184
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4205e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 32761
Based on this analysis, we observe that the year 2022 had the highest number of streams compared to the previous years. However, as we look further back in time, the number of streams tends to decline. This decline could be attributed to several factors. One possible explanation is that Spotify’s user base predominantly consists of younger generations who are more inclined towards listening to the latest music releases. Older generations, who may have different music preferences, might not contribute as much to the streaming numbers on Spotify, as they may still rely on CDs, cassettes, or other traditional media for listening to their favorite songs.
Overall, there doesn’t seem to be a clear cyclic pattern present in the data that repeats at regular intervals. Instead, the trends in streaming numbers reflect the evolving landscape of music consumption and preferences among listeners.
# Load required library
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Plot ACF for streams
ggAcf(spotify_ts$streams)
# Plot PACF for streams
ggPacf(spotify_ts$streams)