Week 12 | Data Dive — Time-based Data

To begin with the data analysis, let’s first load the dataset and convert the relevant time-based columns into Date format in R. In this dataset, the columns “released_year”, “released_month”, and “released_day” together represent the release date of each song. We will combine these columns to create a Date object.

# Load required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the dataset
data <- read.csv("spotify-2023.csv")

# Combine year, month, and day columns to create a Date object
data$release_date <- as.Date(paste(data$released_year, data$released_month, data$released_day, sep = "-"))

# View the first few rows of the dataset with the new release_date column
head(data)

##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability_. valence_. energy_. acousticness_. instrumentalness_.
## 1             80        89       83             31                  0
## 2             71        61       74              7                  0
## 3             51        32       53             17                  0
## 4             55        58       72             11                  0
## 5             65        23       80             14                 63
## 6             92        66       58             19                  0
##   liveness_. speechiness_. release_date
## 1          8             4   2023-07-14
## 2         10             4   2023-03-23
## 3         31             6   2023-06-30
## 4         11            15   2019-08-23
## 5         11             6   2023-05-18
## 6          8            24   2023-06-01

Now, let’s proceed with the analysis using the release_date column as our time-based variable.

Analyzing Spotify Data Over Time

Plotting Data Over Time

We will select a response variable to analyze over time. Let’s choose the total number of streams as our response variable of interest.

# Create a tsibble object with release_date and streams
library(tsibble)

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

# Create a unique identifier column using row_number()
data <- mutate(data, unique_id = row_number())

# Create a tsibble object with release_date as index and unique_id as key
spotify_ts <- as_tsibble(data, key = unique_id, index = release_date) |>
  select(unique_id, streams)

# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
  geom_line() +
  labs(title = "Total Streams Over Time",
       x = "Release Date",
       y = "Total Streams")

I want to zoom in on the data that concentrated high at the end to understand or discover more clear pattern

# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
  geom_line() +
  labs(title = "Total Streams Over Time (2015-2023)",
       x = "Release Date",
       y = "Total Streams") +
  scale_x_date(limits = as.Date(c("2015-01-01", "2023-12-31")), date_labels = "%Y")

## Warning: Removed 122 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Total number of songs before 2015
total_songs_before_2015 <- sum(data$released_year < 2015)

# Total number of songs from 2015 onwards
total_songs_from_2015 <- sum(data$released_year >= 2015)

# Oldest recorded release year for a song
oldest_release_year <- min(data$released_year)

# Print the results
cat("Total number of songs before 2015:", total_songs_before_2015, "\n")

## Total number of songs before 2015: 122

cat("Total number of songs from 2015 onwards:", total_songs_from_2015, "\n")

## Total number of songs from 2015 onwards: 831

cat("Oldest recorded release year for a song:", oldest_release_year)

## Oldest recorded release year for a song: 1930

# Calculate the count of songs for each year
song_count_per_year <- data %>%
  group_by(released_year) %>%
  summarise(song_count = n())

# Print the result
tail(song_count_per_year,10)

## # A tibble: 10 × 2
##    released_year song_count
##            <int>      <int>
##  1          2014         13
##  2          2015         11
##  3          2016         18
##  4          2017         23
##  5          2018         10
##  6          2019         36
##  7          2020         37
##  8          2021        119
##  9          2022        402
## 10          2023        175

# Create a bar graph of song count per year
ggplot(song_count_per_year, aes(x = factor(released_year), y = song_count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Number of Songs Released Each Year",
       x = "Year",
       y = "Number of Songs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Extract year and month from release_date
spotify_ts <- spotify_ts %>%
  mutate(year = lubridate::year(release_date),
         month = lubridate::month(release_date, label = TRUE))

# Filter data for the years 2015 to 2023
spotify_ts_filtered <- spotify_ts %>%
  filter(year >= 2020 & year <= 2023)

# Plotting total streams for each month and year
ggplot(spotify_ts_filtered, aes(x = month, y = streams, group = year, color = factor(year))) +
  geom_line() +
  labs(title = "Total Streams for Each Month (2020-2023)",
       x = "Month",
       y = "Total Streams",
       color = "Year") +
  scale_x_discrete(labels = month.abb) +
  theme(legend.position = "top")

#install.packages("reshape2")

# Load required library
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.3.3

# Extract year and month from release_date
spotify_ts <- spotify_ts %>%
  mutate(year = lubridate::year(release_date),
         month = lubridate::month(release_date, label = TRUE))

# Filter data for the years 2015 to 2023
spotify_ts_filtered <- spotify_ts %>%
  filter(year >= 2015 & year <= 2023)

# Create a matrix of total streams for each month and year
streams_matrix <- dcast(spotify_ts_filtered, month ~ year, value.var = "streams", fun.aggregate = sum)

# Print the matrix
print(streams_matrix)

##    month       2015       2016       2017       2018       2019       2020
## 1    Jan 2391337035 1522708285 9406217866          0 1825208467 3115378988
## 2    Feb 1410088830          0 3124126410          0          0 3811764039
## 3    Mar          0          0 3549285187 3955847936  991336132 7245058746
## 4    Apr  571386359 2870580716 1116995633          0 1065580332 1180094974
## 5    May 2771792003 2591224264          0 1374581173 5286949797  403097450
## 6    Jun  165484133  380319238 5074259994          0 4624539670 2596387718
## 7    Jul  370068639          0 1047101291          0  726837877 3947585427
## 8    Aug          0 1227532697 3377978123          0 4085161792 1692897992
## 9    Sep          0 6823245701          0 3281711063 2146648182 3869160737
## 10   Oct 1127468248          0  683666898 2808096550 5104223531 3355401039
## 11   Nov 2123309722 4827329563 2710197180 3610285668 8745052442 1739157851
## 12   Dec          0          0 1367810478          0 3269394359 1747746896
##           2021        2022       2023
## 1   3442726321  7254514460 5465073812
## 2            0  6673025355 5317407386
## 3   5418303643 10222740305 6293413679
## 4   6246562311  9238399475 3166914154
## 5   7167588076 24605957716 2738201645
## 6   8765449364  8650038759 2246408110
## 7   8653782106 10133593143  581065318
## 8   5548916147  6570256277          0
## 9  11309054926  5232918178          0
## 10  6603312096 11248081403          0
## 11  6837053899  5862709274          0
## 12  3815354150 10710143617          0

# Plotting total streams over time
ggplot(spotify_ts, aes(x = release_date, y = streams)) +
  geom_line() +
  labs(title = "Total Streams Over Time (2020-2023)",
       x = "Release Date",
       y = "Total Streams") +
  scale_x_date(limits = as.Date(c("2020-01-01", "2023-12-31")), date_labels = "%Y")

## Warning: Removed 220 rows containing missing values or values outside the scale range
## (`geom_line()`).

Insights:

Linear Regression to Detect Trends

We’ll perform linear regression to detect any upward or downward trends in the total number of streams.

# Linear regression to detect trends
lm_model <- lm(streams ~ release_date, data = spotify_ts)

# Summary of linear regression
summary(lm_model)

## 
## Call:
## lm(formula = streams ~ release_date, data = spotify_ts)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.434e+09 -3.387e+08 -1.981e+08  1.426e+08  3.204e+09 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1068744101   80261931  13.316  < 2e-16 ***
## release_date     -31216       4399  -7.095 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 552700000 on 951 degrees of freedom
## Multiple R-squared:  0.05028,    Adjusted R-squared:  0.04928 
## F-statistic: 50.35 on 1 and 951 DF,  p-value: 2.518e-12

Based on the provided linear regression model:

\[ streams = 1.069 \times 10^9 - 31,216 \times \text{release_date} \]

Interpretation of Coefficients:

- The intercept coefficient (1.069 * 10^9) represents the estimated number of streams when the release date is zero. However, this interpretation may not be meaningful in this context as the release date is represented in date format.
- The coefficient for the release date (-31,216) indicates the average change in streams per unit increase in the release date. In this case, it suggests a decrease in streams over time, with each additional day after the initial release date associated with a decrease of approximately 31,216 streams.
Residuals:
- Residuals represent the difference between the observed and predicted values of the response variable (streams). They indicate how well the model fits the data. In this case, residuals range from -1.434e+09 to 3.204e+09, indicating considerable variability around the fitted line.
Model Fit:
- The adjusted R-squared value (0.04928) suggests that the model explains approximately 4.9% of the variability in the number of streams. While statistically significant (p-value < 0.05), the model has limited explanatory power.
- The F-statistic tests the overall significance of the model. With a p-value of 2.518e-12 (very close to zero), it indicates that the model is statistically significant in predicting the number of streams.

Overall, while the model is statistically significant, its practical significance may be limited given the low R-squared value.

Smoothing to Detect Seasonality

We’ll use smoothing techniques to detect seasonality in the total number of streams over time.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:tsibble':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# Re-index the data by half-year and calculate average streams
spotify_2023_halfyear <- spotify_ts %>%
  mutate(half_year = floor_date(release_date, '6 months')) %>%
  group_by(half_year) %>%
  summarise(avg_streams = mean(streams))

# Plotting average streams over time with LOESS smoothing
ggplot(spotify_2023_halfyear, aes(x = half_year, y = avg_streams)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se = FALSE) +
  labs(title = "Average Streams Over Time",
       subtitle = "(by half-year)",
       x = "Year",
       y = "Average Streams") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18809

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 184

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4205e-15

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 32761

Based on this analysis, we observe that the year 2022 had the highest number of streams compared to the previous years. However, as we look further back in time, the number of streams tends to decline. This decline could be attributed to several factors. One possible explanation is that Spotify’s user base predominantly consists of younger generations who are more inclined towards listening to the latest music releases. Older generations, who may have different music preferences, might not contribute as much to the streaming numbers on Spotify, as they may still rely on CDs, cassettes, or other traditional media for listening to their favorite songs.

Overall, there doesn’t seem to be a clear cyclic pattern present in the data that repeats at regular intervals. Instead, the trends in streaming numbers reflect the evolving landscape of music consumption and preferences among listeners.

# Load required library
library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Plot ACF for streams
ggAcf(spotify_ts$streams)

# Plot PACF for streams
ggPacf(spotify_ts$streams)