A. INTRODUCTION

B. ANALYSIS

C. CONCLUSION




A. INTRODUCTION

“We can get creative: we can start to send out messages in a bottle: we can sing, write poetry, produce books and blogs, activities stemming from the realisation that people around us won’t ever fully get us but that others – separated across time and space – might just. The history of art is the record of people who couldn’t find anyone in the vicinity to talk to.” —Alain de Botton

The quote above might represent what I felt when I started producing music, among other things. But this time, I hereby hold myself to face the dire state of my musical project in the form of data. Specifically, the data I’ve got from my Spotify for Artists account. Maybe, I can learn something from it. Perhaps, answering questions such as “On what day of week I have the most streams?”, or “How long since the date released that song streams plummeted?””

There are two csv files. The first one captures a timeline of streams, listeners, and followers; while another offers similar metrics but is grouped by song titles. They’re pretty bare bones. I might need to perform some features engineering.

1. Identify Tasks

Almost halfway in writing this notebook, I realized I was going nowhere since I didn’t really have any specific direction. So here I am, writing lists of questions I want to investigate:

  1. On what day of week I have the most streams, or listeners, or increase in followers?

  2. How many days since the date released that song streams plummeted?

  3. Does releasing music in EP or Single have differences in streams?

  4. What metric has more effect to increase in followers?

2. Data Preparation

First, let’s load some tools to look into the data.

2.a. Timelines

I started releasing music in Spotify in 2020. The 2022 hasn’t ended yet, so let’s create the cut off range from 2020 to 2021.

# I started releasing Syh project in 2020
# filter the date
timelines <- timelines %>% 
  mutate(date = as.Date(date)) %>% 
  filter(date >= "2020-01-01",
         date <= "2021-12-31") %>%
  arrange(date)

Let’s visualize this data.

ggplot(timelines, aes(x = date, y = followers))+
  geom_line(aes(color="Total Followers"))+
  geom_point(aes(y=streams, 
                 color="Streams"
                 )
             )+
  geom_point(aes(y=listeners, 
                 color="Listeners"
                 ),
             )+
  theme(axis.title.y = element_blank(),
        legend.position = "top")+
  scale_color_manual(name="Spotify Metrics",
                     breaks=c("Total Followers","Streams","Listeners"),
                     values=c("Total Followers"="red","Streams"="green", "Listeners"="blue"))

NA
NA

As we can see, the red line that represents followers, rise cumulatively, while the green and blue dots that represent streams and listeners are not cumulative. I need to create cumulative data for them.



timelines_edit <- timelines %>% 
  mutate(cum_listeners = cumsum(listeners),
         cum_streams = cumsum(streams),
         day_of_year = yday(date))
head(timelines_edit, 3)

Wait. I don’t remember releasing music on the first day of the year. Maybe, using day_of_year is a bit off though may be useful as well. I think I need to add a column that counts the number of days since the release date of the first song, not from the first day of the year. To do that, it’s necessary to find at what date should I start to count.

# to count day since the first day of release
start_count_date <- recordings_last5years %>%
  filter(release_date == min(release_date)) %>%  pull(release_date) %>% unique()
start_count_date 
[1] "2020-01-09"

Then use the date we found to count how many integer necessary to create lag in day_count.

n_for_lag <- timelines_edit %>% 
  filter(date == start_count_date) %>% pull(day_of_year) - 1 %>% as.integer()
n_for_lag
[1] 8

Then we use lag to start counting, don’t forget to replace the NA with zero.

timelines_edit <- timelines_edit %>% 
  mutate(count_days = lag(seq_along(date), n=n_for_lag)) %>%
  replace_na(list(count_days =0))

head(timelines_edit, 20)

Let’s visualise the new features.

  ggplot(timelines_edit, aes(x = date, y = followers))+
  geom_line(aes(color="Total Followers"))+
  geom_point(aes(y=cum_streams, 
                 color="Cumulative Streams"
                 )
             )+
  geom_point(aes(y=cum_listeners, 
                 color="Cumulative Listeners"
                 ),
             )+
  geom_line(aes(y=count_days,
                color="Cumulative Day Counts"),
            )+
  theme(axis.title.y = element_blank(),
        legend.position = "right")+
  scale_color_manual(name="Cumulative Spotify Metrics",
                     breaks=c("Total Followers",
                              "Cumulative Streams",
                              "Cumulative Listeners",
                              "Cumulative Day Counts"),
                     values=c("Total Followers"="red",
                              "Cumulative Streams"="green", 
                              "Cumulative Listeners"="blue",
                              "Cumulative Day Counts"="black"))

Before getting too carried away into what cumulative metrics might say, let’s take a step back and extract some more features from the date column.

timelines_edit <- timelines_edit %>%
  mutate(day_of_week = wday(date, label = TRUE, abbr = FALSE),
         )

I’m a bit bothered that the followers feature is cumulative only. I need to create a new feature that allows me to see when was the number follower increased by subtracting the followers column with its lagged version using lag(). Because lag() creates NA, I also need to impute it with 0 immediately.

timelines_edit <- timelines_edit %>%
  mutate(followers_diff = followers - lag(followers)) %>%
  mutate_if(is.numeric, ~replace(., is.na(.),0))
head(timelines_edit)

If I wanted to see how some features in the timelines_edit relate to the songs I released, I need to prepare the recordings_last5years dataframe first.

2.b. Recordings

Some caveats about this recording data: 1. In the Spotify for Artists web where I downloaded the csv for this particular data, the only options for adjusting cut off date were: Last 24 Hours, Last 7 Days, Last 28 Days, Since 2015, and All time. So the metrics of streams and listeners is as up to date as the date when I downloaded the csv which was at 14 April 2022. I cannot limit the date up to the end of 2021 to equal the timeline csv. It’s better to see the data bellow as taking a look from different perspective.

  1. This recordings_last5years dataframe consists of 17 songs, and I want to explain somethings about 2 of them. First, the song Short Left Drive is not mine. Someone somewhere had the unfortunate fate to release music under my Spotify Artist ID tag, and I don’t know how to remove it, and I hope whoever they are might do something about it. Another song I need to explain is the song Froth and Foam. It was previously released in 2019 under my other Artist ID tag, and only in 2021 I decided to move it under my Syh Artist ID tag. I need to decide, should I remove them both from the dataframe? Well, I will remove Short Left Drive but not Froth and Foam.
recordings <- recordings_last5years %>%
  filter(song != "Short Left Drive")

Let’s take a look.

print(recordings)

Because this dataframe lacks information, I want to add how the song was released, whether as EP or as Single.

Before that, I have to arrange it by date.

recordings <- recordings %>% 
  arrange(release_date)
recordings
discography_title <- 
  as.factor(c("Manusial", 
    "Manusial", 
    "Manusial", 
    "Manusial", 
    "Antinatal in C", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar",
    "Nyamuk Anjing",
    "Halulintas",
    "Within an Inch of Losing It",
    "Tuhan Cabut Nyawaku",
    "Froth and Foam"
    ))

discography_form <-
 as.factor(c("EP",
    "EP",
    "EP",
    "EP",
    "Single",
    "EP",
    "EP",
    "EP",
    "EP",
    "EP",
    "EP",
    "Single",
    "Single",
    "Single",
    "Single",
    "Single"))
recordings_edit <- recordings %>%
  mutate(disc_title = discography_title,
         disc_form = discography_form
         ) %>%
  select(-saves) # since it's zero, let's just drop it

recordings_edit

B. ANALYSIS

1. Aggregations and Discoveries

1.a. Task1: “On what day of week I have the most streams, or listeners, or increase in followers?”

We can answer this by aggregating the data with group by and summarize.

metrics_per_dow <- timelines_edit %>%
  group_by(day_of_week) %>%
  summarize(total_streams = sum(streams),
            total_listeners = sum(listeners),
            total_followers_diff = sum(followers_diff))
metrics_per_dow

The table above is how the metrics summarized in day of week. How these metrics spread out will be more easily digestable if we pivot them longer so they’re better prepared to be visualized.

dowtop_streams <- max(metrics_per_dow$total_streams)
dowtop_listeners <- max(metrics_per_dow$total_listeners)
dowtop_folldiff <- max(metrics_per_dow$total_followers_diff)

best_dow_stream <- as.data.frame(metrics_per_dow[metrics_per_dow$total_streams == dowtop_streams, c("day_of_week","total_streams")])
best_dow_listeners <- as.data.frame(metrics_per_dow[metrics_per_dow$total_listeners == dowtop_listeners, c("day_of_week","total_listeners")])
best_dow_folldiff <- as.data.frame(metrics_per_dow[metrics_per_dow$total_followers_diff == dowtop_folldiff, c("day_of_week","total_followers_diff")])

answer_1 <- cbind(best_dow_stream, best_dow_listeners, best_dow_folldiff)
metrics_per_dow_longer <- pivot_longer(data = metrics_per_dow,
                                       cols = c("total_streams","total_listeners","total_followers_diff"),
                                       names_to = "metrics",
                                       values_to = "total")



# pivot day
pivot_day <- pivot_longer(data = answer_1,
                          cols = "day_of_week",
                          values_to = "best_day_of_week") %>% 
  select(best_day_of_week)

# pivot stats
pivot_stats <- pivot_longer(data = answer_1, 
                            cols = c("total_streams","total_listeners","total_followers_diff"), 
                            names_to = "metrics", 
                            values_to = "total") %>% 
  select(metrics, total)

# join
answer_1longer<- bind_cols(pivot_stats, pivot_day)

metrics_per_dow_longer
answer_1longer

Let’s visualize how the metrics achieved in day of week.

ggplot(metrics_per_dow_longer, aes(x=reorder(day_of_week, total), y=total))+
  geom_col(aes(fill=metrics), position = "dodge")+
  scale_fill_manual(name = "Metrics", 
                    labels = c("Total Followers Increase","Total Listeners","Total Streams"), 
                    values = c("total_followers_diff"="red",
                                "total_listeners"="blue",
                                "total_streams"="green"))+
  labs(title = "Metrics Total in Day of Weeks")+
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())+
  coord_flip()


ggplot(answer_1longer, aes(x=reorder(best_day_of_week, -total), y=total))+
  geom_col(aes(y=total, fill=metrics), position = "dodge")+
  scale_fill_manual(name = "Metrics", 
                    labels = c("Total Followers Increase","Total Listeners","Total Streams"), 
                    values = c("total_followers_diff"="red",
                                "total_listeners"="blue",
                                "total_streams"="green"))+
  labs(title = "What Day of Week has the Highest Total Metrics?")+
    theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

So, to answer the task, On what day of week I have the most streams, or listeners, or increase in followers? The data shows that I have the most streams at Friday, most listeners at Tuesday, and most increase in followers at Thursday.

1.b. Task2: “How many days since the date released that song streams plummeted?”

Before we go, let’s make something clear. Streams that happened after the release date of some songs were not exclusively streams from the same said songs. The release though, might triggered the streams of other songs that was also already available.

To answer the task, I have to mark timelines_edit. I’ll create recordings_marker. Then, I’ll split timelines_marked according from each release date up until the day before a new release by using recordings_marker as a guide.

recordings_marker <- recordings_edit %>% select(release_date, song, disc_title, disc_form)

Left join them to create timelines_marked

timelines_marked <- timelines_edit %>%
  left_join(recordings_marker,
            by = c("date"="release_date"))
timelines_marked[!is.na(timelines_marked$song),c("date","streams","song","disc_title")]
date_manusial <- timelines_marked %>% filter(disc_title == "Manusial") %>% pull(date)
date_antinatal_in_c <- timelines_marked %>% filter(disc_title == "Antinatal in C") %>% pull(date)
date_biru_memar <- timelines_marked %>% filter(disc_title == "Biru Memar") %>% pull(date)
date_nyamuk_anjing <- timelines_marked %>% filter(disc_title == "Nyamuk Anjing") %>% pull(date)
date_halulintas <- timelines_marked %>% filter(disc_title == "Halulintas") %>% pull(date)
date_waioli <- timelines_marked %>% filter(disc_title == "Within an Inch of Losing It") %>% pull(date)
date_tc_nyawaku <- timelines_marked %>% filter(disc_title == "Tuhan Cabut Nyawaku") %>% pull(date)
date_froth_n_foam <- timelines_marked %>% filter(disc_title == "Froth and Foam") %>% pull(date)
date_end <- timelines_marked %>% filter(date == max(date)) %>% pull(date)

streams_post_manusial <- timelines_marked %>% 
  filter(date >= date_manusial, 
         date < date_antinatal_in_c) %>% 
  mutate(disc_title = as.character(disc_title)) %>% 
  select(date,streams,disc_title)
Warning in `>=.default`(date, date_manusial) :
  longer object length is not a multiple of shorter object length
streams_post_manusial[,"disc_title"] <- replace_na(data = streams_post_manusial$disc_title,
                                    replace = "Manusial")

streams_post_manusial

It’s getting tedious doing repeat code, maybe I should start trying to create my own function.

func_tlChopper <- function(xdate, ydate, streams_xname)
{
streams_xname <- timelines_marked %>% 
                  filter(date >= xdate, 
                         date < ydate) %>% 
                  mutate(disc_title = as.character(disc_title)) %>% 
                  select(date,streams,disc_title)
disc_varname <- streams_xname %>% select(disc_title) %>% filter(!is.na(disc_title)) %>% slice_head() %>% pull()

streams_xname[,"disc_title"] <- replace_na(data = streams_xname$disc_title,
                                    replace = disc_varname)

streams_xname
}

Let’s chop all the song timelines!

streams_post_antinatalinc <- 
  func_tlChopper(date_antinatal_in_c, date_biru_memar, "streams_post_antinatalinc")
Warning in `<.default`(date, ydate) :
  longer object length is not a multiple of shorter object length
streams_post_birumemar <- 
  func_tlChopper(date_biru_memar, date_nyamuk_anjing, "streams_post_birumemar")
Warning in `>=.default`(date, xdate) :
  longer object length is not a multiple of shorter object length
streams_post_nyamukanjing <- 
  func_tlChopper(date_nyamuk_anjing, date_halulintas, "streams_post_nyamukanjing")

streams_post_halulintas <- 
  func_tlChopper(date_halulintas, date_waioli, "streams_post_halulintas")

streams_post_waioli <- 
  func_tlChopper(date_waioli, date_tc_nyawaku, "streams_post_waioli")

streams_post_tcnyawaku <- 
  func_tlChopper(date_tc_nyawaku, date_froth_n_foam, "streams_post_tcnyawaku")

streams_post_fnf <- 
  func_tlChopper(date_froth_n_foam, date_end, "streams_post_tcnyawaku")
streams_merged <- rbind(streams_post_manusial, 
                     streams_post_antinatalinc,
                     streams_post_birumemar,
                     streams_post_nyamukanjing,
                     streams_post_halulintas,
                     streams_post_waioli,
                     streams_post_tcnyawaku,
                     streams_post_fnf)

range <-  c(as.Date("2019-12-05"), as.Date("2022-02-01"))

ggplot(streams_merged, aes(x=date,y=streams))+
  geom_line(aes(col=disc_title))+
  theme(plot.subtitle = element_text(face = "italic", size = 7),
        axis.title = element_blank(),
        legend.position = "right")+
  geom_vline(data=recordings_marker, 
             aes(xintercept=release_date),
             linetype="dashed")+
  geom_text(data=recordings_marker,
            aes(x=release_date,
                label=song,
                y=50, 
                col=disc_title),
            position = position_jitter(height = 40, width = 3, seed = 9),
            size = 5
            )+
  labs(title = "Timelines of Streams", 
       col="Release Title")+
  scale_x_date(date_labels = "%F", 
               limits = range)

For future uses, I think it’s better if I gather the streams_post into a list.

streams_list <- list(streams_post_manusial, 
                     streams_post_antinatalinc,
                     streams_post_birumemar,
                     streams_post_nyamukanjing,
                     streams_post_halulintas,
                     streams_post_waioli,
                     streams_post_tcnyawaku,
                     streams_post_fnf
                     )

Now let’s count how many days that there were still some streams until no streams at all.

streams_post_manusial_dummy <- streams_post_manusial %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 
day_count_stream_post_manusial_dummy <- streams_post_manusial_dummy  %>% 
  count(is.na(streams))
day_count_stream_post_manusial_dummy

How to interpret the table above? It took 44 days since the release of Manusial EP until there was completely no streams at all that lasted for 19 days before the next release. But if we want to look a graph that illustrates this with actual number of streams, I can create markers for the original streams count, then draw up a line plot.

markers_post_manusial_dummy_plot <-streams_post_manusial_dummy %>% 
  mutate(markers = case_when(is.na(streams) ~ "No One Listened",
                           TRUE ~ "Some Streams Happened"))

Bind the markers column to the original stream chopped data.

streams_post_manusial_complete <- streams_post_manusial %>% 
  cbind(markers = markers_post_manusial_dummy_plot$markers)

Visualize the streams timeline, with markers.

ggplot(streams_post_manusial_complete, aes(x=date,y=streams, col=markers))+
  geom_line()+
  labs(title = "Streams Count after Manusial EP release",
        subtitle = "and before Antinatal in C release")

Now let’s do this for all the other timelines releases. I create 2 functions bellow to help me count days and prepare data for plot.

# for count

func_dayCountStreams <- function(streams_post_df){

# 1 create dummy with na, then fill upward
streams_post_df_dummy <- streams_post_df %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 

# 2 create table how many not na, how many na
day_count_stream_post_df_dummy <- streams_post_df_dummy  %>% 
  count(is.na(streams))

day_count_stream_post_df_dummy
}
# for plot

func_streamsCompletePlot <- function(streams_post_df){

# 1 create dummy with NA, then fill upward    
streams_post_df_dummy <- streams_post_df %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 

# 2 create markers 
markers_post_dummy_plot <-streams_post_df_dummy %>% 
  mutate(markers = case_when(is.na(streams) ~ "No One Listened",
                           TRUE ~ "Some Streams Happened"))
# 3 bind markers to each stream data
markers_post_dummy_plot
streams_post_complete <- streams_post_df %>% 
  cbind(markers = markers_post_dummy_plot$markers)

# 4 create title
title <- streams_post_df %>% select(disc_title) %>% slice_head() %>% pull()
plot_title = paste("Streams Count of", title)

# 5 plot to line plot
ggplot(streams_post_complete, aes(x=date,y=streams))+
  geom_line(aes(col=markers))+
  labs(title = plot_title,
       subtitle = "until the date before subsequent release was available or the end of 2022")+
  scale_x_date(date_labels = "%F")+
  scale_color_manual(name = "Stream Status", 
                    labels = c("Some Streams Happened","No One Listened"),
                    values = c("Some Streams Happened"="blue",
                               "No One Listened"="red"))+
  theme(plot.subtitle = element_text(face = "italic", size = 7),
        axis.title = element_blank())

}

Now let’s apply these functions to streams_list:

lapply(X = streams_list,FUN = func_dayCountStreams)
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]
NA
lapply(X = streams_list,FUN = func_streamsCompletePlot)
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

Some notes about the results above: 1. The release date of Tuhan Cabut Nyawaku and Froth and Foam is too close, and the release date of Froth and Foam and the end of 2021 as a cut off range is also too close. The resulting counts and plots were a bit of an outlier so it’s probably wise to exclude them from conclusion. 2. To answer the task, how many days since the date released that song streams plummeted? It could be as short as 40 days, or could be as long as 242 days.

Let’s proceed to the next task.

1.c. Task3: “Does releasing music in EP or Single have differences in streams?”

Take a look.

print(recordings_edit)

Split the data into EPs and Singles

rec_ep <- recordings_edit %>% filter(disc_form == "EP")
rec_single <- recordings_edit %>% filter(disc_form == "Single")
ggplot(recordings_edit, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "Recordings Metrics")


ggplot(rec_ep, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "EP Metrics")


ggplot(rec_single, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "Single Metrics")

To see whether or not there are difference between the two group we need to perform a statistical test. First I need to check the normality of the data, to decide whether or not to use parametric or nonparametric test.

recordings_edit %>% 
  group_by(disc_form) %>% 
  normality(streams)
recordings_edit %>% 
  group_by(disc_form) %>% 
  plot_normality(streams)

recordings_edit[,c(6,3)]  
ggqqplot(recordings_edit[,c(6,3)], x = "streams")

As shown by the table and plot above, the streams among them are quite normally distributed. We can proceed with a parametric test. But witch one? To compare two groups, I can use either Student’s t-Test or Welch’s t-Test. If the data has equal variance, we’ll proceed with Student’s t-Test, else we go with Welch’s t-Test.

If p-value from LeveneTest is lower than 0.05, it means the variance between EP and Single is not equal.

leveneTest(streams ~ disc_form, recordings_edit[,c(6,3)] )
Levene's Test for Homogeneity of Variance (center = median)
      Df F value  Pr(>F)  
group  1  7.8457 0.01415 *
      14                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null Hypothesis = There is no difference between the streams count of releases in form of Single or EP.

Alternative Hypothesis = There is a difference between the streams count of releases in form of Single or EP

# load the package 
library(ggstatsplot)
ggbetweenstats(
  data = recordings_edit,
  x = disc_form,
  y = streams,
  type = "parametric",
  var.equal = FALSE
)

Despite the Hedges’g of 0.81 that implies large effect size, and posterior difference of 22.8 streams. With p.value of 0.10 it shows that there was a lack of evidence to reject the null hypothesis. Also, with the bayes factor of 0.07 it shows that there’s no evidence worth than a bare mention to reject the null hypothesis.

I conclude there are no differences whatsoever in term of streams count whether or not I released my music in EP or single.

On to the next task.

1.d. Task4: “What metric has more effect to increase in followers?”

I intend to investigate this relationships by using regression. But, as mentioned in the data preparation phase, followers is a cumulative metric, to compare evenly with followers, I can use the cumulative version of the other metrics. But if I want to use non-cumulative metrics, I can use the followers_diff metric. Maybe I’ll just do both.

tl_cum <- timelines_edit %>% select(followers, cum_listeners, cum_streams, count_days)
tl_noncum <- timelines_edit %>% select(followers_diff, listeners, streams, day_of_year, day_of_week)

First let’s check on assumption that the Response Variable and Explanatory Variables are linear.

plot(tl_cum)

plot(tl_noncum)

They’re pretty linear alright in the cumulative data, but not so in the in the noncumulative data.

Next, we observe whether or not they’re normally distributed.

plot_normality(tl_cum)

plot_normality(tl_noncum)

Looking at the plot, I would say that they’re not normally distributed.

I should stop.

Not just because the data fail to meet the assumptions. But more so that at the time I write this, I’ve had received a counsel from a mentoring group that informed me, when dealing with data such as this? It’s better to approach them with Time Series, not with Linear Regression.

So here I stop. But! Let me take a tiny peek at the model!

tlcum_model <- lm(formula = followers ~ cum_listeners * cum_streams + count_days + 0, 
                  data = tl_cum)
tlnoncum_model <- lm(formula = followers_diff ~ listeners * streams + day_of_week + 0, 
                  data = tl_noncum)

summary(tlcum_model)

Call:
lm(formula = followers ~ cum_listeners * cum_streams + count_days + 
    0, data = tl_cum)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75922 -0.55944 -0.02806  0.54210  1.39665 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
cum_listeners             -1.352e-02  9.654e-03  -1.401    0.162    
cum_streams                6.091e-03  1.137e-03   5.360 1.12e-07 ***
count_days                 7.847e-03  5.404e-04  14.521  < 2e-16 ***
cum_listeners:cum_streams  4.847e-06  4.976e-06   0.974    0.330    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7095 on 727 degrees of freedom
Multiple R-squared:  0.9866,    Adjusted R-squared:  0.9865 
F-statistic: 1.339e+04 on 4 and 727 DF,  p-value: < 2.2e-16
summary(tlnoncum_model)

Call:
lm(formula = followers_diff ~ listeners * streams + day_of_week + 
    0, data = tl_noncum)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03531 -0.01757 -0.00583  0.01225  0.98243 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
listeners             0.0489391  0.0081377   6.014 2.88e-09 ***
streams              -0.0004790  0.0010408  -0.460 0.645494    
day_of_weekSunday     0.0058341  0.0119242   0.489 0.624804    
day_of_weekMonday     0.0048250  0.0119238   0.405 0.685854    
day_of_weekTuesday   -0.0140734  0.0119526  -1.177 0.239413    
day_of_weekWednesday -0.0125728  0.0118150  -1.064 0.287623    
day_of_weekThursday   0.0175744  0.0118126   1.488 0.137249    
day_of_weekFriday     0.0060414  0.0119152   0.507 0.612287    
day_of_weekSaturday  -0.0122483  0.0118711  -1.032 0.302522    
listeners:streams    -0.0009008  0.0002551  -3.532 0.000439 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1192 on 721 degrees of freedom
Multiple R-squared:  0.0683,    Adjusted R-squared:  0.05538 
F-statistic: 5.286 on 10 and 721 DF,  p-value: 1.544e-07

Cool. The cumulative model’s estimate says, Count of days since the first time my musice released on Spotify has been having more impact than some cumulative streams and/or listeners I’ve scrapped along the way.

While the noncumulative model’s estimate says listeners and Thursday are the only things that matter and have positive impact. And streams has negative estimate value! Bravo! The data doesn’t lie. Perhaps because most streams are coming from my own streams thus it doesn’t really impact any impact in followers. While number of listeners actually broaden and increase the chance of having new followers.

Well, since this linear regression thing isn’t the right approach, I’m gonna take it with a grain–NO! A spoonful of salt and wash it down with alc–Coffee!

Even though I was only going to see the estimates, now I’m tempted to see a little prediction. Here we go blaming the tempation. >“Darn, temptation!”

explanatory_data1_cum <- timelines_edit %>% select(cum_listeners, cum_streams, count_days)
explanatory_data2_cum <- data.frame(cum_listeners = round(seq(242,484,242/731)),
         cum_streams = round(seq(1031,2062 ,1031/731)),
         count_days = 723:(723+731))
explanatory_data_cum <- rbind(explanatory_data1_cum, explanatory_data2_cum)

explanatory_data1_noncum <- timelines_edit %>% select(listeners, streams, day_of_week)
explanatory_data2_noncum <- explanatory_data1_noncum %>% mutate(
  listeners = listeners*2,
  streams = streams*2
)
explanatory_data_noncum <- rbind(explanatory_data1_noncum, explanatory_data2_noncum)
prediction_data_cum <- explanatory_data_cum %>% 
  mutate(followers = predict(tlcum_model, explanatory_data_cum))

prediction_data_noncum <- explanatory_data_noncum %>% 
  mutate(followers_diff = predict(tlnoncum_model, explanatory_data_noncum))
tail(prediction_data_cum, 3)
sum(prediction_data_noncum$followers_diff)
[1] 23.70941
plot(prediction_data_noncum$followers_diff)

cum_plot <- ggplot(tl_cum, aes(x= count_days, y=followers, col=cum_streams, size=cum_listeners))+
  geom_point(shape=1)+
  scale_color_viridis_b(option="D")+
  geom_jitter(data=prediction_data_cum, shape = 2, width = 6, height = 2, alpha=0.3)+
  labs(col="Cumulative Streams",
       size="Cumulative Listeners",
       x = "Count of Days Since First Release Date",
       y = "Number of Followers")

cum_plot


noncum_plot <- ggplot(tl_noncum, aes(x= listeners, y=followers_diff, col=streams))+
  geom_point(shape=1)+
  scale_color_viridis_c(option="D")+
  geom_jitter(data=prediction_data_noncum, height = 0.05, width = 3, alpha=0.3, shape = 2)+
  labs(col="Streams",
       x = "Number of Listeners",
       y = "Increas/Decrease in Followers")+
  facet_wrap(vars(day_of_week), scales = "free")


noncum_plot

Perhap’s that was not the right approach, but, did I have fun while practicing data science along the way? I did.

Let’s look at the plot. It’s quite sane and linear with the cumulative one, but a bit funky with the noncumulative one–what a wild wednesday that plot! Still, both results basically produced twice the Response Variable of the original data set since I merely doubled the numeric Explanatory Variables in both of them.

Welp, since I don’t yet to know how to do Time Series, and let’s say the regression above isn’t the right approach. Maybe at least we can look at the good old correlation bellow. We know the popular saying that says “correlation does not imply causation”, but we too need to acknowledge that, “there is no correlation without causation somewhere”–there’s probably a common cause somewhere upstream.

plot_correlation(tl_cum)

plot_correlation(tl_noncum)

Followers correlates most with cumulative streams and count of days equally on 0.96. While, followers difference correlates most with listeners. Not so different with the estimate coefficients when I called summary on the regression models before.

2. Results

2.a. “On what day of week I have the most streams, or listeners, or increase in followers?”

The data shows that I have the most streams at Friday, most listeners at Tuesday, and most increase in followers at Thursday.

Metrics Day of Week
Streams Friday
Listeners Tuesday
Increase in Followers Thursday

2.b. “How many days since the date released that song streams plummeted?”

It could be as short as 40 days, or could be as long as 242 days.

Count of Days Streaming Still Happened
40 >= <=242

2.c. “Does releasing music in EP or Single have differences in streams?”

Lacking evidence to reject the null hypothesis that there’s no difference.

P-value
0.10
Log natural of Bayes Factor
0.07

2.d. “What metric has more effect to increase in followers?”

Total followers and Followers Difference correlate most with:

Cumulative Metrics Noncumulative Metrics
Count of Days 0.96 Number of Listeners per day 0.18
Cumulative Streams 0.96 -

C. CONCLUSION

How would I conclude all this? Well, according to Cassie Kzyrkov, who works as Chief Decision Scientist at Google, data analytics serves to find good questions and probably should conclude nothing outside the data itself, while statistics may serves to find appropriate answers by inferring beyond the data.

The results for tasks a and b, might lead me to ask further questions such as: “Why does streams increased on Friday? Should I try to A/B test release in Friday and not Friday? Or is there any studies before, that already investigated the importance of Friday for music streaming?” “Are these streams mostly done on the song recently released? Or any release date might offer a starting point to stream other songs? So is it better to release frequently at shorter interval? What songs are short lived, what might be the cause? Whatsongs are generating many days of streams, is it because there are other people who actually streams them, other than me?”

Meanwhile the result for task c led me to conclude that it doesn’t matter whether to release my music in EP or Single–when it comes to stream counts. So when I want to release some musics in the future, I can weigh other considerations more than brooding over EP or Single. Maybe practicality such as the cost to release EP or Single should matter to me more.

Lastly, for the result of task d, let’s just say that it can be a starting point for better statistical investigations when I’m more capable of it later.

---
title: "Syh Spotify Data Exploration"
author: "S.Y. Husada"
date: '2022-05-01'
output:
  html_notebook
---

## A. [INTRODUCTION](#intro)

#### 1.[Identify Tasks](#tasks)

#### 2.[Data Preparation](#prep)

##### 2.a.[Timelines](#tl)

##### 2.b.[Recordings](#rec)

## B. [ANALYSIS](#analysis)

#### 1. [Aggregations and Discoveries](#agg)

##### 1.a. [Task 1: "On what day of week I have the most streams, or listeners, or increase in followers?"](#t1)

##### 1.b. [Task 2: "How many days since the date released that song streams plummeted?"](#t2)

##### 1.c. [Task 3: "Does releasing music in EP or Single have differences in streams?"](#t3)

##### 1.d. [Task 4: "What metric has more effect to increase in followers?"](#t4)

#### 2. [Results](#res)

##### 2.a. [Result 1: "On what day of week I have the most streams, or listeners, or increase in followers?"](#2a) 

##### 2.b. [Result 2: "How many days since the date released that song streams plummeted?"](#2b)

##### 2.c. [Result 3: "Does releasing music in EP or Single have differences in streams?"](#2c)

##### 2.d. [Result 4: "What metric has more effect to increase in followers?"](#2d) 

## C. [CONCLUSION](#concl)
  
---  
  
---  

---  
  

# **A. INTRODUCTION** <a id="intro"></a>

>*"We can get creative: we can start to send out messages in a bottle: we can sing, write poetry, produce books and blogs, activities stemming from the realisation that people around us won't ever fully get us but that others -- separated across time and space -- might just. The history of art is the record of people who couldn't find anyone in the vicinity to talk to."* ---Alain de Botton

The quote above might represent what I felt when I started producing music, among other things.
But this time, I hereby hold myself to face the dire state of my musical project in the form of data.
Specifically, the data I've got from my Spotify for Artists account.
Maybe, I can learn something from it.
Perhaps, answering questions such as "On what day of week I have the most streams?", or "How long since the date released that song streams plummeted?""

There are two csv files.
The first one captures a timeline of streams, listeners, and followers; while another offers similar metrics but is grouped by song titles.
They're pretty bare bones.
I might need to perform some features engineering.

## **1. Identify Tasks** <a id="tasks"></a>

Almost halfway in writing this notebook, I realized I was going nowhere since I didn't really have any specific direction.
So here I am, writing lists of questions I want to investigate:

1.  On what day of week I have the most streams, or listeners, or increase in followers?

2.  How many days since the date released that song streams plummeted?

3.  Does releasing music in EP or Single have differences in streams?

4.  What metric has more effect to increase in followers?

## **2. Data Preparation** <a id="prep"></a>

First, let's load some tools to look into the data.
```{r message=FALSE, warning=FALSE, include=FALSE}
library(tidyverse)
library(readr)
library(dlookr)
library(DataExplorer)
library(lubridate)
library(ggstatsplot)
library(effectsize)
library(ggpubr)
library(car)
recordings_last5years <- read_csv("recordings-last5years.csv")
timelines <- read_csv("timelines.csv")
```

### *2.a. Timelines* <a id="tl"></a>

I started releasing music in Spotify in 2020.
The 2022 hasn't ended yet, so let's create the cut off range from 2020 to 2021.
```{r}
# I started releasing Syh project in 2020
# filter the date
timelines <- timelines %>% 
  mutate(date = as.Date(date)) %>% 
  filter(date >= "2020-01-01",
         date <= "2021-12-31") %>%
  arrange(date)
```

Let's visualize this data.
```{r}
ggplot(timelines, aes(x = date, y = followers))+
  geom_line(aes(color="Total Followers"))+
  geom_point(aes(y=streams, 
                 color="Streams"
                 )
             )+
  geom_point(aes(y=listeners, 
                 color="Listeners"
                 ),
             )+
  theme(axis.title.y = element_blank(),
        legend.position = "top")+
  scale_color_manual(name="Spotify Metrics",
                     breaks=c("Total Followers","Streams","Listeners"),
                     values=c("Total Followers"="red","Streams"="green", "Listeners"="blue"))
 
  
```

As we can see, the red line that represents followers, rise cumulatively, while the green and blue dots that represent streams and listeners are not cumulative.
I need to create cumulative data for them.
```{r}


timelines_edit <- timelines %>% 
  mutate(cum_listeners = cumsum(listeners),
         cum_streams = cumsum(streams),
         day_of_year = yday(date))
```

```{r}
head(timelines_edit, 3)
```

Wait.
I don't remember releasing music on the first day of the year.
Maybe, using day_of_year is a bit off though may be useful as well.
I think I need to add a column that counts the number of days since the release date of the first song, not from the first day of the year. To do that, it's necessary to find at what date should I start to count.
```{r}
# to count day since the first day of release
start_count_date <- recordings_last5years %>%
  filter(release_date == min(release_date)) %>%  pull(release_date) %>% unique()
start_count_date 
```

Then use the date we found to count how many integer necessary to create lag in day_count.
```{r}
n_for_lag <- timelines_edit %>% 
  filter(date == start_count_date) %>% pull(day_of_year) - 1 %>% as.integer()
n_for_lag
```

Then we use lag to start counting, don't forget to replace the NA with zero.
```{r}
timelines_edit <- timelines_edit %>% 
  mutate(count_days = lag(seq_along(date), n=n_for_lag)) %>%
  replace_na(list(count_days =0))

head(timelines_edit, 20)
```

Let's visualise the new features.
```{r}
  ggplot(timelines_edit, aes(x = date, y = followers))+
  geom_line(aes(color="Total Followers"))+
  geom_point(aes(y=cum_streams, 
                 color="Cumulative Streams"
                 )
             )+
  geom_point(aes(y=cum_listeners, 
                 color="Cumulative Listeners"
                 ),
             )+
  geom_line(aes(y=count_days,
                color="Cumulative Day Counts"),
            )+
  theme(axis.title.y = element_blank(),
        legend.position = "right")+
  scale_color_manual(name="Cumulative Spotify Metrics",
                     breaks=c("Total Followers",
                              "Cumulative Streams",
                              "Cumulative Listeners",
                              "Cumulative Day Counts"),
                     values=c("Total Followers"="red",
                              "Cumulative Streams"="green", 
                              "Cumulative Listeners"="blue",
                              "Cumulative Day Counts"="black"))
```

Before getting too carried away into what cumulative metrics might say, let's take a step back and extract some more features from the date column.
```{r}
timelines_edit <- timelines_edit %>%
  mutate(day_of_week = wday(date, label = TRUE, abbr = FALSE),
         )
```

I'm a bit bothered that the followers feature is cumulative only.
I need to create a new feature that allows me to see when was the number follower increased by subtracting the followers column with its lagged version using lag().
Because lag() creates NA, I also need to impute it with 0 immediately.
```{r}
timelines_edit <- timelines_edit %>%
  mutate(followers_diff = followers - lag(followers)) %>%
  mutate_if(is.numeric, ~replace(., is.na(.),0))
```

```{r}
head(timelines_edit)
```

If I wanted to see how some features in the `timelines_edit` relate to the songs I released, I need to prepare the `recordings_last5years` dataframe first.

### **2.b. Recordings** <a id="rec"></a>

Some caveats about this recording data: 1.
In the Spotify for Artists web where I downloaded the csv for this particular data, the only options for adjusting cut off date were: Last 24 Hours, Last 7 Days, Last 28 Days, Since 2015, and All time.
So the metrics of streams and listeners is as up to date as the date when I downloaded the csv which was at 14 April 2022.
I cannot limit the date up to the end of 2021 to equal the timeline csv.
It's better to see the data bellow as taking a look from different perspective.

2.  This `recordings_last5years` dataframe consists of 17 songs, and I want to explain somethings about 2 of them. First, the song **Short Left Drive** is not mine. Someone somewhere had the unfortunate fate to release music under my Spotify Artist ID tag, and I don't know how to remove it, and I hope whoever they are might do something about it. Another song I need to explain is the song **Froth and Foam**. It was previously released in 2019 under my other Artist ID tag, and only in 2021 I decided to move it under my [Syh Artist ID tag](https://open.spotify.com/artist/34LYpPINVzGOwWeMAZ7mWz?si=3h9_phc7RKuPa8XcqRYQ-g). I need to decide, should I remove them both from the dataframe? Well, I will remove **Short Left Drive** but not **Froth and Foam**.

```{r}
recordings <- recordings_last5years %>%
  filter(song != "Short Left Drive")
```

Let's take a look.
```{r}
print(recordings)
```

Because this dataframe lacks information, I want to add how the song was released, whether as EP or as Single.

Before that, I have to arrange it by date.
```{r}
recordings <- recordings %>% 
  arrange(release_date)
recordings
```
```{r}
discography_title <- 
  as.factor(c("Manusial", 
    "Manusial", 
    "Manusial", 
    "Manusial", 
    "Antinatal in C", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar", 
    "Biru Memar",
    "Nyamuk Anjing",
    "Halulintas",
    "Within an Inch of Losing It",
    "Tuhan Cabut Nyawaku",
    "Froth and Foam"
    ))

discography_form <-
 as.factor(c("EP",
    "EP",
    "EP",
    "EP",
    "Single",
    "EP",
    "EP",
    "EP",
    "EP",
    "EP",
    "EP",
    "Single",
    "Single",
    "Single",
    "Single",
    "Single"))
```
```{r}
recordings_edit <- recordings %>%
  mutate(disc_title = discography_title,
         disc_form = discography_form
         ) %>%
  select(-saves) # since it's zero, let's just drop it

recordings_edit
```


# **B. ANALYSIS** <a id="analysis"></a>

## **1. Aggregations and Discoveries** <a id="agg"></a>

### **1.a. Task1: "On what day of week I have the most streams, or listeners, or increase in followers?"** <a id="t1"></a>

We can answer this by aggregating the data with group by and summarize.

```{r}
metrics_per_dow <- timelines_edit %>%
  group_by(day_of_week) %>%
  summarize(total_streams = sum(streams),
            total_listeners = sum(listeners),
            total_followers_diff = sum(followers_diff))
metrics_per_dow
```

The table above is how the metrics summarized in day of week.
How these metrics spread out will be more easily digestable if we pivot them longer so they're better prepared to be visualized.

```{r}
dowtop_streams <- max(metrics_per_dow$total_streams)
dowtop_listeners <- max(metrics_per_dow$total_listeners)
dowtop_folldiff <- max(metrics_per_dow$total_followers_diff)

best_dow_stream <- as.data.frame(metrics_per_dow[metrics_per_dow$total_streams == dowtop_streams, c("day_of_week","total_streams")])
best_dow_listeners <- as.data.frame(metrics_per_dow[metrics_per_dow$total_listeners == dowtop_listeners, c("day_of_week","total_listeners")])
best_dow_folldiff <- as.data.frame(metrics_per_dow[metrics_per_dow$total_followers_diff == dowtop_folldiff, c("day_of_week","total_followers_diff")])

answer_1 <- cbind(best_dow_stream, best_dow_listeners, best_dow_folldiff)
```

```{r}
metrics_per_dow_longer <- pivot_longer(data = metrics_per_dow,
                                       cols = c("total_streams","total_listeners","total_followers_diff"),
                                       names_to = "metrics",
                                       values_to = "total")



# pivot day
pivot_day <- pivot_longer(data = answer_1,
                          cols = "day_of_week",
                          values_to = "best_day_of_week") %>% 
  select(best_day_of_week)

# pivot stats
pivot_stats <- pivot_longer(data = answer_1, 
                            cols = c("total_streams","total_listeners","total_followers_diff"), 
                            names_to = "metrics", 
                            values_to = "total") %>% 
  select(metrics, total)

# join
answer_1longer<- bind_cols(pivot_stats, pivot_day)

metrics_per_dow_longer
answer_1longer
```

Let's visualize how the metrics achieved in day of week.

```{r}
ggplot(metrics_per_dow_longer, aes(x=reorder(day_of_week, total), y=total))+
  geom_col(aes(fill=metrics), position = "dodge")+
  scale_fill_manual(name = "Metrics", 
                    labels = c("Total Followers Increase","Total Listeners","Total Streams"), 
                    values = c("total_followers_diff"="red",
                                "total_listeners"="blue",
                                "total_streams"="green"))+
  labs(title = "Metrics Total in Day of Weeks")+
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())+
  coord_flip()

ggplot(answer_1longer, aes(x=reorder(best_day_of_week, -total), y=total))+
  geom_col(aes(y=total, fill=metrics), position = "dodge")+
  scale_fill_manual(name = "Metrics", 
                    labels = c("Total Followers Increase","Total Listeners","Total Streams"), 
                    values = c("total_followers_diff"="red",
                                "total_listeners"="blue",
                                "total_streams"="green"))+
  labs(title = "What Day of Week has the Highest Total Metrics?")+
    theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())
```

So, to answer the task, On what day of week I have the most streams, or listeners, or increase in followers?
The data shows that I have the most streams at Friday, most listeners at Tuesday, and most increase in followers at Thursday.

### **1.b. Task2: "How many days since the date released that song streams plummeted?"** <a id="t2"></a>

Before we go, let's make something clear.
Streams that happened after the release date of some songs were not exclusively streams from the same said songs.
The release though, might triggered the streams of other songs that was also already available.

To answer the task, I have to mark `timelines_edit`.
I'll create `recordings_marker`.
Then, I'll split `timelines_marked` according from each release date up until the day before a new release by using `recordings_marker` as a guide.

```{r}
recordings_marker <- recordings_edit %>% select(release_date, song, disc_title, disc_form)
```

Left join them to create `timelines_marked`

```{r}
timelines_marked <- timelines_edit %>%
  left_join(recordings_marker,
            by = c("date"="release_date"))

```

```{r}
timelines_marked[!is.na(timelines_marked$song),c("date","streams","song","disc_title")]
```

```{r}
date_manusial <- timelines_marked %>% filter(disc_title == "Manusial") %>% pull(date)
date_antinatal_in_c <- timelines_marked %>% filter(disc_title == "Antinatal in C") %>% pull(date)
date_biru_memar <- timelines_marked %>% filter(disc_title == "Biru Memar") %>% pull(date)
date_nyamuk_anjing <- timelines_marked %>% filter(disc_title == "Nyamuk Anjing") %>% pull(date)
date_halulintas <- timelines_marked %>% filter(disc_title == "Halulintas") %>% pull(date)
date_waioli <- timelines_marked %>% filter(disc_title == "Within an Inch of Losing It") %>% pull(date)
date_tc_nyawaku <- timelines_marked %>% filter(disc_title == "Tuhan Cabut Nyawaku") %>% pull(date)
date_froth_n_foam <- timelines_marked %>% filter(disc_title == "Froth and Foam") %>% pull(date)
date_end <- timelines_marked %>% filter(date == max(date)) %>% pull(date)
```

```{r}

streams_post_manusial <- timelines_marked %>% 
  filter(date >= date_manusial, 
         date < date_antinatal_in_c) %>% 
  mutate(disc_title = as.character(disc_title)) %>% 
  select(date,streams,disc_title)

streams_post_manusial[,"disc_title"] <- replace_na(data = streams_post_manusial$disc_title,
                                    replace = "Manusial")

streams_post_manusial
```

It's getting tedious doing repeat code, maybe I should start trying to create my own function.

```{r}
func_tlChopper <- function(xdate, ydate, streams_xname)
{
streams_xname <- timelines_marked %>% 
                  filter(date >= xdate, 
                         date < ydate) %>% 
                  mutate(disc_title = as.character(disc_title)) %>% 
                  select(date,streams,disc_title)
disc_varname <- streams_xname %>% select(disc_title) %>% filter(!is.na(disc_title)) %>% slice_head() %>% pull()

streams_xname[,"disc_title"] <- replace_na(data = streams_xname$disc_title,
                                    replace = disc_varname)

streams_xname
}
```

Let's chop all the song timelines!

```{r}
streams_post_antinatalinc <- 
  func_tlChopper(date_antinatal_in_c, date_biru_memar, "streams_post_antinatalinc")

streams_post_birumemar <- 
  func_tlChopper(date_biru_memar, date_nyamuk_anjing, "streams_post_birumemar")

streams_post_nyamukanjing <- 
  func_tlChopper(date_nyamuk_anjing, date_halulintas, "streams_post_nyamukanjing")

streams_post_halulintas <- 
  func_tlChopper(date_halulintas, date_waioli, "streams_post_halulintas")

streams_post_waioli <- 
  func_tlChopper(date_waioli, date_tc_nyawaku, "streams_post_waioli")

streams_post_tcnyawaku <- 
  func_tlChopper(date_tc_nyawaku, date_froth_n_foam, "streams_post_tcnyawaku")

streams_post_fnf <- 
  func_tlChopper(date_froth_n_foam, date_end, "streams_post_tcnyawaku")
```
```{r tls, fig.height=10, fig.width=20}
streams_merged <- rbind(streams_post_manusial, 
                     streams_post_antinatalinc,
                     streams_post_birumemar,
                     streams_post_nyamukanjing,
                     streams_post_halulintas,
                     streams_post_waioli,
                     streams_post_tcnyawaku,
                     streams_post_fnf)

range <-  c(as.Date("2019-12-05"), as.Date("2022-02-01"))

ggplot(streams_merged, aes(x=date,y=streams))+
  geom_line(aes(col=disc_title))+
  theme(plot.subtitle = element_text(face = "italic", size = 7),
        axis.title = element_blank(),
        legend.position = "right")+
  geom_vline(data=recordings_marker, 
             aes(xintercept=release_date),
             linetype="dashed")+
  geom_text(data=recordings_marker,
            aes(x=release_date,
                label=song,
                y=50, 
                col=disc_title),
            position = position_jitter(height = 40, width = 3, seed = 9),
            size = 5
            )+
  labs(title = "Timelines of Streams", 
       col="Release Title")+
  scale_x_date(date_labels = "%F", 
               limits = range)
```
For future uses, I think it's better if I gather the streams_post into a list.

```{r}
streams_list <- list(streams_post_manusial, 
                     streams_post_antinatalinc,
                     streams_post_birumemar,
                     streams_post_nyamukanjing,
                     streams_post_halulintas,
                     streams_post_waioli,
                     streams_post_tcnyawaku,
                     streams_post_fnf
                     )
```

Now let's count how many days that there were still some streams until no streams at all.

```{r}
streams_post_manusial_dummy <- streams_post_manusial %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 
```

```{r}
day_count_stream_post_manusial_dummy <- streams_post_manusial_dummy  %>% 
  count(is.na(streams))
day_count_stream_post_manusial_dummy
```

How to interpret the table above?
It took 44 days since the release of Manusial EP until there was completely no streams at all that lasted for 19 days before the next release.
But if we want to look a graph that illustrates this with actual number of streams, I can create markers for the original streams count, then draw up a line plot.

```{r}
markers_post_manusial_dummy_plot <-streams_post_manusial_dummy %>% 
  mutate(markers = case_when(is.na(streams) ~ "No One Listened",
                           TRUE ~ "Some Streams Happened"))
```

Bind the markers column to the original stream chopped data.

```{r}
streams_post_manusial_complete <- streams_post_manusial %>% 
  cbind(markers = markers_post_manusial_dummy_plot$markers)
```

Visualize the streams timeline, with markers.

```{r}
ggplot(streams_post_manusial_complete, aes(x=date,y=streams, col=markers))+
  geom_line()+
  labs(title = "Streams Count after Manusial EP release",
        subtitle = "and before Antinatal in C release")
```

Now let's do this for all the other timelines releases.
I create 2 functions bellow to help me count days and prepare data for plot.

```{r}
# for count

func_dayCountStreams <- function(streams_post_df){

# 1 create dummy with na, then fill upward
streams_post_df_dummy <- streams_post_df %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 

# 2 create table how many not na, how many na
day_count_stream_post_df_dummy <- streams_post_df_dummy  %>% 
  count(is.na(streams))

day_count_stream_post_df_dummy
}

```

```{r}
# for plot

func_streamsCompletePlot <- function(streams_post_df){

# 1 create dummy with NA, then fill upward    
streams_post_df_dummy <- streams_post_df %>% 
  na_if(y = 0) %>% # replace 0 with na
  fill(streams, .direction = "up") 

# 2 create markers 
markers_post_dummy_plot <-streams_post_df_dummy %>% 
  mutate(markers = case_when(is.na(streams) ~ "No One Listened",
                           TRUE ~ "Some Streams Happened"))
# 3 bind markers to each stream data
markers_post_dummy_plot
streams_post_complete <- streams_post_df %>% 
  cbind(markers = markers_post_dummy_plot$markers)

# 4 create title
title <- streams_post_df %>% select(disc_title) %>% slice_head() %>% pull()
plot_title = paste("Streams Count of", title)

# 5 plot to line plot
ggplot(streams_post_complete, aes(x=date,y=streams))+
  geom_line(aes(col=markers))+
  labs(title = plot_title,
       subtitle = "until the date before subsequent release was available or the end of 2022")+
  scale_x_date(date_labels = "%F")+
  scale_color_manual(name = "Stream Status", 
                    labels = c("Some Streams Happened","No One Listened"),
                    values = c("Some Streams Happened"="blue",
                               "No One Listened"="red"))+
  theme(plot.subtitle = element_text(face = "italic", size = 7),
        axis.title = element_blank())

}


```

Now let's apply these functions to streams_list:

```{r}
lapply(X = streams_list,FUN = func_dayCountStreams)
```

```{r}
lapply(X = streams_list,FUN = func_streamsCompletePlot)
```



Some notes about the results above: 1.
The release date of Tuhan Cabut Nyawaku and Froth and Foam is too close, and the release date of Froth and Foam and the end of 2021 as a cut off range is also too close.
The resulting counts and plots were a bit of an outlier so it's probably wise to exclude them from conclusion.
2.
To answer the task, how many days since the date released that song streams plummeted?
It could be as short as 40 days, or could be as long as 242 days.

Let's proceed to the next task.

### **1.c. Task3: "Does releasing music in EP or Single have differences in streams?"** <a id="t3"></a>

Take a look.

```{r}
print(recordings_edit)
```

Split the data into EPs and Singles

```{r}
rec_ep <- recordings_edit %>% filter(disc_form == "EP")
rec_single <- recordings_edit %>% filter(disc_form == "Single")
```

```{r}
ggplot(recordings_edit, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "Recordings Metrics")

ggplot(rec_ep, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "EP Metrics")

ggplot(rec_single, aes(x=reorder(song, streams), y=streams))+
  geom_col(aes(y= streams, fill="Streams"))+
  geom_col(aes(y=listeners, fill="Listeners"))+
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank(),
        legend.position = "left")+
   coord_flip()+
  scale_fill_manual(name="Song Stats",
                     breaks=c("Streams","Listeners"),
                     values=c("Streams"="green","Listeners"="blue"))+
  labs(title = "Single Metrics")

```

To see whether or not there are difference between the two group we need to perform a statistical test.
First I need to check the normality of the data, to decide whether or not to use parametric or nonparametric test.

```{r}
recordings_edit %>% 
  group_by(disc_form) %>% 
  normality(streams)
```

```{r}
recordings_edit %>% 
  group_by(disc_form) %>% 
  plot_normality(streams)
```

```{r}
recordings_edit[,c(6,3)]  
ggqqplot(recordings_edit[,c(6,3)], x = "streams")
```

As shown by the table and plot above, the streams among them are quite normally distributed.
We can proceed with a parametric test.
But witch one?
To compare two groups, I can use either Student's t-Test or Welch's t-Test.
If the data has equal variance, we'll proceed with Student's t-Test, else we go with Welch's t-Test.

If p-value from LeveneTest is lower than 0.05, it means the variance between EP and Single is not equal.

```{r}
leveneTest(streams ~ disc_form, recordings_edit[,c(6,3)] )
```

Null Hypothesis = There is no difference between the streams count of releases in form of Single or EP. 

Alternative Hypothesis = There is a difference between the streams count of releases in form of Single or EP

```{r}
# load the package 
library(ggstatsplot)
```

```{r}
ggbetweenstats(
  data = recordings_edit,
  x = disc_form,
  y = streams,
  type = "parametric",
  var.equal = FALSE
)
```

Despite the Hedges'g of 0.81 that implies large effect size, and posterior difference of 22.8 streams.
With p.value of 0.10 it shows that there was a lack of evidence to reject the null hypothesis.
Also, with the bayes factor of 0.07 it shows that there's no evidence worth than a bare mention to reject the null hypothesis.

I conclude there are no differences whatsoever in term of streams count whether or not I released my music in EP or single.

On to the next task.

### **1.d. Task4: "What metric has more effect to increase in followers?"** <a id="t4"></a>

I intend to investigate this relationships by using regression.
But, as mentioned in the data preparation phase, followers is a cumulative metric, to compare evenly with followers, I can use the cumulative version of the other metrics.
But if I want to use non-cumulative metrics, I can use the followers_diff metric.
Maybe I'll just do both.

```{r}
tl_cum <- timelines_edit %>% select(followers, cum_listeners, cum_streams, count_days)
tl_noncum <- timelines_edit %>% select(followers_diff, listeners, streams, day_of_year, day_of_week)
```

First let's check on assumption that the Response Variable and Explanatory Variables are linear.
```{r}
plot(tl_cum)
plot(tl_noncum)
```

They're pretty linear alright in the cumulative data, but not so in the in the noncumulative data.

Next, we observe whether or not they're normally distributed.
```{r}
plot_normality(tl_cum)
plot_normality(tl_noncum)
```
Looking at the plot, I would say that they're not normally distributed. 

I should stop. 

Not just because the data fail to meet the assumptions. But more so that at the time I write this, I've had received a counsel from a mentoring group that informed me, when dealing with data such as this? It's better to approach them with Time Series, not with Linear Regression. 

So here I stop.
But! 
Let me take a tiny peek at the model! 
```{r}
tlcum_model <- lm(formula = followers ~ cum_listeners * cum_streams + count_days + 0, 
                  data = tl_cum)
tlnoncum_model <- lm(formula = followers_diff ~ listeners * streams + day_of_week + 0, 
                  data = tl_noncum)
```
```{r}

summary(tlcum_model)
summary(tlnoncum_model)
```
Cool. 
The cumulative model's estimate says, Count of days since the first time my musice released on Spotify has been having more impact than some cumulative streams and/or listeners I've scrapped along the way. 

While the noncumulative model's estimate says listeners and Thursday are the only things that matter and have positive impact. And streams has negative estimate value! Bravo! The data doesn't lie. Perhaps because most streams are coming from my own streams thus it doesn't really impact any impact in followers. While number of listeners actually broaden and increase the chance of having new followers. 

Well, since this linear regression thing isn't the right approach, I'm gonna take it with a grain--NO! A spoonful of salt and wash it down with alc--Coffee! 

Even though I was only going to see the estimates, now I'm tempted to see a little prediction.
Here we go blaming the tempation.
>"Darn, temptation!"

```{r}
explanatory_data1_cum <- timelines_edit %>% select(cum_listeners, cum_streams, count_days)
explanatory_data2_cum <- data.frame(cum_listeners = round(seq(242,484,242/731)),
         cum_streams = round(seq(1031,2062 ,1031/731)),
         count_days = 723:(723+731))
explanatory_data_cum <- rbind(explanatory_data1_cum, explanatory_data2_cum)

explanatory_data1_noncum <- timelines_edit %>% select(listeners, streams, day_of_week)
explanatory_data2_noncum <- explanatory_data1_noncum %>% mutate(
  listeners = listeners*2,
  streams = streams*2
)
explanatory_data_noncum <- rbind(explanatory_data1_noncum, explanatory_data2_noncum)
```
```{r}
prediction_data_cum <- explanatory_data_cum %>% 
  mutate(followers = predict(tlcum_model, explanatory_data_cum))

prediction_data_noncum <- explanatory_data_noncum %>% 
  mutate(followers_diff = predict(tlnoncum_model, explanatory_data_noncum))
```
```{r}
tail(prediction_data_cum, 3)
sum(prediction_data_noncum$followers_diff)
```
```{r}
plot(prediction_data_noncum$followers_diff)
```

```{r}
cum_plot <- ggplot(tl_cum, aes(x= count_days, y=followers, col=cum_streams, size=cum_listeners))+
  geom_point(shape=1)+
  scale_color_viridis_b(option="D")+
  geom_jitter(data=prediction_data_cum, shape = 2, width = 6, height = 2, alpha=0.3)+
  labs(col="Cumulative Streams",
       size="Cumulative Listeners",
       x = "Count of Days Since First Release Date",
       y = "Number of Followers")

cum_plot

noncum_plot <- ggplot(tl_noncum, aes(x= listeners, y=followers_diff, col=streams))+
  geom_point(shape=1)+
  scale_color_viridis_c(option="D")+
  geom_jitter(data=prediction_data_noncum, height = 0.05, width = 3, alpha=0.3, shape = 2)+
  labs(col="Streams",
       x = "Number of Listeners",
       y = "Increas/Decrease in Followers")+
  facet_wrap(vars(day_of_week), scales = "free")


noncum_plot
```
Perhap's that was not the right approach, but, did I have fun while practicing data science along the way? I did.

Let's look at the plot. It's quite sane and linear with the cumulative one, but a bit funky with the noncumulative one--what a wild wednesday that plot! Still, both results basically produced twice the Response Variable of the original data set since I merely doubled the numeric Explanatory Variables in both of them.

Welp, since I don't yet to know how to do Time Series, and let's say the regression above isn't the right approach. Maybe at least we can look at the good old correlation bellow. We know the popular saying that says "correlation does not imply causation", but we too need to acknowledge that, "there is no correlation without causation somewhere"--there's probably a common cause somewhere upstream.


```{r}
plot_correlation(tl_cum)
plot_correlation(tl_noncum)
```
Followers correlates most with cumulative streams and count of days equally on 0.96. While, followers difference correlates most with listeners. Not so different with the estimate coefficients when I called summary on the regression models before.


## **2. Results** <a id="res"></a>

### **2.a. "On what day of week I have the most streams, or listeners, or increase in followers?"** <a id="2a"></a>
The data shows that I have the most streams at Friday, most listeners at Tuesday, and most increase in followers at Thursday.

Metrics               | Day of Week
----------------------| -------------
Streams               | Friday
Listeners             | Tuesday
Increase in Followers | Thursday


### **2.b. "How many days since the date released that song streams plummeted?"** <a id="2b"></a>
It could be as short as 40 days, or could be as long as 242 days. 

Count of Days Streaming Still Happened |               
----------------------|----------------
40 >=                 | <=242         



### **2.c. "Does releasing music in EP or Single have differences in streams?"** <a id="2c"></a>
Lacking evidence to reject the null hypothesis that there's no difference.

|P-value | 
|--------|
|0.10    |

|Log natural of Bayes Factor | 
|--------|
|0.07    |



### **2.d. "What metric has more effect to increase in followers?"** <a id="2d"></a>
Total followers and Followers Difference correlate most with:  

Cumulative Metrics         | Noncumulative Metrics
---------------------------| --------------------------------
Count of Days 0.96         | Number of Listeners per day 0.18
Cumulative Streams 0.96    | -


  
# **C. CONCLUSION** <a id="concl"></a>
How would I conclude all this? Well, [according to Cassie Kzyrkov](https://towardsdatascience.com/whats-the-difference-between-analytics-and-statistics-cd35d457e17), who works as Chief Decision Scientist at Google, data analytics serves to find good questions and probably should conclude nothing outside the data itself, while statistics may serves to find appropriate answers by inferring beyond the data. 

The results for tasks a and b, might lead me to ask further questions such as:
"Why does streams increased on Friday? Should I try to A/B test release in Friday and not Friday? Or is there any studies before, that already investigated the importance of Friday for music streaming?" 
"Are these streams mostly done on the song recently released? Or any release date might offer a starting point to stream other songs? So is it better to release frequently at shorter interval? What songs are short lived, what might be the cause? Whatsongs are generating many days of streams, is it because there are other people who actually streams them, other than me?"

Meanwhile the result for task c led me to conclude that it doesn't matter whether to release my music in EP or Single--when it comes to stream counts. So when I want to release some musics in the future, I can weigh other considerations more than brooding over EP or Single. Maybe practicality such as the cost to release EP or Single should matter to me more. 

Lastly, for the result of task d, let's just say that it can be a starting point for better statistical investigations when I'm more capable of it later. 

