library(ggplot2)
library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.4.3
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tsibble)
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

Attaching package: ‘tsibble’

The following object is masked from ‘package:lubridate’:

    interval

The following objects are masked from ‘package:base’:

    intersect, setdiff, union
library(lubridate)
library(ggrepel)
library(xts)
Loading required package: zoo

Attaching package: ‘zoo’

The following object is masked from ‘package:tsibble’:

    index

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: ‘xts’

The following objects are masked from ‘package:dplyr’:

    first, last
# Install the pageviews package
install.packages("pageviews")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.4/pageviews_0.6.0.zip'
Content type 'application/zip' length 37973 bytes (37 KB)
downloaded 37 KB
package ‘pageviews’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\dsjja\AppData\Local\Temp\Rtmps9iDdF\downloaded_packages
# Load the library
library(pageviews)
Warning: package ‘pageviews’ was built under R version 4.4.3
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
names(data)
 [1] "X"                "track_id"         "artists"          "album_name"       "track_name"       "popularity"      
 [7] "duration_ms"      "explicit"         "danceability"     "energy"           "key"              "loudness"        
[13] "mode"             "speechiness"      "acousticness"     "instrumentalness" "liveness"         "valence"         
[19] "tempo"            "time_signature"   "track_genre"     

Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.

I choosed popularity.

Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.

library(tsibble)

# Ensure the date column is in Date format
spotify_views$date <- as.Date(spotify_views$date)

# Create a tsibble object with date as the index and views as the response variable
spotify_ts <- spotify_views |>
  select(date, views) |>
  rename(popularity = views) |>
  as_tsibble(index = date)

# View the time series object
head(spotify_ts)
NA
# Plot popularity over time
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Spotify Track Popularity Over Time (Based on Wikipedia Views)",
    x = "Date",
    y = "Popularity (Views)"
  ) + theme_minimal()

# Aggregate by week
weekly_ts <- spotify_ts |>
  index_by(week = ~ lubridate::floor_date(., "week")) |>
  summarise(mean_popularity = mean(popularity, na.rm = TRUE))

# Plot
ggplot(weekly_ts, aes(x = week, y = mean_popularity)) +
  geom_line(color = "darkgreen") +
  labs(title = "Weekly Average Popularity", x = "Week", y = "Avg Views") +
  theme_minimal()

# By month
monthly_ts <- spotify_ts |>
  index_by(month = ~ lubridate::floor_date(., "month")) |>
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

ggplot(monthly_ts, aes(x = month, y = avg_popularity)) +
  geom_line(color = "orange") +
  labs(title = "Monthly Trends in Popularity", x = "Month", y = "Average Views") +
  theme_minimal()

What stands out immediately?

Statistical Significance: The consistency of the downward trend across different time aggregations (daily, weekly, monthly) strongly suggests statistical significance in the overall decline, particularly for the monthly data where the trend appears strongest (as noise is reduced through aggregation).

Practical Implications: This analysis suggests that Spotify track popularity (as measured by Wikipedia views) experienced a fundamental shift downward in early 2023, potentially indicating: A major change in user behavior, Platform algorithm changes, Competition from other music services, Seasonal effects early in the year that didn’t repeat.

Use linear regression to detect any upwards or downwards trends.

# Fit the linear regression model
model <- lm(popularity ~ date, data = spotify_ts)
summary(model)

Call:
lm(formula = popularity ~ date, data = spotify_ts)

Residuals:
    Min      1Q  Median      3Q     Max 
-2778.0  -842.5    -9.7   553.9  6952.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.410e+05  1.249e+04   19.29   <2e-16 ***
date        -1.200e+01  6.393e-01  -18.77   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1287 on 363 degrees of freedom
Multiple R-squared:  0.4927,    Adjusted R-squared:  0.4913 
F-statistic: 352.5 on 1 and 363 DF,  p-value: < 2.2e-16
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Trend of Popularity over Date", 
       x = "Date", y = "Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the graph showing the “Trend of Popularity over Date” from January 2023 to January 2024, Data Trend Analysis:

The graph displays a clear overall downward trend in popularity over the one-year period, with the smoothed blue line showing a decline from approximately 8,500 in January 2023 to around 4,500 by January 2024. This represents nearly a 50% decrease in popularity over the year.

Use smoothing to detect at least one season in your data, and interpret your results.

spotify_ts |> ggplot(mapping = aes(x = date, y = popularity)) + 
  geom_point(size = 1, alpha = 0.4) + 
  geom_smooth(span = 0.2, color = 'blue', se = FALSE)+theme_classic()

Interpretation of the Smoothing Result

The plot uses LOESS smoothing (with span = 0.2) to visualize the trend in the “popularity” variable over time (from January 2023 to January 2024).

Key Observations

  1. Seasonal Pattern Detected:
    • There is a clear seasonal trend in the data.
    • Early 2023: Popularity starts high, peaking around January.
    • Spring 2023: There is a sharp decline in popularity from January to around April.
    • Mid-2023: Popularity remains relatively low and stable, with minor fluctuations.
    • Late 2023: There are small increases and decreases, but no major spikes until a slight uptick at the end of the year.
  2. Possible Seasonality:
    • The initial peak and subsequent drop suggest a seasonal effect—possibly related to an event or release that caused a spike in popularity at the start of the year.
    • The smaller oscillations throughout the rest of the year may indicate minor seasonal or periodic effects, but they are less pronounced than the initial drop.
  3. Noise and Outliers:
    • The scatterplot shows some outliers (especially high values) that the smoother does not follow closely, which is expected since the smoother is designed to capture the general trend, not every fluctuation.

Can you illustrate the seasonality using ACF or PACF?

acf(spotify_ts, ci = 0.95, na.action = na.exclude)

This ACF plot for spotify_ts suggests the series is not strongly seasonal but may have a trend or be non-stationary.

pacf(spotify_ts, na.action = na.exclude, xlab = 'lag', main = "PACF for Spotify pageviews" )

This PACF plot for Spotify pageviews shows a strong correlation at lag 1 and little to no significant correlation at higher lags. This suggests that an AR(1) model could be a good starting point for modeling this time series analysis.

To detect seasonality in my data, I used LOESS smoothing with a span of 0.2, as shown in the plot. Here’s what I observed:

Clear Seasonality: At the beginning of 2023, there is a noticeable peak in popularity, with values above 10,000. This suggests a strong seasonal effect or a specific event that drove popularity up during this period.

Sharp Decline and Stabilization: After this initial peak, popularity drops sharply through the first quarter of the year, reaching a low around April 2023. From that point onward, the popularity remains relatively stable, fluctuating between 5,000 and 7,000, with only minor ups and downs.

Minor Fluctuations: Throughout the rest of the year, I noticed some smaller oscillations, but none are as dramatic as the initial drop. This indicates that while there may be some minor seasonal effects, the main seasonality is concentrated at the start of the year.

Outliers: There are a few outlier points, especially high values, that the smoother doesn’t follow closely. This is expected, as the smoothing method is designed to capture the overall trend rather than every individual fluctuation.

Conclusion:

From this analysis, I can conclude that there is at least one strong season in my data, with a major peak in popularity at the start of the year, followed by a sharp decline and a stable period. This suggests that timing plays a significant role in popularity, and it may be beneficial to align major releases or promotions with the period where the peak occurs.

If I want to understand the causes behind these trends more deeply, I could further investigate what happened during the peak period or use more advanced time series analysis techniques.

---
title: "Week - 12"
output: html_notebook
---

```{r}
library(ggplot2)
library(dplyr)
library(tidyverse)
library(lubridate)
library(ggrepel)
library(xts)
```


```{r}
# Install the pageviews package
install.packages("pageviews")

# Load the library
library(pageviews)
```

```{r}
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
```
```{r}
names(data)
```
### Select a column of your data that encodes time (e.g., "date", "timestamp", "year", etc.). Convert this into a Date in R. If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week's lab.

```{r}
# Extract daily page views for "Spotify" Wikipedia page
spotify_views <- article_pageviews(
  project = "en.wikipedia",
  article = "Spotify",
  start = as.Date("2023-01-01"),
  end = as.Date("2023-12-31"),
  user_type = "user",     # Only user traffic 
  platform = "all"        # All platforms: desktop, mobile-web, etc.
)

# View the first few rows
head(spotify_views)
```


```{r}
# Plot the time series of page views
ggplot(spotify_views, aes(x = date, y = views)) +
  geom_line(color = "steelblue", size = 1) +
  labs(title = "Daily Wikipedia Page Views for 'Spotify' in 2023",
       x = "Date",
       y = "Views") +
  theme_minimal()
```


```{r}
nrow(data)
nrow(spotify_views)
```

```{r}
# View the first few rows to understand the structure of the data
head(spotify_views)
```
### Choose a column of data to analyze over time. This should be a "response-like" variable that is of particular interest.
I choosed **popularity**.

### Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
```{r}
library(tsibble)

# Ensure the date column is in Date format
spotify_views$date <- as.Date(spotify_views$date)

# Create a tsibble object with date as the index and views as the response variable
spotify_ts <- spotify_views |>
  select(date, views) |>
  rename(popularity = views) |>
  as_tsibble(index = date)

# View the time series object
head(spotify_ts)

```



```{r}
# Plot popularity over time
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Spotify Track Popularity Over Time (Based on Wikipedia Views)",
    x = "Date",
    y = "Popularity (Views)"
  ) + theme_minimal()
```


```{r}
# Aggregate by week
weekly_ts <- spotify_ts |>
  index_by(week = ~ lubridate::floor_date(., "week")) |>
  summarise(mean_popularity = mean(popularity, na.rm = TRUE))

# Plot
ggplot(weekly_ts, aes(x = week, y = mean_popularity)) +
  geom_line(color = "darkgreen") +
  labs(title = "Weekly Average Popularity", x = "Week", y = "Avg Views") +
  theme_minimal()

```
```{r}
# By month
monthly_ts <- spotify_ts |>
  index_by(month = ~ lubridate::floor_date(., "month")) |>
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

ggplot(monthly_ts, aes(x = month, y = avg_popularity)) +
  geom_line(color = "orange") +
  labs(title = "Monthly Trends in Popularity", x = "Month", y = "Average Views") +
  theme_minimal()

```

# What stands out immediately?
Statistical Significance: The consistency of the downward trend across different time aggregations (daily, weekly, monthly) strongly suggests statistical significance in the overall decline, particularly for the monthly data where the trend appears strongest (as noise is reduced through aggregation).

Practical Implications: This analysis suggests that Spotify track popularity (as measured by Wikipedia views) experienced a fundamental shift downward in early 2023, potentially indicating: A major change in user behavior, Platform algorithm changes, Competition from other music services, Seasonal effects early in the year that didn't repeat.


# Use linear regression to detect any upwards or downwards trends.

```{r}
# Fit the linear regression model
model <- lm(popularity ~ date, data = spotify_ts)
summary(model)
```

```{r}
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Trend of Popularity over Date", 
       x = "Date", y = "Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```
Based on the graph showing the "Trend of Popularity over Date" from January 2023 to January 2024, Data Trend Analysis:

The graph displays a clear overall downward trend in popularity over the one-year period, with the smoothed blue line showing a decline from approximately 8,500 in January 2023 to around 4,500 by January 2024. This represents nearly a 50% decrease in popularity over the year.


# Do you need to subset the data for multiple trends?

Yes, subsetting the data would be beneficial for several reasons:

There appear to be distinct clusters in the data that suggest different patterns within the overall trend

The first quarter of 2023 shows significantly higher popularity values (10,000-13,000 range) compared to other periods

There's a visible increase in variability and some higher points again in late 2023/early 2024

The simple linear trend line (using the formula 'y ~ x') doesn't capture these potential cyclical or seasonal patterns

# How strong are these trends?

The overall downward trend appears moderately strong, evidenced by:

The consistent negative slope of the blue trend line throughout the entire period

The relatively narrow confidence interval (blue shaded area) around the trend line, suggesting statistical significance

However, the substantial scatter of data points around the trend line indicates high variability

The graph shows considerable dispersion of data points, with many falling far from the trend line. This suggests that while the downward trend is clear, it explains only a portion of the variation in popularity. The high degree of scatter indicates that other factors beyond the simple time variable are likely influencing popularity values.


# Use smoothing to detect at least one season in your data, and interpret your results.

```{r}
spotify_ts |> ggplot(mapping = aes(x = date, y = popularity)) + 
  geom_point(size = 1, alpha = 0.4) + 
  geom_smooth(span = 0.2, color = 'blue', se = FALSE)+theme_classic()
```
## Interpretation of the Smoothing Result

The plot uses **LOESS smoothing** (with `span = 0.2`) to visualize the trend in the "popularity" variable over time (from January 2023 to January 2024).

### Key Observations

1. **Seasonal Pattern Detected:**
   - There is a **clear seasonal trend** in the data.
   - **Early 2023:** Popularity starts high, peaking around January.
   - **Spring 2023:** There is a **sharp decline** in popularity from January to around April.
   - **Mid-2023:** Popularity remains relatively low and stable, with minor fluctuations.
   - **Late 2023:** There are small increases and decreases, but no major spikes until a slight uptick at the end of the year.

2. **Possible Seasonality:**
   - The **initial peak** and subsequent drop suggest a **seasonal effect**—possibly related to an event or release that caused a spike in popularity at the start of the year.
   - The **smaller oscillations** throughout the rest of the year may indicate **minor seasonal or periodic effects**, but they are less pronounced than the initial drop.

3. **Noise and Outliers:**
   - The scatterplot shows some **outliers** (especially high values) that the smoother does not follow closely, which is expected since the smoother is designed to capture the general trend, not every fluctuation.


# Can you illustrate the seasonality using ACF or PACF?

```{r}
acf(spotify_ts, ci = 0.95, na.action = na.exclude)
```
This ACF plot for spotify_ts suggests the series is not strongly seasonal but may have a trend or be non-stationary. 

```{r}
pacf(spotify_ts, na.action = na.exclude, xlab = 'lag', main = "PACF for Spotify pageviews" )
```
This PACF plot for Spotify pageviews shows a strong correlation at lag 1 and little to no significant correlation at higher lags. This suggests that an AR(1) model could be a good starting point for modeling this time series analysis.




To detect seasonality in my data, I used LOESS smoothing with a span of 0.2, as shown in the plot. Here’s what I observed:

Clear Seasonality:
At the beginning of 2023, there is a noticeable peak in popularity, with values above 10,000. This suggests a strong seasonal effect or a specific event that drove popularity up during this period.

Sharp Decline and Stabilization:
After this initial peak, popularity drops sharply through the first quarter of the year, reaching a low around April 2023. From that point onward, the popularity remains relatively stable, fluctuating between 5,000 and 7,000, with only minor ups and downs.

Minor Fluctuations:
Throughout the rest of the year, I noticed some smaller oscillations, but none are as dramatic as the initial drop. This indicates that while there may be some minor seasonal effects, the main seasonality is concentrated at the start of the year.

Outliers:
There are a few outlier points, especially high values, that the smoother doesn’t follow closely. This is expected, as the smoothing method is designed to capture the overall trend rather than every individual fluctuation.

### Conclusion:

From this analysis, I can conclude that there is at least one strong season in my data, with a major peak in popularity at the start of the year, followed by a sharp decline and a stable period. This suggests that timing plays a significant role in popularity, and it may be beneficial to align major releases or promotions with the period where the peak occurs.

If I want to understand the causes behind these trends more deeply, I could further investigate what happened during the peak period or use more advanced time series analysis techniques.
















