library(ggplot2)
library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.4.3
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
✔ readr 2.1.5 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tsibble)
Registered S3 method overwritten by 'tsibble':
method from
as_tibble.grouped_df dplyr
Attaching package: ‘tsibble’
The following object is masked from ‘package:lubridate’:
interval
The following objects are masked from ‘package:base’:
intersect, setdiff, union
library(lubridate)
library(ggrepel)
library(xts)
Loading required package: zoo
Attaching package: ‘zoo’
The following object is masked from ‘package:tsibble’:
index
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
######################### Warning from 'xts' package ##########################
# #
# The dplyr lag() function breaks how base R's lag() function is supposed to #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
# source() into this session won't work correctly. #
# #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
# #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
###############################################################################
Attaching package: ‘xts’
The following objects are masked from ‘package:dplyr’:
first, last
# Install the pageviews package
install.packages("pageviews")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.4/pageviews_0.6.0.zip'
Content type 'application/zip' length 37973 bytes (37 KB)
downloaded 37 KB
package ‘pageviews’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\dsjja\AppData\Local\Temp\Rtmps9iDdF\downloaded_packages
# Load the library
library(pageviews)
Warning: package ‘pageviews’ was built under R version 4.4.3
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
names(data)
[1] "X" "track_id" "artists" "album_name" "track_name" "popularity"
[7] "duration_ms" "explicit" "danceability" "energy" "key" "loudness"
[13] "mode" "speechiness" "acousticness" "instrumentalness" "liveness" "valence"
[19] "tempo" "time_signature" "track_genre"
Select a column of your data that encodes time (e.g., “date”,
“timestamp”, “year”, etc.). Convert this into a Date in R. If you do not
have a time-based column of data: find a Wikipedia page that is related
to your dataset. Then, extract a time series of page views for that page
using the wikipedia page views websiteLinks to an external site. or the
R package used in this week’s lab.
# Extract daily page views for "Spotify" Wikipedia page
spotify_views <- article_pageviews(
project = "en.wikipedia",
article = "Spotify",
start = as.Date("2023-01-01"),
end = as.Date("2023-12-31"),
user_type = "user", # Only user traffic
platform = "all" # All platforms: desktop, mobile-web, etc.
)
# View the first few rows
head(spotify_views)
# Plot the time series of page views
ggplot(spotify_views, aes(x = date, y = views)) +
geom_line(color = "steelblue", size = 1) +
labs(title = "Daily Wikipedia Page Views for 'Spotify' in 2023",
x = "Date",
y = "Views") +
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.

nrow(data)
[1] 9000
nrow(spotify_views)
[1] 365
# View the first few rows to understand the structure of the data
head(spotify_views)
Choose a column of data to analyze over time. This should be a
“response-like” variable that is of particular interest.
I choosed popularity.
Create a tsibble object of just the date and response variable.
Then, plot your data over time. Consider different windows of time.
library(tsibble)
# Ensure the date column is in Date format
spotify_views$date <- as.Date(spotify_views$date)
# Create a tsibble object with date as the index and views as the response variable
spotify_ts <- spotify_views |>
select(date, views) |>
rename(popularity = views) |>
as_tsibble(index = date)
# View the time series object
head(spotify_ts)
NA
# Plot popularity over time
ggplot(spotify_ts, aes(x = date, y = popularity)) +
geom_line(color = "steelblue", linewidth = 1) +
labs(
title = "Spotify Track Popularity Over Time (Based on Wikipedia Views)",
x = "Date",
y = "Popularity (Views)"
) + theme_minimal()

# Aggregate by week
weekly_ts <- spotify_ts |>
index_by(week = ~ lubridate::floor_date(., "week")) |>
summarise(mean_popularity = mean(popularity, na.rm = TRUE))
# Plot
ggplot(weekly_ts, aes(x = week, y = mean_popularity)) +
geom_line(color = "darkgreen") +
labs(title = "Weekly Average Popularity", x = "Week", y = "Avg Views") +
theme_minimal()

# By month
monthly_ts <- spotify_ts |>
index_by(month = ~ lubridate::floor_date(., "month")) |>
summarise(avg_popularity = mean(popularity, na.rm = TRUE))
ggplot(monthly_ts, aes(x = month, y = avg_popularity)) +
geom_line(color = "orange") +
labs(title = "Monthly Trends in Popularity", x = "Month", y = "Average Views") +
theme_minimal()

Use linear regression to detect any upwards or downwards
trends.
# Fit the linear regression model
model <- lm(popularity ~ date, data = spotify_ts)
summary(model)
Call:
lm(formula = popularity ~ date, data = spotify_ts)
Residuals:
Min 1Q Median 3Q Max
-2778.0 -842.5 -9.7 553.9 6952.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.410e+05 1.249e+04 19.29 <2e-16 ***
date -1.200e+01 6.393e-01 -18.77 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1287 on 363 degrees of freedom
Multiple R-squared: 0.4927, Adjusted R-squared: 0.4913
F-statistic: 352.5 on 1 and 363 DF, p-value: < 2.2e-16
ggplot(spotify_ts, aes(x = date, y = popularity)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Trend of Popularity over Date",
x = "Date", y = "Popularity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the graph showing the “Trend of Popularity over Date” from
January 2023 to January 2024, Data Trend Analysis:
The graph displays a clear overall downward trend in popularity over
the one-year period, with the smoothed blue line showing a decline from
approximately 8,500 in January 2023 to around 4,500 by January 2024.
This represents nearly a 50% decrease in popularity over the year.
Do you need to subset the data for multiple trends?
Yes, subsetting the data would be beneficial for several reasons:
There appear to be distinct clusters in the data that suggest
different patterns within the overall trend
The first quarter of 2023 shows significantly higher popularity
values (10,000-13,000 range) compared to other periods
There’s a visible increase in variability and some higher points
again in late 2023/early 2024
The simple linear trend line (using the formula ‘y ~ x’) doesn’t
capture these potential cyclical or seasonal patterns
How strong are these trends?
The overall downward trend appears moderately strong, evidenced
by:
The consistent negative slope of the blue trend line throughout the
entire period
The relatively narrow confidence interval (blue shaded area) around
the trend line, suggesting statistical significance
However, the substantial scatter of data points around the trend line
indicates high variability
The graph shows considerable dispersion of data points, with many
falling far from the trend line. This suggests that while the downward
trend is clear, it explains only a portion of the variation in
popularity. The high degree of scatter indicates that other factors
beyond the simple time variable are likely influencing popularity
values.
Use smoothing to detect at least one season in your data, and
interpret your results.
spotify_ts |> ggplot(mapping = aes(x = date, y = popularity)) +
geom_point(size = 1, alpha = 0.4) +
geom_smooth(span = 0.2, color = 'blue', se = FALSE)+theme_classic()

Interpretation of the Smoothing Result
The plot uses LOESS smoothing (with
span = 0.2
) to visualize the trend in the “popularity”
variable over time (from January 2023 to January 2024).
Key Observations
- Seasonal Pattern Detected:
- There is a clear seasonal trend in the data.
- Early 2023: Popularity starts high, peaking around
January.
- Spring 2023: There is a sharp
decline in popularity from January to around April.
- Mid-2023: Popularity remains relatively low and
stable, with minor fluctuations.
- Late 2023: There are small increases and decreases,
but no major spikes until a slight uptick at the end of the year.
- Possible Seasonality:
- The initial peak and subsequent drop suggest a
seasonal effect—possibly related to an event or release
that caused a spike in popularity at the start of the year.
- The smaller oscillations throughout the rest of the
year may indicate minor seasonal or periodic effects,
but they are less pronounced than the initial drop.
- Noise and Outliers:
- The scatterplot shows some outliers (especially
high values) that the smoother does not follow closely, which is
expected since the smoother is designed to capture the general trend,
not every fluctuation.
Can you illustrate the seasonality using ACF or PACF?
acf(spotify_ts, ci = 0.95, na.action = na.exclude)

This ACF plot for spotify_ts suggests the series is not strongly
seasonal but may have a trend or be non-stationary.
pacf(spotify_ts, na.action = na.exclude, xlab = 'lag', main = "PACF for Spotify pageviews" )

This PACF plot for Spotify pageviews shows a strong correlation at
lag 1 and little to no significant correlation at higher lags. This
suggests that an AR(1) model could be a good starting point for modeling
this time series analysis.
To detect seasonality in my data, I used LOESS smoothing with a span
of 0.2, as shown in the plot. Here’s what I observed:
Clear Seasonality: At the beginning of 2023, there is a noticeable
peak in popularity, with values above 10,000. This suggests a strong
seasonal effect or a specific event that drove popularity up during this
period.
Sharp Decline and Stabilization: After this initial peak, popularity
drops sharply through the first quarter of the year, reaching a low
around April 2023. From that point onward, the popularity remains
relatively stable, fluctuating between 5,000 and 7,000, with only minor
ups and downs.
Minor Fluctuations: Throughout the rest of the year, I noticed some
smaller oscillations, but none are as dramatic as the initial drop. This
indicates that while there may be some minor seasonal effects, the main
seasonality is concentrated at the start of the year.
Outliers: There are a few outlier points, especially high values,
that the smoother doesn’t follow closely. This is expected, as the
smoothing method is designed to capture the overall trend rather than
every individual fluctuation.
Conclusion:
From this analysis, I can conclude that there is at least one strong
season in my data, with a major peak in popularity at the start of the
year, followed by a sharp decline and a stable period. This suggests
that timing plays a significant role in popularity, and it may be
beneficial to align major releases or promotions with the period where
the peak occurs.
If I want to understand the causes behind these trends more deeply, I
could further investigate what happened during the peak period or use
more advanced time series analysis techniques.
---
title: "Week - 12"
output: html_notebook
---

```{r}
library(ggplot2)
library(dplyr)
library(tidyverse)
library(lubridate)
library(ggrepel)
library(xts)
```


```{r}
# Install the pageviews package
install.packages("pageviews")

# Load the library
library(pageviews)
```

```{r}
#Reading the data set
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
data
```
```{r}
names(data)
```
### Select a column of your data that encodes time (e.g., "date", "timestamp", "year", etc.). Convert this into a Date in R. If you do not have a time-based column of data: find a Wikipedia page that is related to your dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week's lab.

```{r}
# Extract daily page views for "Spotify" Wikipedia page
spotify_views <- article_pageviews(
  project = "en.wikipedia",
  article = "Spotify",
  start = as.Date("2023-01-01"),
  end = as.Date("2023-12-31"),
  user_type = "user",     # Only user traffic 
  platform = "all"        # All platforms: desktop, mobile-web, etc.
)

# View the first few rows
head(spotify_views)
```


```{r}
# Plot the time series of page views
ggplot(spotify_views, aes(x = date, y = views)) +
  geom_line(color = "steelblue", size = 1) +
  labs(title = "Daily Wikipedia Page Views for 'Spotify' in 2023",
       x = "Date",
       y = "Views") +
  theme_minimal()
```


```{r}
nrow(data)
nrow(spotify_views)
```

```{r}
# View the first few rows to understand the structure of the data
head(spotify_views)
```
### Choose a column of data to analyze over time. This should be a "response-like" variable that is of particular interest.
I choosed **popularity**.

### Create a tsibble object of just the date and response variable. Then, plot your data over time. Consider different windows of time.
```{r}
library(tsibble)

# Ensure the date column is in Date format
spotify_views$date <- as.Date(spotify_views$date)

# Create a tsibble object with date as the index and views as the response variable
spotify_ts <- spotify_views |>
  select(date, views) |>
  rename(popularity = views) |>
  as_tsibble(index = date)

# View the time series object
head(spotify_ts)

```



```{r}
# Plot popularity over time
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Spotify Track Popularity Over Time (Based on Wikipedia Views)",
    x = "Date",
    y = "Popularity (Views)"
  ) + theme_minimal()
```


```{r}
# Aggregate by week
weekly_ts <- spotify_ts |>
  index_by(week = ~ lubridate::floor_date(., "week")) |>
  summarise(mean_popularity = mean(popularity, na.rm = TRUE))

# Plot
ggplot(weekly_ts, aes(x = week, y = mean_popularity)) +
  geom_line(color = "darkgreen") +
  labs(title = "Weekly Average Popularity", x = "Week", y = "Avg Views") +
  theme_minimal()

```
```{r}
# By month
monthly_ts <- spotify_ts |>
  index_by(month = ~ lubridate::floor_date(., "month")) |>
  summarise(avg_popularity = mean(popularity, na.rm = TRUE))

ggplot(monthly_ts, aes(x = month, y = avg_popularity)) +
  geom_line(color = "orange") +
  labs(title = "Monthly Trends in Popularity", x = "Month", y = "Average Views") +
  theme_minimal()

```

# What stands out immediately?
Statistical Significance: The consistency of the downward trend across different time aggregations (daily, weekly, monthly) strongly suggests statistical significance in the overall decline, particularly for the monthly data where the trend appears strongest (as noise is reduced through aggregation).

Practical Implications: This analysis suggests that Spotify track popularity (as measured by Wikipedia views) experienced a fundamental shift downward in early 2023, potentially indicating: A major change in user behavior, Platform algorithm changes, Competition from other music services, Seasonal effects early in the year that didn't repeat.


# Use linear regression to detect any upwards or downwards trends.

```{r}
# Fit the linear regression model
model <- lm(popularity ~ date, data = spotify_ts)
summary(model)
```

```{r}
ggplot(spotify_ts, aes(x = date, y = popularity)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Trend of Popularity over Date", 
       x = "Date", y = "Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```
Based on the graph showing the "Trend of Popularity over Date" from January 2023 to January 2024, Data Trend Analysis:

The graph displays a clear overall downward trend in popularity over the one-year period, with the smoothed blue line showing a decline from approximately 8,500 in January 2023 to around 4,500 by January 2024. This represents nearly a 50% decrease in popularity over the year.


# Do you need to subset the data for multiple trends?

Yes, subsetting the data would be beneficial for several reasons:

There appear to be distinct clusters in the data that suggest different patterns within the overall trend

The first quarter of 2023 shows significantly higher popularity values (10,000-13,000 range) compared to other periods

There's a visible increase in variability and some higher points again in late 2023/early 2024

The simple linear trend line (using the formula 'y ~ x') doesn't capture these potential cyclical or seasonal patterns

# How strong are these trends?

The overall downward trend appears moderately strong, evidenced by:

The consistent negative slope of the blue trend line throughout the entire period

The relatively narrow confidence interval (blue shaded area) around the trend line, suggesting statistical significance

However, the substantial scatter of data points around the trend line indicates high variability

The graph shows considerable dispersion of data points, with many falling far from the trend line. This suggests that while the downward trend is clear, it explains only a portion of the variation in popularity. The high degree of scatter indicates that other factors beyond the simple time variable are likely influencing popularity values.


# Use smoothing to detect at least one season in your data, and interpret your results.

```{r}
spotify_ts |> ggplot(mapping = aes(x = date, y = popularity)) + 
  geom_point(size = 1, alpha = 0.4) + 
  geom_smooth(span = 0.2, color = 'blue', se = FALSE)+theme_classic()
```
## Interpretation of the Smoothing Result

The plot uses **LOESS smoothing** (with `span = 0.2`) to visualize the trend in the "popularity" variable over time (from January 2023 to January 2024).

### Key Observations

1. **Seasonal Pattern Detected:**
   - There is a **clear seasonal trend** in the data.
   - **Early 2023:** Popularity starts high, peaking around January.
   - **Spring 2023:** There is a **sharp decline** in popularity from January to around April.
   - **Mid-2023:** Popularity remains relatively low and stable, with minor fluctuations.
   - **Late 2023:** There are small increases and decreases, but no major spikes until a slight uptick at the end of the year.

2. **Possible Seasonality:**
   - The **initial peak** and subsequent drop suggest a **seasonal effect**—possibly related to an event or release that caused a spike in popularity at the start of the year.
   - The **smaller oscillations** throughout the rest of the year may indicate **minor seasonal or periodic effects**, but they are less pronounced than the initial drop.

3. **Noise and Outliers:**
   - The scatterplot shows some **outliers** (especially high values) that the smoother does not follow closely, which is expected since the smoother is designed to capture the general trend, not every fluctuation.


# Can you illustrate the seasonality using ACF or PACF?

```{r}
acf(spotify_ts, ci = 0.95, na.action = na.exclude)
```
This ACF plot for spotify_ts suggests the series is not strongly seasonal but may have a trend or be non-stationary. 

```{r}
pacf(spotify_ts, na.action = na.exclude, xlab = 'lag', main = "PACF for Spotify pageviews" )
```
This PACF plot for Spotify pageviews shows a strong correlation at lag 1 and little to no significant correlation at higher lags. This suggests that an AR(1) model could be a good starting point for modeling this time series analysis.




To detect seasonality in my data, I used LOESS smoothing with a span of 0.2, as shown in the plot. Here’s what I observed:

Clear Seasonality:
At the beginning of 2023, there is a noticeable peak in popularity, with values above 10,000. This suggests a strong seasonal effect or a specific event that drove popularity up during this period.

Sharp Decline and Stabilization:
After this initial peak, popularity drops sharply through the first quarter of the year, reaching a low around April 2023. From that point onward, the popularity remains relatively stable, fluctuating between 5,000 and 7,000, with only minor ups and downs.

Minor Fluctuations:
Throughout the rest of the year, I noticed some smaller oscillations, but none are as dramatic as the initial drop. This indicates that while there may be some minor seasonal effects, the main seasonality is concentrated at the start of the year.

Outliers:
There are a few outlier points, especially high values, that the smoother doesn’t follow closely. This is expected, as the smoothing method is designed to capture the overall trend rather than every individual fluctuation.

### Conclusion:

From this analysis, I can conclude that there is at least one strong season in my data, with a major peak in popularity at the start of the year, followed by a sharp decline and a stable period. This suggests that timing plays a significant role in popularity, and it may be beneficial to align major releases or promotions with the period where the peak occurs.

If I want to understand the causes behind these trends more deeply, I could further investigate what happened during the peak period or use more advanced time series analysis techniques.
















