Week 12 Assignment

Importing libraries and dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)

My dataset doesn’t have a time column, hence I went with the wikipedia page view data for the company, Cars24, my dataset is about.

data <- read.delim('cars24_wiki.csv', sep = ',')
head(data)
##         Date Page_views
## 1 01/07/2020          0
## 2 02/07/2020          0
## 3 03/07/2020          0
## 4 04/07/2020          0
## 5 05/07/2020          0
## 6 06/07/2020          0

Choosing a response variable

I am choosing Page_views as my response variable.

Creating a tibble

Since my data has only two columns, the date and the response variable (page_views), I’m going to leave it as it. Also before plotting the trend, converting the date from chr datatype to date object.

data$Date <- as.Date(data$Date, format = "%d/%m/%Y") 

data_ts <- as_tsibble(data, index=Date)

data_xts <- xts(x = data$Page_views, 
                  order.by = data$Date)

data_xts <- setNames(data_xts, "views")

Plotting data

data |>
  ggplot() +
  geom_line(mapping = aes(x=Date, y=Page_views)) +
  theme_hc()

Since this plot is kind of messy, trying to smooth this plot using the two methods discussed in the class.

Rolling average

data_xts |>
  rollapply(width = 30, \(x) mean(x, na.rm = TRUE), fill = FALSE) |>
  ggplot(mapping = aes(x = Index, y = views)) +
  geom_line() +
  labs(title = "CTR of CARs24 Wiki page",
       subtitle = "Monthly Rolling Average") +
  theme_hc()

LO(W)ESS

data_ts |>
  ggplot(mapping = aes(x = Date, y = Page_views)) +
  geom_point(size=1, shape='O') +
  geom_smooth(span=0.2, color = 'blue', se=FALSE) +
  labs(title = "CTR of CARs24 Wiki page") +
  theme_hc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The most striking feature is the sharp and significant spike in page views around late 2023 and early 2024, followed by an equally steep decline. This pattern suggests a period of interest, possibly due to a marketing campaign or a major event related to Cars24. Outside this peak, the page views are relatively stable, with minor fluctuations. This trend highlights a highly seasonal or event-driven pattern.

Identifying trend

Lets use Linear regression to find the trend for the entire dataset.

data_ts |>
  ggplot(mapping = aes(x = Date, y = Page_views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Overall page views") +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

Lets see the trend for two different windows, before mid 2023 and after mid 2023.

Trend before mid 2023

data_ts |>
  filter_index("2020-01" ~ "2023-06") |>
  ggplot(mapping = aes(x = Date, y = Page_views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Page view between 2020 and 2023") +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

This shows us that the trend is increasing from starting of the wiki page till mid 2023. Lets look at the trend for the rest of the data.

Trend after mid 2023

data_ts |>
  filter_index("2023-07" ~ "2025-12") |>
  ggplot(mapping = aes(x = Date, y = Page_views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Page view between late 2023 and 2024") +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

Here, even though there is a spike in the trend, the overall trend for this window of data is decreasing.

Detecting Seasonality

data_ts |>
  index_by(year = floor_date(Date, 'halfyear')) |>
  summarise(avg_views = mean(Page_views, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_views)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
  labs(title = "Average Page Views Over Time",
       subtitle = "(by half year)") +
  scale_x_date(breaks = "1 year", labels = \(x) year(x)) +
  theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 18437
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 191.31
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 35836

This plot does not show clear evidence of recurring seasonality. While there is a noticeable spike in page views starting in late 2023 and declining sharply by mid-2024, this behavior appears to be driven by a one-time event or trend rather than a repeating seasonal pattern.

Seasonality typically involves regular, periodic fluctuations (e.g., yearly or quarterly), which are absent here. The data prior to 2023 shows relatively stable and gradual growth without distinct recurring peaks, further suggesting that the observed trend is not seasonal but rather event-driven.

data_dc <- as.ts(data_ts) |>
  decompose()           

plot(data_dc)

Seasonality using ACF

acf(data_ts, ci = 0.95, na.action = na.exclude)

From the provided ACF plot, we observe that all lags show a high autocorrelation close to 1. This pattern typically indicates non-stationarity or the presence of strong trends in the data rather than seasonality. For seasonal patterns, we would expect periodic peaks at specific lags corresponding to the seasonal cycle (e.g., daily, weekly, or monthly cycles).

Since this plot shows no distinct periodic drop-offs or cyclical behavior, it suggests that any potential seasonality in the Cars24 page views might be overshadowed by a strong trend or other dominant structure in the data. Transformations or differencing may be needed to isolate and detect seasonality.

Seasonality using PACF

pacf(data_ts, na.action = na.exclude, 
     xlab = "Lag", main = "PACF for Cars24 page views")

The partial autocorrelation drops off sharply after lag 1, indicating that the series likely exhibits a strong trend or dependence on its immediate past value rather than seasonality. Seasonality typically manifests as significant partial autocorrelations at lags corresponding to the seasonal period, such as weekly or monthly intervals. Since this PACF does not show such recurring peaks at higher lags, it suggests that the data does not exhibit strong seasonal patterns.