This data dive focuses on understanding time-series data. Unfortunately, there are not any good candidate variables within the Bank Marketing data set, so I cannot use the traditional data set I have been using for this data dive.
month and day of week last contacted
variables, but I don’t see anyway to translate them in a way such that
they would meet the requirements of time-series data.So instead, we will use the Wikipedia page views on the ‘Certificate of Deposit’ page that relates to the product that is being advertised (I think) in the bank marketing campaign data set. The requirements of this lab will be met when:
Construct a time-series based data set (tsibble object) of the
data and response variable from the chosen data by exploring functions
like index_by and fill_na.
Use linear regression to detect any upwards or downward trends in the data. Whether there may be multiple trends that need to be subset (e.g., weekly, monthly, quarterly, annually), and the strength of these trends.
By using smoothing to detect at least one season (can this be illustrated using ACF or PACF?) in the data and interpret the results.
# Declare libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(lubridate)
library(ggthemes)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:tsibble':
##
## index
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")
# Read in dataframe
cd_page_views <- read_delim("cd-page-views.csv",delim=",")
## Rows: 3960 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (1): count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In order for a time-series dataset to be truly classified as a time-series where analysis can be run on it, there are three key requirements that must be met.
The response variable (e.g., count of daily page views) must be clearly defined and calculated for each unit of time in the series.
The time series index must be consistent and regular where there is only one record per time index.
There must be no implicit missing rows – e.g., the time series index must not ‘skip’ values.
Within the page view dataset – we have data ranging from 7/31/2015 to 2026-05-03. Let’s go ahead write that dataset into a tsibble and check whether we meet those requirements.
na
values, so this is a valid time series object.# Change Date column from a character class into a date field
cd_page_views <- cd_page_views |>
mutate(Date_index = as.Date(Date, format = "%m/%d/%Y")) |>
select(-Date) |>
distinct()
summary(cd_page_views)
## count Date_index
## Min. : 50.0 Min. :2015-07-01
## 1st Qu.: 196.0 1st Qu.:2018-03-16
## Median : 365.0 Median :2020-11-30
## Mean : 409.1 Mean :2020-11-30
## 3rd Qu.: 610.2 3rd Qu.:2023-08-17
## Max. :1452.0 Max. :2026-05-03
# Construct Tibble
page_views_ts <- as_tsibble(cd_page_views, index = Date_index)
# Check for any gaps
count_gaps(page_views_ts)
## # A tibble: 0 × 3
## # ℹ 3 variables: .from <date>, .to <date>, .n <int>
# Check for any null values
sum(is.na(page_views_ts$count))
## [1] 0
We’ll begin by creating a trend chart using
geom_smooth(), which fits a regression line to show the
overall pattern in page view counts over time. Here we can see a
decreasing linear trend present in the data set, where the number of
page visits to the Term Deposit Wikipedia page appears to be declining
over time.
# Construct Trend Chart
page_views_ts |>
ggplot(aes(x = Date_index, y = count)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Page Views Over Time",
x = "Date",
y = "Page Views"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Another interesting thought is that the trend seems almost bimodal where page views were consistent up through the start of 2021 and then maintained a relatively consistent baseline after shifting. Let’s go ahead and visualize these components separately to potentially get a better understanding of possible seasonality to the data.
# Faceting Based on Sudden Drop in Page Views
page_views_ts |>
mutate(
Date_period = if_else(Date_index < as.Date("2021-01-01"),
"2015–2020",
"2021+")
) |>
ggplot(aes(x = Date_index, y = count)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ Date_period, scales = "free_x") +
labs(
title = "Page Views Trend by Time Period",
subtitle = "Linear regression trend split into 2015–2020 and 2021+ periods",
x = "Date",
y = "Page Views"
) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Let’s go a layer deeper and only look at a time period from 2024 - 2026 to see if that can help us spot what ‘type’ of seasonality this would be ~ e.g., daily, weekly, monthly, quarterly, annually, etc.
# Construct Trend Chart
page_views_ts |>
filter(Date_index > as.Date("2024-01-01")) |>
ggplot(aes(x = Date_index, y = count)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Page Views Over Time",
subtitle = "For 2024 through today",
x = "Date",
y = "Page Views"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Now that we’ve identified a possible daily / weekly seasonal pattern, along with recurring spikes in page views, we’ll construct a LOESS curve to better understand the underlying trend while smoothing out short-term fluctuations across various ‘suitor seasons.’
# Construct multiple LOESS curves for each suitor season and facet them
weekly <- page_views_ts |>
index_by(period_date = floor_date(Date_index, "week")) |>
summarise(avg_views = mean(count, na.rm = TRUE)) |>
as_tibble() |>
mutate(period = "Weekly")
monthly <- page_views_ts |>
index_by(period_date = floor_date(Date_index, "month")) |>
summarise(avg_views = mean(count, na.rm = TRUE)) |>
as_tibble() |>
mutate(period = "Monthly")
quarterly <- page_views_ts |>
index_by(period_date = floor_date(Date_index, "quarter")) |>
summarise(avg_views = mean(count, na.rm = TRUE)) |>
as_tibble() |>
mutate(period = "Quarterly")
semi_annually <- page_views_ts |>
index_by(period_date = floor_date(Date_index, "halfyear")) |>
summarise(avg_views = mean(count, na.rm = TRUE)) |>
as_tibble() |>
mutate(period = "Semi-Annually")
annually <- page_views_ts |>
index_by(period_date = floor_date(Date_index, "year")) |>
summarise(avg_views = mean(count, na.rm = TRUE)) |>
as_tibble() |>
mutate(period = "Annually")
bind_rows(
weekly,
monthly,
quarterly,
semi_annually,
annually
) |>
ggplot(aes(x = period_date, y = avg_views)) +
geom_line() +
geom_smooth(span = 0.3, color = "blue", se = FALSE) +
facet_wrap(~ period, scales = "free_y") +
labs(
title = "Average Page Views Over Time",
subtitle = "Comparing weekly, monthly, quarterly, semi-annual, and annual seasonality",
x = "Date",
y = "Average Page Views"
) +
scale_x_date(breaks = "1 year", labels = \(x) year(x)) +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 16416
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 751.09
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 5.6414e+05
Since time series are composed of three distinct parts (1) Trend, (2) Seasonality, and (3) Residuals – if we formally decompose the time series (as opposed to visualizing it), we may be able to surface the seasonal affect better.
We will assume an additive model, since the reoccurring spikes and possibly weekly seasonal pattern we observed appear to sit on top of that stark broader long-term trend.
Based on prior findings, we will only look a weekly, monthly, and annual seasons
# Weekly Aditive Model
page_views_weekly_dc <- page_views_ts$count |>
ts(frequency = 7) |>
decompose(type = "additive")
plot(page_views_weekly_dc)
# Monthly Additive Model
page_views_monthly_dc <- page_views_ts$count |>
ts(frequency = 30) |>
decompose(type = "additive")
plot(page_views_monthly_dc)
# Annual Additive Model
page_views_annual_dc <- page_views_ts$count |>
ts(frequency = 365) |>
decompose(type = "additive")
plot(page_views_annual_dc)
We can see that it is hard to ‘observe’ the weekly pattern because it is so dense. It basically looks like a chocolate bar ~ but the values all seem pretty consistent.
There is a clear monthly pattern that emerges as well present in the underlying time series.
Finally, there also does appear to be some annual reoccurring pattern as well based on season.
This leads me to conclude that we may be able to nest all three of these season types into a model to get a more holistic picture of how the data are moving. Well verify / build a strong case by checking an ACF plot.
acf(
page_views_ts$count,
ci = 0.95,
na.action = na.exclude,
lag.max = 400,
main = "ACF Plot for Page Views"
)
Here we can very clearly see evidence of ‘autocorrelation’ in the data, where pages view values appear highly dependent on prior values. This suggests seasonality as aforementioned ~ with a slow decline across the values suggesting a long-term trend that we also observed is present in the data.
Let’s try and run a simple multivariate regression based on these findings:
Here we can see an adjusted R-squared of 0.8243 at a level of model significance of p < 0.01.
This means our simple time series analysis does find evidence of seasonality where month and day of week impact page views accordingly.
We also account for the negative trend with a time-index value.
# Page Views Dataframe for Modeling
page_views_model_df <- page_views_ts |>
as_tibble() |>
arrange(Date_index) |>
mutate(
time_index = row_number(),
day_of_week = wday(Date_index, label = TRUE),
month = month(Date_index, label = TRUE),
year = year(Date_index)
)
page_views_lm <- lm(
count ~ time_index + day_of_week + month,
data = page_views_model_df
)
summary(page_views_lm)
##
## Call:
## lm(formula = count ~ time_index + day_of_week + month, data = page_views_model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -336.89 -61.30 -0.60 60.52 915.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.750e+02 3.254e+00 238.169 < 2e-16 ***
## time_index -1.847e-01 1.423e-03 -129.816 < 2e-16 ***
## day_of_week.L -1.551e+01 4.293e+00 -3.614 0.000306 ***
## day_of_week.Q -1.656e+02 4.293e+00 -38.583 < 2e-16 ***
## day_of_week.C 1.397e+01 4.294e+00 3.254 0.001149 **
## day_of_week^4 -2.670e+01 4.294e+00 -6.219 5.54e-10 ***
## day_of_week^5 6.656e+00 4.294e+00 1.550 0.121194
## day_of_week^6 1.097e+00 4.293e+00 0.255 0.798386
## month.L -6.681e+01 5.597e+00 -11.936 < 2e-16 ***
## month.Q 5.306e+01 5.610e+00 9.459 < 2e-16 ***
## month.C 8.417e+00 5.578e+00 1.509 0.131376
## month^4 -5.749e+01 5.608e+00 -10.251 < 2e-16 ***
## month^5 -7.842e+00 5.652e+00 -1.388 0.165333
## month^6 2.570e+01 5.666e+00 4.536 5.91e-06 ***
## month^7 9.928e+00 5.630e+00 1.764 0.077874 .
## month^8 -4.145e+00 5.611e+00 -0.739 0.460129
## month^9 -1.731e+00 5.623e+00 -0.308 0.758292
## month^10 2.348e+00 5.640e+00 0.416 0.677262
## month^11 -6.369e+00 5.688e+00 -1.120 0.262965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.1 on 3941 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.8243
## F-statistic: 1033 on 18 and 3941 DF, p-value: < 2.2e-16