00 - Introduction

This data dive focuses on understanding time-series data. Unfortunately, there are not any good candidate variables within the Bank Marketing data set, so I cannot use the traditional data set I have been using for this data dive.

So instead, we will use the Wikipedia page views on the ‘Certificate of Deposit’ page that relates to the product that is being advertised (I think) in the bank marketing campaign data set. The requirements of this lab will be met when:

# Declare libraries
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(lubridate)
library(ggthemes)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
cd_page_views <- read_delim("cd-page-views.csv",delim=",")
## Rows: 3960 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (1): count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Time Series Requirements

In order for a time-series dataset to be truly classified as a time-series where analysis can be run on it, there are three key requirements that must be met.

  1. The response variable (e.g., count of daily page views) must be clearly defined and calculated for each unit of time in the series.

  2. The time series index must be consistent and regular where there is only one record per time index.

  3. There must be no implicit missing rows – e.g., the time series index must not ‘skip’ values.

Within the page view dataset – we have data ranging from 7/31/2015 to 2026-05-03. Let’s go ahead write that dataset into a tsibble and check whether we meet those requirements.

# Change Date column from a character class into a date field
cd_page_views <- cd_page_views |>
  mutate(Date_index = as.Date(Date, format = "%m/%d/%Y")) |>
  select(-Date) |>
  distinct()

summary(cd_page_views)
##      count          Date_index        
##  Min.   :  50.0   Min.   :2015-07-01  
##  1st Qu.: 196.0   1st Qu.:2018-03-16  
##  Median : 365.0   Median :2020-11-30  
##  Mean   : 409.1   Mean   :2020-11-30  
##  3rd Qu.: 610.2   3rd Qu.:2023-08-17  
##  Max.   :1452.0   Max.   :2026-05-03
# Construct Tibble
page_views_ts <- as_tsibble(cd_page_views, index = Date_index) 

# Check for any gaps
count_gaps(page_views_ts)
## # A tibble: 0 × 3
## # ℹ 3 variables: .from <date>, .to <date>, .n <int>
# Check for any null values
sum(is.na(page_views_ts$count))
## [1] 0

03 - Decomposing the Time Series

Since time series are composed of three distinct parts (1) Trend, (2) Seasonality, and (3) Residuals – if we formally decompose the time series (as opposed to visualizing it), we may be able to surface the seasonal affect better.

This leads me to conclude that we may be able to nest all three of these season types into a model to get a more holistic picture of how the data are moving. Well verify / build a strong case by checking an ACF plot.

acf(
  page_views_ts$count,
  ci = 0.95,
  na.action = na.exclude,
  lag.max = 400,
  main = "ACF Plot for Page Views"
)

Here we can very clearly see evidence of ‘autocorrelation’ in the data, where pages view values appear highly dependent on prior values. This suggests seasonality as aforementioned ~ with a slow decline across the values suggesting a long-term trend that we also observed is present in the data.

Let’s try and run a simple multivariate regression based on these findings:

# Page Views Dataframe for Modeling
page_views_model_df <- page_views_ts |>
  as_tibble() |>
  arrange(Date_index) |>
  mutate(
    time_index = row_number(),
    day_of_week = wday(Date_index, label = TRUE),
    month = month(Date_index, label = TRUE),
    year = year(Date_index)
  )

page_views_lm <- lm(
  count ~ time_index + day_of_week + month,
  data = page_views_model_df
)

summary(page_views_lm)
## 
## Call:
## lm(formula = count ~ time_index + day_of_week + month, data = page_views_model_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -336.89  -61.30   -0.60   60.52  915.27 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    7.750e+02  3.254e+00  238.169  < 2e-16 ***
## time_index    -1.847e-01  1.423e-03 -129.816  < 2e-16 ***
## day_of_week.L -1.551e+01  4.293e+00   -3.614 0.000306 ***
## day_of_week.Q -1.656e+02  4.293e+00  -38.583  < 2e-16 ***
## day_of_week.C  1.397e+01  4.294e+00    3.254 0.001149 ** 
## day_of_week^4 -2.670e+01  4.294e+00   -6.219 5.54e-10 ***
## day_of_week^5  6.656e+00  4.294e+00    1.550 0.121194    
## day_of_week^6  1.097e+00  4.293e+00    0.255 0.798386    
## month.L       -6.681e+01  5.597e+00  -11.936  < 2e-16 ***
## month.Q        5.306e+01  5.610e+00    9.459  < 2e-16 ***
## month.C        8.417e+00  5.578e+00    1.509 0.131376    
## month^4       -5.749e+01  5.608e+00  -10.251  < 2e-16 ***
## month^5       -7.842e+00  5.652e+00   -1.388 0.165333    
## month^6        2.570e+01  5.666e+00    4.536 5.91e-06 ***
## month^7        9.928e+00  5.630e+00    1.764 0.077874 .  
## month^8       -4.145e+00  5.611e+00   -0.739 0.460129    
## month^9       -1.731e+00  5.623e+00   -0.308 0.758292    
## month^10       2.348e+00  5.640e+00    0.416 0.677262    
## month^11      -6.369e+00  5.688e+00   -1.120 0.262965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.1 on 3941 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8243 
## F-statistic:  1033 on 18 and 3941 DF,  p-value: < 2.2e-16