Time Series Data Dive

Libraries

library(tidyverse)
library(ggthemes)
library(ggrepel)
library(xts)
library(tsibble)

Note: my dataset only has five years that are aggregated. So it’s only five unique data points and it wouldn’t make sense to convert them to date format.

So I’m using a Wikipedia page related to my dataset and then do time series analysis on page views.

Our data is from mid 2015 to end of 2019.

data <- read_delim("./page_views_sports.csv", delim = ',')
## Rows: 1646 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): List of college athletic programs in Indiana
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data |>
  select(Date, `List of college athletic programs in Indiana`) |>
  rename(date = Date,
         page_views = `List of college athletic programs in Indiana`)
data_ts <- as_tsibble(data, index = date)
data_ts |>
  ggplot() +
  geom_line(mapping = aes(x = date, y = page_views), linewidth=0.3) +
  labs(title = 'Sports page views on Wikipedia') +
  theme_hc()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

It looks like it trends upwards until about 2018 and then begins to trend downwards. Let’s try rolling average to see the data more clearly.

# xts object
data_xts <- xts(x = data_ts$page_views, 
                  order.by = data_ts$date)

data_xts <- setNames(data_xts, "page_views")
#rolling average
data_xts |>
  rollapply(width = 30, \(x) mean(x, na.rm = TRUE), fill = FALSE) |>
  ggplot(mapping = aes(x = Index, y = page_views)) +
  geom_line() +
  labs(title = "Page Views Over Time",
       subtitle = "Monthly Rolling Average") +
  theme_hc()

In this averaged out graph, we can already see some seasonality. Views seem to go up in the beginning of year and then dip in the middle of a year.

Seasonality (Initial Observation)

data_ts |>
  filter_index("2016" ~ "2018") |> #subsetting data
  drop_na() |>
  ggplot(mapping = aes(x = date, y = page_views)) +
  geom_point(size=1, shape='O') +
  geom_smooth(span=0.2, color = 'blue', se=FALSE) +
  labs(title = "page views from 2016 to 2018") +
  theme_hc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Okay now we’re seeing seasonality more clearly. It looks like in the beginning of a year, page views are higher and then they dip in the middle. For instance, in start of 2017 we can see a trough and then a valley in mid-2017. And this seems to keep repeating.

Now let’s explore overall trend or trends in subset of data using linear regression.

data_ts |>
  filter_index("2016" ~ "2018-04") |>
  # filter_index("2018" ~ "2019-12") |>
  ggplot(mapping = aes(x = date, y = page_views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'red', se=FALSE) +
  labs(title = "page views from 2016-2018") +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

Okay we’re definitely seeing a positive trend from 2016 to 2018 in page views as I thought initially. Now let’s subset to 2018 and later to see if there is a downtrend.

After sub-setting to 2018 and later, we’re seeing a downtrend. We saw these trends when we first looked at the data and now linear regression line is making it clear.

Linear Regression (Trend)

data_ts1 <- data_ts |>
  filter_index("2016" ~ "2018-04")

data_ts2 <- data_ts |>
  filter_index("2018" ~ "2019-12")
  
  
lm_model <- lm(page_views ~ date, data = data_ts2)
summary(lm_model)
## 
## Call:
## lm(formula = page_views ~ date, data = data_ts2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.054 -15.777  -2.721  12.707 244.407 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 503.59539   80.71496   6.239 7.46e-10 ***
## date         -0.02351    0.00451  -5.214 2.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.68 on 728 degrees of freedom
## Multiple R-squared:  0.03599,    Adjusted R-squared:  0.03467 
## F-statistic: 27.18 on 1 and 728 DF,  p-value: 2.416e-07

Slope for 2016-2018: 0.02366

Slope for 2018-2020 : -0.02351

So absolute slope for first group is slightly stronger.

We could already see a clear seasonality in our earlier graph, but we can try looking at our data using quarter year grouping.

Seasonality (Continued)

data_ts |>
  index_by(year = floor_date(date, 'quarter')) |>
  summarise(avg_page_views = mean(page_views, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_page_views)) +
  geom_line() +
  # geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
  labs(title = "Average Page Views Over Time",
       subtitle = "(by quarter year)") +
  scale_x_date(breaks = "1 year", labels = \(x) year(x)) +
  theme_hc()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

We can again see the pattern closely resembling the sine curve. Page views keep rising and falling in a 5-6 month time frame.

Autocorrelation

acf(data_ts, ci = 0.95, na.action = na.exclude)

As the lag increases, correlation goes down. In other words, each value is more correlated to closer past than distant past.

Partial Correlation

pacf(data_xts, na.action = na.exclude, 
     xlab = "Lag", main = "PACF for Page Views")

From this plot, we can infer that each value is significantly correlated with its closer past, and correlation becomes insignificant later.

Overall:

The first thing we noticed about our data was an uptrend from 2016 to 2018 and then a downtrend from 2018 and onward. Then we used rolling average to detect seasonality. We observed repeating pattern in that plot, page views seemed to go higher in the beginning of the year and drop in the middle. We also looked at the trends using linear regression to clear our initial skepticism. We saw both positive (2016-2018) and negative trends (2018-2020). Then we clarified seasonality again by plotting quarterly mean data. And lastly even though seasonality wasn’t clear in ACF and PACF, we still saw statistically significant correlations in ACF throughout the time frame while PACF only showed significant correlations in the beginning.

It’s difficult to connect these findings to my sports dataset I’ve been using in the class because these were just page views, however I observed that revenue of some universities like Ball State followed similar patterns here. But that could be coincidence. That’s all!