Data Dive Week 12 - Time Series

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  library(tsibble)
  source("aptheme.R") #Code that helps format graphs
  })

Read in the data (this time page views for Concussions wikipedia page)

data <- import("pageviews-Concussion.csv")

Transform into tsibble

#No need to backfill, since we have a record for every date already in the data 
data_ts <- as_tsibble(data, index=Date) %>%
  index_by(date = date(Date)) %>%
  rename(views = "Concussion") %>%
  mutate(in_season = ifelse(months(date) %in% c("September", "October", "November", "December", "January", "February"), "In Season", "Out of Season"))

Plot the data

ggplot(data = data_ts, aes(x = date, y = views)) + 
  geom_point(aes(color = in_season)) + 
  labs(title = "Page Views of Concussion Wiki",
       x = "Date",
       y = "Views") + 
  theme_ap(family = "sans") + 
  theme(legend.position = "right")

It doesn’t look like there is an overall trend of the data (ie overall trending up or down in page views), but there is an interesting relationship looking at the data in season and out of football season. Looking here at the data limited to just 2020.

data_ts %>%
  filter(date <= ymd("2021-01-01")) %>%
  ggplot( aes(x = date, y = views)) + 
  geom_point(aes(color = in_season)) + 
  labs(title = "Page Views of Concussion Wiki",
       x = "Date",
       y = "Views") + 
  theme_ap(family = "sans") + 
  theme(legend.position = "right")

Here we can see the cycle a little clearer. During the off season, the page views for concussions goes down but it picks back up once the football season starts again. Now we fit a regression.

model1 <- lm(views ~ date, data = data_ts)
model2 <- lm(views ~ date + in_season, data = data_ts)

summary(model1)
## 
## Call:
## lm(formula = views ~ date, data = data_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -232.0  -69.1  -15.3   40.1 3224.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 870.785057  83.660257   10.41  < 2e-16 ***
## date         -0.016889   0.004309   -3.92 9.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136.1 on 2286 degrees of freedom
## Multiple R-squared:  0.006676,   Adjusted R-squared:  0.006242 
## F-statistic: 15.36 on 1 and 2286 DF,  p-value: 9.127e-05
summary(model2)
## 
## Call:
## lm(formula = views ~ date + in_season, data = data_ts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -209.4  -65.1  -15.5   43.2 3189.2 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            968.117502  81.225540  11.919  < 2e-16 ***
## date                    -0.020103   0.004172  -4.818 1.54e-06 ***
## in_seasonOut of Season -70.110052   5.511571 -12.721  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131.6 on 2285 degrees of freedom
## Multiple R-squared:  0.07237,    Adjusted R-squared:  0.07155 
## F-statistic: 89.13 on 2 and 2285 DF,  p-value: < 2.2e-16

Here we can see a couple of interesting things. There is a very slight negative trend of views over time. However, the more interesting piece is that out of season, the Concussion wikipedia page has 70 fewer page views than when its in season.

month_data <- data_ts %>%
  as.data.frame() %>%
  group_by(month_year = format(ymd(date), "%Y-%m")) %>%
  summarise(views = sum(views)) %>%
  distinct()
month_data
## # A tibble: 76 × 2
##    month_year views
##    <chr>      <int>
##  1 2020-01    25104
##  2 2020-02    19024
##  3 2020-03    16047
##  4 2020-04    15995
##  5 2020-05    14587
##  6 2020-06    13115
##  7 2020-07    13340
##  8 2020-08    15046
##  9 2020-09    15696
## 10 2020-10    15463
## # ℹ 66 more rows
acf(month_data, ci = 0.95, na.action = na.exclude)

Looking at this, we can see a little bit of a season, but it’s not entierly clear. Because the football season is a very specific part of the year, there non-obvious monthly period makes sense. If the season was 6 months, then you might see a spike at lag 6, but that is not the case. This all suggests that the football season is a potential driver of page views on the concussions wikipedia page.

This is a very premilinary analysis, and it’s possible there is another driving force here, however these initial results should give us pause. It looks like people are looking at the concussion page more when football is being played. If an injury is so tied to football that it drives people to read about concussions, the sport needs to do quite a bit of work to try and prevent concussions in the game.