Start by setting up the packages to manipulate data.
suppressPackageStartupMessages({
library(tidyverse)
library(rio)
library(tsibble)
source("aptheme.R") #Code that helps format graphs
})
Read in the data (this time page views for Concussions wikipedia page)
data <- import("pageviews-Concussion.csv")
Transform into tsibble
#No need to backfill, since we have a record for every date already in the data
data_ts <- as_tsibble(data, index=Date) %>%
index_by(date = date(Date)) %>%
rename(views = "Concussion") %>%
mutate(in_season = ifelse(months(date) %in% c("September", "October", "November", "December", "January", "February"), "In Season", "Out of Season"))
Plot the data
ggplot(data = data_ts, aes(x = date, y = views)) +
geom_point(aes(color = in_season)) +
labs(title = "Page Views of Concussion Wiki",
x = "Date",
y = "Views") +
theme_ap(family = "sans") +
theme(legend.position = "right")
It doesn’t look like there is an overall trend of the data (ie overall trending up or down in page views), but there is an interesting relationship looking at the data in season and out of football season. Looking here at the data limited to just 2020.
data_ts %>%
filter(date <= ymd("2021-01-01")) %>%
ggplot( aes(x = date, y = views)) +
geom_point(aes(color = in_season)) +
labs(title = "Page Views of Concussion Wiki",
x = "Date",
y = "Views") +
theme_ap(family = "sans") +
theme(legend.position = "right")
Here we can see the cycle a little clearer. During the off season, the page views for concussions goes down but it picks back up once the football season starts again. Now we fit a regression.
model1 <- lm(views ~ date, data = data_ts)
model2 <- lm(views ~ date + in_season, data = data_ts)
summary(model1)
##
## Call:
## lm(formula = views ~ date, data = data_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -232.0 -69.1 -15.3 40.1 3224.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 870.785057 83.660257 10.41 < 2e-16 ***
## date -0.016889 0.004309 -3.92 9.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136.1 on 2286 degrees of freedom
## Multiple R-squared: 0.006676, Adjusted R-squared: 0.006242
## F-statistic: 15.36 on 1 and 2286 DF, p-value: 9.127e-05
summary(model2)
##
## Call:
## lm(formula = views ~ date + in_season, data = data_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209.4 -65.1 -15.5 43.2 3189.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 968.117502 81.225540 11.919 < 2e-16 ***
## date -0.020103 0.004172 -4.818 1.54e-06 ***
## in_seasonOut of Season -70.110052 5.511571 -12.721 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131.6 on 2285 degrees of freedom
## Multiple R-squared: 0.07237, Adjusted R-squared: 0.07155
## F-statistic: 89.13 on 2 and 2285 DF, p-value: < 2.2e-16
Here we can see a couple of interesting things. There is a very slight negative trend of views over time. However, the more interesting piece is that out of season, the Concussion wikipedia page has 70 fewer page views than when its in season.
month_data <- data_ts %>%
as.data.frame() %>%
group_by(month_year = format(ymd(date), "%Y-%m")) %>%
summarise(views = sum(views)) %>%
distinct()
month_data
## # A tibble: 76 × 2
## month_year views
## <chr> <int>
## 1 2020-01 25104
## 2 2020-02 19024
## 3 2020-03 16047
## 4 2020-04 15995
## 5 2020-05 14587
## 6 2020-06 13115
## 7 2020-07 13340
## 8 2020-08 15046
## 9 2020-09 15696
## 10 2020-10 15463
## # ℹ 66 more rows
acf(month_data, ci = 0.95, na.action = na.exclude)
Looking at this, we can see a little bit of a season, but it’s not entierly clear. Because the football season is a very specific part of the year, there non-obvious monthly period makes sense. If the season was 6 months, then you might see a spike at lag 6, but that is not the case. This all suggests that the football season is a potential driver of page views on the concussions wikipedia page.
This is a very premilinary analysis, and it’s possible there is another driving force here, however these initial results should give us pause. It looks like people are looking at the concussion page more when football is being played. If an injury is so tied to football that it drives people to read about concussions, the sport needs to do quite a bit of work to try and prevent concussions in the game.