Topic Choice

For this assignment, I was unable to use my original dataset because there wasn’t a time element involved. I had a hard time finding a Wikipedia page that was directly relevant to my dataset topic but I ended up choosing “SAT” (Scholastic Aptitude Test). My data spans roughly 7 years (2019-2026).

SAT = read.csv("SAT Page Views.csv") |>
  rename(date = 1, views = 2)
SAT$date = as.Date(SAT$date, format = "%m/%d/%Y")

Create the tsibble object

SAT_ts = as_tsibble(SAT, index = date)
SAT_ts = fill_gaps(SAT_ts)

Create the xts object

SAT_xts = xts(SAT_ts$views,
              order.by = SAT_ts$date,
              frequency = 7)
SAT_xts = setNames(SAT_xts, "views")

Visualize the raw time series

ggplot(SAT_ts, aes(x = date, y = views)) +
  geom_line(linewidth = 0.3) +
  labs(title = "SAT Wikipedia Page Views Over Time",
       x = "Date",
       y = "Page Views")

When plotting the SAT page views over time, we can see that there are a few significant spikes. Most notably, there was a spike near 15,000 just a few months before the COVID pandemic started in March of 2020 and a few spikes above 5,000 around that. There were also some fairly consistent spikes over the initial period of the time series that don’t necessarily show up later on. Even without a linear regression, we might be able to detect a general decreasing trend but we’ll double check just to make sure.

ggplot(SAT_ts, aes(x = date, y = views)) +
  geom_line(linewidth = 0.3) +
  geom_smooth(method = "lm", col = 'cyan', se = F, lwd = 1) +
  labs(title = "SAT Wikipedia Page Views Over Time",
       x = "Date",
       y = "Page Views")

summary(lm(views ~ date, data = SAT_ts))
## 
## Call:
## lm(formula = views ~ date, data = SAT_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -795.3  -208.7   -53.3   115.3 12040.6 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 15559.60937   261.70598   59.45   <2e-16 ***
## date           -0.72228     0.01357  -53.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 506.7 on 2555 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.5254 
## F-statistic:  2831 on 1 and 2555 DF,  p-value: < 2.2e-16

Sure enough, when plotting a linear regression on top of the raw data, we can see that there is clear decreasing trend with time. According to the regression line, it seems that daily SAT page views have decreased from about 2,500 to less than 1,000 in the roughly 7 year time period.

We could consider subsetting the data for multiple trends. An option could be to try and capture the noisier (left/earlier) section in one trend and the calmer (right/later) section in another. The issue is that the single trend doesn’t really do that badly. If we did multiple trends, we might be introducing complexity that isn’t necessary. Additionally, a quick look at the linear regression summary shows that the R^2 value is about 0.52 which is quite high for noisy time series trends like this. I want to note that, while the regression performs well when aggregating like this, it could be due to high day-to-day variation early on and much lower later.

Checking for seasonality

SAT_ts |>
  index_by(month = floor_date(date, 'month')) |>
  summarise(avg_views = mean(views)) |>
  ggplot(aes(x = month, y = avg_views)) +
  geom_line() +
  geom_smooth(span = 0.1, color = 'cyan', se = F) +
  labs(title = "Average monthly Page Views Over Time")

Use ACF

acf(SAT_xts,
    lag.max = 30,
    main = "ACF for SAT Page Views (lag = 30)")

acf(SAT_xts,
     lag.max = 365,
    main = "ACF for SAT Page Views (lag = 365)")

The line plot reveals some sort of seasonal trend roughly every 6 months. We can see this by the waviness within the overall downward trend. This almost bi-annual season may line up with the fall and spring SAT dates. The ACF plot with lag = 365 sort of supports this with a slight spike around 210 days (about 7 months). There is also a bit of a spike at about 35 days (about 5 weeks) which may line up with registration deadlines.

We can also look at an ACF plot with lag = 30 to look for smaller seasons. It’s a little bit challenging to see, but there may be a consistent 7 day lag rhythm. This could indicate that there are a few more page views on weekdays versus weekends (or vice versa).

Overall, there could be 3 different seasons within this time series but none of them are super strong. The strong downward trend of the data could be making the signals from the ACF plots harder to detect.

These results may reveal that student interest in academics may be higher on weekdays versus weekends (the 7 day lag). They also may show that student interest in standardized testing has greatly decreased after COVID which might align with general changes in student sentiment towards these exams.