For this assignment, I was unable to use my original dataset because there wasn’t a time element involved. I had a hard time finding a Wikipedia page that was directly relevant to my dataset topic but I ended up choosing “SAT” (Scholastic Aptitude Test). My data spans roughly 7 years (2019-2026).
SAT = read.csv("SAT Page Views.csv") |>
rename(date = 1, views = 2)
SAT$date = as.Date(SAT$date, format = "%m/%d/%Y")
SAT_ts = as_tsibble(SAT, index = date)
SAT_ts = fill_gaps(SAT_ts)
SAT_xts = xts(SAT_ts$views,
order.by = SAT_ts$date,
frequency = 7)
SAT_xts = setNames(SAT_xts, "views")
ggplot(SAT_ts, aes(x = date, y = views)) +
geom_line(linewidth = 0.3) +
labs(title = "SAT Wikipedia Page Views Over Time",
x = "Date",
y = "Page Views")
When plotting the SAT page views over time, we can see that there are a few significant spikes. Most notably, there was a spike near 15,000 just a few months before the COVID pandemic started in March of 2020 and a few spikes above 5,000 around that. There were also some fairly consistent spikes over the initial period of the time series that don’t necessarily show up later on. Even without a linear regression, we might be able to detect a general decreasing trend but we’ll double check just to make sure.
ggplot(SAT_ts, aes(x = date, y = views)) +
geom_line(linewidth = 0.3) +
geom_smooth(method = "lm", col = 'cyan', se = F, lwd = 1) +
labs(title = "SAT Wikipedia Page Views Over Time",
x = "Date",
y = "Page Views")
summary(lm(views ~ date, data = SAT_ts))
##
## Call:
## lm(formula = views ~ date, data = SAT_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -795.3 -208.7 -53.3 115.3 12040.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15559.60937 261.70598 59.45 <2e-16 ***
## date -0.72228 0.01357 -53.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 506.7 on 2555 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.5254
## F-statistic: 2831 on 1 and 2555 DF, p-value: < 2.2e-16
Sure enough, when plotting a linear regression on top of the raw data, we can see that there is clear decreasing trend with time. According to the regression line, it seems that daily SAT page views have decreased from about 2,500 to less than 1,000 in the roughly 7 year time period.
We could consider subsetting the data for multiple trends. An option could be to try and capture the noisier (left/earlier) section in one trend and the calmer (right/later) section in another. The issue is that the single trend doesn’t really do that badly. If we did multiple trends, we might be introducing complexity that isn’t necessary. Additionally, a quick look at the linear regression summary shows that the R^2 value is about 0.52 which is quite high for noisy time series trends like this. I want to note that, while the regression performs well when aggregating like this, it could be due to high day-to-day variation early on and much lower later.
SAT_ts |>
index_by(month = floor_date(date, 'month')) |>
summarise(avg_views = mean(views)) |>
ggplot(aes(x = month, y = avg_views)) +
geom_line() +
geom_smooth(span = 0.1, color = 'cyan', se = F) +
labs(title = "Average monthly Page Views Over Time")
acf(SAT_xts,
lag.max = 30,
main = "ACF for SAT Page Views (lag = 30)")
acf(SAT_xts,
lag.max = 365,
main = "ACF for SAT Page Views (lag = 365)")
The line plot reveals some sort of seasonal trend roughly every 6 months. We can see this by the waviness within the overall downward trend. This almost bi-annual season may line up with the fall and spring SAT dates. The ACF plot with lag = 365 sort of supports this with a slight spike around 210 days (about 7 months). There is also a bit of a spike at about 35 days (about 5 weeks) which may line up with registration deadlines.
We can also look at an ACF plot with lag = 30 to look for smaller seasons. It’s a little bit challenging to see, but there may be a consistent 7 day lag rhythm. This could indicate that there are a few more page views on weekdays versus weekends (or vice versa).
Overall, there could be 3 different seasons within this time series but none of them are super strong. The strong downward trend of the data could be making the signals from the ACF plots harder to detect.
These results may reveal that student interest in academics may be higher on weekdays versus weekends (the 7 day lag). They also may show that student interest in standardized testing has greatly decreased after COVID which might align with general changes in student sentiment towards these exams.