Introduction

This week, we explore time-based trends in page views related to our dataset. Since our original dataset doesn’t include a time column, we use Wikipedia page views for “Sleep deprivation” as a proxy for public interest in this topic over time.

Goals:

  • Convert Wikipedia data into a time series format.
  • Detect any long-term trends using regression.
  • Explore seasonal patterns using smoothing and ACF.
  • Compare patterns with related topics (e.g., stress, screen time).
  • Interpret how time-based interest reflects broader behavioral patterns.


Step 1: Collect Time-Based Data

We pull Wikipedia page view data for the article “Sleep_deprivation”.

# Primary page: Sleep Deprivation
views_sd <- wp_trend(
  page = "Sleep_deprivation",
  from = "2020-01-01", 
  to = "2023-12-31", 
  lang = "en"
)

# Additional terms for comparison
views_stress <- wp_trend("Stress_(biology)", from = "2020-01-01", to = "2023-12-31", lang = "en")
views_screen <- wp_trend("Screen_time", from = "2020-01-01", to = "2023-12-31", lang = "en")
views_insomnia <- wp_trend("Insomnia", from = "2020-01-01", to = "2023-12-31", lang = "en")

# Clean and combine
views_all <- views_sd %>%
  dplyr::select(date, views) %>%
  dplyr::rename(SleepDeprivation = views) %>%
  left_join(views_stress %>% dplyr::select(date, views) %>% dplyr::rename(Stress = views), by = "date") %>%
  left_join(views_screen %>% dplyr::select(date, views) %>% dplyr::rename(ScreenTime = views), by = "date") %>%
  left_join(views_insomnia %>% dplyr::select(date, views) %>% dplyr::rename(Insomnia = views), by = "date") %>%
  dplyr::mutate(date = as.Date(date)) %>%
  as_tsibble(index = date)

head(views_all)
## # A tsibble: 6 x 5 [1D]
##   date       SleepDeprivation Stress ScreenTime Insomnia
##   <date>                <dbl>  <dbl>      <dbl>    <dbl>
## 1 2020-01-01             1127    400        203     2092
## 2 2020-01-02             1267    424        212     2737
## 3 2020-01-03             1348    469        233     2830
## 4 2020-01-04             1230    438        207     2723
## 5 2020-01-05             1234    428        271     2900
## 6 2020-01-06             1464    552        258     3225

Step 2: Plot Time Series of Page Views

views_all %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = SleepDeprivation, color = "Sleep Deprivation")) +
  geom_line(aes(y = Stress, color = "Stress")) +
  geom_line(aes(y = ScreenTime, color = "Screen Time")) +
  geom_line(aes(y = Insomnia, color = "Insomnia")) +
  labs(title = "Wikipedia Page Views: Sleep-Related Topics",
       x = "Date", y = "Page Views", color = "Topic") +
  scale_y_continuous(labels = comma)

Interpretation:

  • Sleep Deprivation has the highest volume but is also the noisiest.
  • Stress and Insomnia show relatively stable interest, but Insomnia closely tracks SleepDeprivation, suggesting latent connections.
  • Screen Time mirrors peaks in SleepDeprivation, indicating possible behavioral drivers.
  • Collective movement across topics suggests common external influencers (e.g., exams, campaigns).

Step 3: Fit Linear Model for Trend Detection

lm_fit <- views_all %>%
  model(Trend = TSLM(SleepDeprivation ~ trend()))

report(lm_fit)
## Series: SleepDeprivation 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -551.63 -206.52  -27.88  140.08 2494.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1166.51264   14.15543  82.407  < 2e-16 ***
## trend()       -0.04812    0.01677  -2.869  0.00418 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 270.4 on 1459 degrees of freedom
## Multiple R-squared: 0.00561, Adjusted R-squared: 0.004928
## F-statistic: 8.231 on 1 and 1459 DF, p-value: 0.0041782

Interpretation:

  • There is a statistically significant decline in interest over time (p < 0.01), but the slope is shallow.
  • Sleep Deprivation is not a growing concern — suggesting that while periodic spikes occur, public interest may be waning slowly over time.

Step 4: Plot with Trend Line

We interpret the coefficients by converting them to odds ratios.

views_all %>%
  model(TSLM(SleepDeprivation ~ trend())) %>%
  augment() %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = SleepDeprivation), color = "gray60") +
  geom_line(aes(y = .fitted), color = "red") +
  labs(title = "Trend in 'Sleep Deprivation' Page Views",
       y = "Page Views", x = "Date")

Interpretation:

  • Red line confirms: while noise and spikes exist, the underlying trend is downward-sloping, reinforcing the model’s message.
  • Strong spikes in 2021–2023 suggest episodic events temporarily override the downward slope.

Step 5: Smoothing to Explore Seasonality

views_all %>%
  model(Smooth = ETS(SleepDeprivation)) %>%
  components() %>%
  autoplot() +
  labs(title = "Smoothed Components of Sleep Deprivation Interest")

Interpretation:

  • Seasonal component is strong—suggesting interest in sleep deprivation follows a consistent calendar rhythm.
  • Remainder plot shows deviations, likely media coverage or public events.

Step 6: Use ACF to Detect Seasonality

views_all %>%
  ACF(SleepDeprivation) %>%
  autoplot() +
  labs(title = "ACF of Page Views: Sleep Deprivation")

Interpretation:

  • High autocorrelation around 7–30 days supports weekly/monthly cyclical triggers.
  • These may align with academic routines, DST changes, and content cycles (like awareness months).

Step 7: Compare with External Cycles (Academic & Media)

views_all %>%
  mutate(Year = year(date)) %>%
  ggplot(aes(x = date, y = SleepDeprivation, color = factor(Year))) +
  geom_line(alpha = 0.8) +
  labs(title = "Sleep Deprivation Page Views: Year-over-Year Pattern",
       x = "Date", y = "Page Views", color = "Year") +
  scale_y_continuous(labels = comma)

Interpretation:

  • Annual plots show recurring spring and fall spikes, suggesting ties to semester timelines or clock changes.
  • Similar shapes across years = strong seasonal regularity.

Step 8: Plan for Multivariate Time Series

# Prep wide tsibble to VAR-compatible data frame
multi_ts <- views_all %>%
  dplyr::select(date, SleepDeprivation, Stress, ScreenTime, Insomnia) %>%
  as_tibble() %>%
  na.omit()

# Convert to time-series matrix
ts_data <- ts(multi_ts[, -1], start = c(2020, 1), frequency = 365)

# Fit simple VAR model
var_fit <- VAR(ts_data, p = 2, type = "const")

summary(var_fit)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: SleepDeprivation, Stress, ScreenTime, Insomnia 
## Deterministic variables: const 
## Sample size: 1459 
## Log Likelihood: -35492.996 
## Roots of the characteristic polynomial:
## 0.9103 0.7982 0.7715 0.5131 0.3245 0.3245 0.2981 0.1925
## Call:
## VAR(y = ts_data, p = 2, type = "const")
## 
## 
## Estimation results for equation SleepDeprivation: 
## ================================================= 
## SleepDeprivation = SleepDeprivation.l1 + Stress.l1 + ScreenTime.l1 + Insomnia.l1 + SleepDeprivation.l2 + Stress.l2 + ScreenTime.l2 + Insomnia.l2 + const 
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## SleepDeprivation.l1   0.394894   0.028648  13.784  < 2e-16 ***
## Stress.l1             0.003355   0.051509   0.065 0.948081    
## ScreenTime.l1         0.356825   0.097056   3.676 0.000245 ***
## Insomnia.l1           0.145350   0.031305   4.643 3.74e-06 ***
## SleepDeprivation.l2   0.332689   0.028636  11.618  < 2e-16 ***
## Stress.l2            -0.062988   0.051108  -1.232 0.217982    
## ScreenTime.l2        -0.303923   0.097667  -3.112 0.001896 ** 
## Insomnia.l2          -0.050933   0.031564  -1.614 0.106815    
## const               160.767909  34.590587   4.648 3.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 184.9 on 1450 degrees of freedom
## Multiple R-Squared: 0.538,   Adjusted R-squared: 0.5354 
## F-statistic:   211 on 8 and 1450 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Stress: 
## ======================================= 
## Stress = SleepDeprivation.l1 + Stress.l1 + ScreenTime.l1 + Insomnia.l1 + SleepDeprivation.l2 + Stress.l2 + ScreenTime.l2 + Insomnia.l2 + const 
## 
##                      Estimate Std. Error t value Pr(>|t|)    
## SleepDeprivation.l1   0.02652    0.01546   1.715  0.08655 .  
## Stress.l1             0.27230    0.02780   9.796  < 2e-16 ***
## ScreenTime.l1         0.34818    0.05238   6.647 4.21e-11 ***
## Insomnia.l1           0.03562    0.01689   2.108  0.03519 *  
## SleepDeprivation.l2   0.03382    0.01545   2.189  0.02878 *  
## Stress.l2             0.11354    0.02758   4.117 4.06e-05 ***
## ScreenTime.l2        -0.13830    0.05271  -2.624  0.00878 ** 
## Insomnia.l2          -0.01752    0.01703  -1.028  0.30400    
## const               128.03650   18.66734   6.859 1.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 99.77 on 1450 degrees of freedom
## Multiple R-Squared: 0.2645,  Adjusted R-squared: 0.2604 
## F-statistic: 65.17 on 8 and 1450 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation ScreenTime: 
## =========================================== 
## ScreenTime = SleepDeprivation.l1 + Stress.l1 + ScreenTime.l1 + Insomnia.l1 + SleepDeprivation.l2 + Stress.l2 + ScreenTime.l2 + Insomnia.l2 + const 
## 
##                      Estimate Std. Error t value Pr(>|t|)    
## SleepDeprivation.l1 -0.022997   0.008162  -2.818   0.0049 ** 
## Stress.l1           -0.027795   0.014675  -1.894   0.0584 .  
## ScreenTime.l1        0.551463   0.027651  19.944  < 2e-16 ***
## Insomnia.l1          0.019746   0.008919   2.214   0.0270 *  
## SleepDeprivation.l2  0.019137   0.008158   2.346   0.0191 *  
## Stress.l2            0.016624   0.014560   1.142   0.2538    
## ScreenTime.l2        0.187737   0.027825   6.747 2.17e-11 ***
## Insomnia.l2         -0.023487   0.008992  -2.612   0.0091 ** 
## const               96.549752   9.854600   9.797  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 52.67 on 1450 degrees of freedom
## Multiple R-Squared: 0.4581,  Adjusted R-squared: 0.4551 
## F-statistic: 153.2 on 8 and 1450 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Insomnia: 
## ========================================= 
## Insomnia = SleepDeprivation.l1 + Stress.l1 + ScreenTime.l1 + Insomnia.l1 + SleepDeprivation.l2 + Stress.l2 + ScreenTime.l2 + Insomnia.l2 + const 
## 
##                      Estimate Std. Error t value Pr(>|t|)    
## SleepDeprivation.l1  -0.04064    0.02572  -1.580 0.114315    
## Stress.l1            -0.02938    0.04624  -0.635 0.525364    
## ScreenTime.l1         0.30070    0.08713   3.451 0.000575 ***
## Insomnia.l1           0.63733    0.02810  22.677  < 2e-16 ***
## SleepDeprivation.l2   0.07292    0.02571   2.836 0.004628 ** 
## Stress.l2            -0.21250    0.04588  -4.631 3.96e-06 ***
## ScreenTime.l2        -0.36100    0.08768  -4.117 4.05e-05 ***
## Insomnia.l2           0.24340    0.02834   8.590  < 2e-16 ***
## const               296.83895   31.05452   9.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 166 on 1450 degrees of freedom
## Multiple R-Squared: 0.7137,  Adjusted R-squared: 0.7121 
## F-statistic: 451.8 on 8 and 1450 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                  SleepDeprivation Stress ScreenTime Insomnia
## SleepDeprivation            34175   5232       3252    12222
## Stress                       5232   9953       1165     4827
## ScreenTime                   3252   1165       2774     2685
## Insomnia                    12222   4827       2685    27545
## 
## Correlation matrix of residuals:
##                  SleepDeprivation Stress ScreenTime Insomnia
## SleepDeprivation           1.0000 0.2837     0.3340   0.3984
## Stress                     0.2837 1.0000     0.2218   0.2916
## ScreenTime                 0.3340 0.2218     1.0000   0.3072
## Insomnia                   0.3984 0.2916     0.3072   1.0000

Interpretation:

  • ScreenTime lag-1 and Stress lag-2 are significant predictors of SleepDeprivation (p < 0.01).
  • Insomnia also predicts SleepDeprivation, both current and future values.
  • SleepDeprivation influences future ScreenTime and Insomnia, suggesting a feedback loop.
  • These dynamics imply that screen fatigue and mental stress may trigger sleep-related concerns, which then amplify digital disengagement or sleep awareness.

Final Insights and Next Steps

Key Findings:

  • Sleep deprivation interest is declining long-term but spikes cyclically.
  • Seasonal rhythms align with real-world anchors (semesters, daylight time, campaigns).
  • Multivariate lags show stress and screen time lead to sleep concern spikes.

What this means:

  • Sleep attention appears reactionary rather than persistent—triggered by external or internal stressors.
  • Campaigns, research drops, and product launches should align with predicted seasonal peaks for maximum impact.

Next Steps:

  • Compare patterns with known academic calendars and mental health campaigns.
  • Run Granger causality and Impulse Response to verify causal strength.
  • Track future spikes and explore lag-based interventions (e.g., app nudges, alerts).
  • Build a trend dashboard using Wikipedia/pageview APIs to monitor behavioral warning signals in real time.