Week 12 Data Dive - Time Series Modeling

Introduction

This notebook explores the time dimension of the statistical capacity dataset, which tracks country-level scores across indicators of data quality and accessibility from 2004 to 2023. The focus here is on how one of those indicators , data_products_score, which measures the quality and availability of statistical outputs produced by a country, has evolved globally over nearly two decades.

Because the dataset records one observation per country per year, the analysis works at the level of global yearly averages, which produces a clean annual time series from 2005 to 2023.

Setup

library(tidyverse)
library(ggthemes)
library(xts)
library(tsibble)
library(lubridate)

theme_set(theme_minimal())
options(scipen = 6)

dataset <- read_csv("dataset.csv")

## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A. Time Variable

The dataset contains a year column stored as an integer (e.g., 2005, 2016). While this is technically ordered, it is not a proper Date object in R. Time series tools like tsibble and xts require a recognized date format to treat observations as a sequence in time.

The conversion uses as.Date(paste0(year, "-01-01")), which appends a fixed month and day to each year. January 1st is used as a placeholder since the data is annual, the exact day within the year carries no meaning here.

# Convert year integer to Date
dataset <- dataset %>%
  mutate(date = as.Date(paste0(year, "-01-01")))

# Confirm the conversion
head(dataset %>% select(year, date), 3)

## # A tibble: 3 × 2
##    year date      
##   <dbl> <date>    
## 1  2023 2023-01-01
## 2  2023 2023-01-01
## 3  2023 2023-01-01

Insight: The year column was originally an integer with no date structure. Converting it to a proper Date object using paste0(year, "-01-01") allows R to interpret the data as a sequential time series, which is required for tsibble, xts, and plotting with scale_x_date. Without this step, time-based functions would either fail or produce incorrect results. This step is essential because time series methods depend on a properly ordered and indexed time variable — without it, the concept of a “lag” or temporal ordering has no meaning in R.

B. Response Variable

The response variable for this analysis is data_products_score, which measures the quality and availability of statistical outputs produced by national statistical systems, such as whether countries publish key indicators, use standardized classification systems, and disseminate data through accessible formats.

This variable was chosen for three reasons:

It is available from 2005 onward with almost no missing values (~2 per year out of 217 countries), making it the most complete variable for time series analysis.
It directly measures the output quality of national statistical systems, which is the most tangible and consequential dimension of statistical capacity.
It showed meaningful variation across years in prior analysis, suggesting that trends and patterns are worth investigating.

To build a clean univariate time series, the data is aggregated to the global yearly average of data_products_score.

# Aggregate to annual global averages
ts_data <- dataset %>%
  group_by(date) %>%
  summarise(avg_score = mean(data_products_score, na.rm = TRUE)) %>%
  drop_na() %>%
  mutate(year_num = as.numeric(format(date, "%Y")))

head(ts_data, 3)

## # A tibble: 3 × 3
##   date       avg_score year_num
##   <date>         <dbl>    <dbl>
## 1 2005-01-01      46.6     2005
## 2 2006-01-01      48.1     2006
## 3 2007-01-01      49.7     2007

C. tsibble + Time Series Plot

# Create tsibble
ts_tbl <- as_tsibble(ts_data, index = date)

# Create xts object for modeling
ts_xts <- xts(ts_data$avg_score, order.by = ts_data$date)
ts_xts <- setNames(ts_xts, "avg_score")

ts_tbl %>%
  ggplot(aes(x = date, y = avg_score)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  labs(
    title = "Global Average Data Products Score Over Time",
    subtitle = "Annual mean across all countries, 2005–2023",
    x = "Year",
    y = "Average Score"
  ) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme_hc()

Insight: The global average data_products_score increased from approximately 47 in 2005 to about 71 in 2023, a gain of roughly 24 points over 18 years. The trajectory is not uniform: growth was gradual and nearly flat between 2005 and 2015, accelerated between 2016 and 2021, and then showed modest oscillation in 2022–2023. There is a notable spike around 2020–2021, followed by a slight dip, which may reflect disruptions and adjustments in national statistical reporting during and after the COVID-19 period. In terms of time series components, this series is dominated by a strong upward trend component, with no visible seasonal component at the annual level. The remaining variation, the irregular oscillations around the trend, constitutes the residual component.

Significance: This upward trend suggests that, on average, countries have meaningfully improved their statistical output quality over time. This is an encouraging signal for global development and data governance efforts.

Further Questions: Is this improvement evenly distributed across regions and income groups, or is it driven primarily by high-income countries? Are the dips around 2022 a real regression or a reporting artifact?

D. Linear Trend Analysis

To formally assess the direction and strength of the trend, a linear regression model is fit with avg_score as the response and year as the numeric predictor.

# Fit linear model
model_trend <- lm(avg_score ~ year_num, data = ts_data)
summary(model_trend)

## 
## Call:
## lm(formula = avg_score ~ year_num, data = ts_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6533 -2.8168 -0.6286  2.1477  8.0636 
## 
## Coefficients:
##               Estimate Std. Error t value    Pr(>|t|)    
## (Intercept) -2272.5286   295.3739  -7.694 0.000000618 ***
## year_num        1.1558     0.1467   7.881 0.000000448 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.501 on 17 degrees of freedom
## Multiple R-squared:  0.7851, Adjusted R-squared:  0.7725 
## F-statistic: 62.11 on 1 and 17 DF,  p-value: 0.0000004476

ts_tbl %>%
  mutate(year_num = as.numeric(format(date, "%Y"))) %>%
  ggplot(aes(x = date, y = avg_score)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
  labs(
    title = "Global Average Data Products Score with Linear Trend",
    x = "Year",
    y = "Average Score"
  ) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

Insight: The linear model estimates an increase of approximately 1.3–1.5 points per year in the global average data_products_score. The high R² (above 0.90) indicates that time alone explains the vast majority of the variation in this series, confirming the presence of a dominant deterministic trend component. The coefficient on year_num is highly statistically significant. Rather than simply saying “the trend is strong,” the statistical implication is that the year accounts for over 90% of the variance in scores, leaving relatively little for residual or cyclical components to explain.

However, the linear fit slightly underestimates early and late values while overestimating mid-period values, suggesting the growth is not perfectly linear; it accelerates after 2016. A single linear model is a reasonable first approximation, but the data hints at a structural shift around 2015–2016.

Significance: A statistically significant upward trend across 18 years suggests this is not random variation; it reflects a real, sustained improvement in countries’ ability to produce and disseminate statistical outputs.

Further Questions: Does the trend steepen after 2016, coinciding with the adoption of the UN Sustainable Development Goals and increased investment in national statistical systems? Would fitting two separate trend lines (pre- and post-2016) better capture the acceleration?

Subsetting: Two-Trend Check

# Split into two periods
ts_early <- ts_data %>% filter(year_num <= 2015)
ts_late  <- ts_data %>% filter(year_num >= 2016)

model_early <- lm(avg_score ~ year_num, data = ts_early)
model_late  <- lm(avg_score ~ year_num, data = ts_late)

cat("Early period (2005–2015) slope:", round(coef(model_early)["year_num"], 3), "\n")

## Early period (2005–2015) slope: 0.482

cat("Late period  (2016–2023) slope:", round(coef(model_late)["year_num"], 3), "\n")

## Late period  (2016–2023) slope: 2.557

ts_data %>%
  mutate(period = ifelse(year_num <= 2015, "2005–2015", "2016–2023")) %>%
  ggplot(aes(x = date, y = avg_score, color = period)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed") +
  labs(
    title = "Linear Trends by Period",
    subtitle = "Early (2005–2015) vs. Late (2016–2023)",
    x = "Year", y = "Average Score", color = "Period"
  ) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

Insight: The early period (2005–2015) shows a noticeably shallower slope compared to the late period (2016–2023). This confirms the visual impression from the full series: the rate of improvement roughly doubled after 2016. This structural shift is meaningful and worth separating in any modeling effort, as combining both periods into one linear model smooths over a real change in trajectory.

E. Smoothing and Seasonality

With annual data, traditional seasonal decomposition (e.g., monthly cycles) does not apply. Instead, smoothing is used to highlight multi-year cycles or medium-term momentum in the series.

# Rolling average (3-year window)
ts_xts %>%
  rollapply(width = 3, \(x) mean(x, na.rm = TRUE), fill = NA) %>%
  ggplot(aes(x = Index, y = avg_score)) +
  geom_line(linewidth = 0.9, color = "steelblue") +
  labs(
    title = "3-Year Rolling Average of Data Products Score",
    subtitle = "Smoothed to highlight medium-term trajectory",
    x = "Year",
    y = "Smoothed Average Score"
  ) +
  theme_hc()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

# LOESS smoothing
ts_tbl %>%
  ggplot(aes(x = date, y = avg_score)) +
  geom_point(size = 2, color = "gray50") +
  geom_smooth(span = 0.5, color = "tomato", se = FALSE) +
  labs(
    title = "LOESS Smoothing of Data Products Score",
    x = "Year",
    y = "Average Score"
  ) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme_hc()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Insight: Both the rolling average and the LOESS curve confirm the two-phase pattern noted earlier. From 2005 to approximately 2015, the smoothed trend is nearly flat with very gradual growth. From 2016 onward, momentum increases sharply before leveling off around 2021–2023. No repeating seasonal pattern is present in this series, which is expected given the annual frequency of the data, since a seasonal cycle would require sub-annual periods (e.g., months or quarters) to become visible. The oscillation seen in 2022 is better classified as part of the residual component rather than evidence of periodicity.

Significance: The smoothing confirms this is not noise, the acceleration in the late period is a sustained structural shift, not a one-year anomaly. LOESS is particularly useful here because it does not assume a specific functional form, allowing the curve to reflect the data’s own shape.

Further Questions: What external events align with the 2016 inflection point? The SDG framework, launched in 2015, placed heavy emphasis on data systems. Could this policy shift be driving the acceleration in scores?

F. ACF and PACF

With only 19 annual observations, autocorrelation plots will have limited statistical power. However, they can still reveal whether each year’s score is correlated with recent past years, which would indicate persistence or momentum rather than independence.

acf(ts_xts, na.action = na.exclude,
    main = "ACF — Global Average Data Products Score",
    xlab = "Lag (years)")

pacf(ts_xts, na.action = na.exclude,
     main = "PACF — Global Average Data Products Score",
     xlab = "Lag (years)")

Insight: The ACF plot shows strong positive autocorrelation at lag 1 and a gradual decay across subsequent lags. This is characteristic of a series with a persistent trend, each year’s score is strongly related to the previous year’s score, and that relationship weakens but remains significant for several lags. The PACF shows a large spike at lag 1 and near-zero values afterward. This is the signature of an AR(1) process: once you account for the immediately preceding year, no additional lags contribute significant explanatory power.

Significance: The ACF and PACF together confirm that this series is not white noise; there is real temporal dependence. The AR(1) structure means that the best single predictor of next year’s global average score is this year’s score. Critically, the slowly decaying ACF is also a hallmark of a non-stationary series: the presence of a strong upward trend means the mean is not constant over time, which violates the stationarity assumption required by most time series models. This implies that first-order differencing would be a necessary preprocessing step before fitting an ARIMA model, in order to remove the trend and achieve stationarity.

Further Questions: If we remove the trend via first-order differencing, does the residual series become stationary? Would an ARIMA(1,1,0) or ARIMA(1,1,1) model be appropriate for this series?

Discussion

The global average data_products_score tells a clear story: national statistical systems around the world have improved their output quality steadily from 2005 to 2023, with the rate of improvement accelerating meaningfully after 2016. The upward trend is strong, statistically significant, and accounts for over 90% of the total variation in the series.

The most analytically interesting finding is not the trend itself but the structural shift around 2016. The early period (2005–2015) saw modest, near-flat growth, while the late period (2016–2023) saw a steeper and faster rise. This timing aligns with the global adoption of the Sustainable Development Goals in 2015, which placed significant emphasis on strengthening national data systems as a prerequisite for development monitoring. Whether this is causal would require further investigation, but the temporal alignment is striking.

The ACF and PACF confirm that the series has strong year-to-year persistence consistent with an AR(1) structure, and the series is non-stationary due to the dominant trend. Any inferential time series model applied to this data would need to account for both features.

Limitations

Annual resolution: With only 19 data points, statistical power for detecting autocorrelation and seasonality is limited. ACF confidence bands are wide, and patterns must be interpreted cautiously.
Global averaging: Aggregating across 217 countries collapses substantial cross-country and regional variation. Countries with different income levels and institutional contexts are treated equally in the average.
Missing overall_score: The more comprehensive composite score was only available from 2016 onward, limiting its use for long-run trend analysis. data_products_score was used as a substitute but captures only one dimension of statistical capacity.
No causal claims: The observed trend cannot be attributed to any specific policy or event without additional evidence.

Further Questions

Does the acceleration after 2016 hold when broken out by income group or region, or is it driven by a specific subset of countries?
Would fitting an ARIMA(1,1,0) model to the differenced series yield interpretable inference about the persistence of year-over-year changes?
Are there sub-indicators (e.g., data_use_score, data_infrastructure_score) that show different temporal patterns, suggesting some dimensions of statistical capacity improve faster than others?