This notebook explores the time dimension of the statistical capacity
dataset, which tracks country-level scores across indicators of data
quality and accessibility from 2004 to 2023. The focus here is on how
one of those indicators , data_products_score, which
measures the quality and availability of statistical outputs produced by
a country, has evolved globally over nearly two decades.
Because the dataset records one observation per country per year, the analysis works at the level of global yearly averages, which produces a clean annual time series from 2005 to 2023.
library(tidyverse)
library(ggthemes)
library(xts)
library(tsibble)
library(lubridate)
theme_set(theme_minimal())
options(scipen = 6)
dataset <- read_csv("dataset.csv")
## Rows: 4340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3c, country, region, income
## dbl (8): year, population, overall_score, data_use_score, data_services_scor...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset contains a year column stored as an integer
(e.g., 2005, 2016). While this is technically
ordered, it is not a proper Date object in R. Time series tools like
tsibble and xts require a recognized date
format to treat observations as a sequence in time.
The conversion uses as.Date(paste0(year, "-01-01")),
which appends a fixed month and day to each year. January 1st is used as
a placeholder since the data is annual, the exact day within the year
carries no meaning here.
# Convert year integer to Date
dataset <- dataset %>%
mutate(date = as.Date(paste0(year, "-01-01")))
# Confirm the conversion
head(dataset %>% select(year, date), 3)
## # A tibble: 3 × 2
## year date
## <dbl> <date>
## 1 2023 2023-01-01
## 2 2023 2023-01-01
## 3 2023 2023-01-01
Insight: The year column was originally
an integer with no date structure. Converting it to a proper Date object
using paste0(year, "-01-01") allows R to interpret the data
as a sequential time series, which is required for tsibble,
xts, and plotting with scale_x_date. Without
this step, time-based functions would either fail or produce incorrect
results. This step is essential because time series methods depend on a
properly ordered and indexed time variable — without it, the concept of
a “lag” or temporal ordering has no meaning in R.
The response variable for this analysis is
data_products_score, which measures the quality and
availability of statistical outputs produced by national statistical
systems, such as whether countries publish key indicators, use
standardized classification systems, and disseminate data through
accessible formats.
This variable was chosen for three reasons:
To build a clean univariate time series, the data is aggregated to
the global yearly average of
data_products_score.
# Aggregate to annual global averages
ts_data <- dataset %>%
group_by(date) %>%
summarise(avg_score = mean(data_products_score, na.rm = TRUE)) %>%
drop_na() %>%
mutate(year_num = as.numeric(format(date, "%Y")))
head(ts_data, 3)
## # A tibble: 3 × 3
## date avg_score year_num
## <date> <dbl> <dbl>
## 1 2005-01-01 46.6 2005
## 2 2006-01-01 48.1 2006
## 3 2007-01-01 49.7 2007
# Create tsibble
ts_tbl <- as_tsibble(ts_data, index = date)
# Create xts object for modeling
ts_xts <- xts(ts_data$avg_score, order.by = ts_data$date)
ts_xts <- setNames(ts_xts, "avg_score")
ts_tbl %>%
ggplot(aes(x = date, y = avg_score)) +
geom_line(linewidth = 0.8) +
geom_point(size = 2) +
labs(
title = "Global Average Data Products Score Over Time",
subtitle = "Annual mean across all countries, 2005–2023",
x = "Year",
y = "Average Score"
) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
theme_hc()
Insight: The global average
data_products_score increased from approximately 47 in 2005
to about 71 in 2023, a gain of roughly 24 points over 18 years. The
trajectory is not uniform: growth was gradual and nearly flat between
2005 and 2015, accelerated between 2016 and 2021, and then showed modest
oscillation in 2022–2023. There is a notable spike around 2020–2021,
followed by a slight dip, which may reflect disruptions and adjustments
in national statistical reporting during and after the COVID-19 period.
In terms of time series components, this series is dominated by a strong
upward trend component, with no visible
seasonal component at the annual level. The remaining
variation, the irregular oscillations around the trend, constitutes the
residual component.
Significance: This upward trend suggests that, on average, countries have meaningfully improved their statistical output quality over time. This is an encouraging signal for global development and data governance efforts.
Further Questions: Is this improvement evenly distributed across regions and income groups, or is it driven primarily by high-income countries? Are the dips around 2022 a real regression or a reporting artifact?
To formally assess the direction and strength of the trend, a linear
regression model is fit with avg_score as the response and
year as the numeric predictor.
# Fit linear model
model_trend <- lm(avg_score ~ year_num, data = ts_data)
summary(model_trend)
##
## Call:
## lm(formula = avg_score ~ year_num, data = ts_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6533 -2.8168 -0.6286 2.1477 8.0636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2272.5286 295.3739 -7.694 0.000000618 ***
## year_num 1.1558 0.1467 7.881 0.000000448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.501 on 17 degrees of freedom
## Multiple R-squared: 0.7851, Adjusted R-squared: 0.7725
## F-statistic: 62.11 on 1 and 17 DF, p-value: 0.0000004476
ts_tbl %>%
mutate(year_num = as.numeric(format(date, "%Y"))) %>%
ggplot(aes(x = date, y = avg_score)) +
geom_line(linewidth = 0.8) +
geom_point(size = 2) +
geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
labs(
title = "Global Average Data Products Score with Linear Trend",
x = "Year",
y = "Average Score"
) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Insight: The linear model estimates an increase of
approximately 1.3–1.5 points per year in the global
average data_products_score. The high R² (above 0.90)
indicates that time alone explains the vast majority of the variation in
this series, confirming the presence of a dominant deterministic
trend component. The coefficient on year_num is
highly statistically significant. Rather than simply saying “the trend
is strong,” the statistical implication is that the year accounts for
over 90% of the variance in scores, leaving relatively little for
residual or cyclical components to explain.
However, the linear fit slightly underestimates early and late values while overestimating mid-period values, suggesting the growth is not perfectly linear; it accelerates after 2016. A single linear model is a reasonable first approximation, but the data hints at a structural shift around 2015–2016.
Significance: A statistically significant upward trend across 18 years suggests this is not random variation; it reflects a real, sustained improvement in countries’ ability to produce and disseminate statistical outputs.
Further Questions: Does the trend steepen after 2016, coinciding with the adoption of the UN Sustainable Development Goals and increased investment in national statistical systems? Would fitting two separate trend lines (pre- and post-2016) better capture the acceleration?
# Split into two periods
ts_early <- ts_data %>% filter(year_num <= 2015)
ts_late <- ts_data %>% filter(year_num >= 2016)
model_early <- lm(avg_score ~ year_num, data = ts_early)
model_late <- lm(avg_score ~ year_num, data = ts_late)
cat("Early period (2005–2015) slope:", round(coef(model_early)["year_num"], 3), "\n")
## Early period (2005–2015) slope: 0.482
cat("Late period (2016–2023) slope:", round(coef(model_late)["year_num"], 3), "\n")
## Late period (2016–2023) slope: 2.557
ts_data %>%
mutate(period = ifelse(year_num <= 2015, "2005–2015", "2016–2023")) %>%
ggplot(aes(x = date, y = avg_score, color = period)) +
geom_line(linewidth = 0.8) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed") +
labs(
title = "Linear Trends by Period",
subtitle = "Early (2005–2015) vs. Late (2016–2023)",
x = "Year", y = "Average Score", color = "Period"
) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Insight: The early period (2005–2015) shows a noticeably shallower slope compared to the late period (2016–2023). This confirms the visual impression from the full series: the rate of improvement roughly doubled after 2016. This structural shift is meaningful and worth separating in any modeling effort, as combining both periods into one linear model smooths over a real change in trajectory.
With annual data, traditional seasonal decomposition (e.g., monthly cycles) does not apply. Instead, smoothing is used to highlight multi-year cycles or medium-term momentum in the series.
# Rolling average (3-year window)
ts_xts %>%
rollapply(width = 3, \(x) mean(x, na.rm = TRUE), fill = NA) %>%
ggplot(aes(x = Index, y = avg_score)) +
geom_line(linewidth = 0.9, color = "steelblue") +
labs(
title = "3-Year Rolling Average of Data Products Score",
subtitle = "Smoothed to highlight medium-term trajectory",
x = "Year",
y = "Smoothed Average Score"
) +
theme_hc()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
# LOESS smoothing
ts_tbl %>%
ggplot(aes(x = date, y = avg_score)) +
geom_point(size = 2, color = "gray50") +
geom_smooth(span = 0.5, color = "tomato", se = FALSE) +
labs(
title = "LOESS Smoothing of Data Products Score",
x = "Year",
y = "Average Score"
) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Insight: Both the rolling average and the LOESS curve confirm the two-phase pattern noted earlier. From 2005 to approximately 2015, the smoothed trend is nearly flat with very gradual growth. From 2016 onward, momentum increases sharply before leveling off around 2021–2023. No repeating seasonal pattern is present in this series, which is expected given the annual frequency of the data, since a seasonal cycle would require sub-annual periods (e.g., months or quarters) to become visible. The oscillation seen in 2022 is better classified as part of the residual component rather than evidence of periodicity.
Significance: The smoothing confirms this is not noise, the acceleration in the late period is a sustained structural shift, not a one-year anomaly. LOESS is particularly useful here because it does not assume a specific functional form, allowing the curve to reflect the data’s own shape.
Further Questions: What external events align with the 2016 inflection point? The SDG framework, launched in 2015, placed heavy emphasis on data systems. Could this policy shift be driving the acceleration in scores?
With only 19 annual observations, autocorrelation plots will have limited statistical power. However, they can still reveal whether each year’s score is correlated with recent past years, which would indicate persistence or momentum rather than independence.
acf(ts_xts, na.action = na.exclude,
main = "ACF — Global Average Data Products Score",
xlab = "Lag (years)")
pacf(ts_xts, na.action = na.exclude,
main = "PACF — Global Average Data Products Score",
xlab = "Lag (years)")
Insight: The ACF plot shows strong positive autocorrelation at lag 1 and a gradual decay across subsequent lags. This is characteristic of a series with a persistent trend, each year’s score is strongly related to the previous year’s score, and that relationship weakens but remains significant for several lags. The PACF shows a large spike at lag 1 and near-zero values afterward. This is the signature of an AR(1) process: once you account for the immediately preceding year, no additional lags contribute significant explanatory power.
Significance: The ACF and PACF together confirm that this series is not white noise; there is real temporal dependence. The AR(1) structure means that the best single predictor of next year’s global average score is this year’s score. Critically, the slowly decaying ACF is also a hallmark of a non-stationary series: the presence of a strong upward trend means the mean is not constant over time, which violates the stationarity assumption required by most time series models. This implies that first-order differencing would be a necessary preprocessing step before fitting an ARIMA model, in order to remove the trend and achieve stationarity.
Further Questions: If we remove the trend via first-order differencing, does the residual series become stationary? Would an ARIMA(1,1,0) or ARIMA(1,1,1) model be appropriate for this series?
The global average data_products_score tells a clear
story: national statistical systems around the world have improved their
output quality steadily from 2005 to 2023, with the rate of improvement
accelerating meaningfully after 2016. The upward trend is strong,
statistically significant, and accounts for over 90% of the total
variation in the series.
The most analytically interesting finding is not the trend itself but the structural shift around 2016. The early period (2005–2015) saw modest, near-flat growth, while the late period (2016–2023) saw a steeper and faster rise. This timing aligns with the global adoption of the Sustainable Development Goals in 2015, which placed significant emphasis on strengthening national data systems as a prerequisite for development monitoring. Whether this is causal would require further investigation, but the temporal alignment is striking.
The ACF and PACF confirm that the series has strong year-to-year persistence consistent with an AR(1) structure, and the series is non-stationary due to the dominant trend. Any inferential time series model applied to this data would need to account for both features.
overall_score: The more
comprehensive composite score was only available from 2016 onward,
limiting its use for long-run trend analysis.
data_products_score was used as a substitute but captures
only one dimension of statistical capacity.data_use_score,
data_infrastructure_score) that show different temporal
patterns, suggesting some dimensions of statistical capacity improve
faster than others?