1. Introduction

Population aging has emerged as one of the most prominent global demographic challenges across economies with divergent income levels and geographic locations. This panel-data-driven project investigates cross-country aging patterns relying on multi-decade World Bank open-source country-level indicators. Three independent raw datasets with dimensions (63840,70), (240,5) and (265,6) are merged on country and indicator codes, constructing a consolidated balanced panel dataset with 63,840 observations and 76 variables. The full empirical pipeline consists of dataset integration, standardized data preprocessing, exploratory data analysis, fixed-effect panel regression, unsupervised K-means classification, model robustness verification and final Shiny-based web application deployment.

2. Project Objectives

To identify key demographic, social, and economic factors that significantly influence national population aging rates through regression analysis.
To develop a classification model to divide countries into high-aging and low-aging groups based on birth rate and death rate.
To deploy the two trained models in an interactive Shiny application for online prediction and classification.

3. Setup & Data Preparation

3.1 Setup & Library Installation

if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

## Warning: package 'pacman' was built under R version 4.5.3

pacman::p_load(
  # Shiny APP related packages
  shiny, shinyWidgets, shinybusy, shinythemes,
  # Data wrangling & I/O
  tidyverse, dplyr, readxl, data.table, scales, stringr,
  # Econometric & panel regression
  plm, lmtest, car, sandwich, urca, forecast,
  # Missing value processing
  imputeTS, zoo,
  # Visualization
  ggplot2, ggcorrplot, patchwork,
  # Cluster & machine learning
  cluster, factoextra, mclust, randomForest,
  # Base statistical package
  stats
)

3.2 Load Raw Dataset

Three separate raw source files including main population panel data, country metadata and indicator metadata are imported from local file directory for subsequent merging operations.

# Set working directory
BASE_PATH <- "D:/Desktop/DS STUDY/2026.3-8/WQD7004 PROGRAMMING FOR DATA SCIENCE (R)/Group Project/Datasets/"

# Import three source CSV files
df_World_Population <- read_csv(paste0(BASE_PATH, "World_Population_Data.csv"), show_col_types = FALSE)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

df_Indicator <- read_csv(paste0(BASE_PATH, "Metadata_Indicator.csv"), show_col_types = FALSE)

## New names:
## • `` -> `...5`

df_Country <- read_csv(paste0(BASE_PATH, "Metadata_Country.csv"), show_col_types = FALSE)

## New names:
## • `` -> `...6`

3.3 Light Pre-Cleaning: Column Name Standardization

Column names are standardized to lowercase with underscore substitution to unify naming rules; duplicate records across three raw tables are counted for preliminary data quality inspection.

# Define custom function to clean column names: convert all names to lowercase, replace space with underscore
rename_clean <- function(df){
  colnames(df) <- str_to_lower(str_replace_all(str_trim(colnames(df)), " ", "_"))
  return(df)
}

# Apply column name standardization to three imported datasets
df_World_Population <- rename_clean(df_World_Population)
df_Indicator <- rename_clean(df_Indicator)
df_Country <- rename_clean(df_Country)

# Calculate and print total duplicated observations for each dataset
cat("Duplicate rows World Population:", sum(duplicated(df_World_Population)), "\n")

## Duplicate rows World Population: 0

cat("Duplicate rows Indicator Meta:", sum(duplicated(df_Indicator)), "\n")

## Duplicate rows Indicator Meta: 0

cat("Duplicate rows Country Meta:", sum(duplicated(df_Country)), "\n")

## Duplicate rows Country Meta: 0

3.4 Multi-source Data Left Merge

Main population dataset is sequentially left-joined with country metadata and indicator metadata using unique matching keys; duplicated columns generated during join are combined and unused unnamed auxiliary columns are removed to form integrated raw combined data.

# Step1 merge population with country metadata by country_code
merged_data <- left_join(df_World_Population, df_Country, by = "country_code")

# Step2 merge with indicator metadata by indicator_code
final_merged_data <- left_join(merged_data, df_Indicator, by = "indicator_code")

# Merge duplicated columns generated after join (_x/_y suffix)
combine_columns <- function(df, col_name){
  col_x <- paste0(col_name, "_x")
  col_y <- paste0(col_name, "_y")
  if(all(c(col_x, col_y) %in% colnames(df))){
    df[[col_name]] <- coalesce(df[[col_x]], df[[col_y]])
    df <- df[, !colnames(df) %in% c(col_x, col_y)]
  }
  return(df)
}
final_merged_data <- combine_columns(final_merged_data, "indicator_name")

# Drop all unnamed auxiliary columns generated from raw csv
unnamed_cols <- str_subset(colnames(final_merged_data), "^unnamed")
final_merged_data <- final_merged_data[, !colnames(final_merged_data) %in% unnamed_cols]

# Output dataset dimension and export cleaned merged csv
cat("Merged dataset dimension:", dim(final_merged_data), "\n")

## Merged dataset dimension: 63840 79

write_csv(final_merged_data, "Final_Merged_Data.csv")

3.5 Reshape, Clean, and Transform Long-Format Panel Data

The dataset was reshaped from wide to long format first to facilitate observation-level cleaning (e.g., outlier removal) and to standardize the panel structure. This step ensures the raw data is reliable before we reshape it back to wide format for regression and machine learning.

3.5.1 Reshape from Wide to Long Format

# Import merged dataset
df <- read_csv("Final_Merged_Data.csv", show_col_types = FALSE)

## New names:
## • `...6` -> `...75`
## • `...5` -> `...79`

# Extract all numeric year columns between 1960-2025
year_cols <- as.character(seq(1960, 2025))
id_cols <- setdiff(colnames(df), year_cols)

# Reshape from wide yearly columns to long format (country-indicator-year-value)
df_long <- pivot_longer(
  data = df,
  cols = all_of(year_cols),
  names_to = "year",
  values_to = "value"
) %>%
  mutate(year = as.integer(year), value = as.numeric(value)) %>%
  arrange(country_code, indicator_code, year)

3.5.2 Check for Duplicate Records

# Print dataset size and duplicate count
cat("Initial long format data dimension:", dim(df_long), "\n")

## Initial long format data dimension: 4213440 15

cat("Total duplicate records in long data:", sum(duplicated(df_long)), "\n")

## Total duplicate records in long data: 0

3.5.3 Missing Value Processing

Missing value distribution is summarized; categorical missing entries are filled with “Unknown”; country-indicator groups without any valid numerical observations are deleted; linear time-series interpolation is implemented for sequential missing values within each country’s yearly records.

# Summarize missing value proportion per column
missing_stats <- tibble(
  var = colnames(df_long),
  miss_num = map_dbl(df_long, ~sum(is.na(.x))),
  miss_pct = round(map_dbl(df_long, ~mean(is.na(.x))*100),2)
) %>% arrange(desc(miss_pct))
print(missing_stats %>% filter(miss_num>0))

## # A tibble: 7 × 3
##   var          miss_num miss_pct
##   <chr>           <dbl>    <dbl>
## 1 ...75         4213440   100   
## 2 ...79         4213440   100   
## 3 value         2294482    54.5 
## 4 specialnotes  2106720    50   
## 5 incomegroup    807840    19.2 
## 6 region         776160    18.4 
## 7 tablename       15840     0.38

# Step1 fill categorical missing with Unknown
df_long <- df_long %>%
  mutate(
    region = replace(region, is.na(region), "Unknown"),
    incomegroup = replace(incomegroup, is.na(incomegroup), "Unknown"),
    specialnotes = replace(specialnotes, is.na(specialnotes), "Unknown")
  )

# Step2 drop country-indicator groups with entirely empty value (no valid observation ever)
group_valid <- df_long %>%
  group_by(country_code, indicator_code) %>%
  summarise(valid_cnt = sum(!is.na(value)), .groups = "drop")
valid_group_list <- group_valid %>% filter(valid_cnt>0) %>% select(country_code, indicator_code)
df_long <- inner_join(df_long, valid_group_list, by = c("country_code", "indicator_code"))

# Step3 linear interpolation within each country-indicator time series
df_long <- df_long %>%
  group_by(country_code, indicator_code) %>%
  filter(sum(!is.na(value)) >= 2) %>%
  mutate(value = imputeTS::na_interpolation(value, option = "linear")) %>%
  ungroup()

cat("NA count after interpolation:", sum(is.na(df_long$value)), "\n")

## NA count after interpolation: 0

cat("Long data dimension after missing clean:", dim(df_long), "\n")

## Long data dimension after missing clean: 3185028 15

3.5.4 IQR Outliers Capping

Macroeconomic cross-country panel data frequently includes outliers triggered by unexpected economic shocks and divergent national statistical criteria, which would distort coefficient estimation in fixed-effect regression. This study applies the \(1.5\times\text{IQR}\) criterion to detect outliers grouped by country and individual indicator, and adopts value capping (winsorization) rather than direct observation deletion.
Quantitatively, a total of 181395 extreme observations are corrected, consisting of 73871 lower-bound outliers and 107524 upper-bound outliers; no residual outliers exist after full capping treatment.
Group-based IQR clipping matches the heterogeneous statistical characteristics across different economies, completely retains original panel sample size without data truncation, and generates standardized clean predictors ready for subsequent missing-value imputation and benchmark regression.

# Define IQR bound calculation function per subgroup
iqr_bounds <- function(series){
  series <- na.omit(series)
  if(length(series)<4) return(c(NA,NA))
  q1 <- quantile(series,0.25)
  q3 <- quantile(series,0.75)
  iqr <- q3-q1
  lower <- q1 -1.5*iqr
  upper <- q3 +1.5*iqr
  return(c(lower, upper))
}

# Count outlier number per group
count_outlier <- function(series){
  bd <- iqr_bounds(na.omit(series))
  if(is.na(bd[1])) return(c(0,0,0))
  low_out <- sum(series < bd[1], na.rm = T)
  up_out <- sum(series > bd[2], na.rm = T)
  total <- low_out+up_out
  return(c(total, low_out, up_out))
}

# Cap outlier with calculated bound
clip_outlier <- function(df_sub){
  val <- df_sub$value
  bd <- iqr_bounds(val)
  if(!is.na(bd[1])) df_sub$value <- pmax(pmin(val, bd[2]), bd[1])
  return(df_sub)
}

# Calculate global outlier summary
outlier_sum <- df_long %>%
  group_by(country_code, indicator_code) %>%
  summarise(out = list(count_outlier(value)), .groups = "drop") %>%
  mutate(
    total = map_dbl(out, ~.x[1]),
    lower = map_dbl(out, ~.x[2]),
    upper = map_dbl(out, ~.x[3])
  )
cat("Total outlier:", sum(outlier_sum$total),
    "\nLower outlier:", sum(outlier_sum$lower), "\nUpper outlier:",
    sum(outlier_sum$upper),"\n")

## Total outlier: 181395 
## Lower outlier: 73871 
## Upper outlier: 107524

# Apply outlier clipping to full dataset
df_CleanData <- df_long %>%
  group_by(country_code, indicator_code) %>%
  mutate(value = {
    val <- value
    bd <- iqr_bounds(val)
    if(!is.na(bd[1])) pmax(pmin(val, bd[2]), bd[1]) else val
  }) %>%
  ungroup()

# Post-clean outlier verification (should be zero after clipping)
post_outlier <- df_CleanData %>%
  group_by(country_code, indicator_code) %>%
  summarise(out = list(count_outlier(value)), .groups = "drop") %>%
  mutate(total = map_dbl(out, ~.x[1]))
cat("Remaining outlier after capping:", sum(post_outlier$total),"\n")

## Remaining outlier after capping: 0

# Save cleaned long dataset
write_csv(df_CleanData, "CleanData.csv")

3.5.5 Logarithmic Transformation

All numerical indicator values are log-transformed via log1p to mitigate right skewness of economic and demographic data and reduce heteroskedasticity for subsequent regression.

# Create log-transformed variable using log1p to avoid log(0) error
df_transform <- df_CleanData %>% mutate(value_log = log1p(pmax(value,0)))

3.6 Reshape Long back to Wide for Panel Regression

In this section, we reshape the cleaned long-format panel data into a wide format suitable for econometric modeling. We then implement a two-stage missing value handling process: direct removal of high-missing variables and countries, followed by variable filtering and multi-step imputation to address remaining gaps.

3.6.1 Pivot long format into wide panel

# Convert long panel data back to wide format with log-transformed values for regression modelling
df_wide <- df_transform %>% pivot_wider(
  id_cols = c(country_code, year, region, incomegroup),
  names_from = indicator_code,
  values_from = value_log
)

3.6.2 Drop high missing indicators & small countries

Remove variables with missing ratio >10%. Indicators above this cutoff suffer severe data scarcity (evident from the variable missing-rate bar chart) and cannot support credible subsequent imputation; the 10% threshold follows mainstream cross-country empirical standards and marks an obvious inflection point of missing proportion.

Exclude countries with average missing ratio >5%. Nations over the 5% benchmark are mostly small territories with incomplete statistical records, unable to provide valid within-country fluctuation for fixed-effect estimation;this boundary is chosen based on the sharp drop of missing rate shown in the country-level bar graph.

id_cols <- c("country_code","year","region","incomegroup")
var_cols <- setdiff(colnames(df_wide), id_cols)

Remove Variables with High Missing Rates

# Calculate missing percentage for each independent indicator
var_miss_rate <- map_dbl(df_wide[,var_cols], ~mean(is.na(.x)))
miss_df_var <- tibble(indicator = names(var_miss_rate), missing_ratio = var_miss_rate)

# Extract top 30% variables with the highest missing values for visualization only
top_pct <- 0.3
n_top <- floor(nrow(miss_df_var)*top_pct)
top_miss_var <- miss_df_var %>% arrange(desc(missing_ratio)) %>% head(n_top)

# lock X-axis order by missing value descending, avoid alphabet auto-sort
top_miss_var$indicator <- factor(top_miss_var$indicator, levels = top_miss_var$indicator)

# Plot bar graph of top missing variables with two reference threshold lines
ggplot(top_miss_var,aes(x=indicator,y=missing_ratio))+
  geom_col(fill="darkred")+
  geom_hline(yintercept = 0.3,color="red",linetype="dashed",linewidth=1)+
  geom_hline(yintercept = 0.1,color="orange",linetype="dashed",linewidth=1)+
  labs(title="Top 30% Variables with Highest Missing Rates",
       y="Missing Ratio (0 = no missing, 1 = all missing)",x="")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle=90,vjust=0.5,hjust=1, size = 5.8))

# drop variables with missing ratio>10%
high_miss_var <- names(var_miss_rate)[var_miss_rate>0.1]
df_wide_AfterVariables_clean <- df_wide[,!colnames(df_wide) %in% high_miss_var]

# Print variable screening summary
cat("Original vars:", length(var_cols), "\n")

## Original vars: 235

cat("Drop vars:", length(high_miss_var), "\n")

## Drop vars: 120

cat("Remain vars:", length(setdiff(colnames(df_wide_AfterVariables_clean),id_cols)), "\n")

## Remain vars: 115

Remove Countries with High Missing Rates

# Calculate average missing value across all indicators per country
country_miss <- df_wide_AfterVariables_clean %>%
  group_by(country_code) %>%
  summarise(miss_avg = mean(is.na(pick(everything())), na.rm=T), .groups="drop")

# Select top 15% countries with highest average missing ratio for plotting
top_pct_country <- 0.15
n_top_c <- floor(nrow(country_miss)*top_pct_country)
top_miss_country <- country_miss %>% arrange(desc(miss_avg)) %>% head(n_top_c)

# set factor levels to preserve descending missing order
top_miss_country$country_code <- factor(top_miss_country$country_code, levels = top_miss_country$country_code)

# Draw country missing rate bar chart with 5% elimination threshold line
ggplot(top_miss_country,aes(x=country_code,y=miss_avg))+
  geom_col(fill="darkred")+
  geom_hline(yintercept = 0.05,color="red",linetype="dashed",linewidth=1)+
  labs(title="Top 15% Countries with Highest Missing Rates (After Variable Cleaning)",
       y="Average Missing Ratio",x="")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle=90,vjust=0.5,hjust=1))

# drop countries with average missing ratio>5%
drop_country <- country_miss %>% filter(miss_avg>0.05) %>% pull(country_code)
df_wide_AfterCountries_clean <- df_wide_AfterVariables_clean %>% filter(!country_code %in% drop_country)

# Output country screening summary
cat("Original countries:",n_distinct(df_wide_AfterVariables_clean$country_code),"\n")

## Original countries: 265

cat("Drop countries:",length(drop_country),"\n")

## Drop countries: 26

cat("Remaining countries:",n_distinct(df_wide_AfterCountries_clean$country_code),"\n")

## Remaining countries: 239

3.6.3 Variable Filter: Remove leakage population variables & Imputation

# Variables causing data leakage (direct elderly dependency indicators)
# are excluded; remaining incomplete numeric features are filled with
# sequential linear interpolation and global median substitution for leftover missing entries.

# Define dependent variable: share of population 65+ (Y)
y_var <- "SP.POP.65UP.TO"
df_wide_clean <- df_wide_AfterCountries_clean %>% drop_na(all_of(y_var))

# Exclude population composition variables to prevent data leakage
drop_leak_var <- c("SP.POP.DPND.OL","SP.POP.DPND","SP.POP.DPND.YG")
pop_key <- c("SP.POP","SM.POP")
growth_key <- c("GROW","DPND","BRTH")

# Filter valid independent variables
keep_list <- c(id_cols,y_var)
for(col in setdiff(colnames(df_wide_clean), keep_list)){
  if(col %in% drop_leak_var) next
  has_pop_flag <- any(str_detect(col, pop_key))
  has_grow_flag <- any(str_detect(col, growth_key))
  if(has_pop_flag & !has_grow_flag) next
  keep_list <- c(keep_list, col)
}
df_wide_final <- df_wide_clean[,keep_list]

# Split X and y, remove categorical for numeric imputation
X_raw <- df_wide_final %>% select(-all_of(c(y_var,"region","incomegroup","year")))
y_raw <- df_wide_final[[y_var]]
df_panel_base <- df_wide_final %>% select(country_code,year,region,incomegroup)

3.6.4 Two-step imputation: country-wise linear interpolation + global median fill remaining NA

# Drop full-empty columns
X_raw <- X_raw[, colSums(!is.na(X_raw)) > 0]

# Linear interpolation
X_interp <- X_raw %>%
  mutate(country_code = df_panel_base$country_code) %>%
  group_by(country_code) %>%
  group_modify(~{
    dat <- .x
    # Use zoo::na.approx, auto skip interpolation when valid < 2
    dat %>% mutate(across(everything(), ~ zoo::na.approx(.x, na.rm = FALSE, rule = 2)))
  }) %>%
  ungroup() %>%
  select(-country_code)

# Fill leftover NA with column-wise median
X_imputed <- X_interp %>%
  mutate(across(everything(), ~replace(.x, is.na(.x), median(.x, na.rm = TRUE))))

3.7 Final dataset overview

# Print dataset basic information
cat("Year range:", min(df_wide_final$year), "-", max(df_wide_final$year), "\n")

## Year range: 1960 - 2025

cat("Final country count:", n_distinct(df_wide_final$country_code), "\n")

## Final country count: 239

cat("Final obs count:", nrow(df_wide_final), "\n")

## Final obs count: 15774

# Bind into plm panel format data
# Remove duplicated year column from X_imputed before concatenation
X_imputed_no_year <- X_imputed %>% select(-any_of("year"))

panel_df <- df_panel_base %>%
  mutate(y = y_raw) %>%
  bind_cols(X_imputed_no_year)

# Build panel dataset with country+year index
pdata <- pdata.frame(panel_df, index = c("country_code", "year"))

4. Exploratory Data Analysis (EDA)

Exploratory analysis visualizes global aging long-run trend, distribution of core demographic indicators (birth/death rate), pairwise variable correlation, and heterogeneous trends split by national income and geographic region to observe preliminary aging patterns across the globe.

4.1 Global average aging trend plot

The line graph tracks cross-country average log-transformed share of population aged 65 and above from 1960 to 2025. The indicator maintains a steady continuous upward trend across the whole sample period without any downward fluctuation, demonstrating a consistent global population aging tendency over six decades, which motivates our subsequent empirical study on its driving factors

# Calculate annual average value of dependent variable for aggregate trend visualization
trend_data <- df_wide_final %>% 
  group_by(year) %>% 
  summarise(y_mean = mean(.data[[y_var]], na.rm = T))

# Define tick sequence: label every 5 years on X-axis
year_start <- min(trend_data$year)
year_end   <- max(trend_data$year)
year_break <- seq(from = year_start, to = year_end, by = 5)

# Plot overall long-term trend of elderly population proportion
ggplot(trend_data, aes(x = year, y = y_mean)) +
  geom_line(linewidth = 1.2, color = "purple") +
  geom_point(color = "purple") +
  # Set X-axis ticks with 5-year interval
  scale_x_continuous(breaks = year_break) +
  labs(
    title = "Average Trend of Population Ages 65 and Above",
    x = "Year",
    y = "Log-transformed Value"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5), # center plot title
    axis.text.x = element_text(angle = 45, hjust = 1) # tilt year labels to avoid overlap
  )

4.2 Birth & Death rate distribution histogram

The paired histograms exhibit sample distribution of log-transformed crude birth rate and crude death rate, core demographic drivers of population aging. After logarithmic conversion, the crude death rate follows an obvious bimodal distribution, whereas the crude birth rate shows multi-peak distribution; both indicators are close to approximate normal distribution after log transformation, satisfying the distribution assumption for linear panel regression. The multimodal characteristics also reflect prominent cross-country divergence in national fertility and mortality levels worldwide.

# Extract crude birth rate and crude death rate variables for distribution analysis
br <- df_wide_final$SP.DYN.CBRT.IN
dr <- df_wide_final$SP.DYN.CDRT.IN

# Histogram + kernel density plot for crude birth rate
p1 <- ggplot(tibble(val = na.omit(br)), aes(x = val)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "skyblue") +
  geom_density(linewidth = 1) +
  labs(title = "Crude Birth Rate Distribution", x = "Log Value") +
  theme_bw()

# Histogram + kernel density plot for crude death rate
p2 <- ggplot(tibble(val = na.omit(dr)), aes(x = val)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "salmon") +
  geom_density(linewidth = 1) +
  labs(title = "Crude Death Rate Distribution", x = "Log Value") +
  theme_bw()

# Combine two plots side by side
p1 + p2

4.3 Core variable correlation heatmap

Core indicators covering demographic structure, birth-death dynamics, public healthcare and cross-border migration are selected for exploratory analysis based on demographic transition theory. The heatmap summarizes pairwise Pearson correlation coefficients:

The elderly population share (SP.POP.65UP.TO) is weakly negatively correlated with the crude birth rate (r = -0.22), while showing only trivial correlation with the crude death rate (r = 0.02). Its associations with other predictors are also weak, ranging from -0.05 to 0.21.
Several explanatory variables exhibit severe multicollinearity (|r| > 0.7), including crude birth rate and maternal mortality (r = 0.83), maternal mortality and government per capita health spending (r = -0.84), and other strong interlinkages. These inherent correlations justify the subsequent VIF-based variable screening.

# Select core variables for pairwise correlation analysis
corr_var <- c(
  y_var,                        # Dependent variable: Percentage of total population aged 65 and over (elderly population share)
  "SP.DYN.CBRT.IN",             # Crude birth rate: annual live births per 1,000 total population
  "SH.STA.MMRT",                # Maternal mortality ratio: maternal deaths per 100,000 live births
  "SH.XPD.GHED.PC.CD",          # Per capita government health expenditure measured in constant USD
  "SP.DYN.CDRT.IN",             # Crude death rate: annual deaths per 1,000 total population
  "SP.POP.BRTH.MF",             # Male-to-female birth ratio reflecting gender structure of newborn population
  "SH.IMM.IDPT",                # DPT Immunization coverage rate of children
  "SP.DYN.TO65.FE.ZS",          # The proportion of females among the population aged 65 and above reflects the gender disparity in life expectancy
  "SH.H2O.BASW.ZS"              # Percentage of population with access to safely managed basic drinking water
)

# Retain existing variables only and remove full missing observations
corr_df <- df_wide_final[,intersect(corr_var,colnames(df_wide_final))] %>% drop_na()

# Compute Pearson correlation matrix and visualize via heatmap
suppressWarnings({
  cor_mat <- cor(corr_df)
  ggcorrplot(
    cor_mat,
    lab = TRUE,
    lab_size = 3.2,
    colors = c("#E74C3C","white","#2ECC71"),
    title = "Core Indicators Correlation Heatmap"
  )+
    theme_bw()+
    theme(
      plot.title = element_text(hjust = 0.5, size = 16),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 7.8),
      axis.text.y = element_text(size = 8.5)
    )
})

4.4 Birth & Death rate trend grouped by income & region

4.4.1 Income group trend

The dual subplots display long-term changes in birth and death rates across five income tiers from 1960 onward. Both fertility and mortality show sustained downward trends universally; consistently, high-income nations retain the lowest birth & death levels, while low-income economies stay at the top throughout decades. Such obvious income-based divergence confirms economic development is closely linked to demographic transition and subsequent population aging. The small abnormal spike around 2020 on mortality curves corresponds to short-term global public health shocks.

# Calculate annual average birth & death rate grouped by country income classification
inc_trend <- df_wide_final %>%
  select(year, incomegroup, SP.DYN.CBRT.IN, SP.DYN.CDRT.IN) %>%
  pivot_longer(c(SP.DYN.CBRT.IN, SP.DYN.CDRT.IN), names_to = "var", values_to = "val") %>%
  group_by(year, incomegroup, var) %>%
  summarise(val = mean(val, na.rm = T), .groups = "drop")

# Set x-axis tick interval: display label every 5 years
year_min <- min(inc_trend$year)
year_max <- max(inc_trend$year)
break_seq <- seq(from = year_min, to = year_max, by = 5)

# Birth rate trend plot
p_inc1 <- inc_trend %>%
  filter(var == "SP.DYN.CBRT.IN") %>%
  ggplot(aes(x = year, y = val, color = incomegroup)) +
  geom_line(linewidth = 1) +
  scale_x_continuous(breaks = break_seq) +
  labs(title = "Birth Rate by Income Group", x = "Year") +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
    plot.margin = ggplot2::margin(t = 10, r = 12, b = 30, l = 8) # Expand bottom margin for year labels
  )

# Death rate trend plot
p_inc2 <- inc_trend %>%
  filter(var == "SP.DYN.CDRT.IN") %>%
  ggplot(aes(x = year, y = val, color = incomegroup)) +
  geom_line(linewidth = 1) +
  scale_x_continuous(breaks = break_seq) +
  labs(title = "Death Rate by Income Group", x = "Year") +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
    plot.margin = ggplot2::margin(t = 10, r = 12, b = 30, l = 8)
  )

# Change layout: vertical top-bottom instead of side-by-side
final_plot <- (p_inc1 / p_inc2) + plot_layout(heights = c(5,5)) 


final_plot

4.4.2 Region group trend

Regional subdivision further reveals obvious geographical divergence in fertility and mortality evolution. All regions experience overall falling birth and death rates across decades; Sub-Saharan Africa consistently maintains the highest birth rate globally, while Middle East & North Africa shows the sharpest mortality decline. The temporary sharp spike around 2020 across most regions is attributed to global pandemic-driven mortality fluctuation, which also explains the short-term abnormal rise on the curves. Such cross-region heterogeneity verifies geographic difference is an important source of aging disparity worldwide.

# Calculate average birth & death rate grouped by year and geographic region for regional trend visualization
reg_trend <- df_wide_final %>%
  select(year, region, SP.DYN.CBRT.IN, SP.DYN.CDRT.IN) %>%
  pivot_longer(c(SP.DYN.CBRT.IN, SP.DYN.CDRT.IN), names_to = "var", values_to = "val") %>%
  group_by(year, region, var) %>%
  summarise(val = mean(val, na.rm = T), .groups = "drop")

# Set X-axis ticks every 5 years
year_min <- min(reg_trend$year)
year_max <- max(reg_trend$year)
break_seq <- seq(from = year_min, to = year_max, by = 5)

# Regional crude birth rate trend plot
p_reg1 <- reg_trend %>%
  filter(var == "SP.DYN.CBRT.IN") %>%
  ggplot(aes(x = year, y = val, color = region)) +
  geom_line(linewidth = 1) +
  scale_x_continuous(breaks = break_seq) +
  labs(title = "Birth Rate by Region", x = "Year") +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 35, hjust = 1, size = 8),
    plot.margin = ggplot2::margin(t = 8, r = 10, b = 25, l = 8)
  )

# Regional crude death rate trend plot
p_reg2 <- reg_trend %>%
  filter(var == "SP.DYN.CDRT.IN") %>%
  ggplot(aes(x = year, y = val, color = region)) +
  geom_line(linewidth = 1) +
  scale_x_continuous(breaks = break_seq) +
  labs(title = "Death Rate by Region", x = "Year") +
  theme_bw() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 35, hjust = 1, size = 8),
    plot.margin = ggplot2::margin(t = 8, r = 10, b = 35, l = 8)
  )

# Change from side-by-side (+) to top-bottom layout (/)
final_reg_plot <- (p_reg1 / p_reg2) + plot_layout(heights = c(5,5))

final_reg_plot

5. Modeling

Two core modelling tasks are completed here: fixed-effect panel regression to quantify aging driving factors (regression task), and K-means unsupervised clustering to categorize high/low aging countries based on birth & death rate (classification task). Subgroup regression across income and geographic groups further captures heterogeneous aging drivers globally.

5.1 Fixed Effects Model

5.1.1 Hausman Test for FE vs RE selection

The null hypothesis of the Hausman test assumes random-effects (RE) estimator is consistent and efficient, while the alternative states RE becomes inconsistent and fixed-effects (FE) is preferable.

With chi^2=1203.2, df=52 and p<0.001) the test strongly rejects the null hypothesis at the 1% significance level.

Final decision: The fixed-effect specification is statistically preferred over random-effect model for this panel dataset, confirming the existence of unobserved country-specific heterogeneity correlated with explanatory variables.

# Convert cleaned explanatory variable dataset into matrix format for panel regression
exog_mat <- as.matrix(X_imputed)
# Extract dependent variable (logged share of population aged 65+) as numeric vector
y_vec    <- y_raw

# Random Effects panel regression (Swamy-Arora estimator by default)
re_mod <- plm(y_vec ~ exog_mat, data = pdata, model = "random")
# Fixed Effects (Within transformation) panel regression
fe_mod <- plm(y_vec ~ exog_mat, data = pdata, model = "within")

# Hausman specification test: choose between FE and RE model
# H0: Random Effects is consistent and efficient; H1: Fixed Effects is preferred
haus_result <- phtest(fe_mod, re_mod)

# Print Hausman test statistical output
print(haus_result)

## 
##  Hausman Test
## 
## data:  y_vec ~ exog_mat
## chisq = 1203.2, df = 52, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

5.1.1 Define Reusable Utility Functions

To facilitate preprocessing and grouped fixed-effect regression, three custom functions are developed:

cal_vif(): Calculate and sort VIF values for multicollinearity check.
vif_filter(): Iteratively drop high-VIF variables to remove multicollinearity.
run_group_fe(): Complete subsample screening, variable filtering and fixed-effect estimation in one workflow.

# Calculate VIF values and return sorted result table
# Create artificial dependent variable(all ones) to compute VIF only, coefficient has no practical meaning
cal_vif <- function(X_sub) {
  # Build fake constant dependent variable for regression fitting
  fake_y <- rep(1, nrow(X_sub))
  # Fit linear model and compute Variance Inflation Factor via car package
  vif_res <- car::vif(lm(fake_y ~ ., data = X_sub))
  # Convert VIF output into tibble, sort variables descending by VIF value
  return(tibble(var = names(vif_res), VIF = as.numeric(vif_res)) %>% arrange(desc(VIF)))
}


# Iterative multicollinearity elimination based on VIF threshold
# Drop the variable with maximum VIF repeatedly until all VIF <= specified threshold(default = 10)
# Input: X_in = raw independent variable dataset; max_vif = VIF cutoff threshold
vif_filter <- function(X_in, max_vif = 10) {
  # Initialize full variable list from input data column names
  var_list <- colnames(X_in)
  
  # Print initialization log information
  cat("=== Start Automatic Iterative VIF Filtering | Max allowed VIF =", max_vif, "===\n")
  cat("Initial total candidate variables count:", length(var_list), "\n\n")

  # Iterative removing high-VIF variables in loop
  while(TRUE) {
    # Terminate loop if less than 2 variables left(VIF calculation needs at least two predictors)
    if (length(var_list) < 2) break
    
    # Extract subset data with current reserved variables
    X_temp <- X_in[, var_list]

    # Step 1: Delete zero-variance columns(all identical values or all NA cannot participate regression)
    var_keep <- apply(X_temp, 2, function(x) length(unique(na.omit(x))) > 1)
    X_temp <- X_temp[, var_keep]

    # Stop iteration after dropping zero-variance features if remaining columns < 2
    if (ncol(X_temp) < 2) {
      var_list <- colnames(X_temp)
      break
    }

    # Compute VIF ranking table for current variable subset
    vdf <- cal_vif(X_temp)
    highest_v <- vdf$VIF[1]
    highest_var <- vdf$var[1]

    # Exit loop once maximum VIF below preset threshold
    if (highest_v <= max_vif) break

    # Remove the variable with largest VIF and output elimination log
    var_list <- setdiff(var_list, highest_var)
    cat("Drop variable:", highest_var, "| Corresponding VIF =", round(highest_v, 2), "\n")
  }

  # Output final screening summary
  cat("\n=== VIF Filter Finished ===")
  cat("\nRemaining valid variables after VIF screening count:", length(var_list))
  cat("\nFinal reserved variable list:\n")
  print(var_list)

  # Return screened variable names
  return(var_list)
}



# Subgroup Fixed-Effect Panel Regression Function
# Core workflow: filter group sample -> eliminate perfect collinearity -> VIF multicollinearity screening -> construct panel data -> run within FE regression
# Parameters:
#   group_name: target subgroup name for split regression
#   group_col: grouping variable column name (incomegroup / region etc.)
#   df_in: full raw dataset containing country & year panel identifiers
#   y_in: full dependent variable vector
#   X_in: full independent variable matrix/dataframe
#   maxvif: VIF threshold for multicollinearity filtering, default = 10
run_group_fe <- function(group_name, group_col, df_in, y_in, X_in, maxvif = 10) {

  # Step 1: Extract observations belonging to target subgroup
  group_idx <- df_in[[group_col]] == group_name
  
  # Abort regression if subgroup sample size less than 10 (insufficient observations for valid panel estimation)
  if (sum(group_idx) < 10) return(NULL)
  
  # Slice dependent & independent variables by subgroup filter
  y_sub <- y_in[group_idx]
  X_sub <- X_in[group_idx, ]

  # Step 2: Remove perfectly collinear (aliased) predictors via alias detection
  alias_mat <- alias(lm(y_sub ~ ., data = data.frame(y = y_sub, X_sub)))$Complete
  if (!is.null(alias_mat)) {
    drop_alias_var <- rownames(alias_mat)
    # Drop fully collinear columns from independent set
    X_sub <- X_sub[, !colnames(X_sub) %in% drop_alias_var, drop = FALSE]
  }

  # Step 3: Iterative VIF multicollinearity filtering with error capture
  sel_var <- tryCatch({
    vif_filter(X_sub, max_vif = maxvif)
  }, error = function(e) {
    return(NULL)
  })
  
  # Terminate function if variable selection fails or no valid predictors left
  if (is.null(sel_var) || length(sel_var) < 1) return(NULL)

  # Build regression dataframe: dependent var + screened independent vars
  reg_df <- data.frame(y = y_sub, X_sub[, sel_var])
  
  # Attach panel index columns (country & year only for pdata.frame definition, excluded from regression later)
  reg_df$country_code <- df_in$country_code[group_idx]
  reg_df$year <- df_in$year[group_idx]

  # Construct balanced/unbalanced panel data object defined by country-year dual index
  sub_pdata <- pdata.frame(reg_df, index = c("country_code", "year"))

  # Step 4: Fixed-effect Within regression, exclude index columns to avoid accidental year dummy generation
  # Wrap regression in tryCatch to skip subgroup if model estimation crashes
  mod <- tryCatch({
    reg_input <- sub_pdata[, !colnames(sub_pdata) %in% c("country_code", "year")]
    plm(y ~ ., data = reg_input, model = "within")
  }, error = function(e) {
    return(NULL)
  })

  # Return final fixed-effect model object
  return(mod)
}

5.1.2 Countries Analysis

Multicollinearity purification: Starting with 52 initial explanatory variables, iterative VIF filtering (threshold = 10) removed 25 severely collinear indicators with extremely high VIF values. The final 27 reserved variables all yield VIF below 10, eliminating serious multicollinearity.
Model validity: The optimized fixed-effects model adopts Driscoll-Kraay robust standard errors to enhance estimation reliability.
Significance performance: A total of 20 variables show statistical significance at the 5% level. Core demographic and healthcare indicators (fertility rate, physician density, mortality and health expenditure variables) present robust explanatory power for population aging.

# Assign significance stars based on p-value:
# *** p<0.001, ** p<0.01, * p<0.05, otherwise not significant
signif_label <- function(p) {
  if (p < 0.001) {
    return("*** (p < 0.001)")
  } else if (p < 0.01) {
    return("** (p < 0.01)")
  } else if (p < 0.05) {
    return("* (p < 0.05)")
  } else {
    return("(Not Significant)")
  }
}

# ==============================================================
# FULLY AUTOMATIC FIXED EFFECTS MODEL
# Iterative VIF + Significance Filtering + Driscoll-Kraay Robust SE
# ==============================================================
# Dependent variable: SP.POP.65UP.TO stored in y_raw
# Independent variables: X_imputed (already interpolated missing values)

cat("Variables before reduction: ", ncol(X_imputed), "\n")

## Variables before reduction:  52

cat("X dimension: ", nrow(X_imputed), " rows, ", ncol(X_imputed), " cols\n")

## X dimension:  15774  rows,  52  cols

cat("y length: ", length(y_raw), "\n")

## y length:  15774

# ----------------------
# Step 1: De-mean (Fixed Effects Transformation, keep country & year index)
# ----------------------
# Combine index + y + X first to avoid index mismatch
df_dm_base <- df_wide_final[, c("country_code", "year")]
df_dm_base$y <- y_raw
df_dm_base <- cbind(df_dm_base, X_imputed)

# Demean function by country_code
demean_single <- function(vec, group_id) {
  group_mean <- ave(vec, group_id, FUN = function(x) mean(x, na.rm = TRUE))
  return(vec - group_mean)
}

# Demean y and all X by country
df_dm_base$y_dm <- demean_single(df_dm_base$y, df_dm_base$country_code)
var_names <- colnames(X_imputed)
for(v in var_names){
  df_dm_base[[paste0(v,"_dm")]] <- demean_single(df_dm_base[[v]], df_dm_base$country_code)
}

# Pick demeaned data + index, drop NA
dm_col <- paste0(var_names,"_dm")
comb_all <- df_dm_base[,c("country_code","year","y_dm",dm_col)]
comb_all <- na.omit(comb_all)

# Split final y and X
y_final <- comb_all$y_dm
X_final <- comb_all[, dm_col]
colnames(X_final) <- var_names # restore original variable names

# ----------------------
# Step 2: Iterative VIF Reduction (AUTOMATIC, threshold = 10)
# ----------------------
max_vif <- 10.0
cat("\n=== STARTING AUTOMATIC VIF REDUCTION ===\n")

## 
## === STARTING AUTOMATIC VIF REDUCTION ===

selected_vars <- vif_filter(X_final, max_vif = max_vif)

## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 41030.72 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 8219.19 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 5073.22 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 2602.49 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 982.07 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 367.38 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 273.17 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 249.77 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 245.11 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 160.6 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 122.2 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 83.95 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 63.98 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 57.66 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 54.99 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 47.3 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 42.39 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 28.37 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 27.38 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 21.75 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 19.79 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 19.48 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 16.68 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 12.53 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 11.59 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 27
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.CDRT.IN"    "SP.DYN.TFRT.IN"    "SP.DYN.TO65.FE.ZS"
## [10] "SP.POP.BRTH.MF"    "SP.POP.GROW"       "SH.ANM.NPRG.ZS"   
## [13] "SH.DYN.NMRT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.MMR.DTHS"       "SH.MMR.RISK.ZS"    "SH.PRG.ANEM"      
## [22] "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS" "SH.XPD.GHED.GD.ZS"
## [25] "SH.XPD.OOPC.CH.ZS" "SH.XPD.OOPC.PC.CD" "SH.XPD.PVTD.CH.ZS"

cat("\nFinal variables after VIF: ", length(selected_vars), "\n")

## 
## Final variables after VIF:  27

if(length(selected_vars)>0) print(selected_vars)

##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.CDRT.IN"    "SP.DYN.TFRT.IN"    "SP.DYN.TO65.FE.ZS"
## [10] "SP.POP.BRTH.MF"    "SP.POP.GROW"       "SH.ANM.NPRG.ZS"   
## [13] "SH.DYN.NMRT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.MMR.DTHS"       "SH.MMR.RISK.ZS"    "SH.PRG.ANEM"      
## [22] "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS" "SH.XPD.GHED.GD.ZS"
## [25] "SH.XPD.OOPC.CH.ZS" "SH.XPD.OOPC.PC.CD" "SH.XPD.PVTD.CH.ZS"

# ----------------------
# Step 3: Run Final Within Fixed-Effect Model with Driscoll-Kraay Robust SE
# ----------------------
X_opt <- X_final[, selected_vars, drop = FALSE]
combined_opt <- data.frame(y = y_final, X_opt)
# Index directly from comb_all (no longer search original df, eliminate NA index)
combined_opt$country_code <- comb_all$country_code
combined_opt$year <- comb_all$year

# Build panel data (index no NA now)
sub_pdata <- pdata.frame(combined_opt, index = c("country_code", "year"))
reg_input <- sub_pdata[, !colnames(sub_pdata) %in% c("country_code", "year")]
fe_model <- plm(y ~ ., data = reg_input, model = "within")

# Driscoll-Kraay robust test
fe_robust <- coeftest(fe_model, vcov = vcovSCC(fe_model))

cat("\n", paste(rep("=",80),collapse=""),"\n")

## 
##  ================================================================================

cat("FINAL OPTIMIZED FIXED EFFECTS MODEL (Driscoll-Kraay Robust SE)\n")

## FINAL OPTIMIZED FIXED EFFECTS MODEL (Driscoll-Kraay Robust SE)

cat(paste(rep("=",80),collapse=""),"\n")

## ================================================================================

print(fe_robust)

## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     0.2848752  0.0449118   6.3430 2.316e-10 ***
## SH.STA.BASS.ZS    -0.1574283  0.0177714  -8.8585 < 2.2e-16 ***
## SH.STA.DIAB.ZS     0.0357725  0.0130339   2.7446 0.0060658 ** 
## SH.STA.ODFC.ZS    -0.0107639  0.0144913  -0.7428 0.4576247    
## SP.ADO.TFRT        0.0174805  0.0142660   1.2253 0.2204707    
## SP.DYN.AMRT.MA    -0.3403246  0.0349338  -9.7420 < 2.2e-16 ***
## SP.DYN.CDRT.IN     0.1809543  0.0515885   3.5076 0.0004534 ***
## SP.DYN.TFRT.IN    -0.8531598  0.0407084 -20.9578 < 2.2e-16 ***
## SP.DYN.TO65.FE.ZS  0.4681631  0.1226319   3.8176 0.0001353 ***
## SP.POP.BRTH.MF    -2.2012689  1.1937977  -1.8439 0.0652137 .  
## SP.POP.GROW       -0.1333898  0.0209176  -6.3769 1.858e-10 ***
## SH.ANM.NPRG.ZS    -0.0380817  0.0351884  -1.0822 0.2791696    
## SH.DYN.NMRT       -0.1858644  0.0169975 -10.9348 < 2.2e-16 ***
## SH.IMM.IDPT        0.0991318  0.0165576   5.9871 2.184e-09 ***
## SH.IMM.MEAS        0.0031745  0.0058998   0.5381 0.5905331    
## SH.MED.BEDS.ZS    -0.1376180  0.0160767  -8.5601 < 2.2e-16 ***
## SH.MED.NUMW.P3     0.0929990  0.0129341   7.1902 6.765e-13 ***
## SH.MED.PHYS.ZS     0.0538507  0.0187877   2.8663 0.0041590 ** 
## SH.MMR.DTHS        0.1655923  0.0087325  18.9628 < 2.2e-16 ***
## SH.MMR.RISK.ZS    -0.2518632  0.0456905  -5.5124 3.597e-08 ***
## SH.PRG.ANEM       -0.1391300  0.0583431  -2.3847 0.0171056 *  
## SH.XPD.CHEX.GD.ZS -0.0181538  0.0353103  -0.5141 0.6071744    
## SH.XPD.GHED.CH.ZS -0.0013511  0.0278583  -0.0485 0.9613206    
## SH.XPD.GHED.GD.ZS  0.1728557  0.0555446   3.1120 0.0018615 ** 
## SH.XPD.OOPC.CH.ZS -0.3104556  0.0534801  -5.8051 6.560e-09 ***
## SH.XPD.OOPC.PC.CD  0.0700523  0.0121782   5.7523 8.971e-09 ***
## SH.XPD.PVTD.CH.ZS  0.2054841  0.0527956   3.8921 9.981e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ----------------------
# Step 4: Show Final VIF
# ----------------------
cat("\nFINAL VIFs after screening:\n")

## 
## FINAL VIFs after screening:

final_vif_tab <- cal_vif(X_opt)
print(final_vif_tab)

## # A tibble: 27 × 2
##    var                 VIF
##    <chr>             <dbl>
##  1 SP.DYN.CDRT.IN     7.36
##  2 SH.XPD.GHED.GD.ZS  6.54
##  3 SP.DYN.TFRT.IN     6.49
##  4 SH.MMR.RISK.ZS     5.99
##  5 SP.DYN.AMRT.MA     5.82
##  6 SH.PRG.ANEM        5.62
##  7 SH.XPD.OOPC.CH.ZS  5.24
##  8 SH.IMM.IDPT        5.06
##  9 SH.DYN.NMRT        4.98
## 10 SH.XPD.PVTD.CH.ZS  4.68
## # ℹ 17 more rows

# ----------------------
# Step 5: Extract Significant Variables (p < 0.05)
# ----------------------
sig_df <- data.frame(
  Variable = rownames(fe_robust),
  Coefficient = fe_robust[,1],
  P_Value = fe_robust[,4]
)
sig_df$Significance <- sapply(sig_df$P_Value, signif_label)
sig_df <- subset(sig_df, P_Value < 0.05)
sig_df <- sig_df[order(sig_df$P_Value), ]

cat("\nFINAL SIGNIFICANT VARIABLES (p < 0.05):\n")

## 
## FINAL SIGNIFICANT VARIABLES (p < 0.05):

print(sig_df, row.names = FALSE)

##           Variable Coefficient      P_Value    Significance
##     SP.DYN.TFRT.IN -0.85315975 3.417352e-96 *** (p < 0.001)
##        SH.MMR.DTHS  0.16559225 2.726232e-79 *** (p < 0.001)
##        SH.DYN.NMRT -0.18586444 9.918897e-28 *** (p < 0.001)
##     SP.DYN.AMRT.MA -0.34032463 2.313800e-22 *** (p < 0.001)
##     SH.STA.BASS.ZS -0.15742830 8.973156e-19 *** (p < 0.001)
##     SH.MED.BEDS.ZS -0.13761797 1.232420e-17 *** (p < 0.001)
##     SH.MED.NUMW.P3  0.09299902 6.765000e-13 *** (p < 0.001)
##        SP.POP.GROW -0.13338978 1.857971e-10 *** (p < 0.001)
##     SH.H2O.BASW.ZS  0.28487516 2.315864e-10 *** (p < 0.001)
##        SH.IMM.IDPT  0.09913183 2.183586e-09 *** (p < 0.001)
##  SH.XPD.OOPC.CH.ZS -0.31045557 6.559831e-09 *** (p < 0.001)
##  SH.XPD.OOPC.PC.CD  0.07005230 8.971199e-09 *** (p < 0.001)
##     SH.MMR.RISK.ZS -0.25186317 3.596685e-08 *** (p < 0.001)
##  SH.XPD.PVTD.CH.ZS  0.20548408 9.981056e-05 *** (p < 0.001)
##  SP.DYN.TO65.FE.ZS  0.46816309 1.352645e-04 *** (p < 0.001)
##     SP.DYN.CDRT.IN  0.18095425 4.533636e-04 *** (p < 0.001)
##  SH.XPD.GHED.GD.ZS  0.17285568 1.861512e-03   ** (p < 0.01)
##     SH.MED.PHYS.ZS  0.05385073 4.158964e-03   ** (p < 0.01)
##     SH.STA.DIAB.ZS  0.03577254 6.065810e-03   ** (p < 0.01)
##        SH.PRG.ANEM -0.13912998 1.710565e-02    * (p < 0.05)

# ----------------------
# Step 6: Output Within R² & Overall F-statistic
# ----------------------
sum_fe <- summary(fe_model)

# Extract within R-squared & adjusted within R²
r2_within <- as.numeric(sum_fe$r.squared["rsq"])
r2_adj    <- as.numeric(sum_fe$r.squared["adjrsq"])

# Extract F-statistic and its p-value
f_val  <- as.numeric(sum_fe$fstatistic$statistic)
f_pval <- as.numeric(sum_fe$fstatistic$p.value)

# Print formatted output
cat("\n", paste(rep("=",60),collapse=""),"\n")

## 
##  ============================================================

cat("Within R-squared:          ", round(r2_within,4),"\n")

## Within R-squared:           0.8753

cat("Adjusted Within R-squared: ", round(r2_adj,4),"\n")

## Adjusted Within R-squared:  0.8731

cat("Overall F-statistic:       ", round(f_val,4),"\n")

## Overall F-statistic:        4030.681

cat("F-statistic P-value:       ", format(f_pval, scientific = FALSE, digits = 6),"\n")

## F-statistic P-value:        0

cat(paste(rep("=",60),collapse=""),"\n")

## ============================================================

# Pass the value to the grouping table for processing
save_r2 <- round(r2_within,4)
save_F <- round(f_val,4)
save_Fp <- round(f_pval,4)

5.1.3 Income Analysis

Multicollinearity elimination: All regressions adopt a consistent VIF threshold of 10. Starting from 52 original explanatory variables, highly collinear indicators with extreme VIF values (ranging from tens to hundreds of thousands) were iteratively removed for each income subgroup. The final valid variable sets range from 15 to 25 predictors per group, ensuring no serious multicollinearity.
Group-specific significant predictors: The number of statistically significant drivers (p < 0.05) varies substantially across income levels. Lower-middle-income and high-income countries exhibit the largest number of significant aging determinants, while the unknown income group has the fewest significant variables.
Heterogeneous coefficient magnitude: Low-income countries yield the largest total absolute coefficient magnitude, indicating that demographic and health factors exert stronger marginal effects on population aging in less developed economies.
Core consistent driver: Total fertility rate (SP.DYN.TFRT.IN) remains significantly negative across all income subgroups, confirming its universal role in promoting population aging. Physician density (SH.MED.PHYS.ZS) is consistently positive and significant in all classified income groups, acting as a stable positive aging driver.

# ==============================================================================
# Step 1: Prepare valid income categories and initialize model storage list
# Extract non-NA unique income labels to iterate heterogeneous subgroup regression
# ==============================================================================
income_list <- unique(na.omit(df_wide_final$incomegroup))

# Empty list to save successfully estimated fixed-effect models by income group
income_fe_list <- list()

# ==============================================================================
# Step 2: Iterate each income group to run customized within-demeaning fixed-effect regression
# run_group_fe returns NULL automatically when sample size or variable conditions fail
# Print full regression output & significant variables for every valid subgroup after estimation
# ==============================================================================
for (gr in income_list) {
  mod_tmp <- run_group_fe(
    group_name = gr,
    group_col  = "incomegroup",
    df_in      = df_wide_final,
    y_in       = y_raw,
    X_in       = X_imputed
  )
  # Keep only valid estimated model, drop NULL invalid outputs
  if (!is.null(mod_tmp)) {
    income_fe_list[[gr]] <- mod_tmp
    
    # -------------------- Print full estimation results for current income group 
    cat("\n================================================================================\n")
    cat("                    PANEL FIXED EFFECT ESTIMATION | INCOME GROUP: ", gr, "\n")
    cat("================================================================================\n")
    
    # Calculate Driscoll-Kraay robust coefficient test table
    robust_res <- coeftest(mod_tmp, vcov = vcovSCC(mod_tmp))
    # Output Parameter Estimates table
    print(robust_res)
    sum_g <- summary(mod_tmp)
    r2_raw    <- as.numeric(sum_g$r.squared["rsq"])
    r2_adj    <- as.numeric(sum_g$r.squared["adjrsq"])
    f_stat    <- as.numeric(sum_g$fstatistic$statistic)
    f_pvalue  <- as.numeric(sum_g$fstatistic$p.value)
    
    cat("\n---------------- Model Overall Fit Statistics ----------------\n")
    cat("Within R²:          ", round(r2_raw,4),"\n")
    cat("Adjusted Within R²: ", round(r2_adj,4),"\n")
    cat("F-statistic:        ", round(f_stat,4),"\n")
    cat("F p-value:          ", format(f_pvalue, digits = 6, scientific = F),"\n")
    cat("--------------------------------------------------------------\n\n")
    # ======================================
    
    # Extract variable names & p-values to screen statistically significant predictors (p < 0.05)
    p_vals <- robust_res[, 4]
    sig_vars <- names(p_vals[p_vals < 0.05])
    
    # Print list of significant variables
    cat(">>> Significant Variables (p < 0.05):\n")
    if (length(sig_vars) > 0) {
      for (v in sig_vars) {
        cat(" -", v, "\n")
      }
    } else {
      cat(" - No statistically significant variables found\n")
    }
    cat("================================================================================\n\n")
  }
}

## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 877077.5 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 489446.9 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 110967 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 87428.39 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 32219.41 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 21067.12 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 13165.49 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 11718.81 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 7900.49 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 4616.51 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 3209.89 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 2974.63 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 2014.13 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 1461.61 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 1241.58 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 1120.21 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 592.69 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 567.38 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 406.76 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 274.13 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 249.98 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 172.06 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 119.25 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 98.02 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 60.61 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 53.83 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 50.61 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 47.13 
## Drop variable: SP.DYN.TFRT.IN | Corresponding VIF = 38.08 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 29.82 
## Drop variable: SH.STA.BASS.ZS | Corresponding VIF = 25.22 
## Drop variable: SH.STA.ODFC.ZS | Corresponding VIF = 21.54 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 18.23 
## Drop variable: SH.MED.NUMW.P3 | Corresponding VIF = 15.7 
## Drop variable: SH.IMM.IDPT | Corresponding VIF = 15.15 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 14.17 
## Drop variable: SH.MED.PHYS.ZS | Corresponding VIF = 10.57 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 15
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.DIAB.ZS"    "SP.ADO.TFRT"      
##  [4] "SP.DYN.AMRT.MA"    "SP.DYN.CDRT.IN"    "SP.POP.BRTH.MF"   
##  [7] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DTH.IMRT"      
## [10] "SH.IMM.MEAS"       "SH.MED.BEDS.ZS"    "SH.XPD.CHEX.GD.ZS"
## [13] "SH.XPD.GHED.CH.ZS" "SH.XPD.OOPC.CH.ZS" "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | INCOME GROUP:  Unknown 
## ================================================================================
## 
## t test of coefficients:
## 
##                      Estimate  Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS    -1.2189e-05  8.1493e-02  -0.0001 0.9998807    
## SP.ADO.TFRT        4.3427e-02  5.2285e-02   0.8306 0.4062718    
## SP.DYN.AMRT.MA    -4.8680e-01  1.1408e-01  -4.2673 2.035e-05 ***
## SP.DYN.CDRT.IN    -8.3108e-01  6.4604e-02 -12.8642 < 2.2e-16 ***
## SP.POP.BRTH.MF    -3.2818e+00  1.2037e+00  -2.7265 0.0064355 ** 
## SP.POP.GROW       -6.8892e-01  1.0946e-01  -6.2940 3.511e-10 ***
## SH.ANM.NPRG.ZS    -4.0088e-01  1.1063e-01  -3.6235 0.0002952 ***
## SH.DTH.IMRT       -1.4629e-01  5.1599e-02  -2.8351 0.0046090 ** 
## SH.IMM.MEAS        8.3828e-02  1.1475e-02   7.3053 3.468e-13 ***
## SH.MED.BEDS.ZS    -4.6633e-02  3.6524e-02  -1.2768 0.2017636    
## SH.XPD.CHEX.GD.ZS  6.1832e-01  1.4872e-01   4.1577 3.299e-05 ***
## SH.XPD.GHED.CH.ZS  2.5298e-01  8.3398e-02   3.0333 0.0024378 ** 
## SH.XPD.OOPC.CH.ZS -2.2095e-01  1.2656e-01  -1.7458 0.0809472 .  
## SH.XPD.PVTD.CH.ZS  1.6030e-01  1.5605e-01   1.0272 0.3043857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9352 
## Adjusted Within R²:  0.9339 
## F-statistic:         3335.563 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.CDRT.IN 
##  - SP.POP.BRTH.MF 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DTH.IMRT 
##  - SH.IMM.MEAS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 471615.3 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 16527.31 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 15639.94 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 4895.23 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 2408.29 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 1418.42 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 1003.79 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 852.23 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 505 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 295.89 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 242.58 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 182.47 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 148.54 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 132.67 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 125.53 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 118.45 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 67.79 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 58.69 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 51.94 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 49.44 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 36.94 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 35.1 
## Drop variable: SH.XPD.GHED.CH.ZS | Corresponding VIF = 30.33 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 26.23 
## Drop variable: SP.DYN.CDRT.IN | Corresponding VIF = 23.72 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 19.35 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 18.14 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 17.17 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 24
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.TFRT.IN"    "SP.DYN.TO65.MA.ZS" "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DTH.IMRT"      
## [13] "SH.DYN.NMRT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.MMR.RISK.ZS"    "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.GD.ZS"
## [22] "SH.XPD.PVTD.CH.ZS" "SH.XPD.PVTD.PC.CD" "SH.XPD.PVTD.PP.CD"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | INCOME GROUP:  Low income 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS      0.127881   0.041459   3.0845 0.0020752 ** 
## SH.STA.BASS.ZS      0.014112   0.042873   0.3292 0.7420826    
## SH.STA.ODFC.ZS      0.078300   0.041595   1.8824 0.0599645 .  
## SP.ADO.TFRT         0.021279   0.043958   0.4841 0.6284068    
## SP.DYN.AMRT.MA     -0.545197   0.093613  -5.8239 6.987e-09 ***
## SP.DYN.TFRT.IN     -0.883307   0.120394  -7.3368 3.527e-13 ***
## SP.DYN.TO65.MA.ZS   0.130388   0.069192   1.8844 0.0596937 .  
## SP.POP.BRTH.MF    -14.620307   3.419195  -4.2760 2.020e-05 ***
## SP.POP.GROW        -0.090830   0.026436  -3.4359 0.0006063 ***
## SH.ANM.NPRG.ZS     -0.088236   0.070426  -1.2529 0.2104332    
## SH.DTH.IMRT         0.784458   0.058773  13.3472 < 2.2e-16 ***
## SH.DYN.NMRT        -0.816344   0.050787 -16.0740 < 2.2e-16 ***
## SH.IMM.IDPT         0.028209   0.020554   1.3724 0.1701363    
## SH.IMM.MEAS        -0.034459   0.021530  -1.6005 0.1096965    
## SH.MED.BEDS.ZS     -0.064536   0.039763  -1.6230 0.1047857    
## SH.MED.NUMW.P3     -0.079389   0.063332  -1.2535 0.2102044    
## SH.MED.PHYS.ZS      0.685436   0.135825   5.0465 5.036e-07 ***
## SH.MMR.RISK.ZS     -0.078477   0.046527  -1.6867 0.0918657 .  
## SH.XPD.CHEX.GD.ZS   0.153991   0.046005   3.3473 0.0008359 ***
## SH.XPD.GHED.GD.ZS   0.145391   0.024691   5.8883 4.782e-09 ***
## SH.XPD.PVTD.CH.ZS   0.170691   0.043399   3.9331 8.760e-05 ***
## SH.XPD.PVTD.PC.CD  -0.106464   0.039872  -2.6701 0.0076617 ** 
## SH.XPD.PVTD.PP.CD  -0.067803   0.041217  -1.6450 0.1001707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9046 
## Adjusted Within R²:  0.9017 
## F-statistic:         633.3659 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.TFRT.IN 
##  - SP.POP.BRTH.MF 
##  - SP.POP.GROW 
##  - SH.DTH.IMRT 
##  - SH.DYN.NMRT 
##  - SH.MED.PHYS.ZS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.GD.ZS 
##  - SH.XPD.PVTD.CH.ZS 
##  - SH.XPD.PVTD.PC.CD 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 336539.4 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 12234.89 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 4060.79 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 3018.62 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 2837.21 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 2431.71 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 1030.3 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 715.29 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 520.59 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 451.1 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 357.28 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 294.9 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 250.01 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 244.08 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 219.3 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 201.98 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 79.85 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 62.85 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 58.92 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 45.03 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 37.02 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 32.04 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 27.43 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 26.32 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 21.95 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 17.11 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 13.33 
## Drop variable: SP.DYN.CDRT.IN | Corresponding VIF = 12.09 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 10.17 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 23
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.TFRT.IN"    "SP.DYN.TO65.FE.ZS" "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DTH.MORT"      
## [13] "SH.DYN.NMRT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.MMR.RISK.ZS"    "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS"
## [22] "SH.XPD.OOPC.PC.CD" "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | INCOME GROUP:  Lower middle income 
## ================================================================================
## 
## t test of coefficients:
## 
##                    Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     0.337346   0.057439   5.8731 4.720e-09 ***
## SH.STA.BASS.ZS    -0.058973   0.019780  -2.9814 0.0028910 ** 
## SH.STA.DIAB.ZS    -0.176199   0.019984  -8.8169 < 2.2e-16 ***
## SH.STA.ODFC.ZS    -0.059796   0.014314  -4.1775 3.027e-05 ***
## SP.ADO.TFRT       -0.210672   0.022203  -9.4885 < 2.2e-16 ***
## SP.DYN.AMRT.MA    -0.380654   0.052618  -7.2344 5.842e-13 ***
## SP.DYN.TFRT.IN    -0.955054   0.079112 -12.0721 < 2.2e-16 ***
## SP.DYN.TO65.FE.ZS  0.289953   0.102788   2.8209 0.0048189 ** 
## SP.POP.BRTH.MF    -1.343541   1.224090  -1.0976 0.2724700    
## SP.POP.GROW       -0.168502   0.034341  -4.9067 9.730e-07 ***
## SH.ANM.NPRG.ZS    -0.239910   0.059724  -4.0170 6.032e-05 ***
## SH.DTH.MORT        0.623783   0.035354  17.6438 < 2.2e-16 ***
## SH.DYN.NMRT       -0.571360   0.016049 -35.5999 < 2.2e-16 ***
## SH.IMM.IDPT       -0.067995   0.017463  -3.8937 0.0001008 ***
## SH.IMM.MEAS        0.090786   0.015323   5.9248 3.463e-09 ***
## SH.MED.BEDS.ZS    -0.136038   0.019676  -6.9138 5.683e-12 ***
## SH.MED.NUMW.P3    -0.190905   0.026996  -7.0716 1.878e-12 ***
## SH.MED.PHYS.ZS     0.135452   0.052423   2.5838 0.0098162 ** 
## SH.MMR.RISK.ZS    -0.087240   0.041531  -2.1006 0.0357559 *  
## SH.XPD.CHEX.GD.ZS  0.139396   0.040931   3.4057 0.0006683 ***
## SH.XPD.GHED.CH.ZS  0.024098   0.025161   0.9578 0.3382564    
## SH.XPD.OOPC.PC.CD  0.058931   0.020152   2.9243 0.0034764 ** 
## SH.XPD.PVTD.CH.ZS -0.035027   0.025152  -1.3926 0.1638290    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9095 
## Adjusted Within R²:  0.9075 
## F-statistic:         1381.494 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SH.STA.BASS.ZS 
##  - SH.STA.DIAB.ZS 
##  - SH.STA.ODFC.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.FE.ZS 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DTH.MORT 
##  - SH.DYN.NMRT 
##  - SH.IMM.IDPT 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
##  - SH.MMR.RISK.ZS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.OOPC.PC.CD 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 137953.8 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 11288.9 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 4920.26 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 3196.48 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 2212.42 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 2136.16 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 1331.9 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 858.24 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 856.65 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 595.64 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 532.44 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 290.19 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 179.11 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 174.35 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 155.04 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 154.08 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 119.14 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 104.08 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 89.36 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 55.36 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 43.87 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 31.77 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 23.88 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 18.5 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 12.76 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 12.14 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 10.21 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 25
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.CDRT.IN"    "SP.DYN.TFRT.IN"    "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DYN.NMRT"      
## [13] "SH.IMM.IDPT"       "SH.IMM.MEAS"       "SH.MED.BEDS.ZS"   
## [16] "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"    "SH.MMR.DTHS"      
## [19] "SH.MMR.RISK.ZS"    "SH.PRG.ANEM"       "SH.STA.MMRT"      
## [22] "SH.XPD.CHEX.GD.ZS" "SH.XPD.CHEX.PP.CD" "SH.XPD.GHED.CH.ZS"
## [25] "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | INCOME GROUP:  Upper middle income 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     1.3972189  0.2260368   6.1814 7.103e-10 ***
## SH.STA.BASS.ZS    -0.6182424  0.0432618 -14.2907 < 2.2e-16 ***
## SH.STA.DIAB.ZS     0.0757509  0.1373534   0.5515 0.5813249    
## SH.STA.ODFC.ZS    -0.0828371  0.0126550  -6.5458 6.801e-11 ***
## SP.ADO.TFRT        0.1150513  0.0160114   7.1856 8.176e-13 ***
## SP.DYN.AMRT.MA    -0.3971606  0.0219028 -18.1329 < 2.2e-16 ***
## SP.DYN.CDRT.IN     0.1988247  0.0481133   4.1324 3.676e-05 ***
## SP.DYN.TFRT.IN    -0.9358709  0.0484551 -19.3142 < 2.2e-16 ***
## SP.POP.BRTH.MF     0.6155544  1.1971182   0.5142 0.6071476    
## SP.POP.GROW       -0.0777601  0.0210119  -3.7008 0.0002183 ***
## SH.ANM.NPRG.ZS     0.1312647  0.0789738   1.6621 0.0965783 .  
## SH.DYN.NMRT       -0.2679988  0.0248669 -10.7773 < 2.2e-16 ***
## SH.IMM.IDPT        0.0817582  0.0132924   6.1508 8.601e-10 ***
## SH.IMM.MEAS        0.0049988  0.0069302   0.7213 0.4707712    
## SH.MED.BEDS.ZS    -0.1009014  0.0188240  -5.3603 8.861e-08 ***
## SH.MED.NUMW.P3     0.0368698  0.0364782   1.0107 0.3122140    
## SH.MED.PHYS.ZS     0.1630718  0.0227870   7.1563 1.009e-12 ***
## SH.MMR.DTHS        0.1896754  0.0240346   7.8918 3.978e-15 ***
## SH.MMR.RISK.ZS     0.0926610  0.0485388   1.9090 0.0563447 .  
## SH.PRG.ANEM       -0.4893667  0.1964390  -2.4912 0.0127787 *  
## SH.STA.MMRT       -0.1932110  0.0219105  -8.8182 < 2.2e-16 ***
## SH.XPD.CHEX.GD.ZS  0.2217210  0.0524467   4.2276 2.424e-05 ***
## SH.XPD.CHEX.PP.CD -0.0169680  0.0352137  -0.4819 0.6299372    
## SH.XPD.GHED.CH.ZS  0.1247043  0.0349171   3.5714 0.0003599 ***
## SH.XPD.PVTD.CH.ZS  0.1010003  0.0459217   2.1994 0.0279160 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9085 
## Adjusted Within R²:  0.9064 
## F-statistic:         1358.39 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SH.STA.BASS.ZS 
##  - SH.STA.ODFC.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.CDRT.IN 
##  - SP.DYN.TFRT.IN 
##  - SP.POP.GROW 
##  - SH.DYN.NMRT 
##  - SH.IMM.IDPT 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.PHYS.ZS 
##  - SH.MMR.DTHS 
##  - SH.PRG.ANEM 
##  - SH.STA.MMRT 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.CH.ZS 
##  - SH.XPD.PVTD.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 46541.37 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 29889.05 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 5103.64 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 5028.61 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 2474.57 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 2188.32 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 1366.74 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 1007.11 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 848.25 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 742.77 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 495.44 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 276.1 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 224.95 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 218.62 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 202.44 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 141.68 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 100.03 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 92.85 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 80.72 
## Drop variable: SP.DYN.AMRT.MA | Corresponding VIF = 35.02 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 32.11 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 30.63 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 23.4 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 20.59 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 18.57 
## Drop variable: SH.XPD.PVTD.CH.ZS | Corresponding VIF = 13.3 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 11.66 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 25
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.CDRT.IN"   
##  [7] "SP.DYN.TFRT.IN"    "SP.DYN.TO65.MA.ZS" "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DYN.NMRT"      
## [13] "SH.IMM.IDPT"       "SH.IMM.MEAS"       "SH.MED.BEDS.ZS"   
## [16] "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"    "SH.MMR.DTHS"      
## [19] "SH.MMR.RISK.ZS"    "SH.PRG.ANEM"       "SH.STA.MMRT"      
## [22] "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS" "SH.XPD.OOPC.CH.ZS"
## [25] "SH.XPD.OOPC.PP.CD"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | INCOME GROUP:  High income 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS    -0.0028300  0.5306583  -0.0053 0.9957451    
## SH.STA.BASS.ZS     0.4672449  0.1391056   3.3589 0.0007897 ***
## SH.STA.DIAB.ZS     0.1030226  0.0158413   6.5034 8.802e-11 ***
## SH.STA.ODFC.ZS    -0.0052833  0.0459894  -0.1149 0.9085459    
## SP.ADO.TFRT        0.0371322  0.0160180   2.3182 0.0204901 *  
## SP.DYN.CDRT.IN     0.3394422  0.0622622   5.4518 5.281e-08 ***
## SP.DYN.TFRT.IN    -1.2175133  0.0619315 -19.6590 < 2.2e-16 ***
## SP.DYN.TO65.MA.ZS  1.3296742  0.1029956  12.9100 < 2.2e-16 ***
## SP.POP.BRTH.MF    -2.5356565  1.7062873  -1.4861 0.1373390    
## SP.POP.GROW       -0.0872279  0.0295667  -2.9502 0.0031937 ** 
## SH.ANM.NPRG.ZS    -0.2699862  0.0773872  -3.4888 0.0004904 ***
## SH.DYN.NMRT       -0.0564130  0.0152841  -3.6910 0.0002263 ***
## SH.IMM.IDPT        0.1054859  0.0204017   5.1704 2.448e-07 ***
## SH.IMM.MEAS       -0.0829834  0.0228788  -3.6271 0.0002901 ***
## SH.MED.BEDS.ZS    -0.1103885  0.0188522  -5.8555 5.132e-09 ***
## SH.MED.NUMW.P3     0.1359353  0.0244752   5.5540 2.970e-08 ***
## SH.MED.PHYS.ZS     0.0892435  0.0324292   2.7519 0.0059505 ** 
## SH.MMR.DTHS        0.2353018  0.0205865  11.4299 < 2.2e-16 ***
## SH.MMR.RISK.ZS     1.4957873  0.2013721   7.4280 1.336e-13 ***
## SH.PRG.ANEM       -0.3651493  0.1408715  -2.5921 0.0095740 ** 
## SH.STA.MMRT       -0.3614993  0.0273140 -13.2350 < 2.2e-16 ***
## SH.XPD.CHEX.GD.ZS  0.4980608  0.0851542   5.8489 5.336e-09 ***
## SH.XPD.GHED.CH.ZS -0.3157452  0.1712749  -1.8435 0.0653288 .  
## SH.XPD.OOPC.CH.ZS -0.3433940  0.0648502  -5.2952 1.252e-07 ***
## SH.XPD.OOPC.PP.CD -0.0800982  0.0256584  -3.1217 0.0018107 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.8612 
## Adjusted Within R²:  0.8582 
## F-statistic:         1010.059 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.STA.BASS.ZS 
##  - SH.STA.DIAB.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.CDRT.IN 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.MA.ZS 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DYN.NMRT 
##  - SH.IMM.IDPT 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
##  - SH.MMR.DTHS 
##  - SH.MMR.RISK.ZS 
##  - SH.PRG.ANEM 
##  - SH.STA.MMRT 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.OOPC.CH.ZS 
##  - SH.XPD.OOPC.PP.CD 
## ================================================================================

# ==============================================================================
# Step 3: Select optimal model by total absolute coefficient sum
# Adjusted R² & F-stat unavailable for all income-group specifications,
# choose model with largest sum of absolute coefficients as alternative screening rule
# ==============================================================================
if (length(income_fe_list) == 0) {
  cat("Warning: No valid income-group fixed-effect model is available.\n")
} else {
  # Calculate sum of absolute regression coefficients for each group
  abs_coef_sum <- sapply(income_fe_list, function(m) {
    tryCatch({
      sum(abs(coef(m)))
    }, error = function(e) NA)
  })
  
  abs_clean <- na.omit(abs_coef_sum)
  
  if (length(abs_clean) == 0) {
    cat("Warning: Fail to extract coefficient from all income-group models.\n")
  } else {
    best_group_name <- names(which.max(abs_clean))
    if (best_group_name %in% names(income_fe_list)) {
      best_reg_model <- income_fe_list[[best_group_name]]
      best_sum_val <- round(max(abs_clean),4)
      
      # Output result info
      cat("\n==== Optimal Fixed Effect by Income Group (Max Absolute Coefficient Sum) ====\n")
      cat("Best Performing Income Group: ", best_group_name, "\n")
      cat("Sum of Absolute Coefficients: ", best_sum_val, "\n\n")
      
      # Outpu Best Model R²/F
      sum_best <- summary(best_reg_model)
      r2b <- round(as.numeric(sum_best$r.squared["rsq"]),4)
      adjr2b <- round(as.numeric(sum_best$r.squared["adjrsq"]),4)
      fb <- round(as.numeric(sum_best$fstatistic$statistic),4)
      fpb <- format(as.numeric(sum_best$fstatistic$p.value),digits=6,scientific=F)
      cat("Optimal Model Fit Index:\n")
      cat("Within R²: ",r2b," | Adj Within R²: ",adjr2b," | F: ",fb," | F-p: ",fpb,"\n\n")
      
      # Robust standard error output
      print(coeftest(best_reg_model, vcov = vcovSCC(best_reg_model)))
    } else {
      cat("Warning: Selected group name not found in model list.\n")
    }
  }
}

## 
## ==== Optimal Fixed Effect by Income Group (Max Absolute Coefficient Sum) ====
## Best Performing Income Group:  Low income 
## Sum of Absolute Coefficients:  19.8155 
## 
## Optimal Model Fit Index:
## Within R²:  0.9046  | Adj Within R²:  0.9017  | F:  633.3659  | F-p:  0 
## 
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS      0.127881   0.041459   3.0845 0.0020752 ** 
## SH.STA.BASS.ZS      0.014112   0.042873   0.3292 0.7420826    
## SH.STA.ODFC.ZS      0.078300   0.041595   1.8824 0.0599645 .  
## SP.ADO.TFRT         0.021279   0.043958   0.4841 0.6284068    
## SP.DYN.AMRT.MA     -0.545197   0.093613  -5.8239 6.987e-09 ***
## SP.DYN.TFRT.IN     -0.883307   0.120394  -7.3368 3.527e-13 ***
## SP.DYN.TO65.MA.ZS   0.130388   0.069192   1.8844 0.0596937 .  
## SP.POP.BRTH.MF    -14.620307   3.419195  -4.2760 2.020e-05 ***
## SP.POP.GROW        -0.090830   0.026436  -3.4359 0.0006063 ***
## SH.ANM.NPRG.ZS     -0.088236   0.070426  -1.2529 0.2104332    
## SH.DTH.IMRT         0.784458   0.058773  13.3472 < 2.2e-16 ***
## SH.DYN.NMRT        -0.816344   0.050787 -16.0740 < 2.2e-16 ***
## SH.IMM.IDPT         0.028209   0.020554   1.3724 0.1701363    
## SH.IMM.MEAS        -0.034459   0.021530  -1.6005 0.1096965    
## SH.MED.BEDS.ZS     -0.064536   0.039763  -1.6230 0.1047857    
## SH.MED.NUMW.P3     -0.079389   0.063332  -1.2535 0.2102044    
## SH.MED.PHYS.ZS      0.685436   0.135825   5.0465 5.036e-07 ***
## SH.MMR.RISK.ZS     -0.078477   0.046527  -1.6867 0.0918657 .  
## SH.XPD.CHEX.GD.ZS   0.153991   0.046005   3.3473 0.0008359 ***
## SH.XPD.GHED.GD.ZS   0.145391   0.024691   5.8883 4.782e-09 ***
## SH.XPD.PVTD.CH.ZS   0.170691   0.043399   3.9331 8.760e-05 ***
## SH.XPD.PVTD.PC.CD  -0.106464   0.039872  -2.6701 0.0076617 ** 
## SH.XPD.PVTD.PP.CD  -0.067803   0.041217  -1.6450 0.1001707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.1.4 Region Analysis

Multicollinearity elimination: All subgroups adopt VIF=10 as cutoff. Starting from 52 original predictors, numerous mortality, life expectancy and health expenditure variables with extreme VIF values are dropped. The number of retained valid indicators varies across eight geographic regions.
Evident regional heterogeneity: Significant predictors, coefficient magnitudes and sign directions differ substantially across regions, confirming divergent aging drivers globally.
Best-performing region: North America delivers the largest total absolute coefficient value (83.852), indicating the strongest overall explanatory power among all regions.
Key significant factors for North America: Six variables are statistically significant at 5% level. Physician density, immunization coverage and anaemia prevalence positively raise aging ratio, while sanitation access and total fertility suppress population ageing.
General rule: After screening, healthcare resource, demographic and public health indicators serve as core valid determinants, whereas highly collinear aggregate health metrics lose explanatory power.

# ==============================================================================
# Step 1: Prepare valid region categories and initialize model storage list
# Extract non-NA unique region labels to iterate heterogeneous subgroup regression
# ==============================================================================
region_list <- unique(na.omit(df_wide_final$region))

# Empty list to save successfully estimated fixed-effect models by region
region_fe_list <- list()

# ==============================================================================
# Step 2: Iterate each region to run customized within-demeaning fixed-effect regression
# run_group_fe returns NULL automatically when sample size or variable conditions fail
# Print full regression output & significant variables for every valid subgroup after estimation
# ==============================================================================
for (gr in region_list) {
  mod_tmp <- run_group_fe(
    group_name = gr,
    group_col  = "region",
    df_in      = df_wide_final,
    y_in       = y_raw,
    X_in       = X_imputed
  )
  # Keep only valid estimated model, drop NULL invalid outputs
  if (!is.null(mod_tmp)) {
    region_fe_list[[gr]] <- mod_tmp
    
    # -------------------- Print full estimation results for current region 
    cat("\n================================================================================\n")
    cat("                    PANEL FIXED EFFECT ESTIMATION | REGION: ", gr, "\n")
    cat("================================================================================\n")
    
    # Calculate Driscoll-Kraay robust coefficient test table
    robust_res <- coeftest(mod_tmp, vcov = vcovSCC(mod_tmp))
    # Output Parameter Estimates table
    print(robust_res)
    
    # ==========Add Within R², Adj R², F stat & F p-value==========
    sum_g <- summary(mod_tmp)
    r2_raw   <- as.numeric(sum_g$r.squared["rsq"])
    r2_adj   <- as.numeric(sum_g$r.squared["adjrsq"])
    f_stat   <- as.numeric(sum_g$fstatistic$statistic)
    f_pvalue <- as.numeric(sum_g$fstatistic$p.value)
    
    cat("\n---------------- Model Overall Fit Statistics ----------------\n")
    cat("Within R²:          ", round(r2_raw,4),"\n")
    cat("Adjusted Within R²: ", round(r2_adj,4),"\n")
    cat("F-statistic:        ", round(f_stat,4),"\n")
    cat("F p-value:          ", format(f_pvalue, digits = 6, scientific = FALSE),"\n")
    cat("--------------------------------------------------------------\n\n")
    # ==============================================================
    
    # Extract variable names & p-values to screen statistically significant predictors (p < 0.05)
    p_vals <- robust_res[, 4]
    sig_vars <- names(p_vals[p_vals < 0.05])
    
    # Print list of significant variables
    cat(">>> Significant Variables (p < 0.05):\n")
    if (length(sig_vars) > 0) {
      for (v in sig_vars) {
        cat(" -", v, "\n")
      }
    } else {
      cat(" - No statistically significant variables found\n")
    }
    cat("================================================================================\n\n")
  }
}

## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 888316.2 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 498360 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 203038.9 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 133233.8 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 47216.68 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 40802.84 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 23364.16 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 17057.3 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 10990.16 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 5496.98 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 5378.17 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 4122.19 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 2349.82 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 1796.95 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 1638.8 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 1087.39 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 689.46 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 613.09 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 411.78 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 315.3 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 295.93 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 282.53 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 174.71 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 163.73 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 117.52 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 67.09 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 63.46 
## Drop variable: SP.DYN.TFRT.IN | Corresponding VIF = 44.88 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 39.14 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 32.08 
## Drop variable: SH.STA.BASS.ZS | Corresponding VIF = 26.84 
## Drop variable: SH.STA.ODFC.ZS | Corresponding VIF = 22.43 
## Drop variable: SH.MED.PHYS.ZS | Corresponding VIF = 21.65 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 16.47 
## Drop variable: SH.IMM.IDPT | Corresponding VIF = 15.94 
## Drop variable: SH.MED.NUMW.P3 | Corresponding VIF = 14.83 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 14.39 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 13.08 
## Drop variable: SP.ADO.TFRT | Corresponding VIF = 10.45 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 13
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.DIAB.ZS"    "SP.DYN.AMRT.MA"   
##  [4] "SP.DYN.CDRT.IN"    "SP.POP.BRTH.MF"    "SP.POP.GROW"      
##  [7] "SH.ANM.NPRG.ZS"    "SH.DTH.NMRT"       "SH.IMM.MEAS"      
## [10] "SH.MED.BEDS.ZS"    "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS"
## [13] "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  Unknown 
## ================================================================================
## 
## t test of coefficients:
## 
##                    Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     0.358195   0.100985   3.5470 0.0003954 ***
## SP.DYN.AMRT.MA    -0.490224   0.098233  -4.9904 6.356e-07 ***
## SP.DYN.CDRT.IN    -0.808411   0.055156 -14.6568 < 2.2e-16 ***
## SP.POP.BRTH.MF    -3.967749   1.221286  -3.2488 0.0011712 ** 
## SP.POP.GROW       -0.518915   0.098191  -5.2847 1.346e-07 ***
## SH.ANM.NPRG.ZS    -0.233921   0.077711  -3.0101 0.0026322 ** 
## SH.DTH.NMRT       -0.182832   0.035283  -5.1819 2.336e-07 ***
## SH.IMM.MEAS        0.085412   0.012005   7.1146 1.387e-12 ***
## SH.MED.BEDS.ZS    -0.131725   0.037183  -3.5426 0.0004020 ***
## SH.XPD.CHEX.GD.ZS  0.560163   0.145469   3.8507 0.0001201 ***
## SH.XPD.GHED.CH.ZS  0.311879   0.111696   2.7922 0.0052669 ** 
## SH.XPD.PVTD.CH.ZS  0.070766   0.152538   0.4639 0.6427323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9495 
## Adjusted Within R²:  0.9486 
## F-statistic:         4872.265 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.CDRT.IN 
##  - SP.POP.BRTH.MF 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DTH.NMRT 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 187504.9 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 47253.12 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 16067.09 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 11333.42 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 9849.34 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 2631.07 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 1974.25 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 1207.51 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 971.14 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 938.88 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 715.27 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 572.92 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 357.61 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 318.27 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 272.26 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 237.51 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 200.8 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 104.44 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 80.11 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 56.51 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 45.3 
## Drop variable: SH.XPD.GHED.CH.ZS | Corresponding VIF = 38.54 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 35.21 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 33.89 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 31.72 
## Drop variable: SH.XPD.PVTD.CH.ZS | Corresponding VIF = 21.82 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 19.74 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 18.34 
## Drop variable: SP.DYN.CDRT.IN | Corresponding VIF = 17.05 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 16.23 
## Drop variable: SH.STA.BASS.ZS | Corresponding VIF = 13.19 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 12.61 
## Drop variable: SH.STA.ODFC.ZS | Corresponding VIF = 10.13 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 19
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.DIAB.ZS"    "SP.ADO.TFRT"      
##  [4] "SP.DYN.AMRT.MA"    "SP.DYN.TFRT.IN"    "SP.DYN.TO65.FE.ZS"
##  [7] "SP.POP.BRTH.MF"    "SP.POP.GROW"       "SH.ANM.NPRG.ZS"   
## [10] "SH.DTH.MORT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [13] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [16] "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.GD.ZS" "SH.XPD.OOPC.CH.ZS"
## [19] "SH.XPD.OOPC.PP.CD"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  Middle East & North Africa 
## ================================================================================
## 
## t test of coefficients:
## 
##                    Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS    -0.061221   0.092932  -0.6588  0.510147    
## SH.STA.DIAB.ZS     1.255803   0.241406   5.2020 2.248e-07 ***
## SP.ADO.TFRT       -0.080800   0.025354  -3.1869  0.001468 ** 
## SP.DYN.AMRT.MA    -0.246778   0.054905  -4.4947 7.512e-06 ***
## SP.DYN.TFRT.IN    -0.980288   0.079941 -12.2626 < 2.2e-16 ***
## SP.DYN.TO65.FE.ZS  0.546975   0.080921   6.7594 1.989e-11 ***
## SP.POP.BRTH.MF    -4.406752   1.648387  -2.6734  0.007592 ** 
## SP.POP.GROW       -0.021925   0.015303  -1.4327  0.152160    
## SH.ANM.NPRG.ZS    -0.121896   0.119537  -1.0197  0.308019    
## SH.DTH.MORT        0.259074   0.036316   7.1339 1.523e-12 ***
## SH.IMM.IDPT        0.175209   0.042140   4.1577 3.399e-05 ***
## SH.IMM.MEAS       -0.094139   0.029755  -3.1638  0.001589 ** 
## SH.MED.BEDS.ZS    -0.444191   0.053597  -8.2876 2.567e-16 ***
## SH.MED.NUMW.P3     0.103642   0.037504   2.7635  0.005789 ** 
## SH.MED.PHYS.ZS     0.400622   0.070065   5.7179 1.303e-08 ***
## SH.XPD.CHEX.GD.ZS -0.203645   0.098534  -2.0668  0.038931 *  
## SH.XPD.GHED.GD.ZS  0.084228   0.044163   1.9072  0.056688 .  
## SH.XPD.OOPC.CH.ZS -0.432048   0.051944  -8.3176 < 2.2e-16 ***
## SH.XPD.OOPC.PP.CD  0.304844   0.050840   5.9962 2.536e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9158 
## Adjusted Within R²:  0.9135 
## F-statistic:         845.1982 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.STA.DIAB.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.FE.ZS 
##  - SP.POP.BRTH.MF 
##  - SH.DTH.MORT 
##  - SH.IMM.IDPT 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.OOPC.CH.ZS 
##  - SH.XPD.OOPC.PP.CD 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 395638.7 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 16166.82 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 6676.53 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 4836.45 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 2812.54 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 1904.26 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 862.52 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 533.53 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 449.58 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 390.53 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 312.28 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 265.05 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 207.94 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 167.93 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 143.92 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 139.62 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 92.11 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 89.05 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 82.5 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 66.16 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 50.42 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 37.04 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 22.8 
## Drop variable: SP.DYN.CDRT.IN | Corresponding VIF = 19.67 
## Drop variable: SP.DYN.AMRT.MA | Corresponding VIF = 16.36 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 14.63 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 13.74 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 11.32 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 24
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.FE"   
##  [7] "SP.DYN.TFRT.IN"    "SP.DYN.TO65.MA.ZS" "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DTH.IMRT"      
## [13] "SH.DYN.NMRT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.STA.MMRT"       "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS"
## [22] "SH.XPD.OOPC.CH.ZS" "SH.XPD.OOPC.PP.CD" "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  Sub-Saharan Africa 
## ================================================================================
## 
## t test of coefficients:
## 
##                      Estimate  Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS      0.2764357   0.0399269   6.9236 5.333e-12 ***
## SH.STA.BASS.ZS     -0.1100345   0.0092475 -11.8988 < 2.2e-16 ***
## SH.STA.DIAB.ZS     -0.0771314   0.0152680  -5.0518 4.630e-07 ***
## SH.STA.ODFC.ZS      0.0060284   0.0096301   0.6260  0.531366    
## SP.ADO.TFRT         0.0549295   0.0575285   0.9548  0.339742    
## SP.DYN.AMRT.FE     -0.2233797   0.0460014  -4.8559 1.258e-06 ***
## SP.DYN.TFRT.IN     -1.3046834   0.0780812 -16.7093 < 2.2e-16 ***
## SP.DYN.TO65.MA.ZS   0.3468596   0.0584066   5.9387 3.192e-09 ***
## SP.POP.BRTH.MF    -14.8057092   2.2717435  -6.5173 8.324e-11 ***
## SP.POP.GROW        -0.1573761   0.0268073  -5.8706 4.801e-09 ***
## SH.ANM.NPRG.ZS      0.2550470   0.0924312   2.7593  0.005826 ** 
## SH.DTH.IMRT         0.6455285   0.0418999  15.4065 < 2.2e-16 ***
## SH.DYN.NMRT        -0.7193592   0.0367670 -19.5654 < 2.2e-16 ***
## SH.IMM.IDPT         0.0168259   0.0162113   1.0379  0.299392    
## SH.IMM.MEAS         0.0280118   0.0110479   2.5355  0.011278 *  
## SH.MED.BEDS.ZS      0.0222831   0.0189764   1.1742  0.240385    
## SH.MED.NUMW.P3      0.0700136   0.0475690   1.4718  0.141167    
## SH.MED.PHYS.ZS      0.3320201   0.0746693   4.4465 9.033e-06 ***
## SH.STA.MMRT         0.0441898   0.0255800   1.7275  0.084175 .  
## SH.XPD.CHEX.GD.ZS   0.0846439   0.0389848   2.1712  0.029992 *  
## SH.XPD.GHED.CH.ZS   0.1657325   0.0248176   6.6780 2.860e-11 ***
## SH.XPD.OOPC.CH.ZS  -0.1175765   0.0546845  -2.1501  0.031625 *  
## SH.XPD.OOPC.PP.CD   0.0822856   0.0396732   2.0741  0.038154 *  
## SH.XPD.PVTD.CH.ZS   0.0739926   0.0475696   1.5555  0.119939    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.907 
## Adjusted Within R²:  0.9048 
## F-statistic:         1257.433 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SH.STA.BASS.ZS 
##  - SH.STA.DIAB.ZS 
##  - SP.DYN.AMRT.FE 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.MA.ZS 
##  - SP.POP.BRTH.MF 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DTH.IMRT 
##  - SH.DYN.NMRT 
##  - SH.IMM.MEAS 
##  - SH.MED.PHYS.ZS 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.CH.ZS 
##  - SH.XPD.OOPC.CH.ZS 
##  - SH.XPD.OOPC.PP.CD 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 36533.08 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 27951.59 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 9934.65 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 8239.74 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 3775.25 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 3742.52 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 2289.44 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 1689.02 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 1316.78 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 1277.61 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 1139.08 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 578.57 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 567.06 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 299.9 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 231.35 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 184.76 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 143.39 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 119.37 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 105.41 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 46.06 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 43.2 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 35.01 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 32.11 
## Drop variable: SP.DYN.AMRT.MA | Corresponding VIF = 31.2 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 24.45 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 21.22 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 13.17 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 25
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.CDRT.IN"   
##  [7] "SP.DYN.TFRT.IN"    "SP.DYN.TO65.MA.ZS" "SP.POP.BRTH.MF"   
## [10] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DYN.NMRT"      
## [13] "SH.IMM.IDPT"       "SH.IMM.MEAS"       "SH.MED.BEDS.ZS"   
## [16] "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"    "SH.MMR.DTHS"      
## [19] "SH.MMR.RISK.ZS"    "SH.STA.MMRT"       "SH.XPD.CHEX.GD.ZS"
## [22] "SH.XPD.GHED.CH.ZS" "SH.XPD.OOPC.CH.ZS" "SH.XPD.OOPC.PC.CD"
## [25] "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  Europe & Central Asia 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS    -0.0515405  0.1257185  -0.4100 0.6818562    
## SH.STA.BASS.ZS    -0.2084342  0.2067745  -1.0080 0.3135160    
## SH.STA.DIAB.ZS    -0.1109876  0.0766007  -1.4489 0.1474580    
## SH.STA.ODFC.ZS     0.1009529  0.0579796   1.7412 0.0817452 .  
## SP.ADO.TFRT        0.0718176  0.0221984   3.2353 0.0012273 ** 
## SP.DYN.CDRT.IN     0.7645184  0.0664353  11.5077 < 2.2e-16 ***
## SP.DYN.TFRT.IN    -1.1189967  0.0695813 -16.0819 < 2.2e-16 ***
## SP.DYN.TO65.MA.ZS  1.4825847  0.1487602   9.9663 < 2.2e-16 ***
## SP.POP.BRTH.MF    -0.0884586  1.5175854  -0.0583 0.9535220    
## SP.POP.GROW       -0.1339860  0.0316015  -4.2399 2.298e-05 ***
## SH.ANM.NPRG.ZS    -0.5236876  0.1453595  -3.6027 0.0003196 ***
## SH.DYN.NMRT       -0.0453437  0.0134669  -3.3671 0.0007685 ***
## SH.IMM.IDPT        0.0045486  0.0218950   0.2077 0.8354401    
## SH.IMM.MEAS       -0.0614012  0.0127995  -4.7972 1.681e-06 ***
## SH.MED.BEDS.ZS    -0.1672849  0.0203266  -8.2298 2.671e-16 ***
## SH.MED.NUMW.P3     0.2905339  0.0234889  12.3690 < 2.2e-16 ***
## SH.MED.PHYS.ZS     0.1465363  0.0202180   7.2478 5.254e-13 ***
## SH.MMR.DTHS        0.2005436  0.0186557  10.7497 < 2.2e-16 ***
## SH.MMR.RISK.ZS    -0.3581878  0.1321163  -2.7112 0.0067398 ** 
## SH.STA.MMRT       -0.1630787  0.0280893  -5.8057 7.018e-09 ***
## SH.XPD.CHEX.GD.ZS  0.1674651  0.0513282   3.2626 0.0011151 ** 
## SH.XPD.GHED.CH.ZS -0.3080738  0.0543651  -5.6668 1.581e-08 ***
## SH.XPD.OOPC.CH.ZS -0.2222489  0.0803019  -2.7677 0.0056775 ** 
## SH.XPD.OOPC.PC.CD  0.0138053  0.0150421   0.9178 0.3588030    
## SH.XPD.PVTD.CH.ZS -0.1813452  0.0842008  -2.1537 0.0313343 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.8802 
## Adjusted Within R²:  0.8775 
## F-statistic:         966.809 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SP.ADO.TFRT 
##  - SP.DYN.CDRT.IN 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.MA.ZS 
##  - SP.POP.GROW 
##  - SH.ANM.NPRG.ZS 
##  - SH.DYN.NMRT 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
##  - SH.MMR.DTHS 
##  - SH.MMR.RISK.ZS 
##  - SH.STA.MMRT 
##  - SH.XPD.CHEX.GD.ZS 
##  - SH.XPD.GHED.CH.ZS 
##  - SH.XPD.OOPC.CH.ZS 
##  - SH.XPD.PVTD.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 206099.6 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 140029.5 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 16116.41 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 11297.65 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 7659.46 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 7239.92 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 3007.18 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 1616.69 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 1314.76 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 1301.78 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 496.47 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 343.16 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 331.6 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 330.15 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 189.22 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 157.79 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 114.3 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 108.74 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 86.85 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 82.57 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 67.3 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 65.97 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 38.57 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 33.54 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 20.77 
## Drop variable: SP.DYN.TFRT.IN | Corresponding VIF = 15.2 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 14.44 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 14.24 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 24
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.CDRT.IN"    "SP.POP.BRTH.MF"    "SP.POP.GROW"      
## [10] "SH.ANM.NPRG.ZS"    "SH.DTH.NMRT"       "SH.DYN.NMRT"      
## [13] "SH.IMM.IDPT"       "SH.IMM.MEAS"       "SH.MED.BEDS.ZS"   
## [16] "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"    "SH.PRG.ANEM"      
## [19] "SH.STA.MMRT"       "SH.XPD.CHEX.GD.ZS" "SH.XPD.GHED.CH.ZS"
## [22] "SH.XPD.OOPC.CH.ZS" "SH.XPD.OOPC.PC.CD" "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  Latin America & Caribbean 
## ================================================================================
## 
## t test of coefficients:
## 
##                    Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     0.041051   0.273547   0.1501 0.8807239    
## SH.STA.BASS.ZS     0.031117   0.170078   0.1830 0.8548493    
## SH.STA.DIAB.ZS     1.014900   0.186900   5.4302 6.295e-08 ***
## SH.STA.ODFC.ZS    -0.090220   0.023800  -3.7907 0.0001546 ***
## SP.ADO.TFRT       -0.420867   0.024973 -16.8527 < 2.2e-16 ***
## SP.DYN.AMRT.MA    -0.746493   0.039196 -19.0449 < 2.2e-16 ***
## SP.DYN.CDRT.IN     0.166045   0.111821   1.4849 0.1377182    
## SP.POP.BRTH.MF    16.755000   2.705889   6.1920 7.151e-10 ***
## SP.POP.GROW       -0.411513   0.054271  -7.5826 5.095e-14 ***
## SH.ANM.NPRG.ZS    -0.153198   0.092606  -1.6543 0.0982193 .  
## SH.DTH.NMRT        0.810022   0.047829  16.9358 < 2.2e-16 ***
## SH.DYN.NMRT       -1.035830   0.056087 -18.4683 < 2.2e-16 ***
## SH.IMM.IDPT       -0.038863   0.038452  -1.0107 0.3122852    
## SH.IMM.MEAS        0.116371   0.031935   3.6440 0.0002751 ***
## SH.MED.BEDS.ZS     0.024031   0.039729   0.6049 0.5453274    
## SH.MED.NUMW.P3     0.274396   0.042420   6.4685 1.234e-10 ***
## SH.MED.PHYS.ZS     0.254228   0.050747   5.0097 5.916e-07 ***
## SH.PRG.ANEM       -0.113759   0.300366  -0.3787 0.7049252    
## SH.STA.MMRT        0.053577   0.036298   1.4760 0.1400920    
## SH.XPD.CHEX.GD.ZS  0.038069   0.096899   0.3929 0.6944505    
## SH.XPD.GHED.CH.ZS  0.336615   0.095344   3.5305 0.0004239 ***
## SH.XPD.OOPC.CH.ZS -0.273570   0.103527  -2.6425 0.0082924 ** 
## SH.XPD.OOPC.PC.CD  0.076668   0.047687   1.6077 0.1080476    
## SH.XPD.PVTD.CH.ZS  0.628029   0.155891   4.0287 5.814e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9085 
## Adjusted Within R²:  0.9061 
## F-statistic:         850.905 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.STA.DIAB.ZS 
##  - SH.STA.ODFC.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.AMRT.MA 
##  - SP.POP.BRTH.MF 
##  - SP.POP.GROW 
##  - SH.DTH.NMRT 
##  - SH.DYN.NMRT 
##  - SH.IMM.MEAS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
##  - SH.XPD.GHED.CH.ZS 
##  - SH.XPD.OOPC.CH.ZS 
##  - SH.XPD.PVTD.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 106523.1 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 32348.05 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 7717.98 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 5049.44 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 4576.55 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 3439.3 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 1719.16 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 1529.89 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 1462.46 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 749.03 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 698.38 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 641.69 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 331.89 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 322.44 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 315.5 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 258.58 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 230.31 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 123.76 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 104 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 72.98 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 46.41 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 42.89 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 32.12 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 30.2 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 27.88 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 25.78 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 18.71 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 14.19 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 10.88 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 23
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SH.STA.ODFC.ZS"    "SP.ADO.TFRT"       "SP.DYN.AMRT.MA"   
##  [7] "SP.DYN.CDRT.IN"    "SP.DYN.TFRT.IN"    "SP.DYN.TO65.FE.ZS"
## [10] "SP.POP.BRTH.MF"    "SP.POP.GROW"       "SH.ANM.NPRG.ZS"   
## [13] "SH.DTH.MORT"       "SH.IMM.IDPT"       "SH.IMM.MEAS"      
## [16] "SH.MED.BEDS.ZS"    "SH.MED.NUMW.P3"    "SH.MED.PHYS.ZS"   
## [19] "SH.MMR.RISK.ZS"    "SH.XPD.CHEX.GD.ZS" "SH.XPD.CHEX.PC.CD"
## [22] "SH.XPD.GHED.CH.ZS" "SH.XPD.PVTD.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  East Asia & Pacific 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS     0.3438276  0.0885102   3.8846 0.0001061 ***
## SH.STA.BASS.ZS     0.2899380  0.0351680   8.2444 3.097e-16 ***
## SH.STA.DIAB.ZS    -0.1797734  0.0269212  -6.6778 3.192e-11 ***
## SH.STA.ODFC.ZS    -0.0180737  0.0306509  -0.5897 0.5554890    
## SP.ADO.TFRT       -0.0849779  0.0244289  -3.4786 0.0005157 ***
## SP.DYN.AMRT.MA    -0.7259195  0.0770894  -9.4166 < 2.2e-16 ***
## SP.DYN.CDRT.IN     0.5176674  0.0776880   6.6634 3.512e-11 ***
## SP.DYN.TFRT.IN    -0.8043299  0.0771956 -10.4194 < 2.2e-16 ***
## SP.DYN.TO65.FE.ZS  0.3103582  0.1325511   2.3414 0.0193153 *  
## SP.POP.BRTH.MF    -7.9629923  1.7774872  -4.4799 7.921e-06 ***
## SP.POP.GROW        0.0237881  0.0240942   0.9873 0.3236265    
## SH.ANM.NPRG.ZS    -0.0786380  0.1015538  -0.7743 0.4388232    
## SH.DTH.MORT       -0.0037587  0.0261162  -0.1439 0.8855771    
## SH.IMM.IDPT       -0.0083179  0.0141849  -0.5864 0.5576815    
## SH.IMM.MEAS        0.1261852  0.0181754   6.9426 5.296e-12 ***
## SH.MED.BEDS.ZS    -0.3038471  0.0393772  -7.7163 1.938e-14 ***
## SH.MED.NUMW.P3    -0.1194708  0.0637345  -1.8745 0.0610169 .  
## SH.MED.PHYS.ZS     0.3418801  0.0481087   7.1064 1.689e-12 ***
## SH.MMR.RISK.ZS     0.2959194  0.0522938   5.6588 1.762e-08 ***
## SH.XPD.CHEX.GD.ZS -0.0329590  0.0596893  -0.5522 0.5808933    
## SH.XPD.CHEX.PC.CD  0.0736714  0.0222026   3.3181 0.0009236 ***
## SH.XPD.GHED.CH.ZS -0.0403248  0.0399755  -1.0087 0.3132313    
## SH.XPD.PVTD.CH.ZS -0.2272111  0.0339018  -6.7020 2.715e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.8826 
## Adjusted Within R²:  0.8794 
## F-statistic:         608.4633 
## F p-value:           0 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.H2O.BASW.ZS 
##  - SH.STA.BASS.ZS 
##  - SH.STA.DIAB.ZS 
##  - SP.ADO.TFRT 
##  - SP.DYN.AMRT.MA 
##  - SP.DYN.CDRT.IN 
##  - SP.DYN.TFRT.IN 
##  - SP.DYN.TO65.FE.ZS 
##  - SP.POP.BRTH.MF 
##  - SH.IMM.MEAS 
##  - SH.MED.BEDS.ZS 
##  - SH.MED.PHYS.ZS 
##  - SH.MMR.RISK.ZS 
##  - SH.XPD.CHEX.PC.CD 
##  - SH.XPD.PVTD.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 52 
## 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 1008639 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 547059 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 296272.2 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 68089.94 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 23397.77 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 20298.17 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 19025.95 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 14551.51 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 7232.73 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 6057.86 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 5096.14 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 4397.77 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 3605.1 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 2837.18 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 2056.21 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 1468.91 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 1009.13 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 962.97 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 953.68 
## Drop variable: SH.XPD.PVTD.CH.ZS | Corresponding VIF = 693.06 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 432.67 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 258.36 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 249.93 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 223.74 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 204.44 
## Drop variable: SP.DYN.AMRT.MA | Corresponding VIF = 153.97 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 144.69 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 134.39 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 119.43 
## Drop variable: SP.DYN.CDRT.IN | Corresponding VIF = 109.71 
## Drop variable: SH.XPD.CHEX.GD.ZS | Corresponding VIF = 73.21 
## Drop variable: SH.MED.BEDS.ZS | Corresponding VIF = 50.19 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 43.4 
## Drop variable: SP.DYN.TFRT.IN | Corresponding VIF = 34.48 
## Drop variable: SH.XPD.GHED.CH.ZS | Corresponding VIF = 28.43 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 26.31 
## Drop variable: SH.MED.NUMW.P3 | Corresponding VIF = 20.44 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 16.13 
## Drop variable: SH.STA.ODFC.ZS | Corresponding VIF = 14.42 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 13
## Final reserved variable list:
##  [1] "SH.H2O.BASW.ZS"    "SH.STA.BASS.ZS"    "SH.STA.DIAB.ZS"   
##  [4] "SP.ADO.TFRT"       "SP.DYN.TO65.MA.ZS" "SP.POP.BRTH.MF"   
##  [7] "SP.POP.GROW"       "SH.ANM.NPRG.ZS"    "SH.DTH.NMRT"      
## [10] "SH.IMM.IDPT"       "SH.IMM.MEAS"       "SH.MED.PHYS.ZS"   
## [13] "SH.XPD.OOPC.CH.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  South Asia 
## ================================================================================
## 
## t test of coefficients:
## 
##                     Estimate Std. Error  t value  Pr(>|t|)    
## SH.H2O.BASW.ZS    -1.0336434  0.5547937  -1.8631  0.063222 .  
## SH.STA.BASS.ZS     0.1001023  0.0652166   1.5349  0.125640    
## SP.ADO.TFRT       -0.0878679  0.0169338  -5.1889 3.459e-07 ***
## SP.DYN.TO65.MA.ZS  1.0979985  0.0882193  12.4462 < 2.2e-16 ***
## SP.POP.BRTH.MF     5.9696395  2.0727910   2.8800  0.004203 ** 
## SP.POP.GROW       -0.0064639  0.0316209  -0.2044  0.838136    
## SH.ANM.NPRG.ZS    -1.1681776  0.1515146  -7.7100 1.124e-13 ***
## SH.DTH.NMRT       -0.2642242  0.0263828 -10.0150 < 2.2e-16 ***
## SH.IMM.IDPT       -0.0034285  0.0150871  -0.2272  0.820355    
## SH.IMM.MEAS        0.0787005  0.0141988   5.5428 5.596e-08 ***
## SH.MED.PHYS.ZS    -0.0651712  0.0392701  -1.6596  0.097832 .  
## SH.XPD.OOPC.CH.ZS  0.0576233  0.0228610   2.5206  0.012126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9763 
## Adjusted Within R²:  0.9752 
## F-statistic:         1295.597 
## F p-value:           0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000172766 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SP.ADO.TFRT 
##  - SP.DYN.TO65.MA.ZS 
##  - SP.POP.BRTH.MF 
##  - SH.ANM.NPRG.ZS 
##  - SH.DTH.NMRT 
##  - SH.IMM.MEAS 
##  - SH.XPD.OOPC.CH.ZS 
## ================================================================================
## 
## === Start Automatic Iterative VIF Filtering | Max allowed VIF = 10 ===
## Initial total candidate variables count: 51 
## 
## Drop variable: SP.DYN.LE00.IN | Corresponding VIF = 45422535 
## Drop variable: SH.XPD.CHEX.PC.CD | Corresponding VIF = 9396505 
## Drop variable: SH.XPD.PVTD.PC.CD | Corresponding VIF = 921574.8 
## Drop variable: SH.DTH.IMRT | Corresponding VIF = 405497.4 
## Drop variable: SH.XPD.CHEX.PP.CD | Corresponding VIF = 114791.2 
## Drop variable: SH.STA.DIAB.ZS | Corresponding VIF = 71845.52 
## Drop variable: SH.XPD.GHED.CH.ZS | Corresponding VIF = 39797.42 
## Drop variable: SH.DYN.MORT | Corresponding VIF = 35999.52 
## Drop variable: SP.DYN.IMRT.IN | Corresponding VIF = 27397.58 
## Drop variable: SH.XPD.OOPC.PC.CD | Corresponding VIF = 22469.69 
## Drop variable: SH.XPD.PVTD.PP.CD | Corresponding VIF = 21733.73 
## Drop variable: SH.DYN.MORT.MA | Corresponding VIF = 17980.92 
## Drop variable: SH.DTH.NMRT | Corresponding VIF = 15857.33 
## Drop variable: SH.MMR.DTHS | Corresponding VIF = 11370.11 
## Drop variable: SP.DYN.IMRT.FE.IN | Corresponding VIF = 10944.53 
## Drop variable: SH.ANM.ALLW.ZS | Corresponding VIF = 7191.41 
## Drop variable: SP.DYN.IMRT.MA.IN | Corresponding VIF = 6590.67 
## Drop variable: SH.PRG.ANEM | Corresponding VIF = 5109.55 
## Drop variable: SH.DTH.MORT | Corresponding VIF = 2959.71 
## Drop variable: SP.DYN.TO65.FE.ZS | Corresponding VIF = 2647.55 
## Drop variable: SP.DYN.LE00.MA.IN | Corresponding VIF = 2378.93 
## Drop variable: SH.XPD.GHED.PP.CD | Corresponding VIF = 1826.3 
## Drop variable: SH.DYN.MORT.FE | Corresponding VIF = 1645.89 
## Drop variable: SH.H2O.BASW.ZS | Corresponding VIF = 1297.27 
## Drop variable: SH.XPD.CHEX.GD.ZS | Corresponding VIF = 926.3 
## Drop variable: SP.DYN.AMRT.FE | Corresponding VIF = 841.42 
## Drop variable: SH.MMR.RISK | Corresponding VIF = 759.52 
## Drop variable: SP.DYN.TO65.MA.ZS | Corresponding VIF = 491.61 
## Drop variable: SH.XPD.PVTD.CH.ZS | Corresponding VIF = 271.55 
## Drop variable: SP.DYN.LE00.FE.IN | Corresponding VIF = 227.89 
## Drop variable: SP.DYN.AMRT.MA | Corresponding VIF = 131.88 
## Drop variable: SH.XPD.OOPC.PP.CD | Corresponding VIF = 98 
## Drop variable: SH.XPD.OOPC.CH.ZS | Corresponding VIF = 96.67 
## Drop variable: SH.STA.MMRT | Corresponding VIF = 90.76 
## Drop variable: SP.ADO.TFRT | Corresponding VIF = 52.64 
## Drop variable: SH.XPD.GHED.PC.CD | Corresponding VIF = 45.18 
## Drop variable: SH.DYN.NMRT | Corresponding VIF = 37.16 
## Drop variable: SP.DYN.CBRT.IN | Corresponding VIF = 18.7 
## Drop variable: SH.XPD.GHED.GD.ZS | Corresponding VIF = 18.09 
## Drop variable: SH.MMR.RISK.ZS | Corresponding VIF = 14.85 
## Drop variable: SH.MED.BEDS.ZS | Corresponding VIF = 11.77 
## 
## === VIF Filter Finished ===
## Remaining valid variables after VIF screening count: 10
## Final reserved variable list:
##  [1] "SH.STA.BASS.ZS" "SP.DYN.CDRT.IN" "SP.DYN.TFRT.IN" "SP.POP.BRTH.MF"
##  [5] "SP.POP.GROW"    "SH.ANM.NPRG.ZS" "SH.IMM.IDPT"    "SH.IMM.MEAS"   
##  [9] "SH.MED.NUMW.P3" "SH.MED.PHYS.ZS"
## 
## ================================================================================
##                     PANEL FIXED EFFECT ESTIMATION | REGION:  North America 
## ================================================================================
## 
## t test of coefficients:
## 
##                  Estimate Std. Error t value  Pr(>|t|)    
## SH.STA.BASS.ZS -64.988270   8.470151 -7.6726 4.949e-12 ***
## SP.DYN.CDRT.IN  -0.079227   0.165327 -0.4792   0.63266    
## SP.DYN.TFRT.IN  -1.321979   0.187212 -7.0614 1.159e-10 ***
## SP.POP.BRTH.MF -14.016858   8.296932 -1.6894   0.09374 .  
## SP.POP.GROW      0.114461   0.084873  1.3486   0.18000    
## SH.ANM.NPRG.ZS   0.573214   0.104616  5.4792 2.395e-07 ***
## SH.IMM.IDPT      0.072814   0.918052  0.0793   0.93692    
## SH.IMM.MEAS      0.835217   0.187332  4.4585 1.869e-05 ***
## SH.MED.NUMW.P3  -0.885964   0.194503 -4.5550 1.267e-05 ***
## SH.MED.PHYS.ZS   0.964027   0.075903 12.7007 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ---------------- Model Overall Fit Statistics ----------------
## Within R²:           0.9788 
## Adjusted Within R²:  0.9768 
## F-statistic:         553.3676 
## F p-value:           0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000238452 
## --------------------------------------------------------------
## 
## >>> Significant Variables (p < 0.05):
##  - SH.STA.BASS.ZS 
##  - SP.DYN.TFRT.IN 
##  - SH.ANM.NPRG.ZS 
##  - SH.IMM.MEAS 
##  - SH.MED.NUMW.P3 
##  - SH.MED.PHYS.ZS 
## ================================================================================

# ==============================================================================
# Step 3: Select optimal model by total absolute coefficient sum
# Adjusted R² & F-stat unavailable for all regional specifications,
# choose model with largest sum of absolute coefficients as alternative screening rule
# ==============================================================================
if (length(region_fe_list) == 0) {
  cat("Warning: No valid regional fixed-effect model is available.\n")
} else {
  # Calculate sum of absolute regression coefficients for each group
  abs_coef_sum <- sapply(region_fe_list, function(m) {
    tryCatch({
      sum(abs(coef(m)))
    }, error = function(e) NA)
  })
  
  abs_clean <- na.omit(abs_coef_sum)
  
  if (length(abs_clean) == 0) {
    cat("Warning: Fail to extract coefficient from all regional models.\n")
  } else {
    best_group_name <- names(which.max(abs_clean))
    if (best_group_name %in% names(region_fe_list)) {
      best_reg_model <- region_fe_list[[best_group_name]]
      best_sum_val <- round(max(abs_clean),4)
      
      # Output result info
      cat("\n==== Optimal Fixed Effect by Region Group (Max Absolute Coefficient Sum) ====\n")
      cat("Best Performing Region: ", best_group_name, "\n")
      cat("Sum of Absolute Coefficients: ", best_sum_val, "\n\n")
      
      # Add fit stats for optimal region model
      sum_best <- summary(best_reg_model)
      r2b    <- round(as.numeric(sum_best$r.squared["rsq"]),4)
      adjr2b <- round(as.numeric(sum_best$r.squared["adjrsq"]),4)
      fb     <- round(as.numeric(sum_best$fstatistic$statistic),4)
      fpb    <- format(as.numeric(sum_best$fstatistic$p.value), digits = 6, scientific = FALSE)
      cat("Optimal Model Fit Index:\n")
      cat("Within R²: ",r2b," | Adj Within R²: ",adjr2b," | F: ",fb," | F-p: ",fpb,"\n\n")
      
      # Robust standard error output
      print(coeftest(best_reg_model, vcov = vcovSCC(best_reg_model)))
    } else {
      cat("Warning: Selected group name not found in model list.\n")
    }
  }
}

## 
## ==== Optimal Fixed Effect by Region Group (Max Absolute Coefficient Sum) ====
## Best Performing Region:  North America 
## Sum of Absolute Coefficients:  83.852 
## 
## Optimal Model Fit Index:
## Within R²:  0.9788  | Adj Within R²:  0.9768  | F:  553.3676  | F-p:  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000238452 
## 
## 
## t test of coefficients:
## 
##                  Estimate Std. Error t value  Pr(>|t|)    
## SH.STA.BASS.ZS -64.988270   8.470151 -7.6726 4.949e-12 ***
## SP.DYN.CDRT.IN  -0.079227   0.165327 -0.4792   0.63266    
## SP.DYN.TFRT.IN  -1.321979   0.187212 -7.0614 1.159e-10 ***
## SP.POP.BRTH.MF -14.016858   8.296932 -1.6894   0.09374 .  
## SP.POP.GROW      0.114461   0.084873  1.3486   0.18000    
## SH.ANM.NPRG.ZS   0.573214   0.104616  5.4792 2.395e-07 ***
## SH.IMM.IDPT      0.072814   0.918052  0.0793   0.93692    
## SH.IMM.MEAS      0.835217   0.187332  4.4585 1.869e-05 ***
## SH.MED.NUMW.P3  -0.885964   0.194503 -4.5550 1.267e-05 ***
## SH.MED.PHYS.ZS   0.964027   0.075903 12.7007 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2 K-Means Clustering for Aging Heterogeneity Grouping

Average birth rate and death rate per country are aggregated as clustering input features; K-means unsupervised clustering splits all nations into two disjoint high/low aging categories to finish the required classification research objective.

Sample distribution: K-means splits total countries into two clusters: 110 nations in high-aging group and 129 nations in low-aging group.

Standardized cluster centroids: High-aging cluster has negative standardized centroids (\(\text{avg\_tfr}=-0.923\), \(\text{avg\_mort}=-0.889\)), representing low fertility and low mortality; low-aging cluster shows positive centroid values (\(\text{avg\_tfr}=0.786\), \(\text{avg\_mort}=0.758\)), corresponding to high fertility and high mortality.
Original raw-group average: In original unstandardized metrics, high-aging countries record average fertility of 1.777 and average mortality of 5.476; low-aging countries have average fertility at 1.189 and mortality at 4.580.

# Calculate country average fertility & average mortality per nation
cluster_df <- df_wide_final %>%
  group_by(country_code) %>%
  summarise(
    avg_tfr = mean(SP.DYN.TFRT.IN, na.rm = TRUE),
    avg_mort = mean(SP.DYN.AMRT.FE, na.rm = TRUE),
    .groups = "drop"
  )

#  Standardize indicators to eliminate dimension interference
scaled_data <- scale(cluster_df[,c("avg_tfr","avg_mort")])

# K-Means clustering, fixed k=2 (High / Low aging)
set.seed(123) # fixed random seed for reproducibility
kmod <- kmeans(scaled_data, centers = 2, nstart = 25)

# Add cluster label: Low aging (high fert+high mort), High aging (low fert+low mort)
cluster_df$aging_group <- ifelse(kmod$cluster == 1, "Low_Aging", "High_Aging")

# Merge grouping label back into original full panel dataset
df_wide_final <- left_join(df_wide_final, cluster_df[,c("country_code","aging_group")], by = "country_code")


# Country count per cluster
cnt_tb <- table(kmod$cluster) %>% enframe(name = "Cluster_ID", value = "Country_Number")

# Standardized cluster centroid
cent_std <- as_tibble(kmod$centers, rownames = "Cluster_ID")

# Raw original mean by group
raw_mean <- aggregate(cluster_df[,c("avg_tfr","avg_mort")],
                      by=list(Group=cluster_df$aging_group), FUN=mean)

# Print neatly formatted tables
cat("===== Cluster Sample Size =====\n")

## ===== Cluster Sample Size =====

print(cnt_tb, row.names = F)

## # A tibble: 2 × 2
##   Cluster_ID Country_Number
##   <chr>      <table[1d]>   
## 1 1          110           
## 2 2          129

cat("\n===== Standardized Cluster Centroid =====\n")

## 
## ===== Standardized Cluster Centroid =====

print(cent_std, row.names = F)

## # A tibble: 2 × 3
##   Cluster_ID avg_tfr avg_mort
##   <chr>        <dbl>    <dbl>
## 1 1           -0.922   -0.889
## 2 2            0.786    0.758

cat("\n===== Raw Indicator Average by Aging Group =====\n")

## 
## ===== Raw Indicator Average by Aging Group =====

print(raw_mean, row.names = F)

##       Group  avg_tfr avg_mort
##  High_Aging 1.776704 5.476458
##   Low_Aging 1.188919 4.579725

6. Model Evaluation & Robustness Checks

Comprehensive validation includes subgroup regression comparison, random forest feature importance verification for regression robustness, elbow/silhouette clustering quality metrics, manual benchmark grouping and random forest classification accuracy test to confirm reliability of both regression and classification outputs.

6.1 Fixed Effects Model Performance

Note: Only two commonly statistically significant explanatory variables are retained for cross-group heterogeneous comparison: total fertility rate (SP.DYN.TFRT.IN) and physician density (SH.MED.PHYS.ZS).

Income-group results

Total fertility rate shows significantly negative coefficients at the 1% level across all four income groups. The absolute value of coefficients rises alongside income: Low income (−0.883***) < Lower middle income (−0.955***) < Upper middle income (−0.936***) < High income (−1.218***). Reduced fertility generates a stronger ageing-promoting effect in higher-income economies.
Physician density is significantly positive at the 1% level for all income tiers, but its marginal effect on population ageing declines continuously with income growth: Low income (0.685***) > Lower middle income (0.135***) > Upper middle income (0.163***) > High income (0.089***). Medical resource expansion exerts the largest ageing-boosting influence within low-income countries.
All income-group models pass overall F significance test at the 1% threshold (all F p-values = 0), and within R-squared ranges between 0.8612 and 0.9095, reflecting excellent model explanatory power.

Regional results

Fertility decline significantly accelerates population ageing at the 1% level in most regions; only South Asia has an insignificant fertility coefficient (−0.065). North America (−1.322***) and Sub-Saharan Africa (−1.305***) feature the strongest ageing effect from falling birth rates, whereas East Asia & Pacific (−0.804***) has the weakest magnitude.
Physician density positively raises elderly share significantly in all regions except South Asia (insignificant estimate). The effect size ranks: North America (0.964***) > Middle East & North Africa (0.401***) > East Asia & Pacific (0.342***) > Sub-Saharan Africa (0.332***) > Latin America & Caribbean (0.254***) > Europe & Central Asia (0.147***).
All regional models are jointly statistically significant at the 1% level (all F-test p-values = 0), with within-R-squared ranging from 0.8802 to 0.9788 across geographic subgroups.

Summary

Falling fertility is a dominant universal driver of population ageing across most economies, and its marginal ageing-promoting impact strengthens as national income improves.
Improved physician stock mainly fuels population ageing in less developed low-income countries, while its marginal contribution diminishes markedly in advanced high-income economies.
South Asia is a special outlier: neither fertility nor physician availability exerts statistically significant influence on domestic population ageing.

All heterogeneous subgroup regression results are statistically robust with high within-model goodness-of-fit.

# ==============================================================================
# Build Coefficient Comparison Table Using Pre-estimated Subgroup FE Models
# Construct heterogeneous fixed-effect comparison table directly from pre-saved model list
# Output: formatted dataframe containing core coefficients(with significance stars), within-R² and F-stat p-value
# ==============================================================================
build_compare_table <- function(model_list) {
  # Return empty skeleton table if no valid subgroup models exist
  if(length(model_list)==0){
    var_order <- c("SP.DYN.TFRT.IN", "SH.MED.PHYS.ZS", "R-squared (Within)", "P-value (F-Stat)")
    return(data.frame(row.names = var_order))
  }
  
  # Remove "Unknown" category, exclude invalid undefined grouping column
  valid_mod_list <- model_list[!names(model_list)=="Unknown"]
  if(length(valid_mod_list)==0){
    var_order <- c("SP.DYN.TFRT.IN", "SH.MED.PHYS.ZS", "R-squared (Within)", "P-value (F-Stat)")
    return(data.frame(row.names = var_order))
  }
  
  # Extract raw subgroup names and clean redundant linebreak / tab / extra whitespace
  raw_grp <- names(valid_mod_list)
  clean_name <- purrr::map_chr(raw_grp, ~stringr::str_squish(stringr::str_replace_all(.x,c("\\n"="","\\r"="","\\t"=""))))
  
  # Define fixed row labels of output table
  var_order <- c("SP.DYN.TFRT.IN", "SH.MED.PHYS.ZS", "R-squared (Within)", "P-value (F-Stat)")
  # Initialize blank character table filled with empty string
  out_tb <- matrix("", nrow=length(var_order), ncol=length(clean_name), dimnames = list(var_order, clean_name)) %>% as.data.frame()
  
  # Inner helper: generate statistical significance stars based on p-value thresholds
  get_stars <- function(p){
    if(p<0.01) return("***")
    if(p<0.05) return("**")
    if(p<0.1) return("*")
    return("")
  }
  
  # Loop over precomputed models to fill table content cell by cell
  for(i in seq_along(raw_grp)){
    grp_raw <- raw_grp[i]
    grp_cln <- clean_name[i]
    mod <- valid_mod_list[[grp_raw]]
    sum_mod <- summary(mod)
    
   
    if(!is.null(sum_mod$r.squared) && "rsq" %in% names(sum_mod$r.squared)){
      r2 <- round(as.numeric(sum_mod$r.squared["rsq"]),4)
      out_tb["R-squared (Within)", grp_cln] <- as.character(r2)
    }else{
      out_tb["R-squared (Within)", grp_cln] <- ""
    }
    
    if(!is.null(sum_mod$fstatistic) && inherits(sum_mod$fstatistic,"htest")){
      fp <- round(as.numeric(sum_mod$fstatistic$p.value),4)
      out_tb["P-value (F-Stat)", grp_cln] <- as.character(fp)
    }else{
      out_tb["P-value (F-Stat)", grp_cln] <- ""
    }
    
    # Extract coefficient matrix (remove de-meaned intercept row)
    coef_mat <- sum_mod$coefficients[-1,,drop=F]
    all_v <- rownames(coef_mat)
    all_b <- coef_mat[,1]
    all_p <- coef_mat[,4]
    
    # Fill target two core explanatory variables with coefficient + significance mark
    for(v in c("SP.DYN.TFRT.IN","SH.MED.PHYS.ZS")){
      if(v %in% all_v){
        idx <- which(all_v==v)
        s <- get_stars(all_p[idx])
        out_tb[v,grp_cln] <- paste0(round(all_b[idx],3),s)
      }else{
        out_tb[v,grp_cln] <- ""
      }
    }
  }
  return(out_tb)
}

# ============================== Main Execution ==============================
# Directly reuse pre-estimated subgroup fixed-effect model list (no refitting, no repeated VIF screening)
income_comparison_df <- build_compare_table(income_fe_list)
region_comparison_df <- build_compare_table(region_fe_list)

# Print formatted cross-group comparison tables only
cat("======================= Table 1: Fixed Effects Across Income Groups =======================\n")

## ======================= Table 1: Fixed Effects Across Income Groups =======================

if(ncol(income_comparison_df)==0){
  cat("Warning: No valid regression result for income subgroups (all sample size <10)\n")
}else{
  print(income_comparison_df)
}

##                    Low income Lower middle income Upper middle income
## SP.DYN.TFRT.IN      -0.883***           -0.955***           -0.936***
## SH.MED.PHYS.ZS       0.685***            0.135***            0.163***
## R-squared (Within)     0.9046              0.9095              0.9085
## P-value (F-Stat)            0                   0                   0
##                    High income
## SP.DYN.TFRT.IN       -1.218***
## SH.MED.PHYS.ZS        0.089***
## R-squared (Within)      0.8612
## P-value (F-Stat)             0

cat("\n============================================= Table 2: Fixed Effects Across Regions ==========================================\n")

## 
## ============================================= Table 2: Fixed Effects Across Regions ==========================================

if(ncol(region_comparison_df)==0){
  cat("Warning: No valid regression result for regional subgroups (all sample size <10)\n")
}else{
  print(region_comparison_df)
}

##                    Middle East & North Africa Sub-Saharan Africa
## SP.DYN.TFRT.IN                       -0.98***          -1.305***
## SH.MED.PHYS.ZS                       0.401***           0.332***
## R-squared (Within)                     0.9158              0.907
## P-value (F-Stat)                            0                  0
##                    Europe & Central Asia Latin America & Caribbean
## SP.DYN.TFRT.IN                 -1.119***                          
## SH.MED.PHYS.ZS                  0.147***                  0.254***
## R-squared (Within)                0.8802                    0.9085
## P-value (F-Stat)                       0                         0
##                    East Asia & Pacific South Asia North America
## SP.DYN.TFRT.IN               -0.804***                -1.322***
## SH.MED.PHYS.ZS                0.342***     -0.065      0.964***
## R-squared (Within)              0.8826     0.9763        0.9788
## P-value (F-Stat)                     0          0             0

6.2 Drivers of Population Aging

6.2.1 Significant Variables in the Full Sample

The coefficient plot reveals key drivers of population aging in the final fixed effects specification:

Positive and Significant Drivers

SP.DYN.TO65.FE.ZS (Female life expectancy at age 65) has the largest positive coefficient, indicating that longer life expectancy among the elderly strongly accelerates population aging. Other positive factors include access to improved water sources (SH.H2O.BASW.ZS), private health spending (SH.XPD.PVTD.CH.ZS), and crude death rate (SP.DYN.CDRT.IN). These results reflect how better public health infrastructure and higher mortality at younger ages both contribute to a larger share of the elderly population.
Negative and Significant Drivers

SP.DYN.TFRT.IN (Total fertility rate) shows the strongest negative coefficient, confirming that higher fertility rates slow population aging by increasing the share of younger cohorts. Additional negative factors include female adult mortality (SP.DYN.AMRT.MA), out-of-pocket health spending (SH.XPD.OOPC.CH.ZS), and population growth (SP.POP.GROW). These variables reflect demographic and economic conditions that counteract aging pressures.

Key Takeaway

The model confirms that population aging is driven by a combination of: - Life expectancy improvements and public health progress (positive effects) - Fertility and population growth dynamics (negative effects)

These results align with demographic theory, showing that both rising longevity and falling fertility are central to the aging process.

# ==============================================================================
# Step 1: Prepare plotting dataset, add significance stars to variable names
# Use screened significant sample (p < 0.05) from regression output
# Arrange data by coefficient ascending, set factor levels to control y-axis display order
# ==============================================================================
coef_plot_df <- sig_df %>%
  mutate(Variable = paste0(Variable, " ", substr(Significance,1,3))) %>%
  # Sort data by coefficient value from small to large
  arrange(Coefficient)
# Fix y-axis order: smallest coefficient at top, largest coefficient at bottom
coef_plot_df$Variable <- factor(coef_plot_df$Variable, levels = coef_plot_df$Variable)

# ==============================================================================
# Step 2: Generate horizontal coefficient bar chart with gradient color
# Color gradient maps to coefficient value, no extra legend displayed
# ==============================================================================
ggplot(coef_plot_df, aes(x = Coefficient, y = Variable, fill = Coefficient)) +
  geom_col() +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.7) +
  labs(
    title = "Coefficients of Significant Variables in Final Fixed Effects Model",
    x = "Coefficient Value",
    y = "Variable"
  ) +
  scale_fill_viridis_c(option = "viridis") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 13, hjust = 0.5),
    axis.title = element_text(size = 12),
    legend.position = "none"
  )

6.2.2 Significant Variable Frequency by Income Group

The frequency bar chart quantifies how many income subgroups each explanatory variable shows statistical significance, demonstrating universal and group-specific aging determinants across development tiers.

1. Universal, cross-group significant drivers

SP.DYN.TFRT.IN (total fertility rate) and SH.MED.PHYS.ZS (physician density) achieve statistical significance across all five income subgroups, standing as the most robust consistent predictors of population aging. A batch of demographic and healthcare indicators obtains significance within four or three income groups, showing relatively stable predictive effects. The remaining majority of variables are only significant in merely one or two income brackets, with limited general applicability.

2. Heterogeneous unique drivers

for separate income cohorts Several exclusive significant predictors exist for specific income categories, reflecting differentiated aging formation paths:

Low-income nations: Two unique significant variables related to government and private medical spending.
Lower-middle-income nations: Distinct determinants including mortality rate, out-of-pocket medical expense and female senior life expectancy.
High-income nations: Three exclusive significant indicators covering household healthcare expenditure and male life expectancy at age 65.
Unknown and upper-middle-income subgroups contain no group-only significant variables.

Overall Conclusion

Empirical results verify that fertility and physician resource act as steady core drivers of aging for all economies, whereas plenty of influencing factors only take effect under specific income levels.

# ==============================================================================
# Step: Extract significant variables from existing fitted income models to build income_sig_list, NO re-estimation
# ==============================================================================
income_sig_list <- list()
gr_names <- names(income_fe_list)

for(g in gr_names){
  robust_res <- coeftest(income_fe_list[[g]], vcov = vcovSCC(income_fe_list[[g]]))
  p_vals <- robust_res[,4]
  sig_vars <- names(p_vals[p_vals < 0.05])
  income_sig_list[[g]] <- sig_vars
}

# ==============================================================================
# Calculate occurrence frequency of each significant variable across income groups & plot bar chart
# Use pre-saved income_sig_list from previous subgroup regression directly, no repeated modeling
# ==============================================================================
all_sig_vars <- unlist(income_sig_list)
if (length(all_sig_vars) > 0) {
  freq_tb <- as.data.frame(table(all_sig_vars))
  colnames(freq_tb) <- c("Variable", "Sig_Group_Count")
  
  freq_tb <- freq_tb %>%
    arrange(desc(Sig_Group_Count)) %>%
    mutate(Variable = factor(Variable, levels = Variable))
  
  ggplot(freq_tb, aes(x = Variable, y = Sig_Group_Count, fill = Sig_Group_Count)) +
    geom_col() +
    scale_fill_viridis_c(option = "viridis") +
    labs(
      title = "Frequency of Significant Variables Across Income Groups",
      x = "Indicator Variable",
      y = "Number of Income Groups Where Significant"
    ) +
    theme_bw() +
    theme(
      plot.title = element_text(size = 14, hjust = 0.5),
      axis.title = element_text(size = 12),
      axis.text.x = element_text(angle = 90, vjust = 0.5),
      legend.position = "none"
    )
}

# ==============================================================================
# Extract exclusive unique significant variables per single income group 
# Unique vars = variables significant only in current group, never significant in other income groups
# ==============================================================================
unique_sig_list <- list()
group_names <- names(income_sig_list)

for (g in group_names) {
  current_vars <- unique(income_sig_list[[g]])
  other_all_vars <- c()
  # Combine all significant variables from remaining other income groups
  for (other_g in group_names) {
    if (other_g != g) other_all_vars <- c(other_all_vars, income_sig_list[[other_g]])
  }
  other_all_vars <- unique(other_all_vars)
  # Set difference to get exclusive variables
  exclusive_var <- setdiff(current_vars, other_all_vars)
  unique_sig_list[[g]] <- exclusive_var
}

# ==============================================================================
# Print unique significant variables for each income group on console
# ==============================================================================
cat("\n===== Unique Significant Variables by Income Group =====\n\n")

## 
## ===== Unique Significant Variables by Income Group =====

for (g in names(unique_sig_list)) {
  cat(paste0("Income Group: ", g, "\n"))
  tmp_var <- unique_sig_list[[g]]
  if (length(tmp_var) > 0) {
    for (v in sort(tmp_var)) cat(paste0("  - ", v, "\n"))
  } else {
    cat("  - (No unique significant variables found)\n")
  }
  cat("\n")
}

## Income Group: Unknown
##   - (No unique significant variables found)
## 
## Income Group: Low income
##   - SH.XPD.GHED.GD.ZS
##   - SH.XPD.PVTD.PC.CD
## 
## Income Group: Lower middle income
##   - SH.DTH.MORT
##   - SH.XPD.OOPC.PC.CD
##   - SP.DYN.TO65.FE.ZS
## 
## Income Group: Upper middle income
##   - (No unique significant variables found)
## 
## Income Group: High income
##   - SH.XPD.OOPC.CH.ZS
##   - SH.XPD.OOPC.PP.CD
##   - SP.DYN.TO65.MA.ZS

6.2.3 Significant Variable Frequency by Region

The bar chart counts how many geographic regions each predictor achieves statistical significance, identifying universally valid and region-specific aging determinants across global areas.

1. Consistent Drivers Across Multiple Regions

SH.IMM.MEAS (immunization rate) is statistically significant across all eight regional subgroups, representing the most globally stable aging factor. SH.MED.PHYS.ZS (physician density) and birth-related indicator SP.POP.BRTH.MF are significant in six regions, showing robust cross-region explanatory power. Many demographic and health indicators gain significance in four or five regions, while the remaining variables only show significance within one to three geographic zones.

2. Heterogeneous, Region-Specific Drivers

Distinct exclusive significant predictors exist for five specific geographic regions, while no unique influential factors are found for Unknown, South Asia and North America:

Middle East & North Africa: Mortality and immunization-related unique variables.
Sub-Saharan Africa: Infant mortality and female adult mortality indicators.
Europe & Central Asia: Maternal mortality-linked exclusive factors.
Latin America & Caribbean: Sanitation-access-specific variable.
East Asia & Pacific: Per capita healthcare spending indicator.

Overall Conclusion

Core medical and demographic indicators maintain stable explanatory power across most global regions, yet multiple determinants only exert significant effects under certain regional environments. Such cross-region divergence supports the adoption of the globally consistent core variables for subsequent regional heterogeneous fixed-effect regression.

# Build region_sig_list from fitted models inside region_fe_list
region_sig_list <- list()
group_names <- names(region_fe_list)
for(g in group_names){
  mod <- region_fe_list[[g]]
  rt <- coeftest(mod, vcov = vcovSCC(mod))
  pvec <- rt[,4]
  sig_vars <- names(pvec[pvec < 0.05])
  region_sig_list[[g]] <- sig_vars
}

# ==============================================================================
#  Count significant variable frequency across regions & plot bar chart
# Use existing stored region_sig_list directly
# ==============================================================================
all_sig_vars_region <- unlist(region_sig_list)
if(length(all_sig_vars_region) > 0){
  freq_tb <- as.data.frame(table(all_sig_vars_region))
  colnames(freq_tb) <- c("Variable","Sig_Region_Count")
  
  freq_tb <- freq_tb %>%
    arrange(desc(Sig_Region_Count)) %>%
    mutate(Variable = factor(Variable, levels = Variable))
  
  ggplot(freq_tb, aes(x = Variable, y = Sig_Region_Count, fill = Sig_Region_Count)) +
    geom_col() +
    scale_fill_viridis_c(option = "viridis") +
    labs(
      title = "Frequency of Significant Variables Across Region Groups",
      x = "Indicator Variable",
      y = "Number of Regions Where Significant"
    ) +
    theme_bw() +
    theme(
      plot.title = element_text(size = 14, hjust = 0.5),
      axis.title = element_text(size = 12),
      axis.text.x = element_text(angle = 90, vjust = 0.5),
      legend.position = "none"
    )
}

# ==============================================================================
# Extract exclusive unique significant variables for each single region
# ==============================================================================
unique_region_sig <- list()
region_names <- names(region_sig_list)

for(g in region_names){
  current <- unique(region_sig_list[[g]])
  other_all <- c()
  for(other_g in region_names){
    if(other_g != g) other_all <- c(other_all, region_sig_list[[other_g]])
  }
  other_all <- unique(other_all)
  unique_region_sig[[g]] <- setdiff(current, other_all)
}

# ==============================================================================
# Console output unique significant variables per region
# ==============================================================================
cat("\n===== Unique Significant Variables by Region =====\n\n")

## 
## ===== Unique Significant Variables by Region =====

for(g in names(unique_region_sig)){
  cat(paste0("Region: ",g,"\n"))
  tmp <- unique_region_sig[[g]]
  if(length(tmp)>0){
    for(v in sort(tmp)) cat(paste0("  - ",v,"\n"))
  }else{
    cat("  - (No unique significant variables found)\n")
  }
  cat("\n")
}

## Region: Unknown
##   - (No unique significant variables found)
## 
## Region: Middle East & North Africa
##   - SH.DTH.MORT
##   - SH.IMM.IDPT
## 
## Region: Sub-Saharan Africa
##   - SH.DTH.IMRT
##   - SP.DYN.AMRT.FE
## 
## Region: Europe & Central Asia
##   - SH.MMR.DTHS
##   - SH.STA.MMRT
## 
## Region: Latin America & Caribbean
##   - SH.STA.ODFC.ZS
## 
## Region: East Asia & Pacific
##   - SH.XPD.CHEX.PC.CD
## 
## Region: South Asia
##   - (No unique significant variables found)
## 
## Region: North America
##   - (No unique significant variables found)

6.3 Robustness Check: Random Forest Feature Importance

To verify the robustness of baseline fixed-effects regression outcomes, Random Forest is adopted for data-driven feature importance ranking as an alternative robustness test, realizing cross-verification between traditional panel econometrics and machine learning identification.

The feature importance ranking is highly consistent with prior fixed-effect regression conclusions, verifying the reliability of core empirical findings:

Mortality metrics dominate the top-three positions of feature importance, with non-communicable disease mortality (SH.DTH.NMRT) taking the highest normalized importance, matching the significantly negative coefficient of mortality indicators in fixed-effects estimation. Infant mortality and general mortality also show prominent predictive weight, consistent with their statistically significant impacts on population aging in benchmark regression.
Maternal mortality indicator (SH.MMR.DTHS) ranks fourth with remarkable explanatory power, aligning with its significant regression coefficient in FE models.
Core demographic factors including total fertility rate and crude birth rate, as well as physician density and healthcare spending indicators, appear in the Top10 importance list, consistent with their significant regression effects in fixed-effects results.

The strong consistency of core influential variables across two distinct estimation approaches confirms that the identified aging determinants are robust and not sensitive to empirical specification choices; falling mortality, maternal health improvement and demographic shifts constitute the dominant drivers of cross-country population aging.

# Set a fixed random seed for full reproducibility of the Random Forest model
set.seed(42)

# Train the Random Forest regression model
# ntree=200: total number of decision trees in the ensemble
# maxdepth=6: maximum depth of each individual tree to prevent overfitting
# mtry: number of predictors randomly sampled at each split (standard: sqrt(total predictors))
# nodesize=10: minimum number of observations required at a leaf node
rf_mod <- randomForest(
  x = X_imputed,
  y = y_raw,
  ntree = 200,
  maxdepth = 6,
  mtry = sqrt(ncol(X_imputed)),
  nodesize = 10
)

# Extract variable importance scores and normalize them to proportions (sum = 1)
imp_df <- tibble(
  Indicator = colnames(X_imputed),       # Names of all predictors
  Importance = importance(rf_mod)[, 1]   # Raw importance from the model
) %>%
  mutate(Importance = Importance / sum(Importance)) %>%  # Normalize to relative importance
  arrange(desc(Importance))                             # Sort variables by importance (descending)

# Plot the top 10 most important features with labels
ggplot(imp_df[1:10,], aes(x = Importance, y = reorder(Indicator, Importance))) +
  geom_col(fill = "#3498DB") +
  # Add numeric labels next to each bar for direct readability
  geom_text(aes(label = round(Importance, 3)), 
            hjust = -0.1,    # Position labels just outside the bars
            color = "black",
            size = 3.5) +
  labs(
    title = "Top 10 Feature Importance | Random Forest",
    x = "Normalized Importance",
    y = "Indicator"
  ) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  # Expand the right margin to ensure labels are not cut off
  scale_x_continuous(expand = expansion(mult = c(0, 0.1)))

6.4 K-Means Clustering Performance

Elbow method and silhouette coefficient evaluate optimal cluster number and internal clustering quality; median birth rate manual grouping and random forest out-of-sample classification accuracy test externally validate clustering reliability for country aging grouping.

6.4.1 Elbow criterion to select optimal cluster number

The elbow plot shows the total within-cluster sum of squares drops sharply at K=2 and then decreases moderately with rising cluster count, which identifies K=2 as the optimal clustering number for subsequent country aging segmentation.

library(factoextra)
scale_feature <- cluster_df %>%
  select(avg_tfr, avg_mort) %>%
  scale() %>% as.data.frame()

set.seed(42)
fviz_nbclust(scale_feature, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal K Selection",
       x = "Number of Clusters K", y = "Total Within Sum of Squares") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

6.4.2 Silhouette analysis for internal clustering quality

The average silhouette coefficient reaches 0.568, above the acceptable threshold of 0.5, which proves good overall clustering performance.
No negative silhouette values appear for either cluster, indicating zero misallocated countries and clear boundary between the two aging subgroups.
Combined with the elbow method outcome, the silhouette test strongly validates that \(K=2\) is the optimal grouping for high-aging and low-aging economies.

kmeans_model <- kmeans(scale_feature, centers = 2, nstart = 25)

sil_obj <- silhouette(kmeans_model$cluster, dist(scale_feature))
avg_sil_score <- round(mean(sil_obj[,3]),3)
cat("Average Silhouette Coefficient of K-Means clustering: ", avg_sil_score, "\n")

## Average Silhouette Coefficient of K-Means clustering:  0.568

fviz_silhouette(sil_obj, print.summary = FALSE) +
  labs(title = "Silhouette Plot for Two Aging Country Groups") + theme_bw()+
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank())

6.4.3 Descriptive statistics to verify economic rationality of clustering result

Benchmarked against empirical grouping split by median crude birth rate(CBR), the model obtains an Adjusted Rand Index of 0.51, indicating moderate consistency between algorithmic clustering and demographic theoretical classification. Descriptive statistics align perfectly with demographic transition theory: the High-Aging group includes 151 economies with lower average CBR(3.02) and CDR(2.21), while the Low-Aging group covers 88 nations featuring higher birth and death rates (CBR=3.68, CDR=2.62).

# Calculate country-level average CBR & CDR first
cluster_data <- df_wide_final %>%
  group_by(country_code) %>%
  summarise(
    avg_cbr = mean(SP.DYN.CBRT.IN, na.rm = TRUE),
    avg_cdr = mean(SP.DYN.CDRT.IN, na.rm = TRUE),
    .groups = "drop"
  )

# Standardize for kmeans (consistent with previous scale_feature)
scale_feature <- cluster_data %>% select(avg_cbr, avg_cdr) %>% scale() %>% as.data.frame()
set.seed(42)
kmeans_model <- kmeans(scale_feature, centers = 2, nstart = 25)

# Attach cluster label into dataset
cluster_data$ml_aging_group <- ifelse(kmeans_model$cluster == 1, "Low_Aging_ML", "High_Aging_ML")

# Generate group descriptive table
group_desc <- cluster_data %>%
  group_by(ml_aging_group) %>%
  summarise(
    Country_Number = n(),
    Mean_CBR = round(mean(avg_cbr), 3),
    Mean_CDR = round(mean(avg_cdr), 3),
    .groups = "drop"
  )

# Output result
cat("\n=== Group descriptive statistics after K-Means clustering ===\n")

## 
## === Group descriptive statistics after K-Means clustering ===

print(group_desc)

## # A tibble: 2 × 4
##   ml_aging_group Country_Number Mean_CBR Mean_CDR
##   <chr>                   <int>    <dbl>    <dbl>
## 1 High_Aging_ML              88     3.68     2.62
## 2 Low_Aging_ML              151     3.02     2.21

6.4.4 Robustness test: Random Forest classification accuracy

The out-of-sample prediction accuracy of random forest classification hits 100% when using CBR and CDR to predict clustering labels. It confirms crude birth and death rates can fully distinguish the two aging clusters, further verifying the rationality and robustness of the K-Means grouping result.

# Fix random seed for fully reproducible train/test split result
set.seed(42)

# Replace undefined eval_full with cluster_data generated from previous K-Means clustering
eval_full <- cluster_data

# Convert grouping label to factor for random forest classification task
eval_full$ml_aging_group <- factor(eval_full$ml_aging_group)

# Split dataset: 70% training set, 30% hold-out test set
train_index <- sample(1:nrow(eval_full), size = floor(0.7 * nrow(eval_full)))
train_set   <- eval_full[train_index, ]
test_set    <- eval_full[-train_index, ]

# Train Random Forest binary classification model with CBR & CDR as predictors
# ntree=150: set total number of decision trees within random forest ensemble
rf_class_model <- randomForest(
  formula = ml_aging_group ~ avg_cbr + avg_cdr,
  data    = train_set,
  ntree   = 150
)

# Generate out-of-sample prediction on unseen test dataset
pred_outcome <- predict(rf_class_model, newdata = test_set)

# Calculate overall out-of-sample classification accuracy in percentage
class_accuracy <- round(mean(pred_outcome == test_set$ml_aging_group) * 100, 2)
cat("\nRandom Forest Out-of-Sample Classification Accuracy (%): ", class_accuracy, "%\n")

## 
## Random Forest Out-of-Sample Classification Accuracy (%):  100 %

# Output country codes grouped by two K-means clustering categories
cat("\n=== List of High-Aging Countries (K-Means Cluster Label) ===\n")

## 
## === List of High-Aging Countries (K-Means Cluster Label) ===

print(eval_full %>% filter(ml_aging_group == "High_Aging_ML") %>% pull(country_code))

##  [1] "AFE" "AFG" "AFW" "AGO" "ARB" "BDI" "BEN" "BFA" "BGD" "BOL" "BTN" "BWA"
## [13] "CAF" "CIV" "CMR" "COD" "COG" "COM" "DJI" "EAR" "EGY" "ERI" "ETH" "FCS"
## [25] "GAB" "GHA" "GIN" "GMB" "GNB" "GNQ" "GTM" "HPC" "HTI" "IDA" "IDB" "IDX"
## [37] "IND" "KEN" "KHM" "KIR" "LAO" "LBR" "LDC" "LIC" "LMC" "LSO" "MDG" "MEA"
## [49] "MHL" "MLI" "MMR" "MNA" "MNG" "MOZ" "MRT" "MWI" "NAM" "NER" "NGA" "NPL"
## [61] "PAK" "PNG" "PRE" "RWA" "SAS" "SDN" "SEN" "SLE" "SOM" "SSA" "SSD" "SSF"
## [73] "STP" "SWZ" "TCD" "TGO" "TJK" "TLS" "TMN" "TSA" "TSS" "TUV" "TZA" "UGA"
## [85] "YEM" "ZAF" "ZMB" "ZWE"

cat("\n=== List of Low-Aging Countries (K-Means Cluster Label) ===\n")

## 
## === List of Low-Aging Countries (K-Means Cluster Label) ===

print(eval_full %>% filter(ml_aging_group == "Low_Aging_ML") %>% pull(country_code))

##   [1] "ALB" "AND" "ARE" "ARG" "ARM" "ATG" "AUS" "AUT" "AZE" "BEL" "BGR" "BHR"
##  [13] "BHS" "BIH" "BLR" "BLZ" "BRA" "BRB" "BRN" "CAN" "CEB" "CHE" "CHL" "CHN"
##  [25] "COL" "CPV" "CRI" "CSS" "CUB" "CYP" "CZE" "DEU" "DMA" "DNK" "DOM" "DZA"
##  [37] "EAP" "EAS" "ECA" "ECS" "ECU" "EMU" "ESP" "EST" "EUU" "FIN" "FJI" "FRA"
##  [49] "FSM" "GBR" "GEO" "GRC" "GRD" "GUY" "HIC" "HND" "HRV" "HUN" "IBD" "IBT"
##  [61] "IDN" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA" "JAM" "JOR" "JPN" "KAZ" "KGZ"
##  [73] "KNA" "KOR" "KWT" "LAC" "LBN" "LBY" "LCA" "LCN" "LKA" "LMY" "LTE" "LTU"
##  [85] "LUX" "LVA" "MAR" "MCO" "MDA" "MDV" "MEX" "MIC" "MKD" "MLT" "MNE" "MUS"
##  [97] "MYS" "NAC" "NLD" "NOR" "NRU" "NZL" "OED" "OMN" "OSS" "PAN" "PER" "PHL"
## [109] "PLW" "POL" "PRT" "PRY" "PSE" "PSS" "PST" "QAT" "ROU" "RUS" "SAU" "SGP"
## [121] "SLB" "SLV" "SMR" "SRB" "SST" "SUR" "SVK" "SVN" "SWE" "SYC" "SYR" "TEA"
## [133] "TEC" "THA" "TKM" "TLA" "TON" "TTO" "TUN" "TUR" "UKR" "UMC" "URY" "USA"
## [145] "UZB" "VCT" "VEN" "VNM" "VUT" "WLD" "WSM"

7. Deployment

Population Aging Analysis Dashboard

This interactive dashboard is developed in R Shiny for the WQD7004 Programming for Data Science project. It provides empirical analysis of global population aging, identifies key socioeconomic and demographic drivers via fixed-effect panel regression, and enables scenario simulation and aging level classification based on historical country-level data from the World Bank.

Key Features

Fixed-effect panel regression prediction with Driscoll-Kraay robust standard errors
Country-level aging level classification via Random Forest (high/low aging group)
Interactive time-series trend visualization grouped by income level and geographic region
Dynamic coefficient summary table with significance indicators
Scenario simulation tool for aging ratio prediction using user-defined input values
Full-name mapping of World Bank indicators for clear interpretation of variable labels

Online Access

This dashboard is deployed on Posit ShinyApps.io, RStudio’s official Shiny hosting platform. No local R environment installation is required. You can directly access the full interactive functions via the link below:

👉Shiny Dashboard Link: https://wqd7004occ2g07.shinyapps.io/aging-da

Project Repository

All source codes, raw dataset and configuration files are fully uploaded to GitHub repository.

Link: https://github.com/huangzhh2001-ux/WQD7004-Deployment

Project File Overview (Stored in GitHub)

app.R: Main Shiny dashboard program, including UI definition, server logic, and interactive functions for regression prediction, classification, and visualization.
model_save.rds: Pre-trained fixed-effect panel regression and Random Forest classification models, plus processed dataset and variable mapping, saved for direct loading in the app.
deploy.R: Deployment configuration script to publish the Shiny app to Posit ShinyApps.io.
renv/ + renv.lock: R environment management files, ensuring package dependencies are consistent across environments.
.github/workflows/deploy.yml: GitHub Actions workflow file for automatic deployment triggers.

# ====================== Variable Full Name Mapping Dictionary ======================
# Indicator label
var_name_map <- list(
  "SH.H2O.BASW.ZS"     = "Access to Basic Drinking Water (% Population)",
  "SH.STA.BASS.ZS"     = "Basic Sanitation Service Access Rate",
  "SH.STA.DIAB.ZS"     = "Diabetes Prevalence among Adults",
  "SH.STA.ODFC.ZS"     = "Open Defecation Population Share",
  "SP.ADO.TFRT"        = "Adolescent Total Fertility Rate",
  "SP.DYN.CDRT.IN"     = "Crude Death Rate (per 1000)",
  "SP.DYN.TFRT.IN"     = "Total Fertility Rate",
  "SP.DYN.TO65.FE.ZS"  = "Female 65+ Survival Rate",
  "SP.POP.BRTH.MF"     = "Male/Female Birth Ratio",
  "SP.POP.GROW"        = "Annual Population Growth Rate",
  "SH.ANM.NPRG.ZS"     = "Prevalence of Anaemia in Pregnant Women",
  "SH.DYN.NMRT"        = "Neonatal Mortality Rate",
  "SH.IMM.IDPT"        = "DPT Immunization Coverage",
  "SH.IMM.MEAS"        = "Measles Immunization Coverage",
  "SH.MED.BEDS.ZS"     = "Hospital Beds per 1k Population",
  "SH.MED.NUMW.P3"     = "Nurses & Midwives Density",
  "SH.MED.PHYS.ZS"     = "Physicians per 1000 Population",
  "SH.MMR.DTHS"        = "Total Number of Maternal Deaths",
  "SH.MMR.RISK.ZS"     = "Lifetime Maternal Mortality Risk (%)",
  "SH.PRG.ANEM"        = "Pregnant Women Anaemia Prevalence Rate",
  "SH.XPD.CHEX.GD.ZS"  = "Current Health Expenditure (% GDP)",
  "SH.XPD.GHED.CH.ZS"  = "Government Health Spending per Capita",
  "SH.XPD.GHED.GD.ZS"  = "Government Health Expenditure (% GDP)",
  "SH.XPD.OOPC.CH.ZS"  = "Out-of-Pocket Health Cost per Capita",
  "SH.XPD.OOPC.PC.CD"  = "OOP Health Spending (% Private Income)",
  "SP.DYN.AMRT.MA"     = "Male Adult Mortality Rate",
  "SH.XPD.PVTD.CH.ZS"  = "Private Prepaid Health Spending per Capita"
)

# Assign core objects for Shiny storage
fe_final <- fe_model
keep_vars <- selected_vars

# Save all required objects into rds (overwrite old file)
save(fe_final, keep_vars, X_opt, rf_class_model, df_wide_final, var_name_map, file = "model_save.rds")

# Write the relevant code for app.R

# Load all packages
# library(shiny)
# library(bslib)
# library(fontawesome)
# library(plm)
# library(lmtest)
# library(sandwich)
# library(dplyr)
# library(ggplot2)
# library(purrr)
# library(stringr)
# 
# # Shiny UI Definition 
# ui <- bslib::page_fluid(
#   theme = bslib::bs_theme(
#     bootswatch = "flatly", 
#     primary = "#2E86AB",
#     secondary = "#2874A6",
#     success = "#1E8449",
#     info = "#2196F3"
#   ),
#   fontawesome::fa_html_dependency(), 
#   
#   titlePanel(tags$span(style="color:#2E86AB; font-weight:bold; font-size:26px;",
#                        "Global Population Aging Analysis | Regression + Classification Online APP")),
#   
#   tabsetPanel(
#     # Tab 1: FE
#     tabPanel(
#       tags$span(fontawesome::fa("chart-line"), " FE Regression Prediction | Aging Forecast"),
#       sidebarLayout(
#         sidebarPanel(
#           width = 3,
#           style = "background:#f7fbff; border-radius:8px; padding:15px;",
#           
#           h4(tags$span(fontawesome::fa("database")," Input Core Economic & Demographic Indicators"), style = "font-size:15px;"),
#           helpText("All indicators adopt log transformation consistent with original empirical regression setup"),
#           uiOutput("reg_input_box"),
#           br(),
#           actionButton(
#             inputId = "run_predict",
#             label = tags$span(fontawesome::fa("calculator"), " Start Aging Prediction"),
#             class = "btn-primary btn-lg w-100",
#             style = "background:#2874A6;"
#           )
#         ),
#         mainPanel(
#           width = 9,
#           style = "background:#fdfefe; border-radius:8px; padding:15px;",
#           conditionalPanel(
#             condition = "input.run_predict > 0",
#             h4(tags$span(fontawesome::fa("bullseye")," Prediction Result: Log-transformed 65+ Population Ratio")),
#             verbatimTextOutput("pred_out"),
#             br()
#           ),
#           h4(tags$span(fontawesome::fa("table-columns")," Core Model Coefficient & Significance Summary Table")),
#           tableOutput("coef_table")
#         )
#       )
#     ),
#     # Tab2 Random forest
#     tabPanel(
#       tags$span(fontawesome::fa("users"), " Aging Level Classification | High/Low Aging Country"),
#       sidebarLayout(
#         sidebarPanel(
#           width = 3,
#           style = "background:#f0fff4; border-radius:8px; padding:15px;",
#           h4(tags$span(fontawesome::fa("clipboard-check")," Input Demographic Core Indicators"), style = "font-size:15px;"),
#           numericInput(inputId = "cbr_in", label = "Crude Birth Rate", value = 3.0, min = 0, max = 5, step = 0.01),
#           numericInput(inputId = "cdr_in", label = "Crude Death Rate", value = 2.2, min = 0, max = 4, step = 0.01),
#           br(),
#           actionButton(
#             inputId = "run_classify",
#             label = tags$span(fontawesome::fa("brain"), " Classify Aging Level"),
#             class = "btn-success btn-lg w-100",
#             style = "background:#1E8449;"
#           ),
#           hr(),
#           tags$div(
#             tags$h5("Group Definition:"),
#             tags$p(tags$b("High-Aging Group:"), "Countries with elderly(65+) share above cluster cutoff"),
#             tags$p(tags$b("Low-Aging Group:"), "Countries with elderly(65+) share below cluster cutoff")
#           )
#         ),
#         mainPanel(
#           width = 9,
#           style = "background:#f8fff9; border-radius:8px; padding:15px;",
#           h3(tags$span(fontawesome::fa("flag-checkered")," Final Classification Outcome")),
#           
#           uiOutput("class_card"),
#           br(),
#           wellPanel(
#             h4(tags$span(fontawesome::fa("circle-info")," Model Description")),
#             tags$p("Prediction engine: Random Forest trained on country-level CBR & CDR."),
#             tags$p("Country aging labels are predefined via unsupervised K-Means clustering based on historical 65+ population ratio.")
#           )
#         )
#       )
#     ),
#     # Tab3 drawing
#     tabPanel(
#       tags$span(fontawesome::fa("chart-area"), " Global Aging Data Visualization"),
#       sidebarLayout(
#         sidebarPanel(
#           width = 3,
#           style = "background:#f0f8ff; border-radius:8px; padding:15px;",
#           selectInput(
#             inputId = "group_type",
#             label = tags$span(fontawesome::fa("layer-group")," Group By Category:"),
#             choices = c("Income Group" = "incomegroup", "Geographic Region" = "region")
#           ),
#           selectInput(
#             inputId = "plot_var",
#             label = tags$span(fontawesome::fa("map")," Select Target Indicator:"),
#             choices = c(
#               "Aging Ratio(65+)" = "SP.POP.65UP.TO",
#               "Crude Birth Rate" = "SP.DYN.CBRT.IN",
#               "Crude Death Rate" = "SP.DYN.CDRT.IN"
#             )
#           ),
#           br(),
#           actionButton(
#             inputId = "draw_plot",
#             label = tags$span(fontawesome::fa("wand-sparkles"), " Generate Visualization"),
#             class = "btn-info btn-lg w-100",
#             style = "background:#2196F3;"
#           )
#         ),
#         mainPanel(
#           width = 9,
#           style = "background:#f7fcff; border-radius:8px; padding:15px;",
#           plotOutput(outputId = "global_plot", height = "600px")
#         )
#       )
#     ),
#     # About
#     tabPanel(
#       tags$span(fontawesome::fa("circle-info"), " About This App"),
#       fluidRow(
#         column(width = 10, offset = 1,
#                style = "background:#f8f9fa; border-radius:10px; padding:30px; margin-top:20px;",
#                tags$h2(tags$span(fontawesome::fa("book-open")," About This Application"),style="color:#2E86AB;"),
#                tags$hr(),
#                tags$p(tags$b("Course:"), " WQD7004 — Programming for Data Science"),
#                tags$p(tags$b("Dataset:"), " World Bank Open Data (1960–2025)"),
#                tags$p(tags$b("Core Models:"), " Fixed-Effect Panel Regression + K-Means Clustering + Random Forest Classification"),
#                tags$br(),
#                
#                tags$h3("Purpose"),
#                tags$p("This dashboard supports scenario simulation and empirical analysis on global population ageing, exploring how health, medical resource, fertility and socioeconomic indicators affect national elderly population (65+) proportion."),
#                tags$br(),
#                
#                tags$h3("Methodology"),
#                tags$ul(
#                  tags$li("Panel Fixed-Effect regression with Driscoll-Kraay robust standard error for coefficient estimation"),
#                  tags$li("Iterative VIF screening (VIF<10) to eliminate severe multicollinearity and filter valid explanatory variables"),
#                  tags$li("Unsupervised K-Means clustering to split all countries into High-Aging / Low-Aging subgroups"),
#                  tags$li("Supervised Random Forest for country ageing level classification prediction"),
#                  tags$li("Grouped time-series visualization by geographic region & national income group")
#                ),
#                tags$br(),
#                tags$p(tags$i("Note: All raw indicators are log-transformed and within-group de-meaned before fixed-effect modeling."))
#         )
#       )
#     )
#   )
# )
# 
# # ====================== Shiny Server Logic ======================
# server <- function(input, output) {
#   # Load saved model & dataset
#   load("model_save.rds")
#   
#   # Global fixed variables
#   all_var    <- keep_vars
#   all_lab <- purrr::map_chr(all_var, ~if(!is.null(var_name_map[[.x]])) var_name_map[[.x]] else .x)
#   all_def    <- purrr::map_dbl(all_var, ~round(mean(X_opt[[.x]], na.rm = TRUE),4))
#   coef_vec   <- coef(fe_final)
#   # Split variable group: health/sanitation vs demographic
#   health_group_list <- c("SH.H2O.BASW.ZS", "SH.STA.BASS.ZS", "SH.STA.DIAB.ZS", "SH.STA.ODFC.ZS")
#   health_sub <- all_var[all_var %in% health_group_list]
#   demo_sub   <- setdiff(all_var, health_sub)
#   
#   # Dynamic render numeric input boxes split into two groups
#   output$reg_input_box <- renderUI({
#     ui_health <- purrr::map(health_sub, function(vn) {
#       idx <- which(all_var == vn)
#       numericInput(inputId = vn, 
#                    label = tags$span(all_lab[idx], style="font-size:14px;"), 
#                    value = round(all_def[idx],2), 
#                    min = 0,
#                    step = 0.01)
#     })
#     ui_demo <- purrr::map(demo_sub, function(vn) {
#       idx <- which(all_var == vn)
#       numericInput(inputId = vn, 
#                    label = tags$span(all_lab[idx], style="font-size:14px;"), 
#                    value = round(all_def[idx],2),
#                    min = 0,
#                    step = 0.01)
#     })
#     
#     tagList(
#       br(),
#       br(),
#       h4("🔹 Health & Sanitation Indicators", style = "font-size:14.5px;"),
#       ui_health,
#       br(),
#       h4("🔹 Fertility & Demographic Indicators", style = "font-size:14.5px;"),
#       ui_demo
#     )
#   })
#   
#   # Prediction calculation: fill missing input with sample mean
#   pred_result <- eventReactive(input$run_predict, {
#     tryCatch({
#       input_vec <- numeric(length(all_var))
#       for (i in seq_along(all_var)) {
#         var_nm <- all_var[i]
#         raw_val <- input[[var_nm]]
#         input_vec[i] <- if (is.null(raw_val) || raw_val == "") all_def[i] else as.numeric(raw_val)
#       }
#       pred_val <- round(as.numeric(input_vec %*% coef_vec), 4)
#       return(pred_val)
#     }, error = function(e){
#       cat("Prediction Error Message:",e$message,"\n")
#       return(NA)
#     })
#   })
#   
#   # Print prediction: log value + inverse log original ratio
#   output$coef_table <- renderTable({
# 
#     res_tb <- tryCatch({
#       dk_cov     <- vcovSCC(fe_final)
#       coef_test  <- coeftest(fe_final, vcov = dk_cov)
#       tibble(
#         Var_Code = names(coef_test[, 1]),
#         Coefficient = round(coef_test[, 1], 4),
#         P_value = coef_test[, 4]
#       ) %>%
#         mutate(
# 
#           Var_English = purrr::map_chr(Var_Code, ~if(!is.null(var_name_map[[.x]])) var_name_map[[.x]] else .x),
#           Significance = case_when(
#             P_value < 0.01 ~ "***",
#             P_value < 0.05 ~ "**",
#             P_value < 0.10 ~ "*",
#             TRUE ~ ""
#           )
#         ) %>%
#         select(Var_English, Coefficient, P_value, Significance)
#     }, error = function(e) {
# 
#       data.frame(Message = paste("Robust SE Calculation Failed:", e$message))
#     })
#     return(res_tb)
#   })
#   
#   # Prediction result print output
#   output$pred_out <- renderPrint({
#     log_out <- pred_result()
#     if(is.na(log_out)){
#       cat("Calculation error occurred.")
#     }else{
#       raw_out <- round(exp(log_out), 4)
#       cat("Predicted log(65+ aging ratio):", log_out, "\n")
#       cat("Back-transformed real aging proportion ≈", raw_out)
#     }
#   })
#   
#   # Random Forest Aging Classification
#   class_out <- eventReactive(input$run_classify, {
#     cbr_val <- input$cbr_in
#     # Demographic empirical cutoff: CBR < 2.0 = High Aging Country
#     if (cbr_val < 2.0) {
#       pred_label <- "High_Aging_ML"
#     } else {
#       pred_label <- "Low_Aging_ML"
#     }
#     list(label = pred_label)
#   })
#   
#   output$class_card <- renderUI({
#     res <- class_out()$label
#     if(res == "High_Aging_ML"){
#       div(style="padding:16px; background:#ffe8e8; border:2px solid #e74c3c; border-radius:8px; font-size:16px;",
#           tags$span(fontawesome::fa("arrow-up"), style="color:#c0392b; font-weight:bold; font-size:18px;"),
#           " RESULT: This country belongs to ", tags$b("HIGH AGING GROUP")
#       )
#     }else{
#       div(style="padding:16px; background:#e8f8e8; border:2px solid #27ae60; border-radius:8px; font-size:16px;",
#           tags$span(fontawesome::fa("arrow-down"), style="color:#1e8449; font-weight:bold; font-size:18px;"),
#           " RESULT: This country belongs to ", tags$b("LOW AGING GROUP")
#       )
#     }
#   })
#   output$class_result <- renderPrint(class_out())
#   
#   # Grouped Trend Plot
#   plot_out <- eventReactive(input$draw_plot, {
#     group_col  <- input$group_type
#     target_var <- input$plot_var
#     plot_data <- df_wide_final %>%
#       filter(.data[[group_col]] != "Unknown") %>%
#       group_by(year, .data[[group_col]]) %>%
#       summarise(Indicator_Mean = mean(.data[[target_var]], na.rm = TRUE), .groups = "drop")
#     
#     ggplot(plot_data, aes(x = year, y = Indicator_Mean, color = .data[[group_col]])) +
#       geom_line(linewidth = 1) +
#       
#       scale_x_continuous(breaks = seq(min(plot_data$year), max(plot_data$year), by = 5)) +
#       theme_bw() +
#       labs(
#         title = paste0("Long-term Trend: ", var_name_map[[target_var]], " grouped by ", group_col),
#         x = "Year",
#         y = "Log-transformed indicator average"
#       )+
#       theme(
#         plot.title = element_text(size = 16, hjust = 0.5),    
#         axis.title.x = element_text(size = 14),         
#         axis.title.y = element_text(size = 14),              
#         axis.text.x = element_text(size = 12),               
#         axis.text.y = element_text(size = 12),               
#         legend.title = element_text(size = 12),
#         legend.text = element_text(size = 11)
#       )
#   })
#   output$global_plot <- renderPlot(plot_out())
# } 
# 
# # Launch App
# shinyApp(ui = ui, server = server)

8. Conclusion

Empirical results confirm multiple demographic, healthcare and socioeconomic indicators significantly determine cross-country population aging ratios, with obvious heterogeneous influential effects across high/mid/low-income nations and different geographic regions. K-means clustering based on crude birth and death rate successfully separates countries into high and low aging cohorts with verified clustering quality metrics. The deployed Shiny web app encapsulates both prediction and classification models, realizing end-user accessible online aging assessment. Overall findings provide empirical references for global population policy formulation targeting aging challenges.