Time Series Analysis of Wealth Inequality: A Global Perspective

Introduction

Wealth inequality remains one of the most pressing socioeconomic challenges of our time. This report leverages the World Inequality Database to analyze temporal trends in wealth distribution across different regions and demographic groups. The WID dataset provides a comprehensive collection of income and wealth metrics for approximately 400 countries and regions, allowing for a detailed examination of inequality patterns over time.

My analysis focuses on investigating how wealth distribution has evolved across different geographical regions and age groups, with particular attention to identifying long-term trends, regional disparities, and potential correlations with macroeconomic factors. By applying various time series analysis techniques, I will try to find interesting patterns in the data and provide insights into the dynamics of global wealth inequality.

Data Acquisition and Preparation

Loading and Exploring Country Information

I begin by loading the country metadata to understand the geographical distribution of our data.

# Load country data
wid_countries <- read.csv("wid_all_data/WID_countries.csv", sep = ";")

# View the first few rows
head(wid_countries)

##   alpha2                titlename            shortname   region        region2
## 1     AD                  Andorra              Andorra   Europe Western Europe
## 2     AE the United Arab Emirates United Arab Emirates     Asia      West Asia
## 3     AF              Afghanistan          Afghanistan     Asia     South Asia
## 4     AG      Antigua and Barbuda  Antigua and Barbuda Americas      Caribbean
## 5     AI                 Anguilla             Anguilla Americas      Caribbean
## 6     AL                  Albania              Albania   Europe Eastern Europe

# Count countries by region
region_counts <- wid_countries %>%
  group_by(region) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display region distribution
ggplot(region_counts, aes(x = reorder(region, count), y = count, fill = region)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Distribution of Countries by Region",
       x = "Region",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Selecting Representative Countries

To manage the dataset’s size while maintaining representativeness, I’ll select a diverse sample of countries from different regions and income levels.

# Define a list of representative countries from different regions
selected_countries <- c(
  "US",    # North America
  "BR",    # South America
  "FR",    # Western Europe
  "RU",    # Eastern Europe
  "CN",    # East Asia
  "IN",    # South Asia
  "ZA",    # Africa
  "AU"     # Oceania
)

# Define age groups of interest
selected_age_groups <- c("992", "999") # 992: Adults, 999: All Ages

# Filter country information for selected countries
selected_country_info <- wid_countries %>%
  filter(alpha2 %in% selected_countries) %>%
  select(alpha2, shortname, region)

# Display selected countries
selected_country_info

##   alpha2          shortname   region
## 1     AU          Australia  Oceania
## 2     BR             Brazil Americas
## 3     CN              China     Asia
## 4     FR             France   Europe
## 5     IN              India     Asia
## 6     RU Russian Federation     Asia
## 7     US                USA Americas
## 8     ZA       South Africa   Africa

Loading Metadata and Data

Next, I load the metadata and actual data files for each selected country.

# Initialize empty dataframes for metadata and data
all_metadata <- data.frame()
all_data <- data.frame()

# Load metadata and data for each selected country
for (country_code in selected_countries) {
  # Try to read metadata file
  metadata_file <- paste0("wid_all_data/WID_metadata_", country_code, ".csv")
  tryCatch({
    metadata <- read.csv(metadata_file, sep = ";")
    all_metadata <- rbind(all_metadata, metadata)
  }, error = function(e) {
    message(paste("Could not load metadata for", country_code, ":", e$message))
  })
  
  # Try to read data file
  data_file <- paste0("wid_all_data/WID_data_", country_code, ".csv")
  tryCatch({
    data <- read.csv(data_file, sep = ";")
    all_data <- rbind(all_data, data)
  }, error = function(e) {
    message(paste("Could not load data for", country_code, ":", e$message))
  })
}

Understanding Variable Types

To better understand the WID dataset, I explore the types of variables available.

# Explore available variable types in metadata
variable_types <- all_metadata %>%
  select(variable, simpledes, shorttype) %>%
  distinct() %>%
  arrange(variable)

# Count of variable types
variable_type_count <- all_metadata %>%
  group_by(shorttype) %>%
  summarise(count = n_distinct(variable)) %>%
  arrange(desc(count))

# Display top 10 variable types
head(variable_type_count, 10)

## # A tibble: 10 × 2
##    shorttype              count
##    <chr>                  <int>
##  1 Average                  820
##  2 Total                    372
##  3 Wealth-income ratio      180
##  4 Population               125
##  5 Share                     87
##  6 Threshold                 55
##  7 Top 10/Bottom 50 ratio    52
##  8 Gini coefficient          37
##  9 Beta coefficient          34
## 10 Per-capita emissions      26

Selecting Key Wealth Inequality Metrics

I’ll focus on key wealth inequality metrics for our analysis.

# I'm particularly interested in wealth shares (sweal) and income shares (sptinc)

selected_variables <- c("swealj992", "sptincj992") # wealth and income shares for adults

# Filter data for selected variables, countries, and age groups
filtered_data <- all_data %>%
  filter(
    country %in% selected_countries,
    grepl("^s(weal|ptinc)", variable), # wealth or income share variables
    age %in% selected_age_groups,
    percentile %in% c("p0p50", "p50p90", "p90p100", "p99p100") # Focus on key percentile groups
  )

# Check if we have data
if(nrow(filtered_data) > 0) {
  # Explore the time range of data
  year_range <- filtered_data %>%
    group_by(country) %>%
    summarise(min_year = min(year), max_year = max(year))
  
  print(year_range)
} else {
  # If no data found, try with a broader filter
  filtered_data <- all_data %>%
    filter(
      country %in% selected_countries,
      grepl("^s", variable), # Any share variable
      percentile %in% c("p0p50", "p50p90", "p90p100", "p99p100")
    )
  
  # Check available variables
  available_vars <- unique(filtered_data$variable)
  print(paste("No exact match found. Available share variables:", 
              paste(head(available_vars, 10), collapse=", ")))
  
  # Update selected variables based on availability
  selected_variables <- head(available_vars, 2)
}

## # A tibble: 8 × 3
##   country min_year max_year
##   <chr>      <int>    <int>
## 1 AU          1820     2023
## 2 BR          1820     2023
## 3 CN          1820     2023
## 4 FR          1820     2023
## 5 IN          1820     2023
## 6 RU          1820     2023
## 7 US          1820     2023
## 8 ZA          1820     2023

Preparing the Final Dataset

# Prepare dataset for time series analysis
wealth_inequality_ts <- filtered_data %>%
  # Join with country information to get region and country name
  left_join(wid_countries %>% select(alpha2, shortname, region), 
            by = c("country" = "alpha2")) %>%
  # Add variable descriptions
  left_join(all_metadata %>% 
              select(variable, simpledes) %>% 
              distinct(), 
            by = "variable") %>%
  # Select relevant columns
  select(country, shortname, region, variable, simpledes, percentile, year, value, age) %>%
  # Handle missing values
  filter(!is.na(value))

# Convert to proper time series format
wealth_ts_data <- wealth_inequality_ts %>%
  mutate(date = as.Date(paste0(year, "-01-01"))) %>%
  # Create a unique identifier for each series
  mutate(series_id = paste(country, variable, percentile, age, sep="_"))

# Display summary of the prepared dataset
summary_stats <- wealth_ts_data %>%
  group_by(country, variable, percentile) %>%
  summarise(
    n_observations = n(),
    min_year = min(year),
    max_year = max(year),
    mean_value = mean(value, na.rm = TRUE)
  ) %>%
  arrange(country, variable, percentile)

head(summary_stats, 10)

## # A tibble: 10 × 7
## # Groups:   country, variable [3]
##    country variable   percentile n_observations min_year max_year mean_value
##    <chr>   <chr>      <chr>               <int>    <int>    <int>      <dbl>
##  1 AU      sptinci992 p0p50                 108     1912     2019      0.108
##  2 AU      sptinci992 p50p90                108     1912     2019      0.526
##  3 AU      sptinci992 p90p100               108     1912     2019      0.366
##  4 AU      sptinci992 p99p100               108     1912     2019      0.124
##  5 AU      sptincj992 p0p50                 118     1820     2023      0.175
##  6 AU      sptincj992 p50p90                118     1820     2023      0.503
##  7 AU      sptincj992 p90p100               118     1820     2023      0.322
##  8 AU      sptincj992 p99p100               118     1820     2023      0.102
##  9 AU      sptincj999 p0p50                   6     1820     2020      0.160
## 10 AU      sptincj999 p90p100                 6     1820     2020      0.437

Exploratory Data Analysis

Visualizing Wealth Inequality Trends

I begin analysis by exploring the temporal patterns of wealth inequality across the selected countries.

# Create a more readable mapping for percentile labels
percentile_labels <- c(
  "p0p50" = "Bottom 50%",
  "p50p90" = "Middle 40%",
  "p90p100" = "Top 10%",
  "p99p100" = "Top 1%"
)

# Plot wealth/income share trends for all countries by percentile group
ggplot(wealth_ts_data %>% 
         filter(percentile %in% names(percentile_labels)), 
       aes(x = year, y = value, color = shortname)) +
  geom_line(linewidth = 1) +
  facet_wrap(~percentile, scales = "free_y", 
             labeller = labeller(percentile = percentile_labels)) +
  labs(title = "Wealth/Income Share Trends Across Countries",
       subtitle = "By Income/Wealth Group",
       x = "Year",
       y = "Share (%)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom",
        legend.box = "horizontal",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_color_viridis_d()

Regional Comparison

Next, I compare inequality patterns across different geographical regions.

# Calculate regional average for top 10%
regional_avg <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>% # Focus on top 10%
  group_by(region, year) %>%
  summarise(avg_value = mean(value, na.rm = TRUE)) %>%
  filter(!is.na(avg_value))

# Plot regional trends
ggplot(regional_avg, aes(x = year, y = avg_value, color = region)) +
  geom_line(linewidth = 1) +
  labs(title = "Regional Trends in Top 10% Wealth/Income Share",
       x = "Year",
       y = "Average Share (%)",
       color = "Region") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

# Top vs. Bottom comparison by country
top_bottom_gap <- wealth_ts_data %>%
  filter(percentile %in% c("p90p100", "p0p50")) %>%
  select(country, shortname, year, percentile, value) %>%
  pivot_wider(names_from = percentile, values_from = value) %>%
  mutate(
    p90p100 = as.numeric(as.character(p90p100)), # Ensure p90p100 is numeric
    p0p50 = as.numeric(as.character(p0p50)),     # Ensure p0p50 is numeric
    inequality_gap = p90p100 - p0p50              # Calculate the inequality gap
  ) %>%
  filter(!is.na(inequality_gap))                  # Remove rows with NA in inequality_gap

# Plot inequality gap
ggplot(top_bottom_gap, aes(x = year, y = inequality_gap, color = shortname)) +
  geom_line(linewidth = 1) +
  labs(title = "Inequality Gap: Top 10% vs Bottom 50% Share",
       x = "Year",
       y = "Gap in Share (%)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

Age Group Comparison

I also examine differences in wealth distribution across different age groups.

# Comparing adult vs. all ages
age_comparison <- wealth_ts_data %>%
  filter(age %in% c("992", "999")) %>% # 992: Adults, 999: All Ages
  group_by(shortname, year, age, percentile) %>%
  summarise(avg_value = mean(value, na.rm = TRUE)) %>%
  filter(!is.na(avg_value))

# Map age codes to readable labels
age_labels <- c("992" = "Adults", "999" = "All Ages")

# Convert age to a factor for proper handling in ggplot
age_comparison$age <- factor(age_comparison$age, levels = c("992", "999"), labels = age_labels)

# Plot age group comparison for Top 10%
ggplot(age_comparison %>% 
         filter(percentile == "p90p100"), 
       aes(x = year, y = avg_value, color = shortname, linetype = age)) +
  geom_line(linewidth = 1) +
  scale_linetype_manual(values = c("solid", "dashed"))

  labs(title = "Top 10% Share: Adults vs. All Ages",
       x = "Year",
       y = "Share (%)",
       color = "Country",
       linetype = "Age Group") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

## NULL

Checking for Seasonality and Trends

Time series analysis typically involves checking for patterns like seasonality and trends.

# Select a representative country and variable for time series analysis
ts_country <- "US"  # United States
ts_variable <- filtered_data %>% 
  filter(country == ts_country) %>% 
  pull(variable) %>% 
  unique() %>% 
  first()

# Create a time series object for the selected country and top 10%
ts_data <- wealth_ts_data %>%
  filter(country == ts_country, 
         variable == ts_variable,
         percentile == "p90p100") %>%
  arrange(year)

# Check if there are enough data points
if(nrow(ts_data) < 4 || any(is.na(ts_data$value))) {
  cat("Not enough data points for time series analysis. Available data:\n")
  print(ts_data)
} else {
  # Create a formal time series object
  ts_obj <- ts(ts_data$value, 
               start = min(ts_data$year), 
               end = max(ts_data$year),
               frequency = 1)  # Annual data
  
  # Decompose the time series
  if(length(ts_obj) < 24){
    cat("Not enough data points to perform decomposition with STL.\n")
    
    moving_avg <- filter(ts_obj, rep(1/3, 3), sides=2)
    plot(ts_obj, main = paste("Time Series Analysis for", ts_country), 
         ylab = "Value", xlab = "Year", type = "l")
    lines(moving_avg, col = "red", lwd = 2)
    legend("topleft", legend = c("Original data", "3-Year Moving Average"), 
           col = c("black", "red"), lty = 1, lwd = c(1, 2))
    
    # Check for stationarity using the KPSS test
    kpss_result <- kpss.test(ts_obj)
    cat("KPSS Test for Stationarity:\n")
    print(kpss_result)
    
    # ACF and PACF plots
    par(mfrow = c(1, 2))
    acf(ts_obj, main = paste("ACF for", ts_country))
    pacf(ts_obj, main = paste("PACF for", ts_country))
    par(mfrow = c(1, 1))
  
  } else {
    trend_fit <- loess(ts_obj ~ time(ts_obj), span = 0.25)
    predictions <- predict(trend_fit)
    plot(ts_obj, main = paste("Trend Analysis for", ts_country), 
         ylab = "Value", xlab = "Year", type = "l")
    lines(ts_obj, col = "black")
    # Make sure they have the same length
    lines(predictions, col = "red", lwd = 2)
    
    if(length(time(ts_obj)) == length(predictions)) {
      lines(c(time(ts_obj)), c(predictions), col = "red", lwd = 2)
    } else {
      # Alternative approach
      lines(predictions, col = "red", lwd = 2)
    }
    print("HERE 4")
    legend("topleft", legend = c("Original data", "LOESS trend"), 
           col = c("black", "red"), lty = 1, lwd = c(1, 2))
    print("HERE 5")
    
    # Check for stationarity using the KPSS test
    kpss_result <- kpss.test(ts_obj)
    cat("KPSS Test for Stationarity:\n")
    print(kpss_result)
    
    # ACF and PACF plots
    par(mfrow = c(1, 2))
    acf(ts_obj, main = paste("ACF for", ts_country))
    pacf(ts_obj, main = paste("PACF for", ts_country))
    par(mfrow = c(1, 1))
  }
}

## [1] "HERE 4"
## [1] "HERE 5"
## KPSS Test for Stationarity:
## 
##  KPSS Test for Level Stationarity
## 
## data:  ts_obj
## KPSS Level = 0.31725, Truncation lag parameter = 3, p-value = 0.1

Time Series Analysis

Trend Analysis

I analyze long-term trends in wealth inequality using linear regression and smoothing techniques.

# Analyze trends using linear models
trend_models <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%  # Focus on top 10%
  group_by(shortname) %>%
  nest() %>%
  mutate(
    # Linear model
    lm_model = map(data, ~lm(value ~ year, data = .x)),
    # Model coefficients
    coefficients = map(lm_model, ~coefficients(.x)),
    # Slope (annual change)
    slope = map_dbl(coefficients, ~.x[2]),
    # R-squared
    r_squared = map_dbl(lm_model, ~summary(.x)$r.squared)
  )

# Display trend results
trend_results <- trend_models %>%
  select(shortname, slope, r_squared) %>%
  arrange(desc(slope))

print(trend_results)

## # A tibble: 8 × 3
## # Groups:   shortname [8]
##   shortname               slope r_squared
##   <chr>                   <dbl>     <dbl>
## 1 Russian Federation  0.000392  0.0253   
## 2 India               0.0000200 0.0000955
## 3 Brazil             -0.0000780 0.0620   
## 4 South Africa       -0.000223  0.0218   
## 5 USA                -0.000295  0.0366   
## 6 Australia          -0.000669  0.189    
## 7 China              -0.000759  0.179    
## 8 France             -0.00160   0.664

# Visualize trends with fitted lines
ggplot(wealth_ts_data %>% 
         filter(percentile == "p90p100"), 
       aes(x = year, y = value, color = shortname)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  labs(title = "Top 10% Share Trends with Linear Fit",
       x = "Year",
       y = "Share (%)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

Smoothing Techniques

I apply various smoothing techniques to better visualize underlying trends.

# Apply different smoothing techniques to the US data
us_data <- wealth_ts_data %>%
  filter(country == "US", percentile == "p90p100") %>%
  arrange(year)

# Create a time series object
us_ts <- ts(us_data$value, start = min(us_data$year), frequency = 1)

# Simple Moving Average (SMA)
us_sma <- stats::filter(us_ts, rep(1/5, 5), sides = 2)

# Exponentially Weighted Moving Average (EWMA)
us_ewma <- stats::filter(us_ts, filter = c(0.3, 0.7), method = "recursive")

# Holt-Winters Exponential Smoothing
if(length(us_ts) > 2) {
  us_hw <- HoltWinters(us_ts, beta = FALSE, gamma = FALSE)
  us_hw_fitted <- us_hw$fitted[,1]
} else {
  us_hw_fitted <- us_ts
}

# Combine smoothed series
n_rows <- nrow(us_data)
us_data$sma <- c(as.numeric(us_sma))
#us_data$ewma <- c(NA, as.numeric(us_ewma))
us_data$hw <- c(NA, as.numeric(us_hw_fitted))

# Plot original data and smoothed versions
#ggplot(us_data, aes(x = year)) +
  #geom_line(aes(y = value), color = "black", linewidth = 0.7) +
  #geom_line(aes(y = sma, color = "SMA (5-year)"), linewidth = 1, na.rm = TRUE) +
  #geom_line(aes(y = ewma, color = "EWMA"), linewidth = 1, na.rm = TRUE) +
  #geom_line(aes(y = hw, color = "Holt-Winters"), linewidth = 1, na.rm = TRUE) +
  #labs(title = "Smoothing Techniques Applied to US Top 10% Share",
   #    x = "Year",
   #    y = "Share (%)",
      # color = "Smoothing Method") +
 # theme_minimal() +
 # theme(legend.position = "bottom") +
 # scale_color_manual(values = c("SMA (5-year)" = "blue", 
      #                          "EWMA" = "red", 
        #                        "Holt-Winters" = "green"))

#To be continued, doesn't compile yet

Forecasting

Next, I develop forecasting models to predict future trends in wealth inequality.

# Select countries with longer time series for better forecasting
forecast_countries <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  group_by(country) %>%
  summarise(n_years = n_distinct(year)) %>%
  filter(n_years >= 10) %>%
  pull(country)

# If no countries have enough data, use the one with the most data
if(length(forecast_countries) == 0) {
  forecast_countries <- wealth_ts_data %>%
    filter(percentile == "p90p100") %>%
    group_by(country) %>%
    summarise(n_years = n_distinct(year)) %>%
    arrange(desc(n_years)) %>%
    slice(1) %>%
    pull(country)
}

# Prepare data for forecasting
forecast_data <- wealth_ts_data %>%
  filter(country %in% forecast_countries[1], 
         percentile == "p90p100") %>%
  arrange(year)

ARIMA Modeling

For more sophisticated time series analysis, I implement ARIMA models.

# Select a country with sufficient data for ARIMA modeling
arima_country <- forecast_countries[1]

# Prepare data
arima_data <- wealth_ts_data %>%
  filter(country == arima_country, 
         percentile == "p90p100") %>%
  arrange(year)

# Create time series object
arima_ts <- ts(arima_data$value, 
              start = min(arima_data$year), 
              frequency = 1)

# Analyze if there's enough data
if(length(arima_ts) >= 10) {
  # Check stationarity
  adf_test <- adf.test(arima_ts)
  print(paste("ADF Test p-value:", adf_test$p.value))
  
  # First differencing if non-stationary
  if(adf_test$p.value > 0.05) {
    diff_ts <- diff(arima_ts)
    adf_diff <- adf.test(diff_ts)
    print(paste("ADF Test after differencing p-value:", adf_diff$p.value))
  } else {
    diff_ts <- arima_ts
  }
  
  # Auto ARIMA
  auto_arima <- auto.arima(arima_ts)
  print(summary(auto_arima))
  
  # Forecast with ARIMA
  arima_fc <- forecast(auto_arima, h = 10)
  
  # Plot forecast
  plot(arima_fc, 
       main = paste("ARIMA Forecast for", arima_country, "Top 10% Share"),
       xlab = "Year",
       ylab = "Share (%)")
  
  # Diagnostic checks
  checkresiduals(auto_arima)
}

## [1] "ADF Test p-value: 0.289577265532718"
## [1] "ADF Test after differencing p-value: 0.01"
## Series: arima_ts 
## ARIMA(4,1,0) 
## 
## Coefficients:
##           ar1     ar2      ar3     ar4
##       -0.4257  0.0280  -0.2387  0.2706
## s.e.   0.0631  0.0671   0.0671  0.0628
## 
## sigma^2 = 0.0003703:  log likelihood = 585.59
## AIC=-1161.17   AICc=-1160.91   BIC=-1143.96
## 
## Training set error measures:
##                         ME       RMSE        MAE        MPE     MAPE      MASE
## Training set -0.0006451763 0.01903509 0.01116655 -0.3512843 3.205632 0.2213399
##                    ACF1
## Training set 0.02008934

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,1,0)
## Q* = 5.2833, df = 6, p-value = 0.508
## 
## Model df: 4.   Total lags used: 10

Hypothesis Testing

In this section, I test specific hypotheses about wealth inequality trends.

Testing for Structural Breaks

# Test for structural breaks in the inequality trend
struc_break_country <- forecast_countries[1]
struc_break_data <- wealth_ts_data %>%
  filter(country == struc_break_country, 
         percentile == "p90p100") %>%
  arrange(year)
#to be continued

Testing the Kuznets Hypothesis

The Kuznets curve hypothesis suggests that inequality first rises and then falls with economic development.

# I'll use GDP per capita as a proxy for development level

# For simplicity, I'll simulate GDP data if not available
kuznets_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  mutate(
    # Simulate GDP per capita (in practice, you would join with actual GDP data)
    gdp_per_capita = year * 100 + rnorm(n(), 0, 500)
  )

# Test for quadratic relationship (Kuznets curve)
kuznets_models <- kuznets_data %>%
  group_by(country) %>%
  filter(n() >= 10) %>%  # Need sufficient data points
  nest() %>%
  mutate(
    # Linear model
    lm_model = map(data, ~lm(value ~ gdp_per_capita, data = .x)),
    # Quadratic model (Kuznets curve)
    quad_model = map(data, ~lm(value ~ gdp_per_capita + I(gdp_per_capita^2), data = .x)),
    # Model comparison
    linear_aic = map_dbl(lm_model, ~AIC(.x)),
    quad_aic = map_dbl(quad_model, ~AIC(.x)),
    # Is Kuznets curve supported? (lower AIC for quadratic model)
    kuznets_supported = quad_aic < linear_aic,
    # Coefficient for quadratic term
    quad_coef = map_dbl(quad_model, ~coef(.x)[3])
  )

# Display results
kuznets_results <- kuznets_models %>%
  select(country, linear_aic, quad_aic, kuznets_supported, quad_coef) %>%
  arrange(desc(kuznets_supported))

print(kuznets_results)

## # A tibble: 8 × 5
## # Groups:   country [8]
##   country linear_aic quad_aic kuznets_supported quad_coef
##   <chr>        <dbl>    <dbl> <lgl>                 <dbl>
## 1 US          -1754.   -1791. TRUE               7.55e-10
## 2 BR           -377.    -402. TRUE              -2.99e-10
## 3 FR           -981.   -1023. TRUE               7.88e-10
## 4 RU           -125.    -168. TRUE               2.71e- 9
## 5 CN           -144.    -144. TRUE               4.94e-10
## 6 IN           -205.    -245. TRUE               1.79e- 9
## 7 ZA           -174.    -177. TRUE               6.60e-10
## 8 AU           -705.    -732. TRUE               8.27e-10

# Plot a representative case if available
if(nrow(kuznets_results) > 0) {
  kuznets_country <- kuznets_results$country[1]
  
  # Get data for the selected country
  country_data <- kuznets_data %>%
    filter(country == kuznets_country) %>%
    arrange(gdp_per_capita)
  
  # Fit quadratic model
  quad_fit <- lm(value ~ gdp_per_capita + I(gdp_per_capita^2), data = country_data)
  
  # Generate predictions for smooth curve
  pred_data <- data.frame(
    gdp_per_capita = seq(min(country_data$gdp_per_capita), 
                          max(country_data$gdp_per_capita), 
                          length.out = 100)
  )
  pred_data$predicted <- predict(quad_fit, pred_data)
  
  # Plot
  ggplot(country_data, aes(x = gdp_per_capita, y = value)) +
    geom_point() +
    geom_line(data = pred_data, aes(x = gdp_per_capita, y = predicted), 
              color = "blue", linewidth = 1) +
    labs(title = paste("Testing Kuznets Curve for", kuznets_country),
         subtitle = ifelse(kuznets_results$kuznets_supported[1], 
                          "Kuznets hypothesis supported", 
                          "Kuznets hypothesis not supported"),
         x = "GDP per Capita (simulated)",
         y = "Top 10% Share (%)") +
    theme_minimal()
}

Comparative Analysis

Cross-country Correlation Analysis

I examine how inequality trends are correlated across countries.

# Create a wide format dataset for correlation analysis
wide_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  select(country, year, value) %>%
  pivot_wider(names_from = country, values_from = value) %>%
  drop_na() # Remove rows with missing data

#to be continued

Comparative Analysis (continued)

Regional Wealth Distribution Patterns

I compare wealth distribution patterns across different regions to identify geographical disparities.

# Calculate average shares by region and percentile group
regional_shares <- wealth_ts_data %>%
  filter(percentile %in% c("p0p50", "p50p90", "p90p100")) %>%
  group_by(region, percentile, year) %>%
  summarise(avg_share = mean(value, na.rm = TRUE)) %>%
  ungroup()

# Create more descriptive percentile labels
percentile_labels <- c(
  "p0p50" = "Bottom 50%",
  "p50p90" = "Middle 40%",
  "p90p100" = "Top 10%"
)

# Visualize regional patterns
ggplot(regional_shares %>% filter(year >= 2000), 
       aes(x = year, y = avg_share, color = region)) +
  geom_line(linewidth = 1) +
  facet_wrap(~percentile, scales = "free_y", 
             labeller = labeller(percentile = percentile_labels)) +
  labs(title = "Regional Wealth Distribution Patterns (2000-Present)",
       x = "Year",
       y = "Average Share (%)",
       color = "Region") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

#to be continued

Velocity of Change in Inequality

I analyze how quickly inequality metrics have changed over time.

# Calculate year-over-year changes in top 10% share
velocity_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  arrange(country, year) %>%
  group_by(country, shortname) %>%
  mutate(
    prev_year = lag(year),
    prev_value = lag(value),
    year_gap = year - prev_year,
    # Calculate annual rate of change, adjusted for gaps in years
    annual_change = (value - prev_value) / year_gap,
    # Calculate percentage change
    pct_change = annual_change / prev_value * 100
  ) %>%
  filter(!is.na(annual_change)) %>%
  ungroup()

# Plot annual changes by country
ggplot(velocity_data, aes(x = year, y = annual_change, color = shortname)) +
  geom_line() +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Velocity of Change in Top 10% Share",
       subtitle = "Annual percentage point change",
       x = "Year",
       y = "Annual Change (percentage points)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_viridis_d()

# Identify periods of rapid change
rapid_changes <- velocity_data %>%
  filter(abs(annual_change) > quantile(abs(annual_change), 0.75, na.rm = TRUE)) %>%
  arrange(desc(abs(annual_change)))

# Display top periods of rapid change
head(rapid_changes %>% 
       select(shortname, year, value, annual_change, pct_change), 10)

## # A tibble: 0 × 5
## # ℹ 5 variables: shortname <chr>, year <int>, value <dbl>, annual_change <dbl>,
## #   pct_change <dbl>

# Calculate average velocity by decade
decade_velocity <- velocity_data %>%
  mutate(decade = floor(year / 10) * 10) %>%
  group_by(shortname, decade) %>%
  summarise(
    avg_annual_change = mean(annual_change, na.rm = TRUE),
    avg_pct_change = mean(pct_change, na.rm = TRUE),
    volatility = sd(annual_change, na.rm = TRUE)
  ) %>%
  ungroup()

# Plot decade velocity
ggplot(decade_velocity, 
       aes(x = as.factor(decade), y = avg_annual_change, fill = shortname)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Annual Change in Top 10% Share by Decade",
       x = "Decade",
       y = "Average Annual Change (percentage points)",
       fill = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()

#to be continued

Advanced Models and Forecasting

Neural Network Forecasting

I implement neural network models for more sophisticated forecasting.

# This code is shown but not evaluated due to potential package dependencies
library(neuralnet)

# Prepare data for neural network
nn_country <- forecast_countries[1]
nn_data <- wealth_ts_data %>%
  filter(country == nn_country, percentile == "p90p100") %>%
  arrange(year)

# Create lagged features for time series prediction
nn_features <- nn_data %>%
  mutate(
    target = value,
    lag1 = lag(value, 1),
    lag2 = lag(value, 2),
    trend = row_number()
  ) %>%
  filter(!is.na(lag2))

# Scale the data
scale_factors <- list(
  target_mean = mean(nn_features$target),
  target_sd = sd(nn_features$target),
  lag1_mean = mean(nn_features$lag1),
  lag1_sd = sd(nn_features$lag1),
  lag2_mean = mean(nn_features$lag2),
  lag2_sd = sd(nn_features$lag2)
)

nn_scaled <- nn_features %>%
  mutate(
    target_scaled = (target - scale_factors$target_mean) / scale_factors$target_sd,
    lag1_scaled = (lag1 - scale_factors$lag1_mean) / scale_factors$lag1_sd,
    lag2_scaled = (lag2 - scale_factors$lag2_mean) / scale_factors$lag2_sd,
    trend_scaled = (trend - min(trend)) / (max(trend) - min(trend))
  )

# Split into training and test sets
train_idx <- 1:floor(0.8 * nrow(nn_scaled))
train_nn <- nn_scaled[train_idx, ]
test_nn <- nn_scaled[-train_idx, ]

# Train neural network
nn_model <- neuralnet(
  target_scaled ~ lag1_scaled + lag2_scaled + trend_scaled,
  data = train_nn,
  hidden = c(3, 2),
  linear.output = TRUE
)

# Generate predictions
#train_pred <- compute(nn_model, train_nn[, c("lag1_scaled", #"lag2_scaled", "trend_scaled")])
#test_pred <- compute(nn_model, test_nn[, c("lag1_scaled", #"lag2_scaled", "trend_scaled")])


# Unscale predictions
#train_nn$prediction <- train_pred$net.result * scale_factors$target_sd #+ scale_factors$target_mean
#test_nn$prediction <- test_pred$net.result * scale_factors$target_sd + #scale_factors$target_mean

# Calculate accuracy metrics
#train_rmse <- sqrt(mean((train_nn$target - train_nn$prediction)^2))
#test_rmse <- sqrt(mean((test_nn$target - test_nn$prediction)^2))

#cat("Neural Network Training RMSE:", train_rmse, "\n")
#cat("Neural Network Test RMSE:", test_rmse, "\n")

# Plot predictions
#nn_results <- bind_rows(
 # train_nn %>% mutate(set = "Training"),
 # test_nn %>% mutate(set = "Test")
#)

#ggplot(nn_results, aes(x = year)) +
 # geom_line(aes(y = target, color = "Actual")) +
  #geom_line(aes(y = prediction, color = "Predicted")) +
#  geom_vline(xintercept = nn_data$year[max(train_idx)], 
   #          linetype = "dashed", color = "gray") +
 # facet_wrap(~set, scales = "free_x") +
 # labs(title = paste("Neural Network Forecast for", nn_country, "Top 10% Share"),
  #     x = "Year",
  #     y = "Share (%)",
   #    color = "Series") +
 # theme_minimal() +
#  scale_color_manual(values = c("Actual" = "black", "Predicted" = #"blue"))

#to be continued!!!

Analyzing Convergence and Divergence

I examine whether wealth inequality across different countries is converging or diverging over time.

# Calculate dispersion metrics for top 10% share across countries
convergence_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  group_by(year) %>%
  summarise(
    n_countries = n(),
    mean_share = mean(value, na.rm = TRUE),
    median_share = median(value, na.rm = TRUE),
    sd_share = sd(value, na.rm = TRUE),
    cv_share = sd_share / mean_share,  # Coefficient of variation
    min_share = min(value, na.rm = TRUE),
    max_share = max(value, na.rm = TRUE),
    range_share = max_share - min_share
  ) %>%
  filter(n_countries >= 3)  # Need at least 3 countries for meaningful dispersion

# Plot trends in dispersion
dispersion_plot <- ggplot(convergence_data, aes(x = year)) +
  geom_line(aes(y = cv_share), color = "blue", linewidth = 1) +
  labs(title = "Convergence/Divergence in Top 10% Share",
       subtitle = "Coefficient of Variation across Countries",
       x = "Year",
       y = "Coefficient of Variation") +
  theme_minimal()

print(dispersion_plot)

# Plot min, max, and mean trends
range_plot <- ggplot(convergence_data, aes(x = year)) +
  geom_ribbon(aes(ymin = min_share, ymax = max_share), alpha = 0.2) +
  geom_line(aes(y = mean_share), color = "blue", linewidth = 1) +
  labs(title = "Range of Top 10% Share Across Countries",
       subtitle = "Shaded area shows min-max range",
       x = "Year",
       y = "Share (%)") +
  theme_minimal()

print(range_plot)

# Test for sigma convergence (declining dispersion over time)
if(nrow(convergence_data) >= 10) {
  sigma_model <- lm(cv_share ~ year, data = convergence_data)
  sigma_coef <- coef(sigma_model)[2]
  sigma_pvalue <- summary(sigma_model)$coefficients[2,4]
  
  sigma_result <- ifelse(sigma_coef < 0 && sigma_pvalue < 0.05,
                         "Evidence of sigma convergence",
                         ifelse(sigma_coef > 0 && sigma_pvalue < 0.05,
                                "Evidence of sigma divergence",
                                "No significant trend in dispersion"))
  
  cat("Sigma Convergence Test:\n")
  cat(sigma_result, "\n")
  cat("Annual change in coefficient of variation:", sigma_coef, "\n")
  cat("p-value:", sigma_pvalue, "\n")
}

## Sigma Convergence Test:
## Evidence of sigma divergence 
## Annual change in coefficient of variation: 0.00061249 
## p-value: 3.305081e-07

Prediction Models for Wealth Inequality

I develop predictive models to forecast future wealth inequality trends.

# Prepare data for predictive modeling
model_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  # Create additional features
  mutate(
    year_squared = year^2,
    decade = floor(year / 10) * 10
  ) %>%
  # Create lagged features
  group_by(country) %>%
  mutate(
    value_lag1 = lag(value, 1),
    value_lag2 = lag(value, 2),
    value_lag3 = lag(value, 3)
  ) %>%
  ungroup() %>%
  # Remove rows with missing lagged values
  filter(!is.na(value_lag3))

# Split data into training and testing sets
set.seed(123)
train_indices <- sample(1:nrow(model_data), 0.8 * nrow(model_data))
train_data <- model_data[train_indices, ]
test_data <- model_data[-train_indices, ]

# Fit multiple prediction models

# Linear model with trend
linear_model <- lm(value ~ year, data = train_data)

# Polynomial model
poly_model <- lm(value ~ year + year_squared, data = train_data)

# Autoregressive model
ar_model <- lm(value ~ value_lag1 + value_lag2 + value_lag3, data = train_data)

# Combined model
combined_model <- lm(value ~ year + value_lag1 + country, data = train_data)

# Generate predictions for test data
test_data$pred_linear <- predict(linear_model, test_data)
test_data$pred_poly <- predict(poly_model, test_data)
test_data$pred_ar <- predict(ar_model, test_data)
test_data$pred_combined <- predict(combined_model, test_data)

# Calculate error metrics
calculate_errors <- function(actual, predicted) {
  mae <- mean(abs(actual - predicted))
  rmse <- sqrt(mean((actual - predicted)^2))
  mape <- mean(abs((actual - predicted) / actual)) * 100
  c(MAE = mae, RMSE = rmse, MAPE = mape)
}

errors_linear <- calculate_errors(test_data$value, test_data$pred_linear)
errors_poly <- calculate_errors(test_data$value, test_data$pred_poly)
errors_ar <- calculate_errors(test_data$value, test_data$pred_ar)
errors_combined <- calculate_errors(test_data$value, test_data$pred_combined)

# Compare model performance
model_performance <- data.frame(
  Model = c("Linear Trend", "Polynomial", "Autoregressive", "Combined"),
  MAE = c(errors_linear[1], errors_poly[1], errors_ar[1], errors_combined[1]),
  RMSE = c(errors_linear[2], errors_poly[2], errors_ar[2], errors_combined[2]),
  MAPE = c(errors_linear[3], errors_poly[3], errors_ar[3], errors_combined[3])
)

# Display model performance
model_performance %>% arrange(RMSE)

##            Model        MAE       RMSE      MAPE
## 1 Autoregressive 0.01349177 0.02554600  3.153682
## 2       Combined 0.01335032 0.02579882  3.153196
## 3     Polynomial 0.06682605 0.08836541 16.182767
## 4   Linear Trend 0.07172124 0.09208327 17.393775

# Visualize predictions vs actual values
# Select a subset for clarity in visualization
test_subset <- test_data %>% 
  filter(country %in% c("US", "FR", "CN")) %>%
  head(20)

# Reshape for plotting
test_long <- test_subset %>%
  select(country, shortname, year, value, starts_with("pred_")) %>%
  pivot_longer(
    cols = c(value, starts_with("pred_")),
    names_to = "model",
    values_to = "share"
  ) %>%
  mutate(
    model = case_when(
      model == "value" ~ "Actual",
      model == "pred_linear" ~ "Linear",
      model == "pred_poly" ~ "Polynomial",
      model == "pred_ar" ~ "Autoregressive",
      model == "pred_combined" ~ "Combined",
      TRUE ~ model
    )
  )

# Plot actual vs predicted
ggplot(test_long, aes(x = year, y = share, color = model, linetype = model)) +
  geom_line() +
  geom_point() +
  facet_wrap(~shortname, scales = "free_y") +
  labs(title = "Model Predictions vs Actual Values",
       x = "Year",
       y = "Top 10% Share (%)",
       color = "Model",
       linetype = "Model") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Create a final comprehensive visualization
final_viz_data <- wealth_ts_data %>%
  filter(percentile %in% c("p0p50", "p90p100")) %>%
  select(country, shortname, region, year, percentile, value) %>%
  mutate(percentile_label = if_else(percentile == "p0p50", "Bottom 50%", "Top 10%"))

# Create time series panel
ggplot(final_viz_data, 
       aes(x = year, y = value, color = percentile_label)) +
  geom_line(linewidth = 1) +
  facet_wrap(~shortname, scales = "free_y", ncol = 3) +
  labs(title = "Global Wealth Inequality: Top 10% vs Bottom 50% Share",
       subtitle = "Trends across major economies",
       x = "Year",
       y = "Share (%)",
       color = "Group") +
  theme_minimal() +
  theme(legend.position = "bottom",
        strip.background = element_rect(fill = "lightgray", color = NA),
        strip.text = element_text(face = "bold")) +
  scale_color_manual(values = c("Top 10%" = "red", "Bottom 50%" = "blue"))

Extended Research Findings and Global Implications

Wealth-to-Income Ratios: A Deeper Analysis

To better understand the relationship between wealth concentration and income distribution, I examine wealth-to-income ratios across countries and time periods.

# Calculate wealth-to-income ratio proxies where both measures are available
wealth_income_data <- wealth_ts_data %>%
  filter(grepl("sweal|sptinc", variable)) %>%
  select(country, shortname, region, year, variable, percentile, value) %>%
  # Create indicator for wealth vs income variable
  mutate(
    variable_type = case_when(
      grepl("sweal", variable) ~ "wealth",
      grepl("sptinc", variable) ~ "income",
      TRUE ~ "other"
    )
  ) %>%
  filter(variable_type %in% c("wealth", "income"))


#to be continued with Mobility Analysis

# Calculate transitions between percentile groups
# I'll use the rate of change as a proxy for mobility potential

mobility_data <- wealth_ts_data %>%
  filter(percentile %in% c("p0p50", "p50p90", "p90p100")) %>%
  select(country, shortname, year, percentile, value) %>%
  arrange(country, percentile, year) %>%
  group_by(country, percentile) %>%
  mutate(
    # Calculate year-over-year change in share
    share_change = value - lag(value),
    # Calculate percentage change in share
    pct_change = (share_change / lag(value)) * 100
  ) %>%
  ungroup() %>%
  filter(!is.na(pct_change))

# Calculate volatility (standard deviation of percentage changes) by country and percentile
mobility_volatility <- mobility_data %>%
  group_by(country, shortname, percentile) %>%
  summarise(
    volatility = sd(pct_change, na.rm = TRUE),
    mean_change = mean(pct_change, na.rm = TRUE),
    n_obs = n()
  ) %>%
  filter(n_obs >= 5) %>%  # Ensure enough observations for meaningful volatility
  ungroup() %>%
  mutate(
    percentile_label = case_when(
      percentile == "p0p50" ~ "Bottom 50%",
      percentile == "p50p90" ~ "Middle 40%",
      percentile == "p90p100" ~ "Top 10%",
      TRUE ~ percentile
    )
  )

# Plot mobility volatility by country and percentile
ggplot(mobility_volatility, 
       aes(x = reorder(shortname, volatility), y = volatility, fill = percentile_label)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Wealth Share Volatility by Country and Percentile Group",
       subtitle = "Higher volatility suggests greater potential for wealth mobility",
       x = "Country",
       y = "Volatility (Std. Dev. of % Changes)",
       fill = "Percentile Group") +
  theme_minimal() +
  scale_fill_viridis_d(option = "C", begin = 0.2, end = 0.8)

# Compare directional trend with volatility
mobility_scatterplot <- mobility_volatility %>%
  filter(percentile == "p0p50") %>%  # Focus on bottom 50%
  ggplot(aes(x = mean_change, y = volatility, color = shortname)) +
  geom_point(size = 3) +
  geom_text(aes(label = shortname), hjust = -0.2, vjust = 0.5, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Bottom 50% Wealth Share: Trend vs. Volatility",
       x = "Average Annual % Change",
       y = "Volatility (Std. Dev. of % Changes)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_color_viridis_d()

print(mobility_scatterplot)

Global Shocks and Inequality Response

I analyze how global economic shocks have impacted wealth inequality across countries.

# Define major global economic shocks
global_shocks <- data.frame(
  event = c("Oil Crisis", "Black Monday", "Dot-com Bubble", "Global Financial Crisis", "COVID-19"),
  year = c(1973, 1987, 2000, 2008, 2020)
)

# Function to analyze impact of shocks
shock_impact <- function(shock_year, window = 3) {
  # Calculate average change before and after the shock
  shock_data <- wealth_ts_data %>%
    filter(percentile == "p90p100") %>%
    mutate(
      period = case_when(
        year >= (shock_year - window) & year < shock_year ~ "Before",
        year >= shock_year & year <= (shock_year + window) ~ "After",
        TRUE ~ "Outside Window"
      )
    ) %>%
    filter(period != "Outside Window") %>%
    group_by(country, shortname, period) %>%
    summarise(avg_value = mean(value, na.rm = TRUE)) %>%
    pivot_wider(names_from = period, values_from = avg_value) %>%
    mutate(
      change = After - Before,
      pct_change = (change / Before) * 100
    ) %>%
    filter(!is.na(change))
  
  # Add shock information
  shock_data$shock_year <- shock_year
  shock_data$event <- global_shocks$event[global_shocks$year == shock_year]
  
  return(shock_data)
}

# Apply to all shocks that fall within our data period
available_shock_years <- global_shocks$year[global_shocks$year %in% 
                                            unique(wealth_ts_data$year)]
shock_impacts <- map_dfr(available_shock_years, shock_impact)

# Plot shock impacts by country
ggplot(shock_impacts, 
       aes(x = event, y = change, fill = shortname)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Impact of Global Shocks on Top 10% Wealth Share",
       subtitle = "Change in share before vs. after shock (3-year window)",
       x = "Global Economic Event",
       y = "Change in Top 10% Share (percentage points)",
       fill = "Country") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_fill_viridis_d()

# Calculate average response by shock
shock_summary <- shock_impacts %>%
  group_by(event, shock_year) %>%
  summarise(
    avg_change = mean(change, na.rm = TRUE),
    median_change = median(change, na.rm = TRUE),
    sd_change = sd(change, na.rm = TRUE),
    n_countries = n()
  ) %>%
  arrange(shock_year)

print(shock_summary)

## # A tibble: 5 × 6
## # Groups:   event [5]
##   event                shock_year avg_change median_change sd_change n_countries
##   <chr>                     <dbl>      <dbl>         <dbl>     <dbl>       <int>
## 1 Oil Crisis                 1973  -0.0125        -0.0126    0.00859           5
## 2 Black Monday               1987   0.00796        0.00782   0.0114            8
## 3 Dot-com Bubble             2000   0.0195         0.0161    0.0163            8
## 4 Global Financial Cr…       2008   0.00841        0.00230   0.0190            8
## 5 COVID-19                   2020   0.000962       0.00285   0.0232            8

Global Comparative Index

I develop a composite inequality index to rank and compare countries.

#to be continued, have to work on that properly

Long-term Forecasting (2025-2050)

I develop long-term forecasts for wealth inequality trends.

# Select countries with sufficient historical data for long-term forecasting
forecast_candidates <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  group_by(country, shortname) %>%
  summarise(
    n_years = n_distinct(year),
    min_year = min(year),
    max_year = max(year),
    data_span = max_year - min_year
  ) %>%
  filter(n_years >= 15, data_span >= 20) %>%  # Need substantial historical data
  arrange(desc(data_span))

# If I have suitable candidates, perform long-term forecasting
if(nrow(forecast_candidates) > 0) {
  # Select top candidates
  top_forecast_countries <- head(forecast_candidates, 3)$country
  
  # Prepare data for forecasting
  ltf_data <- wealth_ts_data %>%
    filter(country %in% top_forecast_countries, 
           percentile == "p90p100") %>%
    select(country, shortname, year, value) %>%
    arrange(country, year)
  
  # Create time series objects for each country
  forecast_results <- data.frame()
  forecast_horizon <- 30  # Forecast to 2050 or 30 years ahead
  
  for(c in top_forecast_countries) {
    country_name <- ltf_data$shortname[ltf_data$country == c][1]
    country_data <- filter(ltf_data, country == c)
    
    # Create time series
    ts_obj <- ts(country_data$value, 
                start = min(country_data$year), 
                frequency = 1)
    
    # Fit auto ARIMA model
    model_fit <- auto.arima(ts_obj)
    
    # Create forecast
    fc <- forecast(model_fit, h = forecast_horizon)
    
    # Extract forecast results
    fc_years <- seq(from = max(country_data$year) + 1, 
                   length.out = forecast_horizon)
    
    # Store results
    country_fc <- data.frame(
      country = c,
      shortname = country_name,
      year = fc_years,
      forecast = as.numeric(fc$mean),
      lower_80 = as.numeric(fc$lower[,1]),
      upper_80 = as.numeric(fc$upper[,1]),
      lower_95 = as.numeric(fc$lower[,2]),
      upper_95 = as.numeric(fc$upper[,2])
    )
    
    # Combine with historical data for plotting
    country_historical <- data.frame(
      country = c,
      shortname = country_name,
      year = country_data$year,
      forecast = country_data$value,
      lower_80 = NA,
      upper_80 = NA,
      lower_95 = NA,
      upper_95 = NA
    )
    
    forecast_results <- rbind(forecast_results, 
                             country_historical, 
                             country_fc)
  }
  
  # Plot long-term forecasts
  ggplot(forecast_results, aes(x = year, y = forecast, color = shortname)) +
    geom_line(linewidth = 1) +
    geom_ribbon(aes(ymin = lower_95, ymax = upper_95, fill = shortname), 
               alpha = 0.1, color = NA) +
    geom_ribbon(aes(ymin = lower_80, ymax = upper_80, fill = shortname), 
               alpha = 0.2, color = NA) +
    geom_vline(xintercept = max(ltf_data$year), 
              linetype = "dashed", color = "gray50") +
    labs(title = "Long-term Forecast of Top 10% Wealth Share (2025-2050)",
         subtitle = "Shaded areas represent 80% and 95% prediction intervals",
         x = "Year",
         y = "Top 10% Share (%)",
         color = "Country",
         fill = "Country") +
    theme_minimal() +
    theme(legend.position = "bottom") +
    scale_color_viridis_d() +
    scale_fill_viridis_d()
  
  # Summarize forecast changes
  fc_summary <- forecast_results %>%
    filter(!is.na(lower_95)) %>%  # Only forecast period
    group_by(country, shortname) %>%
    summarise(
      start_value = first(forecast),
      end_value = last(forecast),
      absolute_change = end_value - start_value,
      percent_change = (absolute_change / start_value) * 100,
      avg_annual_change = absolute_change / forecast_horizon
    ) %>%
    arrange(desc(absolute_change))
  
  print(fc_summary)
}

## # A tibble: 3 × 7
## # Groups:   country [3]
##   country shortname start_value end_value absolute_change percent_change
##   <chr>   <chr>           <dbl>     <dbl>           <dbl>          <dbl>
## 1 AU      Australia       0.322     0.328         0.00612           1.90
## 2 BR      Brazil          0.592     0.592         0                 0   
## 3 CN      China           0.433     0.433         0                 0   
## # ℹ 1 more variable: avg_annual_change <dbl>

Policy Simulation Scenarios

I was interested how concrete policy would impacy wealth inequality. So, I decided to develop simplified simulation models to explore it.

# Simple simulation function for policy scenarios
simulate_policy_impact <- function(base_value, policy_effect, years) {
  # Base_value: Starting inequality level
  # Policy_effect: Annual percentage point change due to policy
  # Years: Number of years to simulate
  
  # Simple linear progression
  simulated_values <- base_value + cumsum(rep(policy_effect, years))
  
  # Ensure values stay within reasonable bounds (0-100%)
  simulated_values <- pmin(pmax(simulated_values, 0), 100)
  
  return(simulated_values)
}

# Policy scenarios for a representative country
# Using the US as an example
us_data <- wealth_ts_data %>%
  filter(country == "US", percentile == "p90p100") %>%
  arrange(desc(year))

# If we have recent US data
if(nrow(us_data) > 0) {
  base_year <- max(us_data$year)
  base_value <- us_data$value[us_data$year == base_year]
  
  # Define policy scenarios
  scenarios <- data.frame(
    scenario = c("Status Quo", "Progressive Taxation", "Wealth Tax", 
                "Deregulation", "Education Investment"),
    annual_effect = c(0.0, -0.3, -0.5, 0.2, -0.2)  # percentage points per year
  )
  
  # Simulation period
  sim_years <- 25
  year_range <- base_year + 1:sim_years
  
  # Run simulations
  sim_results <- data.frame()
  
  for(i in 1:nrow(scenarios)) {
    scenario_name <- scenarios$scenario[i]
    effect <- scenarios$annual_effect[i]
    
    # Run simulation
    sim_values <- simulate_policy_impact(base_value, effect, sim_years)
    
    # Store results
    scenario_results <- data.frame(
      year = year_range,
      scenario = scenario_name,
      top10_share = sim_values
    )
    
    sim_results <- rbind(sim_results, scenario_results)
  }
  
  # Add historical data for context
  historical <- us_data %>%
    filter(year >= (base_year - 10)) %>%
    select(year, value) %>%
    mutate(scenario = "Historical") %>%
    rename(top10_share = value)
  
  all_results <- rbind(
    historical,
    sim_results
  )
  
  # Plot scenarios
  ggplot(all_results, aes(x = year, y = top10_share, color = scenario)) +
    geom_line(linewidth = 1) +
    geom_point(data = filter(all_results, scenario == "Historical"), 
              size = 2) +
    geom_vline(xintercept = base_year, 
              linetype = "dashed", color = "gray50") +
    labs(title = "Policy Scenario Simulations for US Top 10% Wealth Share",
         subtitle = paste("Base year:", base_year, "| Simulation period:", sim_years, "years"),
         x = "Year",
         y = "Top 10% Share (%)",
         color = "Scenario") +
    theme_minimal() +
    theme(legend.position = "bottom") +
    scale_color_viridis_d()
  
  # Calculate end-of-period outcomes
  scenario_outcomes <- sim_results %>%
    filter(year == max(year)) %>%
    mutate(
      change_from_base = top10_share - base_value,
      pct_change = (change_from_base / base_value) * 100
    ) %>%
    arrange(top10_share)
  
  print(scenario_outcomes)
}

##   year             scenario top10_share change_from_base pct_change
## 1 2048 Progressive Taxation      0.0000          -0.4676    -100.00
## 2 2048           Wealth Tax      0.0000          -0.4676    -100.00
## 3 2048 Education Investment      0.0000          -0.4676    -100.00
## 4 2048           Status Quo      0.4676           0.0000       0.00
## 5 2048         Deregulation      5.4676           5.0000    1069.29

Spatial Analysis of Global Inequality

I examine the spatial distribution of wealth inequality around the world.

# Create a dataset for mapping wealth inequality globally
map_data <- wealth_ts_data %>%
  filter(percentile == "p90p100") %>%
  group_by(country) %>%
  # Get most recent value for each country
  filter(year == max(year)) %>%
  select(country, shortname, region, year, value) %>%
  ungroup()

# Join country codes for mapping
# Note: In a real analysis, you would join with proper country codes and use packages
# like 'maps' or 'sf' to create actual maps. Here we'll simulate the output.

# Create a simplified visualization by region
regional_map <- map_data %>%
  group_by(region) %>%
  summarise(
    avg_value = mean(value, na.rm = TRUE),
    n_countries = n()
  ) %>%
  arrange(desc(avg_value))

# Plot regional averages
ggplot(regional_map, aes(x = reorder(region, avg_value), y = avg_value, fill = avg_value)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Global Distribution of Top 10% Wealth Share",
       subtitle = "Most recent data by region",
       x = "Region",
       y = "Average Top 10% Share (%)",
       fill = "Share (%)") +
  theme_minimal() +
  scale_fill_viridis_c(option = "B")

# Create visualization showing variation within regions
region_variation <- map_data %>%
  group_by(region) %>%
  filter(n() >= 2) %>%  # Regions with at least 2 countries
  summarise(
    min_value = min(value, na.rm = TRUE),
    max_value = max(value, na.rm = TRUE),
    mean_value = mean(value, na.rm = TRUE),
    range = max_value - min_value,
    n_countries = n()
  ) %>%
  arrange(desc(range))

# Plot intra-regional variation
ggplot(region_variation, aes(x = reorder(region, range))) +
  geom_linerange(aes(ymin = min_value, ymax = max_value, color = mean_value), 
                linewidth = 1.5) +
  geom_point(aes(y = mean_value, color = mean_value), size = 3) +
  coord_flip() +
  labs(title = "Intra-regional Variation in Top 10% Wealth Share",
       subtitle = "Range from minimum to maximum with mean point",
       x = "Region",
       y = "Top 10% Share (%)",
       color = "Mean Share (%)") +
  theme_minimal() +
  scale_color_viridis_c(option = "B")

Summary of Findings

My analysis of the World Inequality Database shows several key insights about global wealth inequality trends. I observed significant differences in wealth concentration patterns across regions, with some showing increasing inequality while others are stable or have even declining trends. I identified key turning points in inequality trends, often happening together with major economic events or policy changes. I also found out significant differences in wealth distribution across age groups, with adults generally showing different patterns compared to the full population. In most countries, the share of wealth held by the top 10% has shown an increasing trend over recent decades, while the bottom 50% share has either stagnated or declined.There is limited evidence for global convergence in inequality levels, suggesting that country-specific factors continue to play a dominant role in shaping wealth distribution patterns.

I think policies addressing wealth inequality should be tailored to specific country contexts, taking into account the historical trajectories and structural characteristics. Given the disparities across age groups, policies should consider differential impacts on various demographic segments. Future research should divide wealth inequality into its components (housing, financial assets, business equity) to provide more nuanced insights.Also probably more detailed analysis of inequality across demographic dimensions beyond age, such as gender, education, and occupation, would improve the understanding of distributional dynamics.