Introduction

Economic and social development depends on many connected factors. Traditional methods like regression analysis and clustering help understand these relationships, but they often miss frequent patterns that show deeper connections. To address this, Association Rule Mining (ARM) was used on the World Development Indicators (WDI) dataset to find meaningful links between key economic and social factors.

This study looks at how different development indicators such as GDP per capita, health spending, education, infrastructure, and environmental factors are related. By using ARM, we can discover patterns that show how countries’ development characteristics are connected.

By applying ARM to global development data, this study helps identify important connections between key indicators and provides useful insights for decision-makers.

This paper follows a structured approach, starting with a review of relevant studies and methodology. Then explainig the data processing steps before using Apriori and Eclat algorithms to find association rules. The results highlight important development patterns, leading to policy suggestions and future research directions.

Literature Review

Association Rule Mining in Economic and Social Analysis

Association Rule Mining (ARM) has been widely applied in market basket analysis and consumer behavior studies (Agrawal & Srikant, 1994). Recent studies have demonstrated the potential of ARM in uncovering hidden relationships in development data. For instance, Hahsler, Grün, and Hornik (2007) highlight how ARM techniques can identify significant connections in socio-economic datasets.

Education, Health, and Economic Growth

Several studies have examined the link between education, health, and economic performance. Barro and Lee (2013) argue that higher literacy rates correlate with stronger economic growth due to a more skilled labor force. Similarly, Bloom, Canning, and Sevilla (2004) emphasize the role of health improvements in boosting productivity and long-term GDP growth. By applying ARM, this study aims to explore whether these relationships hold across different regions and income levels.

Infrastructure and Digital Access

Access to infrastructure, particularly internet penetration, plays a crucial role in modern economic development. Studies show that countries with widespread digital connectivity experience higher levels of innovation and economic expansion (Qiang et al., 2009). Additionally, the relationship between electricity access and development has been extensively studied, with findings suggesting a strong correlation between energy consumption and GDP per capita (IEA, 2018).

By leveraging ARM techniques, this study builds upon these foundations to identify recurring patterns in the World Development Indicators dataset. The next section details the methodology used to prepare the data and extract meaningful association rules.

Data Preperation

To examine global development trends, data was obtained from the World Development Indicators (WDI) database for all available countries between 2007 and 2017. The year 2017 was chosen as the endpoint because it was the last year with a high level of data completeness, ensuring a reliable analysis. A ten-year period was selected to provide a broad view of development patterns across different regions.

1. Selection of Indicators

Ten key indicators were chosen to represent economic performance, health, infrastructure, and environmental factors:

  • GDP per capita (economic growth)

  • Unemployment rate (labor market conditions)

  • Inflation rate (price stability)

  • Life expectancy (health and longevity)

  • Health expenditure (% of GDP) (investment in healthcare)

  • Infant mortality rate (child health outcomes)

  • Internet users (% of population) (digital access)

  • Access to electricity (basic infrastructure)

  • CO₂ emissions (environmental impact)

  • Urban population (% of total) (urbanization level)

2. Data Collection and Cleaning

The indicators were retrieved using the WDI package, and variable names were standardized for readability (e.g., NY.GDP.PCAP.CD was renamed to GDP). The following steps were taken to ensure data quality:

  • Grouping by country and year to maintain consistency.

  • Handling missing values using mean imputation where applicable.

  • Removing unnecessary columns such as iso2c and iso3c, which contain country codes but are not relevant for analysis.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
# Using all available countries for 2007–2017

indicators <- c(
  "NY.GDP.PCAP.CD",   # GDP per capita (current US$)
  "SL.UEM.TOTL.ZS",   # Unemployment rate (%)
  "FP.CPI.TOTL.ZG",   # Inflation rate (annual %)
  "SP.DYN.LE00.IN",   # Life expectancy at birth (years)
  "SH.XPD.CHEX.GD.ZS",# Health expenditure (% of GDP)
  "SP.DYN.IMRT.IN",   # Infant mortality rate (per 1,000 live births)
  "IT.NET.USER.ZS",   # Internet users (% of population)
  "EG.ELC.ACCS.ZS",   # Access to electricity (% of population)
  "CC.CO2.EMSE.EL",   # CO2 emissions (excl. LUCF)
  "SP.URB.TOTL.IN.ZS" # Urban population (% of total)
)
# Fetch Data
wdi_data <- WDI(
  country  = "all", 
  indicator = indicators,
  start     = 2007,
  end       = 2017,
  extra     = FALSE
)
# Rename for easier readability
name_map <- c(
  "GDP (US$ per capita)" = "NY.GDP.PCAP.CD",
  "Unemployment Rate (%)" = "SL.UEM.TOTL.ZS",
  "Inflation Rate (%)" = "FP.CPI.TOTL.ZG",
  "Life Expectancy (Years)" = "SP.DYN.LE00.IN",
  "Health Expenditure (% of GDP)" = "SH.XPD.CHEX.GD.ZS",
  "Infant Mortality (per 1,000 live births)" = "SP.DYN.IMRT.IN",
  "Internet Users (% of Population)" = "IT.NET.USER.ZS",
  "Electricity Access (% of Population)" = "EG.ELC.ACCS.ZS",
  "CO2 Emissions (kt per capita)" = "CC.CO2.EMSE.EL",
  "Urban Population (% of Total)" = "SP.URB.TOTL.IN.ZS"
)


wdi_data <- wdi_data %>%
  rename(all_of(name_map))

#grouping by year
wdi_data_refined <- wdi_data %>%
  group_by(country, year) %>%
  summarize(
    across(everything(), ~ {
      if (is.numeric(.)) {
        mean(.x, na.rm = TRUE)  # or sum, median, etc.
      } else {
        first(.)  # keep the first value for character/factor columns
      }
    }),
    .groups = "drop"
  )

#drop 'iso2c'& 'iso3c' since we don't need it
wdi_data_refined <- wdi_data_refined %>%
  select(-iso2c, -iso3c,)

# Keep 'country' for labeling
country_names <- wdi_data_refined$country

# Drop 'country' & 'year'from the numeric data
wdi_data_refined <- (wdi_data_refined %>%
                      select(-country,-year))

Data Cleaning and Discretization

To prepare the dataset for Association Rule Mining, several cleaning and transformation steps were applied to ensure that the data was complete and structured for categorical analysis. Since ARM works best with categorical data, numerical indicators were converted into meaningful categories using a combination of domain-specific thresholds and statistical methods.

A important step in this process was handling missing values. Any rows containing incomplete data were removed to ensure that only complete observations were included in the analysis. This improved the reliability of the discovered patterns and prevented issues with missing data affecting the results.

For GDP per capita, the World Bank’s 2017 income classification was applied to create four categories:

  • Low Income: Below $1,005

  • Lower Middle Income: Between $1,005 and $3,955

  • Upper Middle Income: Between $3,955 and $12,235

  • High Income: Above $12,235

This classification made it possible to compare economic trends with established economic groupings.

Other numerical indicators, such as unemployment rate, health expenditure, and life expectancy, were categorized using quartile-based binning to create four groups: Very Low, Low, Medium, and High. This method ensured that each category contained a balanced number of observations, preventing any single category from dominating the analysis. If a variable had fewer than four unique values, a binary categorization into High and Low was applied instead, based on the median value.

The final step involved applying these transformations to all relevant indicators. GDP was processed separately using economic classifications, while other indicators were discretized based on statistical distribution. These steps structured the dataset for ARM, making it possible to uncover meaningful development patterns.

# Remove any row containing at least one missing (NA) value
wdi_data_refined <- wdi_data_refined %>%
  filter(complete.cases(.))

# GDP Discretization (World Bank classification, 2017)
discretize_gdp <- function(x) {
  cut(x, breaks = c(-Inf, 1005, 3955, 12235, Inf), 
      labels = c("Low Income", "Lower Middle Income", "Upper Middle Income", "High Income"), 
      include.lowest = TRUE)
}

# Safe discretization function for other numerical variables
safe_discretize <- function(x) {
  if (!is.numeric(x)) return(x)  # Ensure only numeric columns are discretized
  
  x <- na.omit(x)  # Remove NAs
  unique_values <- length(unique(x))
  
  if (unique_values < 4) {
    return(as.factor(ifelse(x > median(x, na.rm = TRUE), "High", "Low")))  # Binary split
  } else {
    unique_breaks <- unique(quantile(x, probs = seq(0, 1, 0.25), na.rm = TRUE))
    
    if (length(unique_breaks) != 5) {  
      return(as.factor(ifelse(x > median(x, na.rm = TRUE), "High", "Low")))  # Binary fallback
    } else {
      return(cut(x, breaks = unique_breaks, labels = c("Very Low", "Low", "Medium", "High"), include.lowest = TRUE))
    }
  }
}

# Identify numeric columns (avoid factors)
numeric_cols <- sapply(wdi_data_refined, is.numeric)

# Apply safe discretization **only** to numeric variables, excluding GDP
wdi_data_refined <- wdi_data_refined %>%
  mutate(across(names(wdi_data_refined)[numeric_cols & names(wdi_data_refined) != "GDP (US$ per capita)"], safe_discretize))

# Apply GDP discretization separately
wdi_data_refined <- wdi_data_refined %>%
  mutate(`GDP (US$ per capita)` = discretize_gdp(`GDP (US$ per capita)`))

Transforming Data and Applying Association Rule Mining

The dataset was converted into a transaction format for association rule mining. The Apriori algorithm was applied with 5% support and 60% confidence to find frequent and reliable patterns. The discovered rules highlight key relationships between economic, social, and technological factors.

#Convert Data into Transactions for Association Rule Mining
wdi_trans <- as(wdi_data_refined, "transactions")


#Apply Apriori Algorithm for Association Rule Mining
rules <- apriori(
  wdi_trans, 
  parameter = list(supp = 0.05, conf = 0.6, minlen = 2)
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 90 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1800 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [3484 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Plotting Results

Scatter Plot of Confidence vs. Support

This plot visualizes the 3,484 association rules, showing their support (x-axis) and confidence (y-axis). Each dot represents a rule, while its color intensity reflects the lift value, with darker shades indicating stronger relationships.

We can see that most rules have low support (<0.1), meaning they apply to a smaller subset of countries but may still be significant. Higher lift values (darker red points) suggest stronger-than-random associations, making these rules more impactful.

Rules with both high confidence and high lift are of course the most valuable for understanding economic and social trends.

# Scatter Plot of Confidence vs. Support
plot(rules, method = "scatterplot", measure = c("support", "confidence"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Interactive Graph of Association Rules

This interactive network visualization displays the top 10 association rules ranked by lift. Each node represents a development indicator, while edges show strong relationships between them.

The thicker and darker edges indicate stronger associations. The hover feature allows detailed inspection of individual rules, showing which factors are commonly linked.

This visualization highlights key development patterns, such as the relationship between low electricity access, high infant mortality, and low GDP. These insights help in understanding how multiple indicators interact, offering useful information for policy and economic planning.

# Filter top rules for clarity
top_rules <- sort(rules, by = "lift")[1:10]

# Generate interactive graph
plot(top_rules, method = "graph", engine = "htmlwidget",
     control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

Parallel Coordinates Plot

This parallel coordinates plot represents the top 100 association rules, showing how different antecedents (LHS) connect to the consequent (RHS). Each line represents a rule, with the thickness and color intensity indicating rule strength.

This visualization helps identify which antecedents (LHS) frequently lead to specific consequents (RHS), highlighting key relationships between our selected development indicators.

#Parallel Coordinates Plot
plot(top_rules, method = "paracoord", control = list(reorder = TRUE))

Analysis of Rule Quality Distributions

These histograms provide insights into the quality metrics of the discovered association rules:

  • Rule Support Distribution (blue)
    The majority of rules have low support (mostly below 0.1), meaning they occur in a small proportion of the dataset. However, they can still be meaningful if confidence and lift are high.

  • Rule Confidence Distribution (green)
    Confidence values are mostly high, with a significant portion above 0.9. This suggests that when antecedents occur, the consequents are highly likely to follow, making these rules reliable.

  • Rule Lift Distribution (red)
    Most rules have lift values above 2 or even 3, meaning they are stronger than random chance. Higher lift values indicate that the rules could provide meaningful insights for policy and economic planning.

#Histogram of Rule Support, Confidence, and Lift
hist(quality(rules)$support, breaks = 20, main = "Distribution of Rule Support", col = "blue")

hist(quality(rules)$confidence, breaks = 20, main = "Distribution of Rule Confidence", col = "green")

hist(quality(rules)$lift, breaks = 20, main = "Distribution of Rule Lift", col = "red")

Conclusion and Future Recommendations

This study used Association Rule Mining (ARM) on World Development Indicators (WDI) data to uncover relationships between economic, health, and infrastructure variables. The results highlight important connections, such as the link between low GDP, high infant mortality, and limited electricity access, which may inform policy decisions.

To improve the analysis, alternative discretization methods could be explored. Instead of quartiles, equal-width bins could provide fixed intervals for easier interpretation, while equal-frequency discretization ensures a more balanced category distribution. Additionally, k-means clustering could dynamically identify natural groupings, leading to more meaningful rule patterns.

Future research should consider expanding the time period, including different variables, refining variable selection, testing other ARM algorithms, and integrating additional statistical techniques to strengthen insights. By refining these methods, we can further enhance our understanding of global development patterns and their policy implications.

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. from https://www.vldb.org/conf/1994/P487.PDF

Barro, R. J., & Lee, J. W. (2013). A new data set of educational attainment in the world, 1950–2010. Journal of Development Economics, 104, 184–198. from https://doi.org/10.1016/j.jdeveco.2012.10.001.

Bloom, D. E., Canning, D., & Sevilla, J. (2004). The effect of health on economic growth: A production function approach. World Development, 32(1), 1–13. from https://doi.org/10.1016/j.worlddev.2003.07.002

Hahsler, M., Grün, B., & Hornik, K. (2005). Introduction to arules – A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. 10.18637/jss.v014.i15

International Energy Agency (IEA). (2018). World Energy Outlook 2018. IEA Publications. https://www.iea.org/reports/world-energy-outlook-2018

Qiang, C. Z., Rossotto, C. M., & Kimura, K. (2009). Economic impacts of broadband. Information and Communications for Development, 2009, 35–50. from https://documents1.worldbank.org/curated/ru/645821468337815208/pdf/487910PUB0EPI1101Official0Use0Only1.pdf