Machine Learning for Peace:

Infrastructure and Applications

PDRI-DevLab

University of Pennsylvania

September 2, 2024

Jeremy Springman

Research Assistant Professor

Erik Wibbels, Serkant Adiguzel, Mateo Villamizar Chaparro, Zung-Ru Lin, Donald Moratz, Diego Romero, Hanling Su, Mahda Soltani

Overview


  1. Introducing Machine Learning for Peace
  2. Comparing MLP with other media corpora
  3. Application 1: Forecasting Travel Warnings
  4. Application 2: Studying Russian Influence

Introducing MLP Infrastructure

Background

  • Event data are key for understanding political dynamics
  • Media reports are the most comprehensive documentation
  • Positive: AI/ML provides new tools to extract information from text
  • Negative: Existing media corpora have poor coverage of domestic media outlets in developing countries
  • MLP provides high-quality corpus and flexible text processing infrastructure

High Quality Corpus

Input: Online news

  • 400+ news sources
  • 40 languages
  • 120 million articles

Data quality

  • Focus on high-quality local sources (medium data)
  • Direct, human monitored scraping
  • Much better coverage than extant archivers/aggregators (GDELT, Wayback, Lexis Nexis, etc.)



Output: Monthly data

  • 62 countries
  • 2012 - last month

Detecting Civic Space Events in Text

  • Robustly Optimized BERT Pretraining Approach (RoBERTa)
  • Pre-trained on enormous corpora of data + transfer learning
  • Fine-tuned on double human-coded training data (n=9,875)

Data Processing Pipeline


Comparing MLP with Big Data Media Corpora

Findings

  • International media sources have limited, skewed coverage of events in developing countries
    • Event datasets that rely on global language sources will have major biases
  • Accurate data collection from national news sources requires careful human curation
    • GDELT, Common Crawl, Internet Archive include major errors

Importance of Domestic News

Importance of Domestic News

::: notes - most of our news comes from national sources - fundamental differences in the type of events covered by domestic and international sources :::

Importance of Domestic News

Importance of Domestic News

Lexis Nexis vs MLP

Countries where LN has zero local sources: 6/56

  • Albania, Belarus, Kosovo, Jamaica, Angola, South Sudan

Comparing LN on other metrics:

  • Fewer languages: 17 vs 34
  • Slightly more local sources per country: 5.5 vs 5 (median; excluding MLP’s regional sources)
  • Shorter, more sporadic coverage over time

Challenges of Scraping

Scraping Case Study

Bangladesh: easiest case for automated scrapers

  • Massive volume
  • Good web architecture
  • 2/5 sources are in English

Scraping Domestic Outlets is Tough

GDELT

  • MLP: 2013 and 2015
  • GDELT: 2019 forward
  • GDELT’s best covered source: 2,100 articles/mo compared to 2,500 per month from MLP
  • Broken links, redirects, duplicate articles, and advertising
  • Restricts requests to one search every 5 seconds, so that scraping even a single source for the full time-period can take several days

Scraping Domestic Outlets is Tough

Internet Archive

  • Took nearly two weeks to collect URLs from a single source from 2019-2023
  • Numerous irrelevant, broken, and duplicate links
  • Less than half or urls were usable

Random Audit of 5 MLP Countries

Task

  • Use algorithm to identify major events
  • Use GPT to summarize 5 most important events
  • Human check of location and event classification accuracy

Results

  • 40 events detected from April - June 2024 (300 possible)
  • Correct country: 34/40
  • Correct event: 38/40

Forecasting Travel Advisories

Why Travel Advisories?

  • Request from US State Department
    • High-level travel advisories trigger deployment of resources
    • Anticipating location, timing of warnings can help smooth budgets
  • Travel advisories include political instability, natural disasters, health risks, etc

Data

  • Target: onset of a serious travel advisory
  • Predictors: MLP data, indicator for continued advisories, years, time trend, Bayesian country encoding

Modeling

  • Forecast Horizon: 3 and 6 months
  • Model: LightGBM + Temporal CV
  • Hyperparameters: Wide grid search for learning rate, proportion of features, depth of trees
  • Evaluation Metrics:
    • ROC-AUC: ranking months in test set
    • AUC-PR: optimal for imbalanced data

Performance

Performance

Performance

Performance

Feature Importance

Tracking Foreign Authoritarian Influence

Resurgent Authoritarian Influence?


  1. Introduce data on Russian and Chinese influence on developing countries
  2. Describe influence cross-nationally and over time
  3. Examine Russia’s pre-invasion behavior

Resurgent Authoritarian Influence?

Background:

  • Collapse of Soviet Union: less influence by autocracies
  • Recently, powerful autocracies becoming more assertive in foreign policy

Foreign Influence:

  • Actions by the government of an influencing country to affect the policies, capacity, or behavior of a target country to advance its own national interests

Influence Tools

Example of Hard Power

Describing RAI: Spheres


  1. Russian influence is more concentrated in a geographic sphere of influence
  2. Spheres of influence are shifting over time
  3. Russia has dramatically expanded its sphere in recent years, challenging China’s dominance

Describing RAI: Spheres

Describing RAI: Spheres

Describing RAI: Spheres

Describing RAI: Tools


  • Economic Power is the most prevalent theme
  • Beginning in 2022, Diplomacy increases dramatically in places where Russia’s influence grew

Describing RAI: Tools

Describing RAI: Tools

Describing RAI: Tools

Describing RAI: Tools

Did Russia Signal Invasion?

Did Russia Signal Invasion?

Did Russia Signal Invasion?

Did Russia Signal Invasion?

Did Russia Signal Invasion?

Hard Power

Diplomacy

Diplomacy

Period

Accuracy

False Positive

True Positive

Total

Aug-Jan

Count

1

7

8

Row pct

12.5%

87.5%

Feb (Pre)

Count

6

4

10

Row pct

60.0%

40.0%

Feb (Post)

Count

10

3

13

Row pct

76.9%

23.1%

Missing

Count

6

8

14

Row pct

42.9%

57.1%

Total

Count

23

22

45

Diplomacy

  • Countries with change points:
    • 2022: 30
    • 2021: 5
    • Previous high of 4 in 2014
    • 2017-2020: average of 1

Diplomacy

  • True Positives:
    • Aug-Jan: Antagonism (Kosovo), celebration (Bangladesh), military intervention (Kazakhstan)
    • Feb 1-23: High profile meetings (Serbia, Belarus, Nicaragua, Hungary, Turkey), major statements (Albania, India)
  • False Positives:
    • Aug- Jan: domestic reporting on geopolitics (El Salvador, Peru)
    • Feb 1-23: domestic reporting on/criticism of Russian build-up (Cameroon, Jamaica, Nepal, Philippines)

Hard Power

Period

Accuracy

False Positive

True Positive

Total

Aug-Jan

Count

5

6

11

Row pct

45.5%

54.5%

Feb (Pre)

Count

2

2

Row pct

100.0%

Mar-Dec

Count

1

1

Row pct

100.0%

Total

Count

8

6

14

Hard Power

  • Countries with change points:
    • 2022: 18
    • 2021: 8
    • Previous high of 5 in 2015
    • 2016-2020: average of 2.2

Hard Power

  • True Positives:
    • Aug-Jan: direct security cooperation (Mali, Pakistan, Burkina Faso, Belarus), security antagonism (Kosovo, Georgia)
    • Feb 1-23: direct security engagement (Ethiopia)
  • False Positives:
    • Aug-Jan: security response over Russia’s influence on neighbors (Angola, Niger, Colombia), domestic reporting on geopolitics (Cameroon, Peru)
    • Feb 1-23: security response over Russia’s influence on neighbors (Hungary), domestic reporting on geopolitics (Bangladesh, India)

Explaining Increased Influence

  • Explain variation in:
    • Countries with largest increase in Russian influence
  • Potential causes of targeting:
    • Geopolitical alignment (UN Ideal Point Distance)
    • Strategic value (exports and exports)

Diplomacy - Change in Articles

Hard Power - Change in Articles

Appendix

Accuracy

Russian Import Reliance

Russian Export Reliance

Exporters of Russian Imports

Importers of Russian Exports