Descriptive statistics (Scharpf)

We start by plotting the distribution of soaGEN. There are a lot of zeros and a long right tail.

ggplot(df, aes(x = soaGEN)) +
  geom_histogram(binwidth = 20, fill = "black") +
  labs(x = "Number of SOA courses attended (country-year)",
       y = "Frequency",
       title = "Distribution of the dependent variable") +
  theme_minimal()

Calculating the mean.

mean(df$soaGEN, na.rm = TRUE)

## [1] 37.98802

Calculating the variance.

var(df$soaGEN, na.rm = TRUE)

## [1] 6167.665

Calculating the ratio to check for overdispersion.

var(df$soaGEN, na.rm = TRUE) / mean(df$soaGEN, na.rm = TRUE)

## [1] 162.3582

This code creates a country-level summary table showing each country’s total, average, and maximum number of SOA courses, along with the share of zero-years, sorted by total usage.

country_stats <- df %>%
  group_by(countryname) %>%
  summarise(
    total_soa  = sum(soaGEN, na.rm = TRUE),
    mean_soa   = mean(soaGEN, na.rm = TRUE),
    max_soa    = max(soaGEN, na.rm = TRUE),
    zero_share = mean(soaGEN == 0, na.rm = TRUE),
    n_obs      = n()
  ) %>%
  arrange(desc(total_soa))

Our country-level summary table.

country_stats

## # A tibble: 33 × 6
##    countryname total_soa mean_soa max_soa zero_share n_obs
##    <chr>           <dbl>    <dbl>   <dbl>      <dbl> <int>
##  1 Colombia        10427    177.      552     0.102     59
##  2 El Salvador      6808    115.      786     0.0678    59
##  3 Peru             4520     76.6     323     0.0508    59
##  4 Nicaragua        4517     76.6     392     0.407     59
##  5 Bolivia          4203     71.2     366     0.136     59
##  6 Chile            3798     64.4     492     0.254     59
##  7 Ecuador          3603     61.1     326     0.102     59
##  8 Venezuela        3583     60.7     372     0.0847    59
##  9 Panama           3565     60.4     302     0.305     59
## 10 Honduras         3563     60.4     234     0.0508    59
## # ℹ 23 more rows

A visualization of heavy users of SOA training.

ggplot(country_stats, aes(x = reorder(countryname, total_soa), y = total_soa)) +
  geom_col() +
  coord_flip() +
  labs(x = "Country", y = "Total SOA courses 1946-2004",
       title = "Heavy users of SOA training") +
  theme_minimal()

A visualization of potential never-takers.

ggplot(country_stats, aes(x = reorder(countryname, zero_share), y = zero_share)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(x = "Country", y = "Share of country-years with zero SOA courses",
       title = "Potential never-takers") +
  theme_minimal()

The plot shows how many SOA courses each Latin American country attended each year between 1946 and 2004. Each square is one country in one year. The colour shows how many courses were attended. Dark blue means few or zero, light teal means many. The scale is logarithmic. The grey squares mean we have no data. So grey is not the same as zero. Zero is dark blue. Grey means no information. Countries are sorted from heaviest user (top) to lightest (bottom).

ggplot(df, aes(x = year, 
               y = reorder(countryname, soaGEN, sum, na.rm = TRUE),
               fill = log(soaGEN))) +
  geom_tile() +
  scale_fill_viridis_c(option = "mako", na.value = "grey80") +
  labs(x = "Year",
       y = "Country",
       fill = "log(SOA courses)",
       title = "SOA course attendance by country and year, 1946-2004") +
  theme_minimal()

Descriptive statistics (Scharpf)

David

2026-05-07