We start by plotting the distribution of soaGEN. There are a lot of zeros and a long right tail.
ggplot(df, aes(x = soaGEN)) +
geom_histogram(binwidth = 20, fill = "black") +
labs(x = "Number of SOA courses attended (country-year)",
y = "Frequency",
title = "Distribution of the dependent variable") +
theme_minimal()
Calculating the mean.
mean(df$soaGEN, na.rm = TRUE)
## [1] 37.98802
Calculating the variance.
var(df$soaGEN, na.rm = TRUE)
## [1] 6167.665
Calculating the ratio to check for overdispersion.
var(df$soaGEN, na.rm = TRUE) / mean(df$soaGEN, na.rm = TRUE)
## [1] 162.3582
This code creates a country-level summary table showing each country’s total, average, and maximum number of SOA courses, along with the share of zero-years, sorted by total usage.
country_stats <- df %>%
group_by(countryname) %>%
summarise(
total_soa = sum(soaGEN, na.rm = TRUE),
mean_soa = mean(soaGEN, na.rm = TRUE),
max_soa = max(soaGEN, na.rm = TRUE),
zero_share = mean(soaGEN == 0, na.rm = TRUE),
n_obs = n()
) %>%
arrange(desc(total_soa))
Our country-level summary table.
country_stats
## # A tibble: 33 × 6
## countryname total_soa mean_soa max_soa zero_share n_obs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Colombia 10427 177. 552 0.102 59
## 2 El Salvador 6808 115. 786 0.0678 59
## 3 Peru 4520 76.6 323 0.0508 59
## 4 Nicaragua 4517 76.6 392 0.407 59
## 5 Bolivia 4203 71.2 366 0.136 59
## 6 Chile 3798 64.4 492 0.254 59
## 7 Ecuador 3603 61.1 326 0.102 59
## 8 Venezuela 3583 60.7 372 0.0847 59
## 9 Panama 3565 60.4 302 0.305 59
## 10 Honduras 3563 60.4 234 0.0508 59
## # ℹ 23 more rows
A visualization of heavy users of SOA training.
ggplot(country_stats, aes(x = reorder(countryname, total_soa), y = total_soa)) +
geom_col() +
coord_flip() +
labs(x = "Country", y = "Total SOA courses 1946-2004",
title = "Heavy users of SOA training") +
theme_minimal()
A visualization of potential never-takers.
ggplot(country_stats, aes(x = reorder(countryname, zero_share), y = zero_share)) +
geom_col(fill = "darkred") +
coord_flip() +
labs(x = "Country", y = "Share of country-years with zero SOA courses",
title = "Potential never-takers") +
theme_minimal()
The plot shows how many SOA courses each Latin American country attended each year between 1946 and 2004. Each square is one country in one year. The colour shows how many courses were attended. Dark blue means few or zero, light teal means many. The scale is logarithmic. The grey squares mean we have no data. So grey is not the same as zero. Zero is dark blue. Grey means no information. Countries are sorted from heaviest user (top) to lightest (bottom).
ggplot(df, aes(x = year,
y = reorder(countryname, soaGEN, sum, na.rm = TRUE),
fill = log(soaGEN))) +
geom_tile() +
scale_fill_viridis_c(option = "mako", na.value = "grey80") +
labs(x = "Year",
y = "Country",
fill = "log(SOA courses)",
title = "SOA course attendance by country and year, 1946-2004") +
theme_minimal()