Is there a correlation between national wealth and R&D spending as a share of GDP?
This project makes use of two datasets pulled from Our World in Data to study a possible correlation between a country’s GDP per capita and the amount of money they spend on research and development (basic research, applied research, and experimental development) as a percentage of their GDP. The first dataset is aptly named “Research & development spending as a share of GDP” and contains 2605 rows identified by country, region, or economic category and year. The second dataset is named “GDP per capita”, and consists of 7311 observations.
| Column names | Dataset # | Description |
|---|---|---|
| Entity | 1 & 2 | Categorical: Country/territory/region/category name |
| Code | 1 & 2 | Categorical: Country/territory ISO 3166-1 alpha-3 code |
| Year | 1 & 2 | Quantitative, discrete: year of observation |
| Research and development expenditure (% of GDP) | 1 | Total R&D expenditure divided by total GDP |
| GDP per capita, PPP (constant 2021 international $) | 2 | Total GDP divided by population |
Dataset References:
Dataset 1: “Data Page: Research & development spending as a share of GDP”. Our World in Data (2025). Data adapted from UNESCO Institute for Statistics (UIS) Bulk Data Service, via World Bank. Retrieved from https://archive.ourworldindata.org/20250916- 102301/grapher/research-spending-gdp.html [online resource] (archived on September 16, 2025).
Dataset 2: “Data Page: GDP per capita”, part of the following publication: Max Roser, Bertha Rohenkohl, Pablo Arriagada, Joe Hasell, Hannah Ritchie, and Esteban Ortiz-Ospina (2023) - “Economic Growth”. Data adapted from Eurostat, OECD, IMF, and World Bank. Retrieved from https://archive.ourworldindata.org/20251017-101247/grapher/gdp-per-capita- worldbank.html [online resource] (archived on October 17, 2025).
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
spending_data_raw <- read_csv("D:/DATA 101/Datasets/research-spending-gdp.csv")
## Rows: 2605 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Entity, Code
## dbl (2): Year, Research and development expenditure (% of GDP)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdp_data_raw <- read_csv("D:/DATA 101/Datasets/gdp-per-capita-worldbank.csv")
## Rows: 7311 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Entity, Code, World regions according to OWID
## dbl (2): Year, GDP per capita, PPP (constant 2021 international $)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spending_data_raw)
## # A tibble: 6 × 4
## Entity Code Year `Research and development expenditure (% of GDP)`
## <chr> <chr> <dbl> <dbl>
## 1 Albania ALB 2007 0.0876
## 2 Albania ALB 2008 0.154
## 3 Algeria DZA 2001 0.212
## 4 Algeria DZA 2002 0.338
## 5 Algeria DZA 2003 0.181
## 6 Algeria DZA 2004 0.152
head(gdp_data_raw)
## # A tibble: 6 × 5
## Entity Code Year GDP per capita, PPP (constant…¹ World regions accord…²
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Afghanistan AFG 2000 1618. <NA>
## 2 Afghanistan AFG 2001 1454. <NA>
## 3 Afghanistan AFG 2002 1774. <NA>
## 4 Afghanistan AFG 2003 1816. <NA>
## 5 Afghanistan AFG 2004 1777. <NA>
## 6 Afghanistan AFG 2005 1908. <NA>
## # ℹ abbreviated names: ¹`GDP per capita, PPP (constant 2021 international $)`,
## # ²`World regions according to OWID`
data <- spending_data_raw |>
left_join(gdp_data_raw, by=c("Entity", "Year"))
names(data) <- tolower(names(data))
data <- data |>
select(-c("code.y","world regions according to owid")) |>
rename(
"code" = "code.x",
"rd_percent" = "research and development expenditure (% of gdp)",
"gdp_capita" = "gdp per capita, ppp (constant 2021 international $)",
"name" = "entity"
) |>
filter(code != "OWID_WRL" | is.na(code)) ## keeping NAs until regions are separated
colSums(is.na(data)) ## not worried about 50 NAs in gdp_capita in a dataset of 2365 observations
## name code year rd_percent gdp_capita
## 0 213 0 0 50
data_regions <- data |> ## extract regional data before deleting it from main dataset
filter(is.na(code)) |>
select(-code)
data <- data |>
filter(!is.na(code))
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
data_yearly <- data |>
group_by(year) |>
summarise(
rd_percent = mean(rd_percent, na.rm = TRUE),
gdp_capita = mean(gdp_capita, na.rm = TRUE)
) |>
filter(year != 2023) ## minimal data for 2023 is throwing off plot
highchart() |>
hc_chart(type = "line", backgroundColor="black") |>
hc_title(text = "World Average GDP: per capita and R&D %",
style = list(color="aliceblue")) |>
hc_xAxis(categories = data_yearly$year,
title = list(text = "Year",
style = list(color="aliceblue")),
labels=list(style = list(color="aliceblue")))|>
hc_yAxis_multiples(
list(title = list(text = "R&D % of GDP",
style = list(color = "aliceblue")),
labels = list(style = list(color = "aliceblue"))),
list(title = list(text = "GDP per Capita (USD)",
style = list(color = "aliceblue")),
opposite = TRUE,
labels = list(style = list(color = "aliceblue")))
) |>
hc_add_series(
name = "R&D %",
data = data_yearly$rd_percent,
yAxis = 0,
color = "#C090D0"
) |>
hc_add_series(
name = "GDP per capita",
data = data_yearly$gdp_capita,
yAxis = 1,
color = "#39FF14"
) |>
hc_tooltip(shared = TRUE) |>
hc_legend(
backgroundColor="aliceblue"
)
data_regions2 <- data_regions |>
filter(str_detect(name, "WB")) |>
mutate(name = if_else(name=="Middle East, North Africa, Afghanistan and Pakistan (WB)", "Middle East & North Africa", name))
legend_order <- data_regions2 |>
group_by(name) |>
filter(year == max(year)) |>
arrange(desc(gdp_capita)) |>
pull(name)
data_regions2 <- data_regions2 |>
mutate(name = factor(name, levels = legend_order))
highchart() |>
hc_chart(type = "line") |>
hc_add_series(
data = data_regions2,
type = "line",
hcaes(x = year, y = gdp_capita, group = name)
) |>
hc_title(text = "GDP per Capita by Region (1996–2022)") |>
hc_xAxis(
title = list(text = "Year"),
tickInterval = 4
) |>
hc_yAxis(
title = list(text = "GDP per Capita (USD)")
) |>
hc_tooltip(
shared = TRUE,
valuePrefix = "$"
) |>
hc_legend(
layout = "vertical",
align = "right",
verticalAlign = "middle"
)
anova_data <- data_regions |>
filter(str_detect(name,"income"))
anova_result <- aov(gdp_capita ~ name, data=anova_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## name 2 1.753e+10 8.767e+09 392.1 <2e-16 ***
## Residuals 53 1.185e+09 2.236e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = gdp_capita ~ name, data = anova_data)
##
## $name
## diff
## Lower-middle-income countries-High-income countries -41344.94
## Upper-middle-income countries-High-income countries -33487.92
## Upper-middle-income countries-Lower-middle-income countries 7857.02
## lwr
## Lower-middle-income countries-High-income countries -46490.519
## Upper-middle-income countries-High-income countries -36722.915
## Upper-middle-income countries-Lower-middle-income countries 2630.717
## upr p adj
## Lower-middle-income countries-High-income countries -36199.35 0.0000000
## Upper-middle-income countries-High-income countries -30252.91 0.0000000
## Upper-middle-income countries-Lower-middle-income countries 13083.32 0.0018484
This test is just to show that a country’s income categorization, restricted for this project’s sake to middle (upper and lower) and high income levels, is highly correlated with and possibly dependent on said country’s GDP per capita. Each p-value is miles past the 95% confidence level (\(\alpha=0.05\)). That being said, we can then use the income categorization in place of the GDP to perform another ANOVA test. Going forward, we’ll refer to GDP/income categorization as ‘national wealth’.
\[ H_0: \mu_1 = \mu_2 = \mu_3 \]
\[ H_A: \mu_1 \neq \mu_2 \neq \mu_3 \]
anova_result2 <- aov(rd_percent ~ name, data=anova_data)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## name 2 25.069 12.534 118.1 <2e-16 ***
## Residuals 53 5.625 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_result2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = rd_percent ~ name, data = anova_data)
##
## $name
## diff
## Lower-middle-income countries-High-income countries -1.8273202
## Upper-middle-income countries-High-income countries -1.1323496
## Upper-middle-income countries-Lower-middle-income countries 0.6949706
## lwr
## Lower-middle-income countries-High-income countries -2.181850
## Upper-middle-income countries-High-income countries -1.355241
## Upper-middle-income countries-Lower-middle-income countries 0.334879
## upr p adj
## Lower-middle-income countries-High-income countries -1.4727902 0.00e+00
## Upper-middle-income countries-High-income countries -0.9094585 0.00e+00
## Upper-middle-income countries-Lower-middle-income countries 1.0550621 6.51e-05
Frankly, I was not expecting such clear results. The question “is there a correlation” is answered clearly with a p-value \(< 2 \times10^{-16}\), well below the 0.05 significance level indicated by the 95% confidence level. The Tukey HSD test provides even more insight to the relationship between different income categorizations. The largest differences, with p-values mathematically near 0, are between middle income and high income countries, with mean differences of \(\approx-1.83\) and \(\approx-1.13\) for lower-middle and upper-middle respectively. There is a slightly larger yet still statistically significant difference between R&D spending as a share of GDP for upper-middle-income countries and lower-middle-income countries. These results indicate that it is appropriate to reject the null hypothesis and accept the alternative, namely, that there is a correlation between a nation’s national wealth (GDP per capita/income categorization) and the amount of money spent on research and development respective to the GDP.
The answer to my question is essentially yes. My ANOVA model and the previous analyses clearly proved that there is a significant correlation between GDP per capita and national expenditures on research and development as a share of GDP. I was able to reject my null hypothesis, that different economic categorizations would have equal R&D spending as a share of their GDP, because the p-value determined by my model is less than my chosen significance level and any other commonly chosen \(\alpha\).
In the future, I’d be interested in investigating causation rather than correlation. Does increased GDP per capita tend to precipitate higher shares towards R&D or vice versa? Or, is there a third (or more) factor that causes both of these initial variables to change simultaneously? I might accomplish this by taking the entities with the most annual data and calculating the rate of GDP growth, then graphing it with R&D expenditures and seeing where the changes start.
Dataset 1: “Data Page: Research & development spending as a share of GDP”. Our World in Data (2025). Data adapted from UNESCO Institute for Statistics (UIS) Bulk Data Service, via World Bank. Retrieved from https://archive.ourworldindata.org/20250916- 102301/grapher/research-spending-gdp.html [online resource] (archived on September 16, 2025).
Dataset 2: “Data Page: GDP per capita”, part of the following publication: Max Roser, Bertha Rohenkohl, Pablo Arriagada, Joe Hasell, Hannah Ritchie, and Esteban Ortiz-Ospina (2023) - “Economic Growth”. Data adapted from Eurostat, OECD, IMF, and World Bank. Retrieved from https://archive.ourworldindata.org/20251017-101247/grapher/gdp-per-capita- worldbank.html [online resource] (archived on October 17, 2025).
Wickham H (2025). stringr: Simple, Consistent Wrappers for Common
String Operations. R package version 1.6.0, https://stringr.tidyverse.org.
–This is in tidyverse, but we didn’t cover it in this
class