Correlating GDP with % put towards R&D

Introduction

Is there a correlation between national wealth and R&D spending as a share of GDP?

This project makes use of two datasets pulled from Our World in Data to study a possible correlation between a country’s GDP per capita and the amount of money they spend on research and development (basic research, applied research, and experimental development) as a percentage of their GDP. The first dataset is aptly named “Research & development spending as a share of GDP” and contains 2605 rows identified by country, region, or economic category and year. The second dataset is named “GDP per capita”, and consists of 7311 observations.

Included Columns
Column names	Dataset #	Description
Entity	1 & 2	Categorical: Country/territory/region/category name
Code	1 & 2	Categorical: Country/territory ISO 3166-1 alpha-3 code
Year	1 & 2	Quantitative, discrete: year of observation
Research and development expenditure (% of GDP)	1	Total R&D expenditure divided by total GDP
GDP per capita, PPP (constant 2021 international $)	2	Total GDP divided by population

Dataset References:

Dataset 1: “Data Page: Research & development spending as a share of GDP”. Our World in Data (2025). Data adapted from UNESCO Institute for Statistics (UIS) Bulk Data Service, via World Bank. Retrieved from https://archive.ourworldindata.org/20250916- 102301/grapher/research-spending-gdp.html [online resource] (archived on September 16, 2025).

Dataset 2: “Data Page: GDP per capita”, part of the following publication: Max Roser, Bertha Rohenkohl, Pablo Arriagada, Joe Hasell, Hannah Ritchie, and Esteban Ortiz-Ospina (2023) - “Economic Growth”. Data adapted from Eurostat, OECD, IMF, and World Bank. Retrieved from https://archive.ourworldindata.org/20251017-101247/grapher/gdp-per-capita- worldbank.html [online resource] (archived on October 17, 2025).

Data Analysis

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

spending_data_raw <- read_csv("D:/DATA 101/Datasets/research-spending-gdp.csv")

## Rows: 2605 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Entity, Code
## dbl (2): Year, Research and development expenditure (% of GDP)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gdp_data_raw <- read_csv("D:/DATA 101/Datasets/gdp-per-capita-worldbank.csv")

## Rows: 7311 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Entity, Code, World regions according to OWID
## dbl (2): Year, GDP per capita, PPP (constant 2021 international $)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(spending_data_raw)

## # A tibble: 6 × 4
##   Entity  Code   Year `Research and development expenditure (% of GDP)`
##   <chr>   <chr> <dbl>                                             <dbl>
## 1 Albania ALB    2007                                            0.0876
## 2 Albania ALB    2008                                            0.154 
## 3 Algeria DZA    2001                                            0.212 
## 4 Algeria DZA    2002                                            0.338 
## 5 Algeria DZA    2003                                            0.181 
## 6 Algeria DZA    2004                                            0.152

head(gdp_data_raw)

## # A tibble: 6 × 5
##   Entity      Code   Year GDP per capita, PPP (constant…¹ World regions accord…²
##   <chr>       <chr> <dbl>                           <dbl> <chr>                 
## 1 Afghanistan AFG    2000                           1618. <NA>                  
## 2 Afghanistan AFG    2001                           1454. <NA>                  
## 3 Afghanistan AFG    2002                           1774. <NA>                  
## 4 Afghanistan AFG    2003                           1816. <NA>                  
## 5 Afghanistan AFG    2004                           1777. <NA>                  
## 6 Afghanistan AFG    2005                           1908. <NA>                  
## # ℹ abbreviated names: ¹`GDP per capita, PPP (constant 2021 international $)`,
## #   ²`World regions according to OWID`

data <- spending_data_raw |>
  left_join(gdp_data_raw, by=c("Entity", "Year"))

names(data) <- tolower(names(data))

data <- data |>
  select(-c("code.y","world regions according to owid")) |>
  rename(
    "code" = "code.x",
    "rd_percent" = "research and development expenditure (% of gdp)",
    "gdp_capita" = "gdp per capita, ppp (constant 2021 international $)",
    "name" = "entity"
  ) |>
  filter(code != "OWID_WRL" | is.na(code)) ## keeping NAs until regions are separated

colSums(is.na(data)) ## not worried about 50 NAs in gdp_capita in a dataset of 2365 observations

##       name       code       year rd_percent gdp_capita 
##          0        213          0          0         50

data_regions <- data |> ## extract regional data before deleting it from main dataset
  filter(is.na(code)) |>
  select(-code)

data <- data |>
  filter(!is.na(code))

library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

data_yearly <- data |>
  group_by(year) |>
  summarise(
    rd_percent = mean(rd_percent, na.rm = TRUE),
    gdp_capita = mean(gdp_capita, na.rm = TRUE)
  ) |>
  filter(year != 2023) ## minimal data for 2023 is throwing off plot

highchart() |>
  hc_chart(type = "line", backgroundColor="black") |>
  hc_title(text = "World Average GDP: per capita and R&D %", 
           style = list(color="aliceblue")) |>
  
  hc_xAxis(categories = data_yearly$year, 
           title = list(text = "Year",
                        style = list(color="aliceblue")),
           labels=list(style = list(color="aliceblue")))|>
  
  hc_yAxis_multiples(
    list(title = list(text = "R&D % of GDP", 
                      style = list(color = "aliceblue")),
         labels = list(style = list(color = "aliceblue"))),
    list(title = list(text = "GDP per Capita (USD)", 
                      style = list(color = "aliceblue")),
         opposite = TRUE,
         labels = list(style = list(color = "aliceblue")))
  ) |>

  hc_add_series(
    name = "R&D %",
    data = data_yearly$rd_percent,
    yAxis = 0,
    color = "#C090D0"
  ) |>
  hc_add_series(
    name = "GDP per capita",
    data = data_yearly$gdp_capita,
    yAxis = 1,
    color = "#39FF14"
  ) |>
  
  hc_tooltip(shared = TRUE) |>
  
  hc_legend(
    backgroundColor="aliceblue"
  )

data_regions2 <- data_regions |>
  filter(str_detect(name, "WB")) |>
  mutate(name = if_else(name=="Middle East, North Africa, Afghanistan and Pakistan (WB)", "Middle East & North Africa", name))

legend_order <- data_regions2 |>
  group_by(name) |>
  filter(year == max(year)) |> 
  arrange(desc(gdp_capita)) |> 
  pull(name)

data_regions2 <- data_regions2 |> 
  mutate(name = factor(name, levels = legend_order))

highchart() |>
  hc_chart(type = "line") |>
  
  hc_add_series(
    data = data_regions2,
    type = "line",
    hcaes(x = year, y = gdp_capita, group = name)
  ) |>
  
  hc_title(text = "GDP per Capita by Region (1996–2022)") |>
  
  hc_xAxis(
    title = list(text = "Year"),
    tickInterval = 4
  ) |>
  hc_yAxis(
    title = list(text = "GDP per Capita (USD)")
  ) |>
  
  hc_tooltip(
    shared = TRUE,
    valuePrefix = "$"
  ) |>
  
  hc_legend(
    layout = "vertical",
    align = "right",
    verticalAlign = "middle"
  )

Statistical Analysis

anova_data <- data_regions |>
  filter(str_detect(name,"income"))

anova_result <- aov(gdp_capita ~ name, data=anova_data)

summary(anova_result)

##             Df    Sum Sq   Mean Sq F value Pr(>F)    
## name         2 1.753e+10 8.767e+09   392.1 <2e-16 ***
## Residuals   53 1.185e+09 2.236e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = gdp_capita ~ name, data = anova_data)
## 
## $name
##                                                                  diff
## Lower-middle-income countries-High-income countries         -41344.94
## Upper-middle-income countries-High-income countries         -33487.92
## Upper-middle-income countries-Lower-middle-income countries   7857.02
##                                                                    lwr
## Lower-middle-income countries-High-income countries         -46490.519
## Upper-middle-income countries-High-income countries         -36722.915
## Upper-middle-income countries-Lower-middle-income countries   2630.717
##                                                                   upr     p adj
## Lower-middle-income countries-High-income countries         -36199.35 0.0000000
## Upper-middle-income countries-High-income countries         -30252.91 0.0000000
## Upper-middle-income countries-Lower-middle-income countries  13083.32 0.0018484

This test is just to show that a country’s income categorization, restricted for this project’s sake to middle (upper and lower) and high income levels, is highly correlated with and possibly dependent on said country’s GDP per capita. Each p-value is miles past the 95% confidence level ($\alpha=0.05$). That being said, we can then use the income categorization in place of the GDP to perform another ANOVA test. Going forward, we’ll refer to GDP/income categorization as ‘national wealth’.

The real statistical question: is there a correlation between national wealth and R&D spending as a percentage of national wealth?

\[ H_0: \mu_1 = \mu_2 = \mu_3 \]

\[ H_A: \mu_1 \neq \mu_2 \neq \mu_3 \]

anova_result2 <- aov(rd_percent ~ name, data=anova_data)

summary(anova_result2)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## name         2 25.069  12.534   118.1 <2e-16 ***
## Residuals   53  5.625   0.106                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova_result2)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = rd_percent ~ name, data = anova_data)
## 
## $name
##                                                                   diff
## Lower-middle-income countries-High-income countries         -1.8273202
## Upper-middle-income countries-High-income countries         -1.1323496
## Upper-middle-income countries-Lower-middle-income countries  0.6949706
##                                                                   lwr
## Lower-middle-income countries-High-income countries         -2.181850
## Upper-middle-income countries-High-income countries         -1.355241
## Upper-middle-income countries-Lower-middle-income countries  0.334879
##                                                                    upr    p adj
## Lower-middle-income countries-High-income countries         -1.4727902 0.00e+00
## Upper-middle-income countries-High-income countries         -0.9094585 0.00e+00
## Upper-middle-income countries-Lower-middle-income countries  1.0550621 6.51e-05

Frankly, I was not expecting such clear results. The question “is there a correlation” is answered clearly with a p-value $< 2 \times10^{-16}$, well below the 0.05 significance level indicated by the 95% confidence level. The Tukey HSD test provides even more insight to the relationship between different income categorizations. The largest differences, with p-values mathematically near 0, are between middle income and high income countries, with mean differences of $\approx-1.83$ and $\approx-1.13$ for lower-middle and upper-middle respectively. There is a slightly larger yet still statistically significant difference between R&D spending as a share of GDP for upper-middle-income countries and lower-middle-income countries. These results indicate that it is appropriate to reject the null hypothesis and accept the alternative, namely, that there is a correlation between a nation’s national wealth (GDP per capita/income categorization) and the amount of money spent on research and development respective to the GDP.

Conclusion

The answer to my question is essentially yes. My ANOVA model and the previous analyses clearly proved that there is a significant correlation between GDP per capita and national expenditures on research and development as a share of GDP. I was able to reject my null hypothesis, that different economic categorizations would have equal R&D spending as a share of their GDP, because the p-value determined by my model is less than my chosen significance level and any other commonly chosen $\alpha$.

In the future, I’d be interested in investigating causation rather than correlation. Does increased GDP per capita tend to precipitate higher shares towards R&D or vice versa? Or, is there a third (or more) factor that causes both of these initial variables to change simultaneously? I might accomplish this by taking the entities with the most annual data and calculating the rate of GDP growth, then graphing it with R&D expenditures and seeing where the changes start.

References

Wickham H (2025). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.6.0, https://stringr.tidyverse.org.
–This is in tidyverse, but we didn’t cover it in this class