# install.packages("tidyverse", "data.table") # remove comment if not already installed
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(patchwork)
library(corrplot)
## corrplot 0.95 loaded
library(RColorBrewer)
Title: Impact of Patents on National Growth
The primary question in this study is how do patents applications relate to national economic growth? In relation to that, how do inceptives for scientific and technologic growth as measured by fraction of GDP invested in R&D impact further growth as measured by GDP? The underlying concept is that building and protecting intellectual property is a source of national impact and a foundation for future growth.
We have seen a fast increase in patent filing over the last 3 decades and we would like to better understand which countries are promoting that growth and how it is affecting their economy.
Patent production is enabled by supporting infrastructure that enables their development. Isolated and eccentric inventors of the past are no longer the primary source of innovative ideas. The working model is an economy that places emphasis on technical or technological innovation as a basis of national advancement for its people and well-being. Such emphasis is evidenced by the support of a strong educational foundation that promotes creativity and independent thinking among its youngest citizens, the source of future inventors. But technical interest and education must be complemented with a powerful manufacturing and scientific infrastructure that allows inventors to discover where innovation is needed the most, and what’s feasible and realizable in practice. Since the low-hanging fruit has already been collected, the current generation of inventors are aiming for technological marvels that seemed science fiction just a few years ago. Thus, from the practical inventions to facilitate daily human living, mostly mechanical and electrical in nature, inventors are now turning to sophisticated medical, pharmaceutical, and chemical solutions to prolong life; electronic and digital solutions to robotize routine tasks as well to achieve tasks not easily feasible for humans, or instruments and communication and manufacturing technology for the fast growing space industry. I mentioned two of the necessary legs to competitive patent production: technical education, and manufacturing and scientific infrastructure. The third leg is a favorable legal, regulatory and financial system that protects invention and makes it attractive and profitable. In the past century, only a handful of countries had all three legs in place. Over the last 4 decades, however, more and more countries are reaching the state of development that enable them to enter the “race”.
In this initial entry into this topic, I am aiming to start scoping the land of patents and see how they are influencing the economy of countries. One basic question is: do countries that invest more in R&D produce more patents? I hope to answer that question here. The second question is: are more patents leading to more exports (or trade) for countries? If I can collect sufficient data, I may be able to answer that here. Is the growth of R&D in general impacting patent production? On the basis of considerations raised in the previous paragraph, other questions are: is gender diversity playing a role in patent production? Are different patent sectors and types growing differently? The last question may lead to very valuable strategic guidance, but a more comprehensive dataset and much analysis time are needed, thus I don’t expect to answer that question here.
The project required assembling a dataset with the necessary information by country and year, to answer the questions posed above: (1) tables of individual patent filings from USPTO,(2) tables with various indicators from World Bank: GDP, patent applications (total), R&D spending, exports, manufacturing, researchers, population, (3) tables from OurWorldinData on scientific publications, on trade, and country classification by income, (4) table from IBAN to convert country codes from Alpha-2 to Alpha-3 (ISO 3166 international standard.)
USPTO (US Patent and Trademark Office)
PatentsView (annualized tables): https://patentsview.org/data/annualized; downloaded 10 tables 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020 and 2024.
World Bank (World Development Indicators):
GDP(current US$): https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
Patent applications (residents): https://data.worldbank.org/indicator/IP.PAT.RESD
Exports of goods and services (%GDP): https://data.worldbank.org/indicator/NE.EXP.GNFS.ZS
Manufacturing, value added (%GDP): https://data.worldbank.org/indicator/NV.IND.MANF.ZS
Population (total): https://data.worldbank.org/indicator/SP.POP.TOTL
Researchers in R&D (per million population): https://data.worldbank.org/indicator/SP.POP.SCIE.RD.P6
Research and Development expenditure (%GDP): https://data.worldbank.org/indicator/GB.XPD.RSDV.GD.ZS
Our World in Data:
Trade (%GDP): https://ourworldindata.org/grapher/trade-as-share-of-gdp
Publications, science (per million population): https://ourworldindata.org/grapher/scientific-publications-per-million
Countries by income classification, https://ourworldindata.org/grapher/countries-by-income-classification
IBAN: https://www.iban.com/country-codes; conversion between Alpha-2 and Alpha-3 codes used by WorldBank and OurWorldinData
These datasets were bound and then cleaned and wrangled as described below.
setwd(getwd()) # ensure we're at the directory where the RMD is running at and all patent datasets are in subdir PATENTSVIEW
dt <- fread("PATENTSVIEW/1980.csv")
str(dt)
## Classes 'data.table' and 'data.frame': 67133 obs. of 60 variables:
## $ patent_number : chr "4180867" "4180868" "4180869" "4180870" ...
## $ grant_year : int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ application_number : int 5881758 5851785 5812057 5676879 5847841 5855370 5748233 5885934 5906415 5957799 ...
## $ application_year : int 1978 1977 1977 1976 1977 1977 1976 1978 1978 1978 ...
## $ d_inventor : int 1 1 1 1 1 1 1 1 1 1 ...
## $ d_assignee : int 0 0 1 1 0 0 0 1 0 1 ...
## $ d_location : int 0 0 1 1 0 0 0 1 0 1 ...
## $ d_application : int 1 1 1 1 1 1 1 1 1 1 ...
## $ d_cpc : int 1 1 1 1 1 1 1 1 1 1 ...
## $ d_ipc : int 1 1 1 1 1 1 1 1 1 1 ...
## $ d_wipo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ assignee : chr "" "" "COLGATE-PALMOLIVE COMPANY" "FA WILH. JUL. TEUFEL" ...
## $ assignee_sequence : int NA NA 0 0 NA NA NA 0 NA 0 ...
## $ assignee_ind : int 0 0 0 0 0 0 0 0 0 0 ...
## $ country : chr "" "" "US" "DE" ...
## $ city : chr "" "" "New York" "Stuttgart" ...
## $ state : chr "" "" "NY" "" ...
## $ county : chr "" "" "New York" "" ...
## $ cpc_sections : chr "A" "A" "A" "B A" ...
## $ n_cpc : int 3 1 1 4 3 1 6 1 5 1 ...
## $ n_ipc : int 2 1 1 1 1 1 1 1 1 2 ...
## $ ipc_sections : chr "A E" "A" "A" "A" ...
## $ n_wipo : int 1 1 1 2 1 1 1 1 1 1 ...
## $ wipo_field_ids : chr "34" "34" "13" "13 25" ...
## $ first_wipo_field_title : chr "Other consumer goods" "Other consumer goods" "Medical technology" "Medical technology" ...
## $ first_wipo_sector_title: chr "Other fields" "Other fields" "Instruments" "Instruments" ...
## $ team_size : int 1 1 2 3 1 1 1 2 1 5 ...
## $ inventors : int 1 1 2 3 1 1 1 2 1 5 ...
## $ men_inventors : int 0 1 0 0 1 0 1 2 1 4 ...
## $ women_inventors : int 0 0 1 0 0 0 0 0 0 1 ...
## $ inventor_id1 : chr "fl:ma_ln:ridgeway-3" "fl:ch_ln:snow-1" "oqd48eh5qnqj4ghh6gl8noz33" "fl:ra_ln:radulovic-2" ...
## $ inventor_name1 : chr "Marcus L. Ridgeway, Jr." "Charles C. Snow" "John E. Pedergrass" "Radoje Radulovic" ...
## $ male_flag1 : int 0 1 0 0 1 0 1 1 1 1 ...
## $ inventor_id2 : chr "" "" "fl:cy_ln:wichman-1" "bj0zl4z4ab7cppmiw4oh8m5vx" ...
## $ inventor_name2 : chr "" "" "Cynthia A. Wichman" "Mustafa Karisik" ...
## $ male_flag2 : int NA NA 0 0 NA NA NA 1 NA 1 ...
## $ inventor_id3 : chr "" "" "" "fl:kl_ln:wolfer-2" ...
## $ inventor_name3 : chr "" "" "" "Klaus R. Wolfer" ...
## $ male_flag3 : int NA NA NA 0 NA NA NA NA NA 1 ...
## $ inventor_id4 : chr "" "" "" "" ...
## $ inventor_name4 : chr "" "" "" "" ...
## $ male_flag4 : int NA NA NA NA NA NA NA NA NA 1 ...
## $ inventor_id5 : chr "" "" "" "" ...
## $ inventor_name5 : chr "" "" "" "" ...
## $ male_flag5 : int NA NA NA NA NA NA NA NA NA 0 ...
## $ inventor_id6 : chr "" "" "" "" ...
## $ inventor_name6 : chr "" "" "" "" ...
## $ male_flag6 : int NA NA NA NA NA NA NA NA NA NA ...
## $ inventor_id7 : chr "" "" "" "" ...
## $ inventor_name7 : chr "" "" "" "" ...
## $ male_flag7 : int NA NA NA NA NA NA NA NA NA NA ...
## $ inventor_id8 : chr "" "" "" "" ...
## $ inventor_name8 : chr "" "" "" "" ...
## $ male_flag8 : int NA NA NA NA NA NA NA NA NA NA ...
## $ inventor_id9 : chr "" "" "" "" ...
## $ inventor_name9 : chr "" "" "" "" ...
## $ male_flag9 : int NA NA NA NA NA NA NA NA NA NA ...
## $ inventor_id10 : chr "" "" "" "" ...
## $ inventor_name10 : chr "" "" "" "" ...
## $ male_flag10 : int NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
head(dt)
## patent_number grant_year application_number application_year d_inventor
## <char> <int> <int> <int> <int>
## 1: 4180867 1980 5881758 1978 1
## 2: 4180868 1980 5851785 1977 1
## 3: 4180869 1980 5812057 1977 1
## 4: 4180870 1980 5676879 1976 1
## 5: 4180871 1980 5847841 1977 1
## 6: 4180872 1980 5855370 1977 1
## d_assignee d_location d_application d_cpc d_ipc d_wipo
## <int> <int> <int> <int> <int> <int>
## 1: 0 0 1 1 1 1
## 2: 0 0 1 1 1 1
## 3: 1 1 1 1 1 1
## 4: 1 1 1 1 1 1
## 5: 0 0 1 1 1 1
## 6: 0 0 1 1 1 1
## assignee assignee_sequence assignee_ind country city
## <char> <int> <int> <char> <char>
## 1: NA 0
## 2: NA 0
## 3: COLGATE-PALMOLIVE COMPANY 0 0 US New York
## 4: FA WILH. JUL. TEUFEL 0 0 DE Stuttgart
## 5: NA 0
## 6: NA 0
## state county cpc_sections n_cpc n_ipc ipc_sections n_wipo wipo_field_ids
## <char> <char> <char> <int> <int> <char> <int> <char>
## 1: A 3 2 A E 1 34
## 2: A 1 1 A 1 34
## 3: NY New York A 1 1 A 1 13
## 4: B A 4 1 A 2 13 25
## 5: A 3 1 A 1 13
## 6: A 1 1 A 1 13
## first_wipo_field_title first_wipo_sector_title team_size inventors
## <char> <char> <int> <int>
## 1: Other consumer goods Other fields 1 1
## 2: Other consumer goods Other fields 1 1
## 3: Medical technology Instruments 2 2
## 4: Medical technology Instruments 3 3
## 5: Medical technology Instruments 1 1
## 6: Medical technology Instruments 1 1
## men_inventors women_inventors inventor_id1
## <int> <int> <char>
## 1: 0 0 fl:ma_ln:ridgeway-3
## 2: 1 0 fl:ch_ln:snow-1
## 3: 0 1 oqd48eh5qnqj4ghh6gl8noz33
## 4: 0 0 fl:ra_ln:radulovic-2
## 5: 1 0 fl:ro_ln:hamas-1
## 6: 0 0 0804gl64thfu0vjgwfmo9v18m
## inventor_name1 male_flag1 inventor_id2
## <char> <int> <char>
## 1: Marcus L. Ridgeway, Jr. 0
## 2: Charles C. Snow 1
## 3: John E. Pedergrass 0 fl:cy_ln:wichman-1
## 4: Radoje Radulovic 0 bj0zl4z4ab7cppmiw4oh8m5vx
## 5: Robert S. Hamas 1
## 6: Neal S. Chaikin 0
## inventor_name2 male_flag2 inventor_id3 inventor_name3 male_flag3
## <char> <int> <char> <char> <int>
## 1: NA NA
## 2: NA NA
## 3: Cynthia A. Wichman 0 NA
## 4: Mustafa Karisik 0 fl:kl_ln:wolfer-2 Klaus R. Wolfer 0
## 5: NA NA
## 6: NA NA
## inventor_id4 inventor_name4 male_flag4 inventor_id5 inventor_name5
## <char> <char> <int> <char> <char>
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
## male_flag5 inventor_id6 inventor_name6 male_flag6 inventor_id7
## <int> <char> <char> <int> <char>
## 1: NA NA
## 2: NA NA
## 3: NA NA
## 4: NA NA
## 5: NA NA
## 6: NA NA
## inventor_name7 male_flag7 inventor_id8 inventor_name8 male_flag8
## <char> <int> <char> <char> <int>
## 1: NA NA
## 2: NA NA
## 3: NA NA
## 4: NA NA
## 5: NA NA
## 6: NA NA
## inventor_id9 inventor_name9 male_flag9 inventor_id10 inventor_name10
## <char> <char> <int> <char> <char>
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
## male_flag10
## <int>
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
# list all the csv files stored in the project directory/PATENTS subdirectory
file_list <- list.files(path = "PATENTSVIEW", pattern = "*.csv", full.names = TRUE)
file_list
## [1] "PATENTSVIEW/1980.csv" "PATENTSVIEW/1985.csv" "PATENTSVIEW/1990.csv"
## [4] "PATENTSVIEW/1995.csv" "PATENTSVIEW/2000.csv" "PATENTSVIEW/2005.csv"
## [7] "PATENTSVIEW/2010.csv" "PATENTSVIEW/2015.csv" "PATENTSVIEW/2020.csv"
## [10] "PATENTSVIEW/2024.csv"
# select a specific set of columns in the datasets
selected_columns <- c("grant_year","application_year","country","first_wipo_field_title","first_wipo_sector_title","men_inventors","women_inventors")
# bind all the datasets directly into a data.table format; note: I'm using a base-R for loop for clarity, but there must be a more sophisticated function for this.
combined_data <- data.table()
for (file in file_list) {
temp_data <- fread(file, select = selected_columns)
combined_data <- rbind(combined_data, temp_data)
}
str(combined_data) # show number of rows and list columns
## Classes 'data.table' and 'data.frame': 2096504 obs. of 7 variables:
## $ grant_year : int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ application_year : int 1978 1977 1977 1976 1977 1977 1976 1978 1978 1978 ...
## $ country : chr "" "" "US" "DE" ...
## $ first_wipo_field_title : chr "Other consumer goods" "Other consumer goods" "Medical technology" "Medical technology" ...
## $ first_wipo_sector_title: chr "Other fields" "Other fields" "Instruments" "Instruments" ...
## $ men_inventors : int 0 1 0 0 1 0 1 2 1 4 ...
## $ women_inventors : int 0 0 1 0 0 0 0 0 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(combined_data) # show the top rows
## grant_year application_year country first_wipo_field_title
## <int> <int> <char> <char>
## 1: 1980 1978 Other consumer goods
## 2: 1980 1977 Other consumer goods
## 3: 1980 1977 US Medical technology
## 4: 1980 1976 DE Medical technology
## 5: 1980 1977 Medical technology
## 6: 1980 1977 Medical technology
## first_wipo_sector_title men_inventors women_inventors
## <char> <int> <int>
## 1: Other fields 0 0
## 2: Other fields 1 0
## 3: Instruments 0 1
## 4: Instruments 0 0
## 5: Instruments 1 0
## 6: Instruments 0 0
# commented to avoid memory overrun - un-comment to check NAs
# colSums(is.na(combined_data)) # show the number of NAs in each column
# colSums(combined_data == "" | combined_data == " ", na.rm=TRUE) # and the number of empty or blank spaces
combined_data1 <- mutate(combined_data, application_year= ifelse(is.na(application_year), grant_year-round(mean(grant_year-application_year, na.rm = TRUE)), application_year))
combined_data1$country[is.na(combined_data1$country) | combined_data1$country == ""] <- "Other"
combined_data1$first_wipo_field_title [combined_data1$first_wipo_field_title == ""] <- "Unspecified"
combined_data1$first_wipo_sector_title [combined_data1$first_wipo_sector_title == ""] <- "Unspecified"
country_codes <- read_csv("country_codes-iban.csv", show_col_types = FALSE)
population <- read_csv("population-worldbank.csv", show_col_types = FALSE)
income_class <- read_csv("income-groups-ourworldindata.csv", show_col_types = FALSE)
gdp <- read_csv("gdp-worldbank-reduced.csv", show_col_types = FALSE)
researchers <- read_csv("researchers-per-million-worldbank.csv", show_col_types = FALSE)
research_spending <- read_csv("research-spending-gdp-worldbank.csv", show_col_types = FALSE)
patents_apps <- read_csv("annual-patent-applications-residents-worldbank.csv", show_col_types = FALSE)
exports <- read_csv("exports-gdp-worldbank-reduced.csv", show_col_types = FALSE)
manufacturing <- read_csv("manufacturing-gdp-worldbank.csv", show_col_types = FALSE)
publications <- read_csv("sci-pubs-per-million-ourworldindata.csv", show_col_types = FALSE)
trade <- read_csv("trade-share-gdp-ourworldindata.csv", show_col_types = FALSE)
combined_data2 <- left_join(combined_data1, country_codes, by= c("country" = "country"))
combined_data2 <- left_join(combined_data2, income_class, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, population, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, gdp, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, researchers, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, research_spending, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, patents_apps, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, exports, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, manufacturing, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, publications, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, trade, by= c("code" = "code", "grant_year"="year"))
combined_data2
## grant_year application_year country first_wipo_field_title
## <num> <num> <char> <char>
## 1: 1980 1978 Other Other consumer goods
## 2: 1980 1977 Other Other consumer goods
## 3: 1980 1977 US Medical technology
## 4: 1980 1976 DE Medical technology
## 5: 1980 1977 Other Medical technology
## ---
## 2096500: 2024 2021 KR Unspecified
## 2096501: 2024 2021 KR Unspecified
## 2096502: 2024 2021 JP Unspecified
## 2096503: 2024 2022 US Unspecified
## 2096504: 2024 2022 TW Unspecified
## first_wipo_sector_title men_inventors women_inventors
## <char> <int> <int>
## 1: Other fields 0 0
## 2: Other fields 1 0
## 3: Instruments 0 1
## 4: Instruments 0 0
## 5: Instruments 1 0
## ---
## 2096500: Unspecified 1 3
## 2096501: Unspecified 0 0
## 2096502: Unspecified 2 0
## 2096503: Unspecified 2 1
## 2096504: Unspecified 3 1
## country_name code income_class population
## <char> <char> <char> <num>
## 1: Other Other <NA> NA
## 2: Other Other <NA> NA
## 3: United States of America USA <NA> 227225000
## 4: Germany DEU <NA> 78288576
## 5: Other Other <NA> NA
## ---
## 2096500: Korea (Republic) KOR High-income countries 51751065
## 2096501: Korea (Republic) KOR High-income countries 51751065
## 2096502: Japan JPN High-income countries 123975371
## 2096503: United States of America USA High-income countries 340110988
## 2096504: Taiwan (China) TWN High-income countries NA
## gdp researchers rd_spending_gdp patent_apps exports_gdp
## <num> <num> <num> <num> <num>
## 1: NA NA NA NA NA
## 2: NA NA NA NA NA
## 3: 2.857307e+12 NA NA 62098 9.826455
## 4: 9.537725e+11 NA NA 28683 18.543411
## 5: NA NA NA NA NA
## ---
## 2096500: 0.000000e+00 NA NA NA 0.000000
## 2096501: 0.000000e+00 NA NA NA 0.000000
## 2096502: 4.026211e+12 NA NA NA 0.000000
## 2096503: 2.918489e+13 NA NA NA 10.896875
## 2096504: NA NA NA NA NA
## manufacture_gdp pubs_millpop trade_gdp
## <num> <num> <num>
## 1: NA NA NA
## 2: NA NA NA
## 3: 0.000000 NA 20.10984
## 4: 0.000000 NA 41.69634
## 5: NA NA NA
## ---
## 2096500: 0.000000 NA NA
## 2096501: 0.000000 NA NA
## 2096502: 0.000000 NA NA
## 2096503: 9.981583 NA 24.88799
## 2096504: NA NA NA
Income_class: The World Bank relabeled the country classification according to GDP several times prior to 1989, with the most recent official classification starting in 1989 (used as early as 1987) and containing four income groups: Low, Lower Middle, Upper Middle, and High. Moreover, countries may move between classes over time, and some countries were not labeled in the original patents nor received an ISO-3 code. Thus, for proper comparison, the statistics will be restricted to rows with assigned income_class in years 1987-2024. When unavailable, income_class is imputed to “Unspecified”.
mutate(combined_data2, income_class = ifelse(is.na(income_class), "Unspecified", income_class))
Population: has remaining NAs for the rows without assigned country of issuance nor ISO-3 code. In addition, Taiwan (province of China) did not report a population, as it is officially added to China. Since the total aggregated population is 0, the mean yields NaN and cannot be used to impute population. Those NAs will not be imputed and those rows will be excluded from the calculations with na.rm.
Application_year: Summary shows two abnormal entries for application_year>2024, a filter was used to find them, and they are imputed to grant_year minus the mean(delay), where delay = grant_year - application_year.
filter(combined_data2, application_year>2024) # (to find and show the 2 abnormal entries)
mutate(combined_data2, application_year= ifelse(application_year > 2024, grant_year-round(mean(grant_year-application_year, na.rm = TRUE)),application_year))
Men_inventor and women_inventors: data summary shows for men a Max=111, which is not an error. It occurs for 3 separate applications from Sweden with 111 inventors; it appears to be the same or related patent application submitted at various times. In women_inventors, there is one application with 20 women. It is not an error.
GDP: not available for rows without assigned country not ISO-3 code. Left as NA for now. Later these rows will be filtered out when appropriate or by the functions used.
Researchers: sparse information from World Bank, available from 1996 until 2022 for some of the countries. Imputed with the mean over the period 1996 until 2022 for each country. To be used only for that period. NA outside that range.
combined_data2 |>
group_by(code) |> mutate(
mean_res = mean(researchers[code != "" & grant_year >= 1996 & grant_year <=2022], na.rm = TRUE), # mean for countries with code
researchers = ifelse(is.na(researchers) & code != "", mean_res, researchers)) |>
ungroup() |> select (-mean_res) # remove temp col mean_res
Rd_spending_gdp: sparse information from World Bank, available from 1996 until 2022 for some of the countries. Imputed with the mean over the period 1996 until 2022 for each country. To be used only for that period. NA outside that range.
combined_data2 |>
group_by(code) |> mutate(
mean_res_spend = mean(rd_spending_gdp[grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
researchers = ifelse(is.na(rd_spending_gdp) & code != "", mean_res_spend, rd_spending_gdp)) |>
ungroup() |> select (-mean_res_spend) # remove temp col mean_res_spend
Patent_apps: partial information from World Bank, available from 1980 until 2021 for most countries. The analysis involving patent_apps will only be done in that time range. NAs are imputed with the mean for each country.
combined_data2 |>
group_by(code) |> mutate(
mean_patent_apps = mean(patent_apps[grant_year >= 1980 & grant_year <=2021], na.rm = TRUE),
patent_apps = ifelse(is.na(patent_apps) & code != "", mean_patent_apps, patent_apps)) |>
ungroup() |> select (-mean_patent_apps) # remove temp col mean_patent_apps
Exports_gdp: mostly complete since 1980-2024 but not for all countries. Not available for rows without assigned country not ISO-3 code. Left as NA.
Manufacture_gdp: mostly complete since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.
Publications per million: mostly complete from Our World in Data since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.
Trade-gdp: mostly complete since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.
Carry out these operations in a single code chunk:
combined_data3 <- combined_data2 |> dplyr::mutate(
income_class = ifelse(is.na(income_class), "Unspecified", income_class),
application_year= ifelse(application_year > 2024, grant_year-round(mean(grant_year-application_year, na.rm = TRUE)),application_year)
)
combined_data4 <- combined_data3 |>
group_by(code) |> dplyr::mutate(
mean_res = mean(researchers[code != "" & grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
researchers = ifelse(is.na(researchers) & code != "", mean_res, researchers),
mean_res_spend = mean(rd_spending_gdp[grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
rd_spending_gdp = ifelse(is.na(rd_spending_gdp) & code != "", mean_res_spend, rd_spending_gdp),
mean_patent_apps = mean(patent_apps[grant_year >= 1980 & grant_year <=2021], na.rm = TRUE),
patent_apps = ifelse(is.na(patent_apps) & code != "", mean_patent_apps, patent_apps)) |>
ungroup() |> select (-mean_res, -mean_res_spend, -mean_patent_apps)
combined_data5 <- combined_data4 |>
filter (grant_year >= 2000 & grant_year <= 2020 & code != "Other" & code != "TWN" & population != 0 & !is.na(population) & !is.nan(population))
# Add a column for gdp per capita:
combined_data5 <- dplyr::mutate(combined_data5, gdp_capita = gdp / population) # to avoid error, I had to specify dplyr::mutate
# Attempt to remove remaining NAs by imputing with the mean for each country, for researchers, rd_spending_gdp, patent_apps, pubs_millpop, trade_gdp
combined_data6 <- combined_data5 |>
dplyr::mutate (
across(where(is.numeric), ~ replace(., is.nan(.), NA))) |> #making sure all NaN are NA so mean() works well
group_by(code) |>
dplyr::mutate (
across(where(is.numeric), ~ replace(., is.na(.), mean(., na.rm = TRUE)))) |> #replace all NAs by country means
ungroup() # remove grouping
combined_data7 <- drop_na(combined_data6) # clean up all rows with any NA and NaN.
summary(combined_data7)
## grant_year application_year country first_wipo_field_title
## Min. :2000 Min. :1933 Length:1165301 Length:1165301
## 1st Qu.:2010 1st Qu.:2004 Class :character Class :character
## Median :2015 Median :2011 Mode :character Mode :character
## Mean :2013 Mean :2010
## 3rd Qu.:2020 3rd Qu.:2016
## Max. :2020 Max. :2020
## first_wipo_sector_title men_inventors women_inventors country_name
## Length:1165301 Min. : 0.00 Min. : 0.0000 Length:1165301
## Class :character 1st Qu.: 1.00 1st Qu.: 0.0000 Class :character
## Mode :character Median : 2.00 Median : 0.0000 Mode :character
## Mean : 2.33 Mean : 0.2819
## 3rd Qu.: 3.00 3rd Qu.: 0.0000
## Max. :111.00 Max. :20.0000
## code income_class population gdp
## Length:1165301 Length:1165301 Min. :8.286e+04 Min. :0.000e+00
## Class :character Class :character 1st Qu.:1.211e+08 1st Qu.:3.940e+12
## Mode :character Mode :character Median :2.822e+08 Median :1.025e+13
## Mean :2.479e+08 Mean :1.053e+13
## 3rd Qu.:3.218e+08 3rd Qu.:1.830e+13
## Max. :1.411e+09 Max. :2.135e+13
## researchers rd_spending_gdp patent_apps exports_gdp
## Min. : 17.3 Min. :0.01942 Min. : 1 Min. : 0.00
## 1st Qu.:3550.1 1st Qu.:2.61983 1st Qu.: 164795 1st Qu.: 10.69
## Median :4464.1 Median :2.77328 Median : 241977 Median : 12.41
## Mean :4368.2 Mean :2.88831 Mean : 235560 Mean : 20.56
## 3rd Qu.:5117.2 3rd Qu.:3.26458 3rd Qu.: 288335 3rd Qu.: 21.70
## Max. :8620.0 Max. :4.79571 Max. :1344817 Max. :225.16
## manufacture_gdp pubs_millpop trade_gdp gdp_capita
## Min. : 0.00 Min. : 1.842 Min. : 15.68 Min. : 0
## 1st Qu.:11.32 1st Qu.: 878.568 1st Qu.: 25.10 1st Qu.: 36330
## Median :12.98 Median :1337.213 Median : 28.22 Median : 44123
## Mean :15.53 Mean :1206.992 Mean : 41.66 Mean : 45493
## 3rd Qu.:20.28 3rd Qu.:1380.029 3rd Qu.: 43.98 3rd Qu.: 56849
## Max. :34.86 Max. :2738.632 Max. :420.43 Max. :116860
patent_apps_histg <- ggplot(combined_data7, aes(x = patent_apps)) +
geom_histogram(binwidth = 30, fill = "lightgreen", color = "blue") +
labs(title = "Patents 2000-2020, World", x = "Patents/year", y = "Frequency") +
theme_minimal()
rd_spend_histg <- ggplot(combined_data7, aes(x = rd_spending_gdp)) +
geom_histogram(binwidth = 0.03, fill = "yellow", color = "red") +
labs(title = "R&D Spending 2000-2020, World", x = "R&D Spending, %GDP", y = "Frequency") +
theme_minimal()
patent_apps_histg + rd_spend_histg
combined_grp_yr <- combined_data7 |>
group_by(grant_year) |>
summarize(patent_tot = n(), rd_spending_avg = mean(rd_spending_gdp), manufacture_avg = mean(manufacture_gdp), exports_avg = mean(exports_gdp), trade_avg = mean(trade_gdp), pubs_mill_avg = mean(pubs_millpop), researchers_avg = mean(researchers), na.rm=TRUE) |>
ungroup()
y_scale = 8*10^-6 # guessed scaling factor for right-y axis
ggplot(combined_grp_yr, aes(x=grant_year)) +
# plot the total patent line
geom_point(aes(y= patent_tot, color= "Patents")) + geom_line(linewidth=1, aes(y= patent_tot, color= "Patents"))+
# plot the R&D spending line
geom_point(aes(y= rd_spending_avg/y_scale, color= "R&D")) + geom_line(linewidth=1, aes(y= rd_spending_avg/y_scale, color= "R&D"))+
# draw the y axes
scale_y_continuous(name = "Patents/year",
sec.axis = sec_axis(~. * y_scale, name = "R&D Spending, %GDP")) +
scale_x_continuous(name = "Year") +
labs(title = "World Patents and R&D Spending in the 2000-2020 Period") +
scale_color_manual(values = c("Patents" = "blue", "R&D" = "red")) +
theme( axis.title.y.left = element_text(color = "blue"),
axis.title.y.right = element_text(color = "red"),
legend.position = "bottom",
legend.title = element_blank())
ggplot(combined_grp_yr, aes(x = rd_spending_avg, y = patent_tot, label = grant_year)) +
geom_point() +
geom_line() + # connect rd_spending_avg vs patent_tot in order of the 'year' variable
geom_text(vjust = -1, hjust = 0.5) + # add year labels
labs(
title = "Patents vs. R&D Investment",
x = "R&D Investment (per year)",
y = "Patents (per year)"
) +
theme_bw()
ggplot(combined_grp_yr, aes(x = manufacture_avg, y = patent_tot, label = grant_year)) +
geom_point() +
geom_line() + # line connecting y and x for each year group
geom_text(vjust = -1, hjust = 0.5) +
labs(
title = "Patents vs. Manufaturing",
x = "Manufacturing (fraction of GDP)",
y = "Patents (per year)"
) +
theme_bw()
ggplot(combined_grp_yr, aes(x = exports_avg, y = patent_tot, label = grant_year)) +
geom_point() +
geom_line() + # line connecting y and x for each year group
geom_text(vjust = -1, hjust = 0.5) +
labs(
title = "Patents vs. Exports",
x = "Exports (fraction of GDP)",
y = "Patents (per year)"
) +
theme_bw()
ggplot(combined_grp_yr, aes(x = pubs_mill_avg, y = patent_tot, label = grant_year)) +
geom_point() +
geom_line() + # line connecting y and x for each year group
geom_text(vjust = -1, hjust = 0.5) +
labs(
title = "Patents vs. Publications",
x = "Publications (per million people)",
y = "Patents (per year)"
) +
theme_bw()
ggplot(combined_grp_yr, aes(x = researchers_avg, y = patent_tot, label = grant_year)) +
geom_point() +
geom_line() + # line connecting y and x for each year group
geom_text(vjust = -1, hjust = 0.5) +
labs(
title = "Patents vs. Researchers",
x = "Reserchers (per million people)",
y = "Patents (per year)"
) +
theme_bw()
combined_grp_yr_class <- combined_data7 |>
group_by(grant_year, income_class) |>
summarize(malepatents = sum(men_inventors), femalepatents = sum(women_inventors), patent_tot = n(), na.rm=TRUE) |> ungroup()
## `summarise()` has grouped output by 'grant_year'. You can override using the
## `.groups` argument.
ggplot(combined_grp_yr_class, aes(x=grant_year, y=patent_tot, color=income_class)) +
geom_point() + geom_line(linewidth=1)+
labs(title = "Total Patents per Year by Income Class",
x = "Year", y = "Total Number of Patents",
color = "Income Class") #+ theme_minimal() + scale_color_brewer(palette = "Set1")
ggplot(combined_grp_yr_class, aes(x=grant_year, y=patent_tot, color=income_class)) +
geom_point() + geom_line(linewidth=1)+
labs(title = "Total Patents per Year by Income Class",
x = "Year", y = "Total Number of Patents",
color = "Income Class") + scale_y_log10()
y_scale= 3.5 # guess the scaling factor for right-y
ggplot(combined_grp_yr_class, aes(x=factor(grant_year))) +
geom_col(aes(y = malepatents, fill = "Male"), position = "dodge", width = 0.4) +
geom_col(aes(y = femalepatents * y_scale, fill = "Female"), position = position_dodge(width = 0.4), width = 0.4) +
scale_y_continuous(name = "Male coauthors", sec.axis = sec_axis(~ . / y_scale, name = "Female coauthors")) +
scale_fill_manual(name = "Patent authorship",
values = c("Male" = "blue", "Female" = "red")) +
labs(title = "Male and Female applications in the 2000-2020 Period", x="Year")
To examine the pairwise correlations among the variables, we use the cor() function.
rel_matrix <- cor( combined_data7 |> select(gdp, gdp_capita, rd_spending_gdp, researchers, exports_gdp, manufacture_gdp, trade_gdp, pubs_millpop, patent_apps))
# rel_matrix #uncomment to see the matrix
# allow plotting in the margin
corrplot(rel_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 40, addCoef.col = "black", col = brewer.pal(n = 8, name = "PiYG"),
title="\nCorrelations")
Conclusion
The correlations among the various variables indicate a complex relationship that can only be better understood utilizing multilinear regression. A logical dependent variable is GDP, as an end point of increased scientific, technological, manufacturing and trade activities of a country.
The logical progression is from increased promotion of innovation (R&D investment) towards scientific and technological capability demonstrated by researchers and scientific publications, towards increased patent applications, towards tangible outcomes (represented by manufacturing, trade and exports), towards increased GDP.
Thus in the following analysis, I will focus on GDP as the dependent variable.
mlrmodel <- lm (gdp ~ rd_spending_gdp + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita + patent_apps, data= combined_data7) # by default ignores NAs, but no NA present in current data
summary(mlrmodel)
##
## Call:
## lm(formula = gdp ~ rd_spending_gdp + researchers + exports_gdp +
## manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita +
## patent_apps, data = combined_data7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.496e+13 -1.237e+12 3.595e+11 5.758e+11 1.804e+13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.460e+11 2.355e+10 -31.67 <2e-16 ***
## rd_spending_gdp 6.610e+12 7.395e+09 893.91 <2e-16 ***
## researchers -3.200e+09 3.608e+06 -886.85 <2e-16 ***
## exports_gdp -8.950e+11 2.540e+09 -352.31 <2e-16 ***
## manufacture_gdp -4.092e+11 8.909e+08 -459.26 <2e-16 ***
## trade_gdp 4.832e+11 1.391e+09 347.28 <2e-16 ***
## pubs_millpop -1.926e+09 1.036e+07 -185.92 <2e-16 ***
## gdp_capita 2.271e+08 2.832e+05 801.88 <2e-16 ***
## patent_apps 1.182e+07 1.399e+04 844.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.198e+12 on 1165292 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.912
## F-statistic: 1.509e+06 on 8 and 1165292 DF, p-value: < 2.2e-16
Conclusion
The p-value less than 0.05, and close to zero for the intercept, each coefficient and the entire model means that the model is statistically significant and that each independent variable is a meaningful predictor of the dependent variable. Thus we can reject the null hypothesis that the predictors have no effect. The large F value also indicates the overall goodness of fit. The adjusted R-square value of 91% means that the model explains 91% of the variance of the dependent variable. The residuals seem to follow a normal distribution.
The negative intercept shows the prediction with all the independent variables set to zero, and has only mathematical meaning. The regression coefficients show how each independent variable affects the dependent one. For example, A unit change of rd_spending (1% of GDP) increases GDP by 6.6x10^12 dollars. In the context of this project, such has represented a very meaningful return for the investment for the average country in the 2000-2020 period. The total GDP in that period has averaged 10^13 dollars, so the return has been 6.6x1012/1011=$66 per dollar invested, i.e., a profit of $65 per dollar, or 6500% ROI.
But before we get too excited, let’s examine how accurate this linear model is.
We will visually check for linearity of the predicted GDP with respect to some of the independent variables, expecting the data points to appear to follow a linear trend. Warning: however, I expect the predicted line (BLUE line) produced by lm() in the ggplot to be correct only with respect to slope, but not in the y-values, because it assumes that only one independent variable is varying while the others are held constant. But it serves to illustrate the point.
combined_data7 |>
ggplot( aes (y = gdp, x = rd_spending_gdp))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'GDP against R&D Investment')
## `geom_smooth()` using formula = 'y ~ x'
combined_data7 |>
ggplot( aes (y = gdp, x = patent_apps))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'GDP against Patents')
## `geom_smooth()` using formula = 'y ~ x'
In general, the behavior with respect to the other variables is similar. Below I provide the code for each of the other independent variables, but I will not run it because the memory of my small PC was exhausted and R crashed several times.
combined_data7 |> ggplot( aes (y = gdp, x = researchers))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Active Researchers’)
combined_data7 |> ggplot( aes (y = gdp, x = trade_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Trade’)
combined_data7 |> ggplot( aes (y = gdp, x = exports_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Exports’)
combined_data7 |> ggplot( aes (y = gdp, x = pubs_millpop))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Publications per million pop.’)
combined_data7 |> ggplot( aes (y = gdp, x = manufacture_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Manufacturing’)
combined_data7 |> ggplot( aes (y = gdp, x = gdp_capita))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against GDP/capita’)
#```
Conclusions
Although the R-squared correlation factor of 91% for the GDP multinear model is high, visually one can appeciate that the assumptions of normal distribution and linearity of GDP with respect to most of the variables are not well satisfied. In general, one can observe a strong right skew, with clumping of the data towards the smaller GDPs.
Thus, we proceed to transform some of the variables to amplify the influence of the smaller GDP over the model. To do that, we do a log transformation of GDP. Upon some test runs, I saw that doing log transformation of some of the independent variables also improves the model, increasing the R-squared from 86% with the original variable values, to 96% when we use the log(patent applications) and log(R&D spending). We did not proceed to transform the other independent variables at this point. This is just to illustrate the point that the model can be improved.
# Convert to log(gdp), log(patent_apps), log(rd_spending_gdp)
combined_data7$gdp_log <- log(combined_data7$gdp) # transform gdp
combined_data7$patent_log <- log(combined_data7$patent_apps) # transform patent_apps
combined_data7$rd_spending_log <- log(combined_data7$rd_spending_gdp) # transform rd_spending_gdp
# The log transformations introduce infinite values (log of very small numbers) and prevents lm() from running. Thus I had to convert the infinites to NA, so that lm() can ignore them and run ok.
combined_data7$gdp_log[is.infinite(combined_data7$gdp_log)] <- NA
combined_data7$patent_log[is.infinite(combined_data7$patent_log)] <- NA
combined_data7$rd_spending_log[is.infinite(combined_data7$rd_spending_log)] <- NA
mlrmodel_log <- lm (gdp_log ~ patent_log + rd_spending_log + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita, data= combined_data7) # by default ignores NAs
summary(mlrmodel_log)
##
## Call:
## lm(formula = gdp_log ~ patent_log + rd_spending_log + researchers +
## exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop +
## gdp_capita, data = combined_data7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7972 -0.0666 0.0206 0.0696 2.1476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.351e+01 4.023e-03 5843.69 <2e-16 ***
## patent_log 5.720e-01 3.179e-04 1799.34 <2e-16 ***
## rd_spending_log 7.903e-02 2.066e-03 38.25 <2e-16 ***
## researchers -2.486e-04 3.271e-07 -760.12 <2e-16 ***
## exports_gdp -1.654e-02 2.645e-04 -62.54 <2e-16 ***
## manufacture_gdp -5.171e-02 9.566e-05 -540.62 <2e-16 ***
## trade_gdp 8.194e-03 1.440e-04 56.92 <2e-16 ***
## pubs_millpop 2.230e-04 1.206e-06 184.85 <2e-16 ***
## gdp_capita 1.837e-05 3.086e-08 595.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.23 on 1165281 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.9602, Adjusted R-squared: 0.9602
## F-statistic: 3.516e+06 on 8 and 1165281 DF, p-value: < 2.2e-16
Conclusion
The new model displays very high R-squared value of 96% and rather normally distributed residuals. The p values much smaller than 0.05 for all of the regression coefficients, the intercept, and the model, indicates that they are statistically significant with higher than 95% confidence. The corresponding F-value is also huge,in accordance. The R-squared means that the model explains 96% of the variance of the log of GDP.
It would be interesting to calculate the ROI of GDP allocated to R&D predicted by this better model. Using the elasticity concept, a 1% increase in rd_spending results in a 0.079% increase in gdp. Since the average GDP is about $10^13, 1% invested is $10^11. The return is about $7.9 per dollar invested. The ROI is about 690%.
But we must check the assumptions behind the linear model. A visual examination of fit is a very fast approach.
combined_data7 |>
ggplot( aes (y = gdp_log, x = patent_log))+
geom_point() + geom_smooth(method = 'lm', se = TRUE, aes(group = 1)) +
labs (title = 'log(GDP) against log(Patent Applications)')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
This result is much more pleasing visually. The points show more balanced distribution around a straight line. Again, the slope is the significant aspect, since the y-axis value is approximate because ggplot lm() is predicting a line from a single variable (patent_log) while keeping the other independent variables fixed.
combined_data7 |>
ggplot( aes (y = gdp_log, x = rd_spending_log))+
geom_point() + geom_smooth(method = 'lm') +
labs (title = 'log(GDP) against log(R&D Spending)',
x = "log(R&D Spending)", y = "log(GDP)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
The result is more visually pleasing than the original model. The points in the log model follow better a straight line (with large residuals around the mean values).
combined_data7$predicted_log_gdp <- predict(mlrmodel_log, newdata = combined_data7) # predicted log(gdp)
combined_data7$predicted_gdp <- exp(combined_data7$predicted_log_gdp) # predicted gdp
# Plot again
combined_data7 |>
ggplot( aes (y = gdp, x = patent_apps))+
geom_point() + # original points
geom_line (data= combined_data7, aes(y= predicted_gdp, x= patent_apps, color = "Predicted GDP")) +
scale_color_manual(name = "Legend", values = c("Predicted GDP" = "red")) +
labs (title = 'GDP against Patent Applications per year',
x = "Patent Applications", y = "GDP, U$S") +
theme( plot.background = element_rect(fill = "lightyellow"), # Outer plot background
panel.background = element_rect(fill = "lightblue")) # Inner panel background)
combined_data7 |>
ggplot( aes (y = gdp, x = rd_spending_gdp))+
geom_point() + # original points
geom_line (data= combined_data7, aes(y= predicted_gdp, x= rd_spending_gdp, color = "Predicted GDP")) +
scale_color_manual(name = "Legend", values = c("Predicted GDP" = "purple")) +
labs (title = 'GDP against R&D Spending',
x = "R&D Spending, %GDP", y = "GDP, U$S") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
panel.background = element_rect(fill = "lightyellow")) # Inner panel background)
Conclusion
The predictions (red and purple line) are amazingly close to the original data points.
With such good results, it is interesting to see whether the model can now incorporate categorical variables.
combined_data7$income_class <- factor (combined_data7$income_class,
levels= c("Low-income countries","Lower-middle-income countries","Upper-middle-income countries","High-income countries"),
labels= c("Low", "Lower-middle", "Upper-middle", "High"))
mlrmodel_log <- lm (gdp_log ~ patent_log + rd_spending_log + income_class + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita, data= combined_data7) # by default ignores NAs
summary(mlrmodel_log)
##
## Call:
## lm(formula = gdp_log ~ patent_log + rd_spending_log + income_class +
## researchers + exports_gdp + manufacture_gdp + trade_gdp +
## pubs_millpop + gdp_capita, data = combined_data7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6158 -0.0659 0.0233 0.0668 2.1689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.322e+01 1.184e-02 1962.02 <2e-16 ***
## patent_log 5.689e-01 3.199e-04 1778.20 <2e-16 ***
## rd_spending_log 4.993e-02 2.120e-03 23.55 <2e-16 ***
## income_classLower-middle 2.699e-01 1.190e-02 22.67 <2e-16 ***
## income_classUpper-middle 4.028e-01 1.149e-02 35.07 <2e-16 ***
## income_classHigh 2.629e-01 1.143e-02 23.01 <2e-16 ***
## researchers -2.328e-04 3.760e-07 -619.09 <2e-16 ***
## exports_gdp -2.284e-02 2.749e-04 -83.06 <2e-16 ***
## manufacture_gdp -5.229e-02 9.593e-05 -545.07 <2e-16 ***
## trade_gdp 1.144e-02 1.491e-04 76.76 <2e-16 ***
## pubs_millpop 2.091e-04 1.214e-06 172.18 <2e-16 ***
## gdp_capita 1.913e-05 3.215e-08 594.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2292 on 1165278 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.9605, Adjusted R-squared: 0.9605
## F-statistic: 2.575e+06 on 11 and 1165278 DF, p-value: < 2.2e-16
The results are excellent still. All p values are almost zero, the F is very high, the residual distribution seems almost normal around 0. The goodness of fit (R-squared value) is 0.96, meaning that the model describes 96% of the variance of the log GDP. It is interesting to note that among the 4 income classes, the Upper-middle class has the largest effect (slope=0.4028) on the growth of log(GDP). The reference is Low income class. Researchers, Exports and Manufacture have negative slopes, effects that come from the actual data also seen in the linear model of gdp. We must investigate these negative effects in more depth to understand them.
combined_data7 |>
ggplot( aes (y = gdp_log, x = patent_log, color= income_class))+
geom_point() + geom_smooth(method = 'lm',se = FALSE) +
# scale_color_manual(name= "Income Class", values = c("Low" = #"red","Lower-Middle"="orange","Upper-Middle"="green","High" = "blue"),
#labels = c("Low", "Lower-Middle", "Upper-Middle","High")) +
labs (title = 'log(GDP) against log(Patent Applications)',
x = "log(Patent Applications)", y = "log(GDP)",
color = "Income Class") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
panel.background = element_rect(fill = "lightyellow")) # Inner panel background)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
Result is pleasing. The 4 classes of income follow similar trends, with the dispersion or variance around predicted lines roughly balanced; the low-income countries show the largest variance. The high and middle-high countries file the most patents, as incentives (%GDP allocated to R&D) is higher than in middle-low and low countries. But let’s examine that point further now.
combined_data7 |>
ggplot( aes (y = gdp_log, x = rd_spending_log, color= income_class))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'log(GDP) against log(R&D spending)',
x = "log(R&D spending)", y = "log(GDP)",
color = "Income Class") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
panel.background = element_rect(fill = "lightyellow")) # Inner panel background)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
The result is not as pleasing in terms of visual distribution of points but the slopes are meaningful. Interestingly they show that Upper-middle income countries GDP appear to benefit the most from investment, although the High income countries allocate the highest percentages of GDP to R&D.
Let’s now examine some statistical aspects to ascertain the validity of the multilinear log models. We’ll examine normality and independence of residuals, homoscedasticity, and multicollinearity of the independent variables. I’ll just use simple visual examination.
# do a Q-Q plot to visualize normality
plot(mlrmodel_log, which = 2) # to generate the Normal Q-Q plot
hist(residuals(mlrmodel_log))
plot(mlrmodel_log, which = 1) # this generates the residuals vs fitted
plot(1:length(residuals(mlrmodel_log)), residuals(mlrmodel_log), xlab = "Order, by year", ylab = "Residuals")
cor_matrix <- cor( combined_data7 |> select(rd_spending_log,patent_log,researchers,manufacture_gdp,trade_gdp,pubs_millpop,exports_gdp))
cor_matrix
## rd_spending_log patent_log researchers manufacture_gdp
## rd_spending_log 1.0000000 0.4816980 0.6548849 0.1430838
## patent_log 0.4816980 1.0000000 -0.1323853 0.1203137
## researchers 0.6548849 -0.1323853 1.0000000 0.2846661
## manufacture_gdp 0.1430838 0.1203137 0.2846661 1.0000000
## trade_gdp -0.2262144 -0.7310600 0.2835756 0.2264085
## pubs_millpop 0.1430034 -0.4718066 0.2801434 -0.5530510
## exports_gdp -0.2121138 -0.7185020 0.2991440 0.2646476
## trade_gdp pubs_millpop exports_gdp
## rd_spending_log -0.2262144 0.1430034 -0.2121138
## patent_log -0.7310600 -0.4718066 -0.7185020
## researchers 0.2835756 0.2801434 0.2991440
## manufacture_gdp 0.2264085 -0.5530510 0.2646476
## trade_gdp 1.0000000 0.3341954 0.9981307
## pubs_millpop 0.3341954 1.0000000 0.3099796
## exports_gdp 0.9981307 0.3099796 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 40, addCoef.col = "black", col = brewer.pal(n = 8, name = "PiYG"),
title = "/nCorrelations")
Conclusion
The deep green diagonal is the expected autocorrelation. The green in trade and export (1.0), researchers and rd_spending (.65) and patents and rd_spending (.48) show collinearity among those pairs; these are expected. The dark purple shows anticorrelation between manufacture and publications (0.55), patents and exports (.72), and patents and trade (.73). These cases of multicollinearity are a concern for the robustness of the model.
The initial goal of the project was to investigate the assumption that R&D investment by nations leads to increased production of intellectual property patents, which lastly impact economic growth as measured by GDP. As the project progressed, other assumptions were made, with respect to the effect of gender on patent production, and the influence of income status of nations on their ability to produce patents and thus increase their GDP.
In our mental model, the basic framework supporting patent (IP) production has three legs: (1) a strong investment in scientific or technical R&D promotion (measured by GDP% invested in R&D, number of researchers and number of scientific publications), (2) manufacturing capability to enable technicians and scientists to test, develop and commercialize their concepts (measured by manufacturing as fraction of GDP), and (3) a strong legal, IP protection, and commercialization policy (measure by exports and trade as fractions of GDP).
The initial linear model related GDP to the various variables yielded interesting and statistically significant results supporting the alternative hypothesis that increased investment in R&D leads to GDP growth (many-fold ROI), and that increase in patent applications leads to GDP growth. So does supporting trade. On the other hand, the computed model predicts negative effects on GDP from the other variables (export, manufacturing, researchers and publications.) This goes against our assumed patent-supporting ecosystem. Further statistical examination of the linear model shows that despite having an excellent R-squared of 91%, the basic assumption of linearity and normality of the prediction and the residual distribution cannot be proven, withdrawing confidence from the predicted outcome.
A modified multilinear model was constructed, relating log-transformed GDP (dependent variable) to log-transformed R&D investments, log-transformed patents, and the other linear independent variables. This model is also statistically significant and with goodness of fit R-square of 96%. Moreover it displays linearity and normality of predictions and homoscedasticity of residuals much better than the original model. The prediction again shows that increased R&D investment leads to GDP growth, with a ROI approximating $690/$1 invested. Increased patent applications, trade and scientific publications also lead to GDP growth. However, this model still predicts negative effects on GDP from increases in researchers, exports, and manufacturing, which defies a simple explanation. Nevertheless, the model shows that countries with Middle-high income contribute to GDP growth at the fastest rate compared to Low, Middle-high, and High income countries. The model didn’t include gender as a variable, but in the exploratory analysis, it was shown that the female/male-inventors ratio is about 25% and increasing faster in recent years.
The major obstacle in this project was compiling the dataset because it required multiple sources, followed by extensive data cleaning and wrangling in order to arrive at a database wihout NA without introduction of bias. The final dataset covers 20 years and contains 1.2M records. This also originated complications with crashes when R was computing some of the graphics.
The modeling itself didn’t present many obstacles (other than plenty of time dedicated to it). In the end, as said above, the modeling can be improved with further effort.
After examining the collinearities, I can improve the log(GDP) model by removing one or two of these independent variables: trade and researchers. Trade in practice includes both exports and imports. For the purpose of this project, productivity is better measured by exports than imports. Thus, trade is the logical variable to be dropped. Researchers is definitely linked to rd_spending, so researchers could be dropped. The largest concern is the unexpected anti correlation between patents and exports and trade as well as anticorrelation between manufacture and publications. I ran out of time to investigate how to explain these anticorrelations and then logically improve the model.
One other aspect left for future work is possibly most valuable, which is to investigate the type or class of patents that provide the largest returns for countries. In this dataset I compiled data on WIPO field and WIPO sector which are high level classifications of patents. Initially one could analyze the impact of these classes on either GDP or exports or manufacture. Further work for the future is to include in this dataset finer details on patents available in USPTO, EPO, WIPO and other sources.