Do the necessary library installations

# install.packages("tidyverse", "data.table") # remove comment if not already installed

    library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
    library(dplyr)
    library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
    library(patchwork)
    library(corrplot)
## corrplot 0.95 loaded
    library(RColorBrewer)

Title: Impact of Patents on National Growth

a. Introduction

The primary question in this study is how do patents applications relate to national economic growth? In relation to that, how do inceptives for scientific and technologic growth as measured by fraction of GDP invested in R&D impact further growth as measured by GDP? The underlying concept is that building and protecting intellectual property is a source of national impact and a foundation for future growth.

We have seen a fast increase in patent filing over the last 3 decades and we would like to better understand which countries are promoting that growth and how it is affecting their economy.

Patent production is enabled by supporting infrastructure that enables their development. Isolated and eccentric inventors of the past are no longer the primary source of innovative ideas. The working model is an economy that places emphasis on technical or technological innovation as a basis of national advancement for its people and well-being. Such emphasis is evidenced by the support of a strong educational foundation that promotes creativity and independent thinking among its youngest citizens, the source of future inventors. But technical interest and education must be complemented with a powerful manufacturing and scientific infrastructure that allows inventors to discover where innovation is needed the most, and what’s feasible and realizable in practice. Since the low-hanging fruit has already been collected, the current generation of inventors are aiming for technological marvels that seemed science fiction just a few years ago. Thus, from the practical inventions to facilitate daily human living, mostly mechanical and electrical in nature, inventors are now turning to sophisticated medical, pharmaceutical, and chemical solutions to prolong life; electronic and digital solutions to robotize routine tasks as well to achieve tasks not easily feasible for humans, or instruments and communication and manufacturing technology for the fast growing space industry. I mentioned two of the necessary legs to competitive patent production: technical education, and manufacturing and scientific infrastructure. The third leg is a favorable legal, regulatory and financial system that protects invention and makes it attractive and profitable. In the past century, only a handful of countries had all three legs in place. Over the last 4 decades, however, more and more countries are reaching the state of development that enable them to enter the “race”.

In this initial entry into this topic, I am aiming to start scoping the land of patents and see how they are influencing the economy of countries. One basic question is: do countries that invest more in R&D produce more patents? I hope to answer that question here. The second question is: are more patents leading to more exports (or trade) for countries? If I can collect sufficient data, I may be able to answer that here. Is the growth of R&D in general impacting patent production? On the basis of considerations raised in the previous paragraph, other questions are: is gender diversity playing a role in patent production? Are different patent sectors and types growing differently? The last question may lead to very valuable strategic guidance, but a more comprehensive dataset and much analysis time are needed, thus I don’t expect to answer that question here.

b. Source databases for this project

The project required assembling a dataset with the necessary information by country and year, to answer the questions posed above: (1) tables of individual patent filings from USPTO,(2) tables with various indicators from World Bank: GDP, patent applications (total), R&D spending, exports, manufacturing, researchers, population, (3) tables from OurWorldinData on scientific publications, on trade, and country classification by income, (4) table from IBAN to convert country codes from Alpha-2 to Alpha-3 (ISO 3166 international standard.)

USPTO (US Patent and Trademark Office)

PatentsView (annualized tables): https://patentsview.org/data/annualized; downloaded 10 tables 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020 and 2024.

World Bank (World Development Indicators):

GDP(current US$): https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

Patent applications (residents): https://data.worldbank.org/indicator/IP.PAT.RESD

Exports of goods and services (%GDP): https://data.worldbank.org/indicator/NE.EXP.GNFS.ZS

Manufacturing, value added (%GDP): https://data.worldbank.org/indicator/NV.IND.MANF.ZS

Population (total): https://data.worldbank.org/indicator/SP.POP.TOTL

Researchers in R&D (per million population): https://data.worldbank.org/indicator/SP.POP.SCIE.RD.P6

Research and Development expenditure (%GDP): https://data.worldbank.org/indicator/GB.XPD.RSDV.GD.ZS

Our World in Data:

Trade (%GDP): https://ourworldindata.org/grapher/trade-as-share-of-gdp

Publications, science (per million population): https://ourworldindata.org/grapher/scientific-publications-per-million

Countries by income classification, https://ourworldindata.org/grapher/countries-by-income-classification

IBAN: https://www.iban.com/country-codes; conversion between Alpha-2 and Alpha-3 codes used by WorldBank and OurWorldinData

These datasets were bound and then cleaned and wrangled as described below.

c. Dataset preparation

Download to my project folder a selected number of annual datasets from USPTO, at https://patentsview.org/data/annualized (1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020, 2024). Now, read one of them with data.table’s fread for efficient handling of large files. Examine its top rows to take a peek at its variable structure.

setwd(getwd()) # ensure we're at the directory where the RMD is running at and all patent datasets are in subdir PATENTSVIEW
dt <- fread("PATENTSVIEW/1980.csv")
str(dt)
## Classes 'data.table' and 'data.frame':   67133 obs. of  60 variables:
##  $ patent_number          : chr  "4180867" "4180868" "4180869" "4180870" ...
##  $ grant_year             : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ application_number     : int  5881758 5851785 5812057 5676879 5847841 5855370 5748233 5885934 5906415 5957799 ...
##  $ application_year       : int  1978 1977 1977 1976 1977 1977 1976 1978 1978 1978 ...
##  $ d_inventor             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ d_assignee             : int  0 0 1 1 0 0 0 1 0 1 ...
##  $ d_location             : int  0 0 1 1 0 0 0 1 0 1 ...
##  $ d_application          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ d_cpc                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ d_ipc                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ d_wipo                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ assignee               : chr  "" "" "COLGATE-PALMOLIVE COMPANY" "FA WILH. JUL. TEUFEL" ...
##  $ assignee_sequence      : int  NA NA 0 0 NA NA NA 0 NA 0 ...
##  $ assignee_ind           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ country                : chr  "" "" "US" "DE" ...
##  $ city                   : chr  "" "" "New York" "Stuttgart" ...
##  $ state                  : chr  "" "" "NY" "" ...
##  $ county                 : chr  "" "" "New York" "" ...
##  $ cpc_sections           : chr  "A" "A" "A" "B  A" ...
##  $ n_cpc                  : int  3 1 1 4 3 1 6 1 5 1 ...
##  $ n_ipc                  : int  2 1 1 1 1 1 1 1 1 2 ...
##  $ ipc_sections           : chr  "A  E" "A" "A" "A" ...
##  $ n_wipo                 : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ wipo_field_ids         : chr  "34" "34" "13" "13  25" ...
##  $ first_wipo_field_title : chr  "Other consumer goods" "Other consumer goods" "Medical technology" "Medical technology" ...
##  $ first_wipo_sector_title: chr  "Other fields" "Other fields" "Instruments" "Instruments" ...
##  $ team_size              : int  1 1 2 3 1 1 1 2 1 5 ...
##  $ inventors              : int  1 1 2 3 1 1 1 2 1 5 ...
##  $ men_inventors          : int  0 1 0 0 1 0 1 2 1 4 ...
##  $ women_inventors        : int  0 0 1 0 0 0 0 0 0 1 ...
##  $ inventor_id1           : chr  "fl:ma_ln:ridgeway-3" "fl:ch_ln:snow-1" "oqd48eh5qnqj4ghh6gl8noz33" "fl:ra_ln:radulovic-2" ...
##  $ inventor_name1         : chr  "Marcus L. Ridgeway, Jr." "Charles C. Snow" "John E. Pedergrass" "Radoje Radulovic" ...
##  $ male_flag1             : int  0 1 0 0 1 0 1 1 1 1 ...
##  $ inventor_id2           : chr  "" "" "fl:cy_ln:wichman-1" "bj0zl4z4ab7cppmiw4oh8m5vx" ...
##  $ inventor_name2         : chr  "" "" "Cynthia A. Wichman" "Mustafa Karisik" ...
##  $ male_flag2             : int  NA NA 0 0 NA NA NA 1 NA 1 ...
##  $ inventor_id3           : chr  "" "" "" "fl:kl_ln:wolfer-2" ...
##  $ inventor_name3         : chr  "" "" "" "Klaus R. Wolfer" ...
##  $ male_flag3             : int  NA NA NA 0 NA NA NA NA NA 1 ...
##  $ inventor_id4           : chr  "" "" "" "" ...
##  $ inventor_name4         : chr  "" "" "" "" ...
##  $ male_flag4             : int  NA NA NA NA NA NA NA NA NA 1 ...
##  $ inventor_id5           : chr  "" "" "" "" ...
##  $ inventor_name5         : chr  "" "" "" "" ...
##  $ male_flag5             : int  NA NA NA NA NA NA NA NA NA 0 ...
##  $ inventor_id6           : chr  "" "" "" "" ...
##  $ inventor_name6         : chr  "" "" "" "" ...
##  $ male_flag6             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ inventor_id7           : chr  "" "" "" "" ...
##  $ inventor_name7         : chr  "" "" "" "" ...
##  $ male_flag7             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ inventor_id8           : chr  "" "" "" "" ...
##  $ inventor_name8         : chr  "" "" "" "" ...
##  $ male_flag8             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ inventor_id9           : chr  "" "" "" "" ...
##  $ inventor_name9         : chr  "" "" "" "" ...
##  $ male_flag9             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ inventor_id10          : chr  "" "" "" "" ...
##  $ inventor_name10        : chr  "" "" "" "" ...
##  $ male_flag10            : int  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(dt)
##    patent_number grant_year application_number application_year d_inventor
##           <char>      <int>              <int>            <int>      <int>
## 1:       4180867       1980            5881758             1978          1
## 2:       4180868       1980            5851785             1977          1
## 3:       4180869       1980            5812057             1977          1
## 4:       4180870       1980            5676879             1976          1
## 5:       4180871       1980            5847841             1977          1
## 6:       4180872       1980            5855370             1977          1
##    d_assignee d_location d_application d_cpc d_ipc d_wipo
##         <int>      <int>         <int> <int> <int>  <int>
## 1:          0          0             1     1     1      1
## 2:          0          0             1     1     1      1
## 3:          1          1             1     1     1      1
## 4:          1          1             1     1     1      1
## 5:          0          0             1     1     1      1
## 6:          0          0             1     1     1      1
##                     assignee assignee_sequence assignee_ind country      city
##                       <char>             <int>        <int>  <char>    <char>
## 1:                                          NA            0                  
## 2:                                          NA            0                  
## 3: COLGATE-PALMOLIVE COMPANY                 0            0      US  New York
## 4:      FA WILH. JUL. TEUFEL                 0            0      DE Stuttgart
## 5:                                          NA            0                  
## 6:                                          NA            0                  
##     state   county cpc_sections n_cpc n_ipc ipc_sections n_wipo wipo_field_ids
##    <char>   <char>       <char> <int> <int>       <char>  <int>         <char>
## 1:                            A     3     2         A  E      1             34
## 2:                            A     1     1            A      1             34
## 3:     NY New York            A     1     1            A      1             13
## 4:                         B  A     4     1            A      2         13  25
## 5:                            A     3     1            A      1             13
## 6:                            A     1     1            A      1             13
##    first_wipo_field_title first_wipo_sector_title team_size inventors
##                    <char>                  <char>     <int>     <int>
## 1:   Other consumer goods            Other fields         1         1
## 2:   Other consumer goods            Other fields         1         1
## 3:     Medical technology             Instruments         2         2
## 4:     Medical technology             Instruments         3         3
## 5:     Medical technology             Instruments         1         1
## 6:     Medical technology             Instruments         1         1
##    men_inventors women_inventors              inventor_id1
##            <int>           <int>                    <char>
## 1:             0               0       fl:ma_ln:ridgeway-3
## 2:             1               0           fl:ch_ln:snow-1
## 3:             0               1 oqd48eh5qnqj4ghh6gl8noz33
## 4:             0               0      fl:ra_ln:radulovic-2
## 5:             1               0          fl:ro_ln:hamas-1
## 6:             0               0 0804gl64thfu0vjgwfmo9v18m
##             inventor_name1 male_flag1              inventor_id2
##                     <char>      <int>                    <char>
## 1: Marcus L. Ridgeway, Jr.          0                          
## 2:         Charles C. Snow          1                          
## 3:      John E. Pedergrass          0        fl:cy_ln:wichman-1
## 4:        Radoje Radulovic          0 bj0zl4z4ab7cppmiw4oh8m5vx
## 5:         Robert S. Hamas          1                          
## 6:         Neal S. Chaikin          0                          
##        inventor_name2 male_flag2      inventor_id3  inventor_name3 male_flag3
##                <char>      <int>            <char>          <char>      <int>
## 1:                            NA                                           NA
## 2:                            NA                                           NA
## 3: Cynthia A. Wichman          0                                           NA
## 4:    Mustafa Karisik          0 fl:kl_ln:wolfer-2 Klaus R. Wolfer          0
## 5:                            NA                                           NA
## 6:                            NA                                           NA
##    inventor_id4 inventor_name4 male_flag4 inventor_id5 inventor_name5
##          <char>         <char>      <int>       <char>         <char>
## 1:                                     NA                            
## 2:                                     NA                            
## 3:                                     NA                            
## 4:                                     NA                            
## 5:                                     NA                            
## 6:                                     NA                            
##    male_flag5 inventor_id6 inventor_name6 male_flag6 inventor_id7
##         <int>       <char>         <char>      <int>       <char>
## 1:         NA                                     NA             
## 2:         NA                                     NA             
## 3:         NA                                     NA             
## 4:         NA                                     NA             
## 5:         NA                                     NA             
## 6:         NA                                     NA             
##    inventor_name7 male_flag7 inventor_id8 inventor_name8 male_flag8
##            <char>      <int>       <char>         <char>      <int>
## 1:                        NA                                     NA
## 2:                        NA                                     NA
## 3:                        NA                                     NA
## 4:                        NA                                     NA
## 5:                        NA                                     NA
## 6:                        NA                                     NA
##    inventor_id9 inventor_name9 male_flag9 inventor_id10 inventor_name10
##          <char>         <char>      <int>        <char>          <char>
## 1:                                     NA                              
## 2:                                     NA                              
## 3:                                     NA                              
## 4:                                     NA                              
## 5:                                     NA                              
## 6:                                     NA                              
##    male_flag10
##          <int>
## 1:          NA
## 2:          NA
## 3:          NA
## 4:          NA
## 5:          NA
## 6:          NA

Now bind all of the datasets into a single data.table, but to avoid memory overrun, fread only a selected number of columns from each of them, for efficient use of RAM.

# list all the csv files stored in the project directory/PATENTS subdirectory

file_list <- list.files(path =  "PATENTSVIEW", pattern = "*.csv", full.names = TRUE)
file_list
##  [1] "PATENTSVIEW/1980.csv" "PATENTSVIEW/1985.csv" "PATENTSVIEW/1990.csv"
##  [4] "PATENTSVIEW/1995.csv" "PATENTSVIEW/2000.csv" "PATENTSVIEW/2005.csv"
##  [7] "PATENTSVIEW/2010.csv" "PATENTSVIEW/2015.csv" "PATENTSVIEW/2020.csv"
## [10] "PATENTSVIEW/2024.csv"
# select a specific set of columns in the datasets

selected_columns <- c("grant_year","application_year","country","first_wipo_field_title","first_wipo_sector_title","men_inventors","women_inventors") 

# bind all the datasets directly into a data.table format; note: I'm using a base-R for loop for clarity, but there must be a more sophisticated function for this.

combined_data <- data.table()
 for (file in file_list) {
     temp_data <- fread(file, select = selected_columns)
     combined_data <- rbind(combined_data, temp_data)
 }

str(combined_data)  # show number of rows and list columns
## Classes 'data.table' and 'data.frame':   2096504 obs. of  7 variables:
##  $ grant_year             : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ application_year       : int  1978 1977 1977 1976 1977 1977 1976 1978 1978 1978 ...
##  $ country                : chr  "" "" "US" "DE" ...
##  $ first_wipo_field_title : chr  "Other consumer goods" "Other consumer goods" "Medical technology" "Medical technology" ...
##  $ first_wipo_sector_title: chr  "Other fields" "Other fields" "Instruments" "Instruments" ...
##  $ men_inventors          : int  0 1 0 0 1 0 1 2 1 4 ...
##  $ women_inventors        : int  0 0 1 0 0 0 0 0 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(combined_data) # show the top rows
##    grant_year application_year country first_wipo_field_title
##         <int>            <int>  <char>                 <char>
## 1:       1980             1978           Other consumer goods
## 2:       1980             1977           Other consumer goods
## 3:       1980             1977      US     Medical technology
## 4:       1980             1976      DE     Medical technology
## 5:       1980             1977             Medical technology
## 6:       1980             1977             Medical technology
##    first_wipo_sector_title men_inventors women_inventors
##                     <char>         <int>           <int>
## 1:            Other fields             0               0
## 2:            Other fields             1               0
## 3:             Instruments             0               1
## 4:             Instruments             0               0
## 5:             Instruments             1               0
## 6:             Instruments             0               0

Cleaning: check for NAs and cells empty or with blanks and impute them as explained below

# commented to avoid memory overrun - un-comment to check NAs
# colSums(is.na(combined_data)) # show the number of NAs in each column
# colSums(combined_data == "" | combined_data == " ", na.rm=TRUE)   # and the number of empty or blank spaces

The NAs and blanks in “country” are critical, since country is a major independent categorical variable in this study. That information cannot be extracted or inferred from any of the other columns. Thus NA and blank country values will be turned to a new category: “Other”. Application_year has NA values; they will be imputed to be equal to grant_year - mean(delay), where delay=grant_year minus application_year. The variables first_wipo_field_title and first_wipo_sector_title have missing values, thus will be turned to “Unspecified”.

combined_data1 <- mutate(combined_data, application_year= ifelse(is.na(application_year), grant_year-round(mean(grant_year-application_year, na.rm = TRUE)), application_year))

Impute the NA country with Other, wipo field or sector with Unspecified

combined_data1$country[is.na(combined_data1$country) | combined_data1$country == ""] <- "Other"
combined_data1$first_wipo_field_title [combined_data1$first_wipo_field_title == ""] <- "Unspecified"
combined_data1$first_wipo_sector_title [combined_data1$first_wipo_sector_title == ""] <- "Unspecified"

Read datasets from outside into the workind directory of the rmd: country code conversion table, income class table, population, gdp, researchers, research spending, patent applications, exports, manufacturing, scientific publications, and trade.

country_codes <- read_csv("country_codes-iban.csv", show_col_types = FALSE)
population <- read_csv("population-worldbank.csv", show_col_types = FALSE)
income_class <- read_csv("income-groups-ourworldindata.csv", show_col_types = FALSE)
gdp <- read_csv("gdp-worldbank-reduced.csv", show_col_types = FALSE)
researchers <- read_csv("researchers-per-million-worldbank.csv", show_col_types = FALSE)
research_spending <- read_csv("research-spending-gdp-worldbank.csv", show_col_types = FALSE)
patents_apps <- read_csv("annual-patent-applications-residents-worldbank.csv", show_col_types = FALSE)
exports <- read_csv("exports-gdp-worldbank-reduced.csv", show_col_types = FALSE)
manufacturing <- read_csv("manufacturing-gdp-worldbank.csv", show_col_types = FALSE)
publications <- read_csv("sci-pubs-per-million-ourworldindata.csv", show_col_types = FALSE)
trade <- read_csv("trade-share-gdp-ourworldindata.csv", show_col_types = FALSE)

Join datasets, creating columns for country codes, income classification, population, GDP, researchers per million, research spending, all patent apps, exports, manufacturing, scientific publications and trade.

combined_data2 <- left_join(combined_data1, country_codes, by= c("country" = "country"))
combined_data2 <- left_join(combined_data2, income_class, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, population, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, gdp, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, researchers, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, research_spending, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, patents_apps, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, exports, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, manufacturing, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, publications, by= c("code" = "code", "grant_year"="year"))
combined_data2 <- left_join(combined_data2, trade, by= c("code" = "code", "grant_year"="year"))
combined_data2
##          grant_year application_year country first_wipo_field_title
##               <num>            <num>  <char>                 <char>
##       1:       1980             1978   Other   Other consumer goods
##       2:       1980             1977   Other   Other consumer goods
##       3:       1980             1977      US     Medical technology
##       4:       1980             1976      DE     Medical technology
##       5:       1980             1977   Other     Medical technology
##      ---                                                           
## 2096500:       2024             2021      KR            Unspecified
## 2096501:       2024             2021      KR            Unspecified
## 2096502:       2024             2021      JP            Unspecified
## 2096503:       2024             2022      US            Unspecified
## 2096504:       2024             2022      TW            Unspecified
##          first_wipo_sector_title men_inventors women_inventors
##                           <char>         <int>           <int>
##       1:            Other fields             0               0
##       2:            Other fields             1               0
##       3:             Instruments             0               1
##       4:             Instruments             0               0
##       5:             Instruments             1               0
##      ---                                                      
## 2096500:             Unspecified             1               3
## 2096501:             Unspecified             0               0
## 2096502:             Unspecified             2               0
## 2096503:             Unspecified             2               1
## 2096504:             Unspecified             3               1
##                      country_name   code          income_class population
##                            <char> <char>                <char>      <num>
##       1:                    Other  Other                  <NA>         NA
##       2:                    Other  Other                  <NA>         NA
##       3: United States of America    USA                  <NA>  227225000
##       4:                  Germany    DEU                  <NA>   78288576
##       5:                    Other  Other                  <NA>         NA
##      ---                                                                 
## 2096500:         Korea (Republic)    KOR High-income countries   51751065
## 2096501:         Korea (Republic)    KOR High-income countries   51751065
## 2096502:                    Japan    JPN High-income countries  123975371
## 2096503: United States of America    USA High-income countries  340110988
## 2096504:           Taiwan (China)    TWN High-income countries         NA
##                   gdp researchers rd_spending_gdp patent_apps exports_gdp
##                 <num>       <num>           <num>       <num>       <num>
##       1:           NA          NA              NA          NA          NA
##       2:           NA          NA              NA          NA          NA
##       3: 2.857307e+12          NA              NA       62098    9.826455
##       4: 9.537725e+11          NA              NA       28683   18.543411
##       5:           NA          NA              NA          NA          NA
##      ---                                                                 
## 2096500: 0.000000e+00          NA              NA          NA    0.000000
## 2096501: 0.000000e+00          NA              NA          NA    0.000000
## 2096502: 4.026211e+12          NA              NA          NA    0.000000
## 2096503: 2.918489e+13          NA              NA          NA   10.896875
## 2096504:           NA          NA              NA          NA          NA
##          manufacture_gdp pubs_millpop trade_gdp
##                    <num>        <num>     <num>
##       1:              NA           NA        NA
##       2:              NA           NA        NA
##       3:        0.000000           NA  20.10984
##       4:        0.000000           NA  41.69634
##       5:              NA           NA        NA
##      ---                                       
## 2096500:        0.000000           NA        NA
## 2096501:        0.000000           NA        NA
## 2096502:        0.000000           NA        NA
## 2096503:        9.981583           NA  24.88799
## 2096504:              NA           NA        NA

d. Description of the Variables to be used - Data cleaning and wrangling

Income_class: The World Bank relabeled the country classification according to GDP several times prior to 1989, with the most recent official classification starting in 1989 (used as early as 1987) and containing four income groups: Low, Lower Middle, Upper Middle, and High. Moreover, countries may move between classes over time, and some countries were not labeled in the original patents nor received an ISO-3 code. Thus, for proper comparison, the statistics will be restricted to rows with assigned income_class in years 1987-2024. When unavailable, income_class is imputed to “Unspecified”.

  mutate(combined_data2,  income_class = ifelse(is.na(income_class), "Unspecified", income_class))

Population: has remaining NAs for the rows without assigned country of issuance nor ISO-3 code. In addition, Taiwan (province of China) did not report a population, as it is officially added to China. Since the total aggregated population is 0, the mean yields NaN and cannot be used to impute population. Those NAs will not be imputed and those rows will be excluded from the calculations with na.rm.

Application_year: Summary shows two abnormal entries for application_year>2024, a filter was used to find them, and they are imputed to grant_year minus the mean(delay), where delay = grant_year - application_year.

 filter(combined_data2, application_year>2024)  # (to find and show the 2 abnormal entries)
 mutate(combined_data2, application_year= ifelse(application_year > 2024, grant_year-round(mean(grant_year-application_year, na.rm = TRUE)),application_year))
 

Men_inventor and women_inventors: data summary shows for men a Max=111, which is not an error. It occurs for 3 separate applications from Sweden with 111 inventors; it appears to be the same or related patent application submitted at various times. In women_inventors, there is one application with 20 women. It is not an error.

GDP: not available for rows without assigned country not ISO-3 code. Left as NA for now. Later these rows will be filtered out when appropriate or by the functions used.

Researchers: sparse information from World Bank, available from 1996 until 2022 for some of the countries. Imputed with the mean over the period 1996 until 2022 for each country. To be used only for that period. NA outside that range.

 combined_data2 |> 
 group_by(code) |> mutate(
 mean_res = mean(researchers[code != "" & grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),   # mean for countries with code
 researchers = ifelse(is.na(researchers) & code != "", mean_res, researchers)) |>
 ungroup() |> select (-mean_res)   # remove temp col mean_res
 

Rd_spending_gdp: sparse information from World Bank, available from 1996 until 2022 for some of the countries. Imputed with the mean over the period 1996 until 2022 for each country. To be used only for that period. NA outside that range.

  combined_data2 |> 
  group_by(code) |> mutate(
  mean_res_spend = mean(rd_spending_gdp[grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
  researchers = ifelse(is.na(rd_spending_gdp) & code != "", mean_res_spend, rd_spending_gdp)) |>
  ungroup() |> select (-mean_res_spend)   # remove temp col mean_res_spend
  

Patent_apps: partial information from World Bank, available from 1980 until 2021 for most countries. The analysis involving patent_apps will only be done in that time range. NAs are imputed with the mean for each country.

  combined_data2 |> 
  group_by(code) |> mutate(
  mean_patent_apps = mean(patent_apps[grant_year >= 1980 & grant_year <=2021], na.rm = TRUE),
  patent_apps = ifelse(is.na(patent_apps) & code != "", mean_patent_apps, patent_apps)) |>
  ungroup() |> select (-mean_patent_apps)   # remove temp col mean_patent_apps
  

Exports_gdp: mostly complete since 1980-2024 but not for all countries. Not available for rows without assigned country not ISO-3 code. Left as NA.

Manufacture_gdp: mostly complete since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.

Publications per million: mostly complete from Our World in Data since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.

Trade-gdp: mostly complete since 1980-2024 but not for all countries. Not available for countries without ISO-3 code. Left as NA.

Carry out these operations in a single code chunk:

combined_data3 <- combined_data2 |> dplyr::mutate(  
       income_class = ifelse(is.na(income_class), "Unspecified", income_class),
       application_year= ifelse(application_year > 2024, grant_year-round(mean(grant_year-application_year, na.rm = TRUE)),application_year)
       )

combined_data4 <- combined_data3 |> 
     group_by(code) |> dplyr::mutate(
     mean_res = mean(researchers[code != "" & grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
     researchers = ifelse(is.na(researchers) & code != "", mean_res, researchers),
     mean_res_spend = mean(rd_spending_gdp[grant_year >= 1996 & grant_year <=2022], na.rm = TRUE),
      rd_spending_gdp = ifelse(is.na(rd_spending_gdp) & code != "", mean_res_spend, rd_spending_gdp),
     mean_patent_apps = mean(patent_apps[grant_year >= 1980 & grant_year <=2021], na.rm = TRUE),
      patent_apps = ifelse(is.na(patent_apps) & code != "", mean_patent_apps, patent_apps)) |>
     ungroup() |> select (-mean_res, -mean_res_spend, -mean_patent_apps)

Reduced dataset for a full regression study: combined_data5 is a subset that covers patents from 2000 until 2020, in order to exclude years where multiple variables have sparse data. combined_data5 also excludes all rows where country (code) information is Other or TWN. Taiwan is excluded because it does not provide most of the critical information compiled in this dataset, such as population, gdp, r&d spending, trade, etc, as it is considered a province of China. It also excludes about 50 lines missing population, gdp, exports and manufacture, from small islands. The remaining dataset contains 1.2M records.

combined_data5 <- combined_data4 |>
      filter (grant_year >= 2000 & grant_year <= 2020 & code != "Other" & code != "TWN" & population != 0 & !is.na(population) & !is.nan(population))

# Add a column for gdp per capita:

combined_data5 <- dplyr::mutate(combined_data5, gdp_capita = gdp / population)   # to avoid error, I had to specify dplyr::mutate

# Attempt to remove remaining NAs by imputing with the mean for each country, for researchers, rd_spending_gdp, patent_apps, pubs_millpop, trade_gdp

combined_data6 <- combined_data5 |> 
     dplyr::mutate (
       across(where(is.numeric), ~ replace(., is.nan(.), NA))) |>               #making sure all NaN are NA so mean() works well
     group_by(code) |> 
     dplyr::mutate (
       across(where(is.numeric), ~ replace(., is.na(.), mean(., na.rm = TRUE)))) |>   #replace all NAs by country means
       ungroup()   # remove grouping

The remaining NAs appear in smaller nations or nations that do not report all patent information. For those, the mean value calculation yields NaN (div by 0), and can’t be replaced. This number of NAs is acceptable for the 1.2M records and won’t affect the statistics significantly. However, we will create a further reduced dataset (combined_data7) by eliminating the rows that contain NAs or NaN to allow for plotting or correlation functions that do not accept NaN. The set contains 1,165,301 records.

combined_data7 <- drop_na(combined_data6)    # clean up all rows with any NA and NaN.

summary(combined_data7)
##    grant_year   application_year   country          first_wipo_field_title
##  Min.   :2000   Min.   :1933     Length:1165301     Length:1165301        
##  1st Qu.:2010   1st Qu.:2004     Class :character   Class :character      
##  Median :2015   Median :2011     Mode  :character   Mode  :character      
##  Mean   :2013   Mean   :2010                                              
##  3rd Qu.:2020   3rd Qu.:2016                                              
##  Max.   :2020   Max.   :2020                                              
##  first_wipo_sector_title men_inventors    women_inventors   country_name      
##  Length:1165301          Min.   :  0.00   Min.   : 0.0000   Length:1165301    
##  Class :character        1st Qu.:  1.00   1st Qu.: 0.0000   Class :character  
##  Mode  :character        Median :  2.00   Median : 0.0000   Mode  :character  
##                          Mean   :  2.33   Mean   : 0.2819                     
##                          3rd Qu.:  3.00   3rd Qu.: 0.0000                     
##                          Max.   :111.00   Max.   :20.0000                     
##      code           income_class         population             gdp           
##  Length:1165301     Length:1165301     Min.   :8.286e+04   Min.   :0.000e+00  
##  Class :character   Class :character   1st Qu.:1.211e+08   1st Qu.:3.940e+12  
##  Mode  :character   Mode  :character   Median :2.822e+08   Median :1.025e+13  
##                                        Mean   :2.479e+08   Mean   :1.053e+13  
##                                        3rd Qu.:3.218e+08   3rd Qu.:1.830e+13  
##                                        Max.   :1.411e+09   Max.   :2.135e+13  
##   researchers     rd_spending_gdp    patent_apps       exports_gdp    
##  Min.   :  17.3   Min.   :0.01942   Min.   :      1   Min.   :  0.00  
##  1st Qu.:3550.1   1st Qu.:2.61983   1st Qu.: 164795   1st Qu.: 10.69  
##  Median :4464.1   Median :2.77328   Median : 241977   Median : 12.41  
##  Mean   :4368.2   Mean   :2.88831   Mean   : 235560   Mean   : 20.56  
##  3rd Qu.:5117.2   3rd Qu.:3.26458   3rd Qu.: 288335   3rd Qu.: 21.70  
##  Max.   :8620.0   Max.   :4.79571   Max.   :1344817   Max.   :225.16  
##  manufacture_gdp  pubs_millpop        trade_gdp        gdp_capita    
##  Min.   : 0.00   Min.   :   1.842   Min.   : 15.68   Min.   :     0  
##  1st Qu.:11.32   1st Qu.: 878.568   1st Qu.: 25.10   1st Qu.: 36330  
##  Median :12.98   Median :1337.213   Median : 28.22   Median : 44123  
##  Mean   :15.53   Mean   :1206.992   Mean   : 41.66   Mean   : 45493  
##  3rd Qu.:20.28   3rd Qu.:1380.029   3rd Qu.: 43.98   3rd Qu.: 56849  
##  Max.   :34.86   Max.   :2738.632   Max.   :420.43   Max.   :116860

e. Exploratory data analysis

What has been the frequency of patent_apps in 2000-2020 and the corresponding R&D spending – for the world

patent_apps_histg <- ggplot(combined_data7, aes(x = patent_apps)) +
  geom_histogram(binwidth = 30, fill = "lightgreen", color = "blue") +
  labs(title = "Patents 2000-2020, World", x = "Patents/year", y = "Frequency") +
  theme_minimal()

rd_spend_histg <- ggplot(combined_data7, aes(x = rd_spending_gdp)) +
  geom_histogram(binwidth = 0.03, fill = "yellow", color = "red") +
  labs(title = "R&D Spending 2000-2020, World", x = "R&D Spending, %GDP", y = "Frequency") +
  theme_minimal()

patent_apps_histg + rd_spend_histg

How do total patent applications and R&D spending evolve over time

combined_grp_yr <- combined_data7 |>
  group_by(grant_year) |>
  summarize(patent_tot = n(), rd_spending_avg = mean(rd_spending_gdp), manufacture_avg = mean(manufacture_gdp), exports_avg = mean(exports_gdp), trade_avg = mean(trade_gdp), pubs_mill_avg = mean(pubs_millpop), researchers_avg = mean(researchers), na.rm=TRUE)  |>
ungroup()

y_scale = 8*10^-6 # guessed scaling factor for right-y axis
ggplot(combined_grp_yr, aes(x=grant_year)) + 
# plot the total patent line
      geom_point(aes(y= patent_tot, color= "Patents")) + geom_line(linewidth=1, aes(y= patent_tot, color= "Patents"))+
# plot the R&D spending line
      geom_point(aes(y= rd_spending_avg/y_scale, color= "R&D")) + geom_line(linewidth=1, aes(y= rd_spending_avg/y_scale, color= "R&D"))+
# draw the y axes  
   scale_y_continuous(name = "Patents/year",
  sec.axis = sec_axis(~. * y_scale, name = "R&D Spending, %GDP")) +
   scale_x_continuous(name = "Year") +
  labs(title = "World Patents and R&D Spending in the 2000-2020 Period") +
  scale_color_manual(values = c("Patents" = "blue", "R&D" = "red")) +
  theme( axis.title.y.left = element_text(color = "blue"),
         axis.title.y.right = element_text(color = "red"),
    legend.position = "bottom",
    legend.title = element_blank())

explore the connected scatterplot of patents vs rd investment:

ggplot(combined_grp_yr, aes(x = rd_spending_avg, y = patent_tot, label = grant_year)) +
  geom_point() +
  geom_line() + # connect rd_spending_avg vs patent_tot in order of the 'year' variable
  geom_text(vjust = -1, hjust = 0.5) +  # add year labels
  labs(
    title = "Patents vs. R&D Investment",
    x = "R&D Investment (per year)",
    y = "Patents (per year)"
  ) +
  theme_bw()

explore plot of patents vs manufacture:

ggplot(combined_grp_yr, aes(x = manufacture_avg, y = patent_tot, label = grant_year)) +
  geom_point() +
  geom_line() + # line connecting y and x for each year group 
  geom_text(vjust = -1, hjust = 0.5) + 
  labs(
    title = "Patents vs. Manufaturing",
    x = "Manufacturing (fraction of GDP)",
    y = "Patents (per year)"
  ) +
  theme_bw()

explore the connected scatterplot of patents and exports:

ggplot(combined_grp_yr, aes(x = exports_avg, y = patent_tot, label = grant_year)) +
  geom_point() +
  geom_line() + # line connecting y and x for each year group 
  geom_text(vjust = -1, hjust = 0.5) + 
  labs(
    title = "Patents vs. Exports",
    x = "Exports (fraction of GDP)",
    y = "Patents (per year)"
  ) +
  theme_bw()

explore the connected scatterplot of patents and publications per million:

ggplot(combined_grp_yr, aes(x = pubs_mill_avg, y = patent_tot, label = grant_year)) +
  geom_point() +
  geom_line() + # line connecting y and x for each year group 
  geom_text(vjust = -1, hjust = 0.5) + 
  labs(
    title = "Patents vs. Publications",
    x = "Publications (per million people)",
    y = "Patents (per year)"
  ) +
  theme_bw()

explore the connected scatterplot of patents and researchers:

ggplot(combined_grp_yr, aes(x = researchers_avg, y = patent_tot, label = grant_year)) +
  geom_point() +
  geom_line() + # line connecting y and x for each year group 
  geom_text(vjust = -1, hjust = 0.5) + 
  labs(
    title = "Patents vs. Researchers",
    x = "Reserchers (per million people)",
    y = "Patents (per year)"
  ) +
  theme_bw()

Let’s see on a yearly basis by income_class and by gender

combined_grp_yr_class <- combined_data7 |>
  group_by(grant_year, income_class) |>
  summarize(malepatents = sum(men_inventors), femalepatents = sum(women_inventors), patent_tot = n(), na.rm=TRUE) |> ungroup()
## `summarise()` has grouped output by 'grant_year'. You can override using the
## `.groups` argument.
ggplot(combined_grp_yr_class, aes(x=grant_year, y=patent_tot, color=income_class)) + 
    geom_point() + geom_line(linewidth=1)+ 
    labs(title = "Total Patents per Year by Income Class",
    x = "Year", y = "Total Number of Patents",
    color = "Income Class") #+  theme_minimal() + scale_color_brewer(palette = "Set1")

Let’s transform the y axis so that lower income classes are obvious

ggplot(combined_grp_yr_class, aes(x=grant_year, y=patent_tot, color=income_class)) + 
   geom_point() +  geom_line(linewidth=1)+ 
    labs(title = "Total Patents per Year by Income Class",
    x = "Year", y = "Total Number of Patents",
    color = "Income Class")  + scale_y_log10()

Let’s see how gender has faired over the 2 decades

y_scale= 3.5 # guess the scaling factor for right-y

ggplot(combined_grp_yr_class, aes(x=factor(grant_year))) +
  geom_col(aes(y = malepatents, fill = "Male"), position = "dodge", width = 0.4) +
  geom_col(aes(y = femalepatents * y_scale, fill = "Female"), position = position_dodge(width = 0.4), width = 0.4) +
  scale_y_continuous(name = "Male coauthors", sec.axis = sec_axis(~ . / y_scale, name = "Female coauthors")) +
  scale_fill_manual(name = "Patent authorship",
    values = c("Male" = "blue", "Female" = "red")) +
  labs(title = "Male and Female applications in the 2000-2020 Period", x="Year")

Correlations among the variables

To examine the pairwise correlations among the variables, we use the cor() function.

rel_matrix <- cor( combined_data7 |> select(gdp, gdp_capita, rd_spending_gdp, researchers, exports_gdp, manufacture_gdp, trade_gdp, pubs_millpop, patent_apps))
# rel_matrix   #uncomment to see the matrix
 # allow plotting in the margin 
corrplot(rel_matrix, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 40, addCoef.col = "black", col = brewer.pal(n = 8, name = "PiYG"),
        title="\nCorrelations")

Conclusion

The correlations among the various variables indicate a complex relationship that can only be better understood utilizing multilinear regression. A logical dependent variable is GDP, as an end point of increased scientific, technological, manufacturing and trade activities of a country.

The logical progression is from increased promotion of innovation (R&D investment) towards scientific and technological capability demonstrated by researchers and scientific publications, towards increased patent applications, towards tangible outcomes (represented by manufacturing, trade and exports), towards increased GDP.

Thus in the following analysis, I will focus on GDP as the dependent variable.

f. Statistical Analysis

Multiple linear regression model:

mlrmodel <- lm (gdp ~ rd_spending_gdp + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita + patent_apps, data= combined_data7) # by default ignores NAs, but no NA present in current data
summary(mlrmodel)
## 
## Call:
## lm(formula = gdp ~ rd_spending_gdp + researchers + exports_gdp + 
##     manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita + 
##     patent_apps, data = combined_data7)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.496e+13 -1.237e+12  3.595e+11  5.758e+11  1.804e+13 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.460e+11  2.355e+10  -31.67   <2e-16 ***
## rd_spending_gdp  6.610e+12  7.395e+09  893.91   <2e-16 ***
## researchers     -3.200e+09  3.608e+06 -886.85   <2e-16 ***
## exports_gdp     -8.950e+11  2.540e+09 -352.31   <2e-16 ***
## manufacture_gdp -4.092e+11  8.909e+08 -459.26   <2e-16 ***
## trade_gdp        4.832e+11  1.391e+09  347.28   <2e-16 ***
## pubs_millpop    -1.926e+09  1.036e+07 -185.92   <2e-16 ***
## gdp_capita       2.271e+08  2.832e+05  801.88   <2e-16 ***
## patent_apps      1.182e+07  1.399e+04  844.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.198e+12 on 1165292 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.912 
## F-statistic: 1.509e+06 on 8 and 1165292 DF,  p-value: < 2.2e-16

Conclusion

The p-value less than 0.05, and close to zero for the intercept, each coefficient and the entire model means that the model is statistically significant and that each independent variable is a meaningful predictor of the dependent variable. Thus we can reject the null hypothesis that the predictors have no effect. The large F value also indicates the overall goodness of fit. The adjusted R-square value of 91% means that the model explains 91% of the variance of the dependent variable. The residuals seem to follow a normal distribution.

The negative intercept shows the prediction with all the independent variables set to zero, and has only mathematical meaning. The regression coefficients show how each independent variable affects the dependent one. For example, A unit change of rd_spending (1% of GDP) increases GDP by 6.6x10^12 dollars. In the context of this project, such has represented a very meaningful return for the investment for the average country in the 2000-2020 period. The total GDP in that period has averaged 10^13 dollars, so the return has been 6.6x1012/1011=$66 per dollar invested, i.e., a profit of $65 per dollar, or 6500% ROI.

But before we get too excited, let’s examine how accurate this linear model is.

Are model assumptions correct?

We will visually check for linearity of the predicted GDP with respect to some of the independent variables, expecting the data points to appear to follow a linear trend. Warning: however, I expect the predicted line (BLUE line) produced by lm() in the ggplot to be correct only with respect to slope, but not in the y-values, because it assumes that only one independent variable is varying while the others are held constant. But it serves to illustrate the point.

Linearity of GDP against Patents:

combined_data7 |>
ggplot( aes (y = gdp, x = rd_spending_gdp))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'GDP against R&D Investment')
## `geom_smooth()` using formula = 'y ~ x'

combined_data7 |>
ggplot( aes (y = gdp, x = patent_apps))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'GDP against Patents')
## `geom_smooth()` using formula = 'y ~ x'

code below can be run by removing the comments

```{r}

combined_data7 |> ggplot( aes (y = gdp, x = researchers))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Active Researchers’)

combined_data7 |> ggplot( aes (y = gdp, x = trade_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Trade’)

combined_data7 |> ggplot( aes (y = gdp, x = exports_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Exports’)

combined_data7 |> ggplot( aes (y = gdp, x = pubs_millpop))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Publications per million pop.’)

combined_data7 |> ggplot( aes (y = gdp, x = manufacture_gdp))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against Manufacturing’)

combined_data7 |> ggplot( aes (y = gdp, x = gdp_capita))+ geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + labs (title = ‘GDP against GDP/capita’)

#```

Conclusions

Although the R-squared correlation factor of 91% for the GDP multinear model is high, visually one can appeciate that the assumptions of normal distribution and linearity of GDP with respect to most of the variables are not well satisfied. In general, one can observe a strong right skew, with clumping of the data towards the smaller GDPs.

Modified model

Thus, we proceed to transform some of the variables to amplify the influence of the smaller GDP over the model. To do that, we do a log transformation of GDP. Upon some test runs, I saw that doing log transformation of some of the independent variables also improves the model, increasing the R-squared from 86% with the original variable values, to 96% when we use the log(patent applications) and log(R&D spending). We did not proceed to transform the other independent variables at this point. This is just to illustrate the point that the model can be improved.

#  Convert to log(gdp), log(patent_apps), log(rd_spending_gdp)

combined_data7$gdp_log <- log(combined_data7$gdp)   # transform gdp
combined_data7$patent_log <- log(combined_data7$patent_apps) # transform patent_apps
combined_data7$rd_spending_log <- log(combined_data7$rd_spending_gdp) # transform rd_spending_gdp

#  The log transformations introduce infinite values (log of very small numbers) and prevents lm() from running. Thus I had to convert the infinites to NA, so that lm() can ignore them and run ok.

combined_data7$gdp_log[is.infinite(combined_data7$gdp_log)] <- NA   
combined_data7$patent_log[is.infinite(combined_data7$patent_log)] <- NA
combined_data7$rd_spending_log[is.infinite(combined_data7$rd_spending_log)] <- NA

Now do the new multilinear regression of the modified model

mlrmodel_log <- lm (gdp_log ~  patent_log + rd_spending_log + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita, data= combined_data7) # by default ignores NAs
summary(mlrmodel_log)
## 
## Call:
## lm(formula = gdp_log ~ patent_log + rd_spending_log + researchers + 
##     exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + 
##     gdp_capita, data = combined_data7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7972 -0.0666  0.0206  0.0696  2.1476 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.351e+01  4.023e-03 5843.69   <2e-16 ***
## patent_log       5.720e-01  3.179e-04 1799.34   <2e-16 ***
## rd_spending_log  7.903e-02  2.066e-03   38.25   <2e-16 ***
## researchers     -2.486e-04  3.271e-07 -760.12   <2e-16 ***
## exports_gdp     -1.654e-02  2.645e-04  -62.54   <2e-16 ***
## manufacture_gdp -5.171e-02  9.566e-05 -540.62   <2e-16 ***
## trade_gdp        8.194e-03  1.440e-04   56.92   <2e-16 ***
## pubs_millpop     2.230e-04  1.206e-06  184.85   <2e-16 ***
## gdp_capita       1.837e-05  3.086e-08  595.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.23 on 1165281 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9602, Adjusted R-squared:  0.9602 
## F-statistic: 3.516e+06 on 8 and 1165281 DF,  p-value: < 2.2e-16

Conclusion

The new model displays very high R-squared value of 96% and rather normally distributed residuals. The p values much smaller than 0.05 for all of the regression coefficients, the intercept, and the model, indicates that they are statistically significant with higher than 95% confidence. The corresponding F-value is also huge,in accordance. The R-squared means that the model explains 96% of the variance of the log of GDP.

It would be interesting to calculate the ROI of GDP allocated to R&D predicted by this better model. Using the elasticity concept, a 1% increase in rd_spending results in a 0.079% increase in gdp. Since the average GDP is about $10^13, 1% invested is $10^11. The return is about $7.9 per dollar invested. The ROI is about 690%.

But we must check the assumptions behind the linear model. A visual examination of fit is a very fast approach.

Plot the log (GDP) vs log (Patents) curve

combined_data7 |>
ggplot( aes (y = gdp_log, x = patent_log))+
geom_point() + geom_smooth(method = 'lm', se = TRUE, aes(group = 1)) +
labs (title = 'log(GDP) against log(Patent Applications)')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

This result is much more pleasing visually. The points show more balanced distribution around a straight line. Again, the slope is the significant aspect, since the y-axis value is approximate because ggplot lm() is predicting a line from a single variable (patent_log) while keeping the other independent variables fixed.

Plot the log (GDP) vs log (R&D investment) curve

combined_data7 |>
ggplot( aes (y = gdp_log, x = rd_spending_log))+
geom_point() + geom_smooth(method = 'lm') +
labs (title = 'log(GDP) against log(R&D Spending)',
      x = "log(R&D Spending)", y = "log(GDP)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

The result is more visually pleasing than the original model. The points in the log model follow better a straight line (with large residuals around the mean values).

Retrieve back the predicted values of GDP from the log-log model and plot them. Note: here we are not using ggplot lm() to calculate the straight line. But we are using the log-log model to calculate the prediction line which is no longer a straight line in the original units (GDP in $, patents per year, R&D as %GDP).

combined_data7$predicted_log_gdp <- predict(mlrmodel_log, newdata = combined_data7)  # predicted log(gdp)

combined_data7$predicted_gdp <- exp(combined_data7$predicted_log_gdp)  # predicted gdp


# Plot again

combined_data7 |>
ggplot( aes (y = gdp, x = patent_apps))+
geom_point() +     # original points
geom_line (data= combined_data7, aes(y= predicted_gdp, x= patent_apps, color = "Predicted GDP")) +
      scale_color_manual(name = "Legend", values = c("Predicted GDP" = "red")) +  
labs (title = 'GDP against Patent Applications per year',
      x = "Patent Applications", y = "GDP, U$S") +
theme( plot.background = element_rect(fill = "lightyellow"), # Outer plot background
       panel.background = element_rect(fill = "lightblue")) # Inner panel background)

combined_data7 |>
ggplot( aes (y = gdp, x = rd_spending_gdp))+
geom_point() +     # original points
geom_line (data= combined_data7, aes(y= predicted_gdp, x= rd_spending_gdp, color = "Predicted GDP")) +
      scale_color_manual(name = "Legend", values = c("Predicted GDP" = "purple")) +  
labs (title = 'GDP against R&D Spending',
      x = "R&D Spending, %GDP", y = "GDP, U$S") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
       panel.background = element_rect(fill = "lightyellow")) # Inner panel background)

Conclusion

The predictions (red and purple line) are amazingly close to the original data points.

With such good results, it is interesting to see whether the model can now incorporate categorical variables.

Adding income-level as a factor to the model

The reference level is “Low” income countries; the 4 levels are: Low, Lower-middle, Upper-middle, and High

combined_data7$income_class <- factor (combined_data7$income_class,
             levels= c("Low-income countries","Lower-middle-income countries","Upper-middle-income countries","High-income countries"),
             labels= c("Low", "Lower-middle", "Upper-middle", "High"))

mlrmodel_log <- lm (gdp_log ~  patent_log + rd_spending_log + income_class + researchers + exports_gdp + manufacture_gdp + trade_gdp + pubs_millpop + gdp_capita, data= combined_data7) # by default ignores NAs

summary(mlrmodel_log)
## 
## Call:
## lm(formula = gdp_log ~ patent_log + rd_spending_log + income_class + 
##     researchers + exports_gdp + manufacture_gdp + trade_gdp + 
##     pubs_millpop + gdp_capita, data = combined_data7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6158 -0.0659  0.0233  0.0668  2.1689 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.322e+01  1.184e-02 1962.02   <2e-16 ***
## patent_log                5.689e-01  3.199e-04 1778.20   <2e-16 ***
## rd_spending_log           4.993e-02  2.120e-03   23.55   <2e-16 ***
## income_classLower-middle  2.699e-01  1.190e-02   22.67   <2e-16 ***
## income_classUpper-middle  4.028e-01  1.149e-02   35.07   <2e-16 ***
## income_classHigh          2.629e-01  1.143e-02   23.01   <2e-16 ***
## researchers              -2.328e-04  3.760e-07 -619.09   <2e-16 ***
## exports_gdp              -2.284e-02  2.749e-04  -83.06   <2e-16 ***
## manufacture_gdp          -5.229e-02  9.593e-05 -545.07   <2e-16 ***
## trade_gdp                 1.144e-02  1.491e-04   76.76   <2e-16 ***
## pubs_millpop              2.091e-04  1.214e-06  172.18   <2e-16 ***
## gdp_capita                1.913e-05  3.215e-08  594.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2292 on 1165278 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9605 
## F-statistic: 2.575e+06 on 11 and 1165278 DF,  p-value: < 2.2e-16

The results are excellent still. All p values are almost zero, the F is very high, the residual distribution seems almost normal around 0. The goodness of fit (R-squared value) is 0.96, meaning that the model describes 96% of the variance of the log GDP. It is interesting to note that among the 4 income classes, the Upper-middle class has the largest effect (slope=0.4028) on the growth of log(GDP). The reference is Low income class. Researchers, Exports and Manufacture have negative slopes, effects that come from the actual data also seen in the linear model of gdp. We must investigate these negative effects in more depth to understand them.

Visualize the effect of income_class of countries on the effect of patents on GDP

combined_data7 |>
ggplot( aes (y = gdp_log, x = patent_log, color= income_class))+
geom_point() + geom_smooth(method = 'lm',se = FALSE) +
# scale_color_manual(name= "Income Class", values = c("Low" = #"red","Lower-Middle"="orange","Upper-Middle"="green","High" = "blue"),
#labels = c("Low", "Lower-Middle", "Upper-Middle","High")) +
labs (title = 'log(GDP) against log(Patent Applications)',
      x = "log(Patent Applications)", y = "log(GDP)",
      color = "Income Class") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
       panel.background = element_rect(fill = "lightyellow")) # Inner panel background)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

Result is pleasing. The 4 classes of income follow similar trends, with the dispersion or variance around predicted lines roughly balanced; the low-income countries show the largest variance. The high and middle-high countries file the most patents, as incentives (%GDP allocated to R&D) is higher than in middle-low and low countries. But let’s examine that point further now.

Visualize the effect of income_class of countries on the effect of R&D funding on GDP

combined_data7 |>
ggplot( aes (y = gdp_log, x = rd_spending_log, color= income_class))+
geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs (title = 'log(GDP) against log(R&D spending)',
      x = "log(R&D spending)", y = "log(GDP)",
      color = "Income Class") +
theme( plot.background = element_rect(fill = "lightblue"), # Outer plot background
       panel.background = element_rect(fill = "lightyellow")) # Inner panel background)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

The result is not as pleasing in terms of visual distribution of points but the slopes are meaningful. Interestingly they show that Upper-middle income countries GDP appear to benefit the most from investment, although the High income countries allocate the highest percentages of GDP to R&D.

Are the statistical assumptions satisfied by the log model?

Let’s now examine some statistical aspects to ascertain the validity of the multilinear log models. We’ll examine normality and independence of residuals, homoscedasticity, and multicollinearity of the independent variables. I’ll just use simple visual examination.

Residuals: independence and normality

# do a Q-Q plot to visualize normality

        plot(mlrmodel_log, which = 2) # to generate the Normal Q-Q plot

The residuals fall along the diagonal from -2 to 3 Q, but the deviations from the diagonal line at the tails imply non-normality. To reassess, try histogram:

hist(residuals(mlrmodel_log))

This looks a lot better, not exactly symmetrical but left skewed.

Homoscedasticity: let’s check for residual independence, by plotting residuals against fitted gpd_avg plot

        plot(mlrmodel_log, which = 1) # this generates the residuals vs fitted 

This looks like residuals are independent without specific pattern. The spread seems uniform across the range of fitted values, i.e., shows homoscedasticity. But let’s recheck with a plot of residuals vs order by year.

plot(1:length(residuals(mlrmodel_log)), residuals(mlrmodel_log), xlab = "Order, by year", ylab = "Residuals")

The 1.2M data points result in a “blob” but no specific symmetric pattern (no autocorrelation) that I can see, except for some “streams”.

Multicollinearity of the independent variables

let’s use again the cor() function but with the independent variables that include two of the log-transformed variables.

cor_matrix <- cor( combined_data7 |> select(rd_spending_log,patent_log,researchers,manufacture_gdp,trade_gdp,pubs_millpop,exports_gdp))
cor_matrix
##                 rd_spending_log patent_log researchers manufacture_gdp
## rd_spending_log       1.0000000  0.4816980   0.6548849       0.1430838
## patent_log            0.4816980  1.0000000  -0.1323853       0.1203137
## researchers           0.6548849 -0.1323853   1.0000000       0.2846661
## manufacture_gdp       0.1430838  0.1203137   0.2846661       1.0000000
## trade_gdp            -0.2262144 -0.7310600   0.2835756       0.2264085
## pubs_millpop          0.1430034 -0.4718066   0.2801434      -0.5530510
## exports_gdp          -0.2121138 -0.7185020   0.2991440       0.2646476
##                  trade_gdp pubs_millpop exports_gdp
## rd_spending_log -0.2262144    0.1430034  -0.2121138
## patent_log      -0.7310600   -0.4718066  -0.7185020
## researchers      0.2835756    0.2801434   0.2991440
## manufacture_gdp  0.2264085   -0.5530510   0.2646476
## trade_gdp        1.0000000    0.3341954   0.9981307
## pubs_millpop     0.3341954    1.0000000   0.3099796
## exports_gdp      0.9981307    0.3099796   1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 40, addCoef.col = "black", col = brewer.pal(n = 8, name = "PiYG"),
         title = "/nCorrelations")

Conclusion

The deep green diagonal is the expected autocorrelation. The green in trade and export (1.0), researchers and rd_spending (.65) and patents and rd_spending (.48) show collinearity among those pairs; these are expected. The dark purple shows anticorrelation between manufacture and publications (0.55), patents and exports (.72), and patents and trade (.73). These cases of multicollinearity are a concern for the robustness of the model.

g.Overall Conclusions

The initial goal of the project was to investigate the assumption that R&D investment by nations leads to increased production of intellectual property patents, which lastly impact economic growth as measured by GDP. As the project progressed, other assumptions were made, with respect to the effect of gender on patent production, and the influence of income status of nations on their ability to produce patents and thus increase their GDP.

In our mental model, the basic framework supporting patent (IP) production has three legs: (1) a strong investment in scientific or technical R&D promotion (measured by GDP% invested in R&D, number of researchers and number of scientific publications), (2) manufacturing capability to enable technicians and scientists to test, develop and commercialize their concepts (measured by manufacturing as fraction of GDP), and (3) a strong legal, IP protection, and commercialization policy (measure by exports and trade as fractions of GDP).

The initial linear model related GDP to the various variables yielded interesting and statistically significant results supporting the alternative hypothesis that increased investment in R&D leads to GDP growth (many-fold ROI), and that increase in patent applications leads to GDP growth. So does supporting trade. On the other hand, the computed model predicts negative effects on GDP from the other variables (export, manufacturing, researchers and publications.) This goes against our assumed patent-supporting ecosystem. Further statistical examination of the linear model shows that despite having an excellent R-squared of 91%, the basic assumption of linearity and normality of the prediction and the residual distribution cannot be proven, withdrawing confidence from the predicted outcome.

A modified multilinear model was constructed, relating log-transformed GDP (dependent variable) to log-transformed R&D investments, log-transformed patents, and the other linear independent variables. This model is also statistically significant and with goodness of fit R-square of 96%. Moreover it displays linearity and normality of predictions and homoscedasticity of residuals much better than the original model. The prediction again shows that increased R&D investment leads to GDP growth, with a ROI approximating $690/$1 invested. Increased patent applications, trade and scientific publications also lead to GDP growth. However, this model still predicts negative effects on GDP from increases in researchers, exports, and manufacturing, which defies a simple explanation. Nevertheless, the model shows that countries with Middle-high income contribute to GDP growth at the fastest rate compared to Low, Middle-high, and High income countries. The model didn’t include gender as a variable, but in the exploratory analysis, it was shown that the female/male-inventors ratio is about 25% and increasing faster in recent years.

e.Obstacles

The major obstacle in this project was compiling the dataset because it required multiple sources, followed by extensive data cleaning and wrangling in order to arrive at a database wihout NA without introduction of bias. The final dataset covers 20 years and contains 1.2M records. This also originated complications with crashes when R was computing some of the graphics.

The modeling itself didn’t present many obstacles (other than plenty of time dedicated to it). In the end, as said above, the modeling can be improved with further effort.

h.Future steps

After examining the collinearities, I can improve the log(GDP) model by removing one or two of these independent variables: trade and researchers. Trade in practice includes both exports and imports. For the purpose of this project, productivity is better measured by exports than imports. Thus, trade is the logical variable to be dropped. Researchers is definitely linked to rd_spending, so researchers could be dropped. The largest concern is the unexpected anti correlation between patents and exports and trade as well as anticorrelation between manufacture and publications. I ran out of time to investigate how to explain these anticorrelations and then logically improve the model.

One other aspect left for future work is possibly most valuable, which is to investigate the type or class of patents that provide the largest returns for countries. In this dataset I compiled data on WIPO field and WIPO sector which are high level classifications of patents. Initially one could analyze the impact of these classes on either GDP or exports or manufacture. Further work for the future is to include in this dataset finer details on patents available in USPTO, EPO, WIPO and other sources.

i.References

j. Published