Aims and Motivation

Expenditure on public education is a subject of debate in many places around the world, and is prioritized in different ways. The usefulness or otherwise of this expenditure is one of the cornerstones of this debate. In the United States, much of the debate is centered on whether that expenditure is better utilized when allocated to the private sector.

In today’s world the importance of the so-called intellectual capital, largely comprising of technological research and development, cannot be overstated. In technology, there is a worldwide race among nations to establish a dominant position in research and applications of emerging areas, in fields such as artifical intelligence, autonomous systems and alternative energy. The filing of patents in technical fields in general is naturally perceived to be a key indicator of technological prowess. Most economic decision-makers would probably view the filing of large numbers of patents by their private sector companies and universities to be a desirable outcome of increasing expenditure on public education, if such an outcome could be obtained.

The objective of this study is to investigate whether a correlation exists between the amount of per-capita expenditure that a country allocates to public education, and the number of patents filed per-capita by that country. For this, data from several publically available data sources are used, along with regression modeling techniques.

This project focuses on the relationship between a country’s per-capita investment in public higher education, and its number of patents filed per capita. Depending on the data, I shall examine whether the two are correlated.

Depending on its results, this study could be used to support making greater investment in public education in the United States.

A linear regression model is used to determine correlation among the variables.

Data sources:

The following data sources were used:

# Libraries for data manipulations, analysis and web scraping.
library(dplyr)
library(ggplot2)
library(knitr)
library(RCurl)
library(rvest)
library(stringr)
library(tidyr)
library(utils)
library(XML)
library(stats)
library(grid)
library(gridExtra)
Step 1: Select the countries to include in this study.

For this, we select the top 6 countries with the highest number of patent applications per million of population. The source of this information is Wikipedia: https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators

wiki_url = "https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators"
temp1 <- wiki_url %>% read_html %>% html_nodes("table")

# List of the countries with highest patent applications for 2012
# Patent applications per million population for the top 10 origins, 2012
clist = html_table(temp1[6])
Countries = data.frame(clist)
Countries = Countries[1:6,]
# Countries = Countries %>% select(Country)
kable(Countries)
Rank Country Patent.applications.per.million.population
1 South Korea 2,962
2 Japan 2,250
3 Switzerland 1,013
4 Germany 902
5 United States 856
6 Finland 665

Patents by Country.

This data is obtained from World Intellectual Property Organization (WIPO). The list of countries is from: http://www.wipo.int/ipstats/en/statistics/country_profile/

Ccodes = list("South Korea", "Japan", "Switzerland", "Germany", "United States", "Finland")
Ccodes2 = list("KR", "JP", "CH", "DE", "US", "FI")
Ccodes3 = list("KOR", "JPN", "CHE", "DEU", "USA", "FIN")
ccount = length(Ccodes)
wipo_url = "http://www.wipo.int/ipstats/en/statistics/country_profile/profile.jsp?code="

Patents = list()
i = 0
for (ccode in Ccodes2) {
    i = i + 1
    url2 = paste(wipo_url, ccode, sep="")
    temp2 <- url2 %>% read_html %>% html_nodes("table")
    grants = html_table(temp2[5])
    df = data.frame(grants)
    df = df[-c(1),] %>% select(-X4)
    df$Applications = as.integer(gsub(",", "", df$X2)) + as.integer(gsub(",", "", df$X3))
    df = df %>% select(-c(X2, X3))
    
    # Drop the first 3 rows, 2007-2009.
    df = df[4:10,]
    names(df) = c("Year", "Applications")
    Patents[[ccode]] = df
}

Display the charts of Raw Patents Counts (not adjusted for population).

par(mfrow=c(2, 3))
for (i in 1:ccount) {
    df = data.frame(Patents[i])
    names(df) = c("Year", "Total Applications")
    plot(df, main=Ccodes[i])
}

Public Education Expenditures

Now we read data on public education expenditures. Data source is World Bank Group. This data is available as a zipped CSV file downloaded from https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS

Education expenditure is shown as a percentage of GDP.

par(mfrow=c(1,1))
newdf = data.frame()
Expenditures = data.frame()
filename = "API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_9908693.csv"
expend2 <- read.csv(filename, skip=4)

for (cty in Ccodes3) {
    newdf = filter(expend2, Country.Code == cty)
    newdf = newdf %>% select(Country.Code, X2010:X2016)
    Expenditures = rbind(Expenditures, newdf)
}

# Education expenditure is shown as a percentage of GDP.
kable(Expenditures)
Country.Code X2010 X2011 X2012 X2013 X2014 X2015 X2016
KOR NA NA 4.61823 4.93070 5.05672 5.06544 NA
JPN 3.63950 3.64258 3.69226 3.66538 3.59059 NA NA
CHE 4.92605 4.96986 5.03337 5.04048 5.05123 NA NA
DEU 4.91368 4.80780 4.93331 4.93497 4.93113 NA NA
USA 5.42001 5.22390 5.19486 4.94379 4.98948 NA NA
FIN 6.54071 6.48201 7.19254 7.15848 7.15156 NA NA

Population Growth Data

Next we read data on population growth by year of the selected countries. This data is required to calculate the per-capita patent applications filed. Without this information we would be using raw patent counts which would not normalize for countries with large and small populations. This information is obtained from Wikipedia.

wiki_url3 = "https://en.wikipedia.org/wiki/List_of_countries_by_past_population_(United_Nations,_estimates)"
temp3 <- wiki_url3 %>% read_html %>% html_nodes("table")

# Country Population data by Year.
clist = html_table(temp3[1])
pop.all = data.frame(clist)
names(pop.all)[1] = "Country"
pop.all = pop.all %>% select(Country, X2010, X2015)
Populations = data.frame()

for (cy in Ccodes) {
    Populations = rbind(Populations, subset(pop.all, pop.all$Country == cy))
}

Populations$Mean = (as.integer(gsub(",", "", Populations$X2010)) +
                    as.integer(gsub(",", "", Populations$X2015)))/2
kable(Populations)
Country X2010 X2015 Mean
193 South Korea 49,090 50,293 49691.5
104 Japan 127,320 126,573 126946.5
201 Switzerland 7,831 8,299 8065.0
77 Germany 80,435 80,689 80562.0
220 United States 309,876 321,774 315825.0
71 Finland 5,368 5,503 5435.5

Data Visualization

# Add Patents columns to expenditures
Spending = list()

patents = data.frame()
i = 1
for (cty in Ccodes2) {
    cty2 = unlist(Ccodes2[i])
    cty3 = unlist(Ccodes3[i])
    
    # Divide patents counts by population means to get per-capita patents.
    Patents[[cty2]]$Applications = Patents[[cty2]]$Applications / Populations[i,]$Mean
    df = Patents[[cty2]]
    
    spending = filter(Expenditures, Country.Code == cty3)
    names(spending) = c("Country.Code", "2010":"2016")
    spending = spending %>%
               gather(Year, Spent, "2010":"2016") %>%
               select(-c(Country.Code))
    Spending[[cty2]] = spending
    i = i + 1
}


# Make a combined Expenditure vs Patents-per-capita DF.
Models = list()
for (i in 1:ccount) {
    cty = unlist(Ccodes3[i])
    df = inner_join(Spending[[i]], Patents[[i]], by = "Year")
    df$Year = as.numeric(df$Year)
    Models[[cty]] = df
}

Display the charts of Patents vs Educational Expenditure for all countries.

for (i in 1:ccount) {
    cty3 = unlist(Ccodes3[i])
    cty = unlist(Ccodes[i])
    df = Models[[cty3]]
    
    p1 = ggplot(data=df, aes(x=Year)) +
         geom_line(aes(y=Spent)) +
         ggtitle(paste(cty, ": Public Education", sep="")) +
         ylab("Public Education Expenditure / GDP")
    p2 = ggplot(data=df, aes(x=Year)) +
         geom_line(aes(y=Applications)) + 
         ggtitle(paste("Patent Applications per Capita")) +
         ylab("Patents per capita")
    grid.arrange(p1, p2, ncol=2)
}

Regression Model

There seems to be no particular pattern in the charts showing Expenditures (as percentage of GDP) against Patent Applications per capita. A linear regression model was attempted for a few countries in the data set.

kor = Models[["KOR"]]
usa = Models[["USA"]]

m.kor = lm(Applications ~ Spent, data = kor)
summary(m.kor)
## 
## Call:
## lm(formula = Applications ~ Spent, data = kor)
## 
## Residuals:
##         3         4         5         6 
##  0.003296 -0.009210 -0.026698  0.032613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.04829    0.41597   -2.52  0.12793   
## Spent        1.04948    0.08453   12.42  0.00642 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03059 on 2 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9808 
## F-statistic: 154.2 on 1 and 2 DF,  p-value: 0.006425
m.usa = lm(Applications ~ Spent, data = usa)
summary(m.usa)
## 
## Call:
## lm(formula = Applications ~ Spent, data = usa)
## 
## Residuals:
##        1        2        3        4        5 
##  0.01303 -0.06461  0.04186 -0.02050  0.03021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   4.8537     0.6658   7.290  0.00533 **
## Spent        -0.6115     0.1291  -4.737  0.01784 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04976 on 3 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8821, Adjusted R-squared:  0.8428 
## F-statistic: 22.44 on 1 and 3 DF,  p-value: 0.01784
par(mfrow=c(1,2))
plot(fitted(m.kor), resid(m.kor))
plot(fitted(m.usa), resid(m.usa))

Conclusions

In the two countries for which linear regression models were built, the Adjusted R-squared values are relatively high, indicating a good fit. However, the coefficients are opposite in sign, reflecting the fact that for one of the two (USA) the data suggest a negative correlation.

From the above charts it seems clear that, for the countries selected in this study, the available data do not show a clear relationship between the level of expenditure on public education, measured as a percentage of GDP, and the number of patents filed per-capita. It is likely that other macro-economic or global factors affect the measured output, and the proposed model is over-simplistic. It is likely that more factors, such as regional and global economic growth rates, which affect private industry investment in research and development, influence the measured output and need to be included as significant variables.

Challenges

  1. Population growth data is not available for countries on a per-year basis, but only on 5-year intervals. To address this limitation I used the data from the closest year to estimate the population on a given year. Data showed that for the countries selected, population growth rates in the 6-year interval used were not very high.
  2. For some the period 2010-2016, educational expenditures were not available for some of the selected countries. To address this limitation I restricted the data to those years for which the data were available.
  3. Parsing the data in tables in Wikipedia took some time to implement correctly. Initially I started with the WikepediR package (https://cran.r-project.org/web/packages/WikipediR/index.html) but switched to the rvest package when I realized that the latter was easier to use.