Expenditure on public education is a subject of debate in many places around the world, and is prioritized in different ways. The usefulness or otherwise of this expenditure is one of the cornerstones of this debate. In the United States, much of the debate is centered on whether that expenditure is better utilized when allocated to the private sector.
In today’s world the importance of the so-called intellectual capital, largely comprising of technological research and development, cannot be overstated. In technology, there is a worldwide race among nations to establish a dominant position in research and applications of emerging areas, in fields such as artifical intelligence, autonomous systems and alternative energy. The filing of patents in technical fields in general is naturally perceived to be a key indicator of technological prowess. Most economic decision-makers would probably view the filing of large numbers of patents by their private sector companies and universities to be a desirable outcome of increasing expenditure on public education, if such an outcome could be obtained.
The objective of this study is to investigate whether a correlation exists between the amount of per-capita expenditure that a country allocates to public education, and the number of patents filed per-capita by that country. For this, data from several publically available data sources are used, along with regression modeling techniques.
This project focuses on the relationship between a country’s per-capita investment in public higher education, and its number of patents filed per capita. Depending on the data, I shall examine whether the two are correlated.
Depending on its results, this study could be used to support making greater investment in public education in the United States.
A linear regression model is used to determine correlation among the variables.
The following data sources were used:
Wikipedia https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators
World Intellectual Property Organization (WIPO). http://www.wipo.int/ipstats/en/statistics/country_profile/
World Bank Group. https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS
# Libraries for data manipulations, analysis and web scraping.
library(dplyr)
library(ggplot2)
library(knitr)
library(RCurl)
library(rvest)
library(stringr)
library(tidyr)
library(utils)
library(XML)
library(stats)
library(grid)
library(gridExtra)
For this, we select the top 6 countries with the highest number of patent applications per million of population. The source of this information is Wikipedia: https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators
wiki_url = "https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators"
temp1 <- wiki_url %>% read_html %>% html_nodes("table")
# List of the countries with highest patent applications for 2012
# Patent applications per million population for the top 10 origins, 2012
clist = html_table(temp1[6])
Countries = data.frame(clist)
Countries = Countries[1:6,]
# Countries = Countries %>% select(Country)
kable(Countries)
Rank | Country | Patent.applications.per.million.population |
---|---|---|
1 | South Korea | 2,962 |
2 | Japan | 2,250 |
3 | Switzerland | 1,013 |
4 | Germany | 902 |
5 | United States | 856 |
6 | Finland | 665 |
This data is obtained from World Intellectual Property Organization (WIPO). The list of countries is from: http://www.wipo.int/ipstats/en/statistics/country_profile/
Ccodes = list("South Korea", "Japan", "Switzerland", "Germany", "United States", "Finland")
Ccodes2 = list("KR", "JP", "CH", "DE", "US", "FI")
Ccodes3 = list("KOR", "JPN", "CHE", "DEU", "USA", "FIN")
ccount = length(Ccodes)
wipo_url = "http://www.wipo.int/ipstats/en/statistics/country_profile/profile.jsp?code="
Patents = list()
i = 0
for (ccode in Ccodes2) {
i = i + 1
url2 = paste(wipo_url, ccode, sep="")
temp2 <- url2 %>% read_html %>% html_nodes("table")
grants = html_table(temp2[5])
df = data.frame(grants)
df = df[-c(1),] %>% select(-X4)
df$Applications = as.integer(gsub(",", "", df$X2)) + as.integer(gsub(",", "", df$X3))
df = df %>% select(-c(X2, X3))
# Drop the first 3 rows, 2007-2009.
df = df[4:10,]
names(df) = c("Year", "Applications")
Patents[[ccode]] = df
}
Display the charts of Raw Patents Counts (not adjusted for population).
par(mfrow=c(2, 3))
for (i in 1:ccount) {
df = data.frame(Patents[i])
names(df) = c("Year", "Total Applications")
plot(df, main=Ccodes[i])
}
Now we read data on public education expenditures. Data source is World Bank Group. This data is available as a zipped CSV file downloaded from https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS
Education expenditure is shown as a percentage of GDP.
par(mfrow=c(1,1))
newdf = data.frame()
Expenditures = data.frame()
filename = "API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_9908693.csv"
expend2 <- read.csv(filename, skip=4)
for (cty in Ccodes3) {
newdf = filter(expend2, Country.Code == cty)
newdf = newdf %>% select(Country.Code, X2010:X2016)
Expenditures = rbind(Expenditures, newdf)
}
# Education expenditure is shown as a percentage of GDP.
kable(Expenditures)
Country.Code | X2010 | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 |
---|---|---|---|---|---|---|---|
KOR | NA | NA | 4.61823 | 4.93070 | 5.05672 | 5.06544 | NA |
JPN | 3.63950 | 3.64258 | 3.69226 | 3.66538 | 3.59059 | NA | NA |
CHE | 4.92605 | 4.96986 | 5.03337 | 5.04048 | 5.05123 | NA | NA |
DEU | 4.91368 | 4.80780 | 4.93331 | 4.93497 | 4.93113 | NA | NA |
USA | 5.42001 | 5.22390 | 5.19486 | 4.94379 | 4.98948 | NA | NA |
FIN | 6.54071 | 6.48201 | 7.19254 | 7.15848 | 7.15156 | NA | NA |
Next we read data on population growth by year of the selected countries. This data is required to calculate the per-capita patent applications filed. Without this information we would be using raw patent counts which would not normalize for countries with large and small populations. This information is obtained from Wikipedia.
wiki_url3 = "https://en.wikipedia.org/wiki/List_of_countries_by_past_population_(United_Nations,_estimates)"
temp3 <- wiki_url3 %>% read_html %>% html_nodes("table")
# Country Population data by Year.
clist = html_table(temp3[1])
pop.all = data.frame(clist)
names(pop.all)[1] = "Country"
pop.all = pop.all %>% select(Country, X2010, X2015)
Populations = data.frame()
for (cy in Ccodes) {
Populations = rbind(Populations, subset(pop.all, pop.all$Country == cy))
}
Populations$Mean = (as.integer(gsub(",", "", Populations$X2010)) +
as.integer(gsub(",", "", Populations$X2015)))/2
kable(Populations)
Country | X2010 | X2015 | Mean | |
---|---|---|---|---|
193 | South Korea | 49,090 | 50,293 | 49691.5 |
104 | Japan | 127,320 | 126,573 | 126946.5 |
201 | Switzerland | 7,831 | 8,299 | 8065.0 |
77 | Germany | 80,435 | 80,689 | 80562.0 |
220 | United States | 309,876 | 321,774 | 315825.0 |
71 | Finland | 5,368 | 5,503 | 5435.5 |
# Add Patents columns to expenditures
Spending = list()
patents = data.frame()
i = 1
for (cty in Ccodes2) {
cty2 = unlist(Ccodes2[i])
cty3 = unlist(Ccodes3[i])
# Divide patents counts by population means to get per-capita patents.
Patents[[cty2]]$Applications = Patents[[cty2]]$Applications / Populations[i,]$Mean
df = Patents[[cty2]]
spending = filter(Expenditures, Country.Code == cty3)
names(spending) = c("Country.Code", "2010":"2016")
spending = spending %>%
gather(Year, Spent, "2010":"2016") %>%
select(-c(Country.Code))
Spending[[cty2]] = spending
i = i + 1
}
# Make a combined Expenditure vs Patents-per-capita DF.
Models = list()
for (i in 1:ccount) {
cty = unlist(Ccodes3[i])
df = inner_join(Spending[[i]], Patents[[i]], by = "Year")
df$Year = as.numeric(df$Year)
Models[[cty]] = df
}
Display the charts of Patents vs Educational Expenditure for all countries.
for (i in 1:ccount) {
cty3 = unlist(Ccodes3[i])
cty = unlist(Ccodes[i])
df = Models[[cty3]]
p1 = ggplot(data=df, aes(x=Year)) +
geom_line(aes(y=Spent)) +
ggtitle(paste(cty, ": Public Education", sep="")) +
ylab("Public Education Expenditure / GDP")
p2 = ggplot(data=df, aes(x=Year)) +
geom_line(aes(y=Applications)) +
ggtitle(paste("Patent Applications per Capita")) +
ylab("Patents per capita")
grid.arrange(p1, p2, ncol=2)
}
There seems to be no particular pattern in the charts showing Expenditures (as percentage of GDP) against Patent Applications per capita. A linear regression model was attempted for a few countries in the data set.
kor = Models[["KOR"]]
usa = Models[["USA"]]
m.kor = lm(Applications ~ Spent, data = kor)
summary(m.kor)
##
## Call:
## lm(formula = Applications ~ Spent, data = kor)
##
## Residuals:
## 3 4 5 6
## 0.003296 -0.009210 -0.026698 0.032613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.04829 0.41597 -2.52 0.12793
## Spent 1.04948 0.08453 12.42 0.00642 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03059 on 2 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.9872, Adjusted R-squared: 0.9808
## F-statistic: 154.2 on 1 and 2 DF, p-value: 0.006425
m.usa = lm(Applications ~ Spent, data = usa)
summary(m.usa)
##
## Call:
## lm(formula = Applications ~ Spent, data = usa)
##
## Residuals:
## 1 2 3 4 5
## 0.01303 -0.06461 0.04186 -0.02050 0.03021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8537 0.6658 7.290 0.00533 **
## Spent -0.6115 0.1291 -4.737 0.01784 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04976 on 3 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8821, Adjusted R-squared: 0.8428
## F-statistic: 22.44 on 1 and 3 DF, p-value: 0.01784
par(mfrow=c(1,2))
plot(fitted(m.kor), resid(m.kor))
plot(fitted(m.usa), resid(m.usa))
In the two countries for which linear regression models were built, the Adjusted R-squared values are relatively high, indicating a good fit. However, the coefficients are opposite in sign, reflecting the fact that for one of the two (USA) the data suggest a negative correlation.
From the above charts it seems clear that, for the countries selected in this study, the available data do not show a clear relationship between the level of expenditure on public education, measured as a percentage of GDP, and the number of patents filed per-capita. It is likely that other macro-economic or global factors affect the measured output, and the proposed model is over-simplistic. It is likely that more factors, such as regional and global economic growth rates, which affect private industry investment in research and development, influence the measured output and need to be included as significant variables.