DATA 607 Final Project

Aims and Motivation

Expenditure on public education is a subject of debate in many places around the world, and is prioritized in different ways. The usefulness or otherwise of this expenditure is one of the cornerstones of this debate. In the United States, much of the debate is centered on whether that expenditure is better utilized when allocated to the private sector.

In today’s world the importance of the so-called intellectual capital, largely comprising of technological research and development, cannot be overstated. In technology, there is a worldwide race among nations to establish a dominant position in research and applications of emerging areas, in fields such as artifical intelligence, autonomous systems and alternative energy. The filing of patents in technical fields in general is naturally perceived to be a key indicator of technological prowess. Most economic decision-makers would probably view the filing of large numbers of patents by their private sector companies and universities to be a desirable outcome of increasing expenditure on public education, if such an outcome could be obtained.

The objective of this study is to investigate whether a correlation exists between the amount of per-capita expenditure that a country allocates to public education, and the number of patents filed per-capita by that country. For this, data from several publically available data sources are used, along with regression modeling techniques.

This project focuses on the relationship between a country’s per-capita investment in public higher education, and its number of patents filed per capita. Depending on the data, I shall examine whether the two are correlated.

Depending on its results, this study could be used to support making greater investment in public education in the United States.

A linear regression model is used to determine correlation among the variables.

Data sources:

The following data sources were used:

Wikipedia https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators
World Intellectual Property Organization (WIPO). http://www.wipo.int/ipstats/en/statistics/country_profile/
World Bank Group. https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS

# Libraries for data manipulations, analysis and web scraping.
library(dplyr)
library(ggplot2)
library(knitr)
library(RCurl)
library(rvest)
library(stringr)
library(tidyr)
library(utils)
library(XML)
library(stats)
library(grid)
library(gridExtra)

Step 1: Select the countries to include in this study.

For this, we select the top 6 countries with the highest number of patent applications per million of population. The source of this information is Wikipedia: https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators

wiki_url = "https://en.wikipedia.org/wiki/World_Intellectual_Property_Indicators"
temp1 <- wiki_url %>% read_html %>% html_nodes("table")

# List of the countries with highest patent applications for 2012
# Patent applications per million population for the top 10 origins, 2012
clist = html_table(temp1[6])
Countries = data.frame(clist)
Countries = Countries[1:6,]
# Countries = Countries %>% select(Country)
kable(Countries)

Rank	Country	Patent.applications.per.million.population
1	South Korea	2,962
2	Japan	2,250
3	Switzerland	1,013
4	Germany	902
5	United States	856
6	Finland	665

Patents by Country.

This data is obtained from World Intellectual Property Organization (WIPO). The list of countries is from: http://www.wipo.int/ipstats/en/statistics/country_profile/

Ccodes = list("South Korea", "Japan", "Switzerland", "Germany", "United States", "Finland")
Ccodes2 = list("KR", "JP", "CH", "DE", "US", "FI")
Ccodes3 = list("KOR", "JPN", "CHE", "DEU", "USA", "FIN")
ccount = length(Ccodes)
wipo_url = "http://www.wipo.int/ipstats/en/statistics/country_profile/profile.jsp?code="

Patents = list()
i = 0
for (ccode in Ccodes2) {
    i = i + 1
    url2 = paste(wipo_url, ccode, sep="")
    temp2 <- url2 %>% read_html %>% html_nodes("table")
    grants = html_table(temp2[5])
    df = data.frame(grants)
    df = df[-c(1),] %>% select(-X4)
    df$Applications = as.integer(gsub(",", "", df$X2)) + as.integer(gsub(",", "", df$X3))
    df = df %>% select(-c(X2, X3))
    
    # Drop the first 3 rows, 2007-2009.
    df = df[4:10,]
    names(df) = c("Year", "Applications")
    Patents[[ccode]] = df
}

Display the charts of Raw Patents Counts (not adjusted for population).

par(mfrow=c(2, 3))
for (i in 1:ccount) {
    df = data.frame(Patents[i])
    names(df) = c("Year", "Total Applications")
    plot(df, main=Ccodes[i])
}

Public Education Expenditures

Now we read data on public education expenditures. Data source is World Bank Group. This data is available as a zipped CSV file downloaded from https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS

Education expenditure is shown as a percentage of GDP.

par(mfrow=c(1,1))
newdf = data.frame()
Expenditures = data.frame()
filename = "API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_9908693.csv"
expend2 <- read.csv(filename, skip=4)

for (cty in Ccodes3) {
    newdf = filter(expend2, Country.Code == cty)
    newdf = newdf %>% select(Country.Code, X2010:X2016)
    Expenditures = rbind(Expenditures, newdf)
}

# Education expenditure is shown as a percentage of GDP.
kable(Expenditures)

Country.Code	X2010	X2011	X2012	X2013	X2014	X2015	X2016
KOR	NA	NA	4.61823	4.93070	5.05672	5.06544	NA
JPN	3.63950	3.64258	3.69226	3.66538	3.59059	NA	NA
CHE	4.92605	4.96986	5.03337	5.04048	5.05123	NA	NA
DEU	4.91368	4.80780	4.93331	4.93497	4.93113	NA	NA
USA	5.42001	5.22390	5.19486	4.94379	4.98948	NA	NA
FIN	6.54071	6.48201	7.19254	7.15848	7.15156	NA	NA

Population Growth Data

Next we read data on population growth by year of the selected countries. This data is required to calculate the per-capita patent applications filed. Without this information we would be using raw patent counts which would not normalize for countries with large and small populations. This information is obtained from Wikipedia.

wiki_url3 = "https://en.wikipedia.org/wiki/List_of_countries_by_past_population_(United_Nations,_estimates)"
temp3 <- wiki_url3 %>% read_html %>% html_nodes("table")

# Country Population data by Year.
clist = html_table(temp3[1])
pop.all = data.frame(clist)
names(pop.all)[1] = "Country"
pop.all = pop.all %>% select(Country, X2010, X2015)
Populations = data.frame()

for (cy in Ccodes) {
    Populations = rbind(Populations, subset(pop.all, pop.all$Country == cy))
}

Populations$Mean = (as.integer(gsub(",", "", Populations$X2010)) +
                    as.integer(gsub(",", "", Populations$X2015)))/2
kable(Populations)

	Country	X2010	X2015	Mean
193	South Korea	49,090	50,293	49691.5
104	Japan	127,320	126,573	126946.5
201	Switzerland	7,831	8,299	8065.0
77	Germany	80,435	80,689	80562.0
220	United States	309,876	321,774	315825.0
71	Finland	5,368	5,503	5435.5

Data Visualization

# Add Patents columns to expenditures
Spending = list()

patents = data.frame()
i = 1
for (cty in Ccodes2) {
    cty2 = unlist(Ccodes2[i])
    cty3 = unlist(Ccodes3[i])
    
    # Divide patents counts by population means to get per-capita patents.
    Patents[[cty2]]$Applications = Patents[[cty2]]$Applications / Populations[i,]$Mean
    df = Patents[[cty2]]
    
    spending = filter(Expenditures, Country.Code == cty3)
    names(spending) = c("Country.Code", "2010":"2016")
    spending = spending %>%
               gather(Year, Spent, "2010":"2016") %>%
               select(-c(Country.Code))
    Spending[[cty2]] = spending
    i = i + 1
}


# Make a combined Expenditure vs Patents-per-capita DF.
Models = list()
for (i in 1:ccount) {
    cty = unlist(Ccodes3[i])
    df = inner_join(Spending[[i]], Patents[[i]], by = "Year")
    df$Year = as.numeric(df$Year)
    Models[[cty]] = df
}

Display the charts of Patents vs Educational Expenditure for all countries.

for (i in 1:ccount) {
    cty3 = unlist(Ccodes3[i])
    cty = unlist(Ccodes[i])
    df = Models[[cty3]]
    
    p1 = ggplot(data=df, aes(x=Year)) +
         geom_line(aes(y=Spent)) +
         ggtitle(paste(cty, ": Public Education", sep="")) +
         ylab("Public Education Expenditure / GDP")
    p2 = ggplot(data=df, aes(x=Year)) +
         geom_line(aes(y=Applications)) + 
         ggtitle(paste("Patent Applications per Capita")) +
         ylab("Patents per capita")
    grid.arrange(p1, p2, ncol=2)
}

Regression Model

There seems to be no particular pattern in the charts showing Expenditures (as percentage of GDP) against Patent Applications per capita. A linear regression model was attempted for a few countries in the data set.

kor = Models[["KOR"]]
usa = Models[["USA"]]

m.kor = lm(Applications ~ Spent, data = kor)
summary(m.kor)

## 
## Call:
## lm(formula = Applications ~ Spent, data = kor)
## 
## Residuals:
##         3         4         5         6 
##  0.003296 -0.009210 -0.026698  0.032613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.04829    0.41597   -2.52  0.12793   
## Spent        1.04948    0.08453   12.42  0.00642 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03059 on 2 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9808 
## F-statistic: 154.2 on 1 and 2 DF,  p-value: 0.006425

m.usa = lm(Applications ~ Spent, data = usa)
summary(m.usa)

## 
## Call:
## lm(formula = Applications ~ Spent, data = usa)
## 
## Residuals:
##        1        2        3        4        5 
##  0.01303 -0.06461  0.04186 -0.02050  0.03021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   4.8537     0.6658   7.290  0.00533 **
## Spent        -0.6115     0.1291  -4.737  0.01784 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04976 on 3 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8821, Adjusted R-squared:  0.8428 
## F-statistic: 22.44 on 1 and 3 DF,  p-value: 0.01784

par(mfrow=c(1,2))
plot(fitted(m.kor), resid(m.kor))
plot(fitted(m.usa), resid(m.usa))

Conclusions

In the two countries for which linear regression models were built, the Adjusted R-squared values are relatively high, indicating a good fit. However, the coefficients are opposite in sign, reflecting the fact that for one of the two (USA) the data suggest a negative correlation.

From the above charts it seems clear that, for the countries selected in this study, the available data do not show a clear relationship between the level of expenditure on public education, measured as a percentage of GDP, and the number of patents filed per-capita. It is likely that other macro-economic or global factors affect the measured output, and the proposed model is over-simplistic. It is likely that more factors, such as regional and global economic growth rates, which affect private industry investment in research and development, influence the measured output and need to be included as significant variables.

Challenges

Population growth data is not available for countries on a per-year basis, but only on 5-year intervals. To address this limitation I used the data from the closest year to estimate the population on a given year. Data showed that for the countries selected, population growth rates in the 6-year interval used were not very high.
For some the period 2010-2016, educational expenditures were not available for some of the selected countries. To address this limitation I restricted the data to those years for which the data were available.
Parsing the data in tables in Wikipedia took some time to implement correctly. Initially I started with the WikepediR package (https://cran.r-project.org/web/packages/WikipediR/index.html) but switched to the rvest package when I realized that the latter was easier to use.